In this bite, you will learn about the process of testing hypotheses, how to form a scientific hypothesis and the meaning of p-values, significance levels and effect sizes.
What is hypothesis testing?
Hypothesis testing is a way of using inferential statistics to make inferences, claims or conclusions about a population based on a representative sample. Conducting an experiment, observation, survey— or whatever your study’s design— would be extremely costly and near impossible for that matter, if we were to sample the entire population of interest. Instead, we draw a sample of the population in focus and test the hypothesis to allow us to generalize our findings to the population from which the sample was drawn.
Below shows the three steps involved in the process of hypotheses testing.
Stating the null and alternative hypotheses
Hypotheses are educated guesses about the condition of the world around us. Hypotheses are statements that should be both testable and falsifiable in some way.
The first step in hypothesis testing is the formulation of the null hypothesis. The null hypothesis proposes there is no relationship, association or effect between the phenomena you are investigating. It is a statement about a parameter of the population (for example, the mean), that is assumed to be true. You can think about hypothesis testing a bit like ‘innocent until proven guilty’— we accept the status quo (the null hypothesis) until evidence strongly suggests otherwise. That is why, when carrying out hypothesis testing we seek to falsify the null hypothesis, which in turn, lets us accept the alternative hypothesis. The alternative hypothesis is what you think, as a researcher, actually explains the phenomena. It directly contradicts the claims of the null hypothesis. It is a statement that claims the actual value of a population parameter is less than, greater than or not equal to the value stated in the null hypothesis. Below is an example of a research question and appropriate null and alternative hypotheses constructed to test the predictions.
Say you were interested in the degree to which mindfulness impacts scores on an end of term statistical exam. You suspect that mindfulness will positively impact students grades so you construct hypotheses to test this;
Null hypothesis (H0): Practicing mindfulness will have no effect on test scores compared to not practicing mindfuless
Alternative hypothesis (H1): Practicing mindfulness will positively effect test scores compared to not practicing mindfulness
In order to test our hypothesis, we must first collect data. We collect a sample of students from the class and randomly assign them to one of two groups; practice mindfulness 1hr a week for the semester vs. do not practice mindfulness. Of course, we can’t conduct an experiment with all students in the world, so we draw a sample. Then we use this sample to make inferences about the population. This is the nature of hypothesis testing.
Setting level of significance and performing statistical analysis
Now that our hypotheses have been constructed and we have gathered the relevant data, we need to set a criterion for reliably accepting that the null hypothesis is in fact not true and that the alternative hypothesis is. Similar to providing evidence beyond reasonable doubt, the criterion lets you reject the null hypothesis and accept the alternative hypothesis with minimal error.
The criterion is usually set at <0.05, but is occasionally more stringently set at <0.01. The criterion tells you the probability (expressed through the p-value) of observing the results obtained in your sample assuming that the null hypothesis is true. With a criterion of <0.05 you are prepared to accept that 5% of the time (5 times in 100 or less) you will find a statistically significant difference/effect/relationship in your sample even if one does not exist in the population from which your sample was drawn [the null hypothesis is true]. Take a look at the following video for a good explanation of p-values. You can also take a look at this article which explains significance levels in a bit more depth.
Interpreting your p-value and drawing conclusions
Based on the analysis of our data it has been revealed that the mean exam score for those who practiced mindfulness was 81 whilst the mean score for those who did not was 63. This alone doesn’t answer our hypothesis, all it tells us is that in our sample those who practiced mindfulness did better in the exam than those who did not, but says nothing about the population. To make inferences about the population we need to refer to the p-value to determine whether we should accept or reject the null hypothesis and the alternative hypothesis. Remember, our p-value tells us the probability of observing the results obtained by chance alone and is an indication of the strength of our evidence. Let’s say we run a statistical test to test our hypotheses stated earlier and we obtain a p-value of 0.03. What does this tell us about our hypotheses? Well, this would mean there is a 3% chance of obtaining the results you did, if the null hypotheses were true. If we set our criterion at significance level <0.05, we would be able to reject the null hypothesis (that practicing mindfulness has no impact on stats exam performance) and accept the alternative hypothesis (that practicing mindfulness positively impacts stats exam performance). We can say that our results are statistically significant as they are below our set significance level.
“Statistical significance is the least interesting thing about the results. You should describe the results in terms of measures of magnitude –not just, does a treatment affect people, but how much does it affect them.”
-Gene V. Glass
A little note on effect sizes
Although we have concluded that our results are statistically significant and can therefore reject the null hypothesis, our conclusions tell us nothing about the magnitude of the observed effect— only that it exists. Depending on your chosen analysis, the effect size will usually be presented in the results of your analysis. In our case, as we have run a t-test we should look at the associated ‘Cohen’s d‘ to determine the effect size. If your test is statistically significant it is always a good idea to report the effect size, as well as the p-value as this tells the reader both 1) the strength of your evidence in, correctly rejecting the null hypothesis and 2) how large observed differences/effects/relationships are between the groups you are investigating.