# Hypothesis testing explained

This post is dedicated to all those who still believes that *R in a nutshell* is a
book for learning statistics… Let’s get started with the easiest question.

### What is a hypothesis?

In statistics, having a hypothesis means that we believe that the value of a parameter,
for instance the mean or the variance of a distribution is close to a certain number.
As all statements it can be correct or completely wrong.
Data will tell us what seems to be correctly explained.
In order to overcome the trickiness of the concept let’s define some
terminology that might be helpful in the next paragraphs. Before the actual
analysis, scientists usually formulate some questions they would like to
answer. Those questions are technically referred to as *hypotheses*. Given the
null hypothesis , the alternative hypothesis and the
sample , the *rejection region* (also called
**critical region**) is defined as the region C such that if is
accepted, . Similarly if is
rejected, the data do belong to .

Probably the most classical way to explain hypothesis testing is by referring to the gaussian distribution with both mean and variance known. Let me oversimplify the problem of cancer attacked with statistical analysis tools.

Imagine that there are some reasons to believe that gene RSPC is
responsible of a type of cancer. Moreover, doctors have samples of patients
who are affected by cancer (control). The control sample has a mean value for
gene RSPC, . The idea behind hypothesis testing is that if
another group of patients has a mean value that is close enough to , the hypothesis will be accepted and
the group will be labeled as *affected*. Contrarily, if the mean value is not
close enough to then should
be accepted instead and the group will be labeled as *not at risk*. In terms
of the critical region, the aforementioned hypotheses are saying that if the
sample under investigation belongs to the critical region, we better reject
.

No need to be a genius to conclude that if the sample does not belong to the critical region, we have a good reason to accept , instead. Fine!

Let’s define the critical region and we are done. Under the
assumption of gaussian distribution with known variance the
critical region
,
where is an estimator of the mean (sampled mean). What
is then? indicates how far from the given mean we accept
the estimated mean can go and still consider the test acceptable or
*significant*. That’s why it depends on two other concepts I’m introducing
now: the **type I error** and the **significance level** . The
former represents the error we could make by rejecting when it is
true. The latter is an upper bound of the probability of such an error. The
probability to reject, erroneously, should be kept under control
and should be lower than . Therefore,

Am I annoying if I say that again? is the probability that the given sample belongs to the
critical region (and should be rejected for that) but imagine that
a prophet told us in a whisper that is true and must be accepted
instead (that’s what stands for). Does it sound enough like an
error? That’s what statisticians call **type I error**. Under the assumption
of normal distribution, the test statistic measure .
Therefore the probability
we are looking for is and in conclusion

From the definition of the cumulative distribution function and the area under the Normal Curve . Therefore, and

Almost there. A test with significance should reject if

Whenever the analyst feels the
presence of higher reliability about the truth of her hypotheses, she can
transmit her feelings to the significance level and relax it
too. In fact, a much stricter hypothesis testing would be conducted with a
very small $\alpha $$. This is translated into being less tolerant about
the probability of the *type I error*.

The beauty of mathematical statistics
consists in the capability of explaining the same concepts in so many
different ways. Very often, academic papers and research studies in general
exploit the concept of **p-value**. Once the statistic measure is computed
from the data (in the example above it is ) we might be interested in evaluating
the probability that a random sample from the standard normal distribution is
greater than our statistic measure. That probability is what they call the *p-
value*. If that probability is greater than the statistic measure and that
happens a number of times, then should be accepted.

How many times? , of course!

### Notes

The simplification about cancer above is just ridiculous, I know. I’ve also read quite a number of papers in which they claim to govern the complexity of some aspects of (some types of) cancer with hypothesis testing, which I also find ridiculous.

## Before you go

If you enjoyed this post, you will love the newsletter of Data Science at Home. It’s my FREE digest of the best content in Artificial Intelligence, data science, predictive analytics and computer science. Subscribe!