A Reluctant Ombudsman: January 2016

Sunday, 3 January 2016

On Null Hypothesis Significance Testing, P values and the Scientific Method

Hypothesis Testing

Hypothesis testing is essential in science to determine the presence of an effect. A technique commonly used is NHST, which tests if the data points in your alternative distribution are representative of the normal distribution i.e if your data distribution is different from what would be considered 'normal', and assigning a p value to your data. If the Mean in your data sample is different from the one in the normal distribution, this might tell you that your data is not simply a random sample but that an effect (your variable) is present.

We conduct Hypothesis Testing by comparing our alternative hypothesis against a null hypothesis. You either reject or fail to reject the null hypothesis (double negatives can be used in statistics - not implausible, failed to reject, etc.).

Failing to reject the null hypothesis - This does not mean that the null hypothesis is true, only that this sample does not show that the alternative is true. Not rejecting a position like the null hypothesis does not mean that we're saying it is correct.

Rejecting the null hypothesis - This does not mean that the null hypothesis is false/not true. Neither does it mean that the alternative is true. It just means that this sample shows that the data is different from the null. Another sample might not.

These two points above are important to understand because when we look for effects in data, particularly noisy data which might be influenced by a lot of factors, you cannot simply reduce the act of spotting an effect to rejecting or failing to reject a null hypothesis. This is because the null hypothesis is almost always false.

When you sample from a population, it will be a coincidence indeed if you get the exact same means in both your experimental and control samples. I think the more important question to ask is how much of an effect is present and under what conditions will it vary. Statistics is no substitute for thinking. You need to decide what an important effect is.

Other points to remember -

- If you have a research question, circle around the problem, address it in different ways. Don't frame it in one specific manner and pin your conclusions on a null hypothesis to be tested.

- Hypothesis testing does not have to be applied to all questions. You can have one-off events worth studying that do not need falsification.

- It's OK to conceive your hypothesis after you have conducted research but it should be before you have analysed data statistically (more on this later).

- Hypothesis tests are always about population parameters, never about sample statistics. We always use the sample data to hypothesise about the population mean, not the sample mean.

- Hypothesis testing and significance testing are different things. Hypothesis testing or Null Hypothesis testing is about rejecting or failing to reject a null hypothesis, Significance testing is about assigning a p value. We commonly use these two together in a hybrid called NHST, which is controversial.

Null Hypothesis Significance Testing (NHST) and P values

In order to conduct a hypothesis test, we usually assign a significance value, a threshold on which we decide whether to reject or fail to reject the null hypothesis. This is how the NHST methodology works, but it has drawbacks, like a dependance on the p value. A p value is supposed to quantify the strength of evidence against the null value. It tells you how unusual the occurrence would be if it was due to chance.

The p value is the probability of observing a sample statistic like the mean being at least as extreme/favourable as it is in this sample, given our assumptions of the population mean.

p value = P(sample mean being as extreme | assumption about population mean)

It is simply the probability distribution on a normal normalised distribution like a Z score table (you can find it using the pnorm function in R). For example if you test two groups of people and group A gets 5 and group B gets 7 and you want to see if their scores are significantly different from each other, you subtract the differences and get 2 and then decide if this is significantly different from your null value, whatever it is (probably 0), given a certain standard error (Remember that all statistics is essentially a test statistic divided by the error in that statistic).

One way to do this is to be so immersed in your subject matter, be a complete expert at it and have full subjective contextual knowledge that you know subjectively if a difference of 2 really matters, if it really translates to real world significance. Remember that real world and statistical significance are two different things.

In statistical significance, you would run your test statistic against a normalised distribution, assuming it follows one, and your data might just be deemed significant if you get a low p value. The low p value is supposed to tell you that the probability of getting this difference of 2 is low i.e on the lower end of one end of the normal distribution, given a null default.

There are a few drawbacks to using p values as indications of significance. This paper shows us the harmful effects of using NHST and confusing statistical significance with real life significance but I've included my own notes below.

- Significance testing tells you more about the quality of your study (variation and sample size) than about your effect size which is more important. Andy Field has written a very easy-to-follow chapter on this topic.

- As I said before, p values are the probability of observing what you observed given a null default, but the default is never null. The null hypothesis might always be false since two groups rarely have the same mean. How then do you make sense of how probable your data is?

- The p value is conditional on the null hypothesis. It is not a statement about underlying reality. Even if it is accurate, the p value is a statement about data when the null is true, it cannot be a statement about data when the null is false.

- A p value is not the probability of the null hypothesis being true or false. The p value is the probability of extreme data conditional on a null hypothesis.

- It is not the probability of a hypothesis conditional on the data. P values tell us about our data based on assumptions of no effect, but we want a statement of hypotheses based on our data. To infer latter from a p value is to commit the logical fallacy of inverting conditionals.

- P values do not tell you if the result you obtained was due to chance, they tell you if the result was consistent with being due to chance.

- p values do not tell you the probability of false positives. The sig level (not the p value) is the probability of the type I error rate i.e P(Type 1 error) or P(reject | H0 is true).

- This paper does a good job of expanding on my points above, listing a lot of the common misconceptions about p values and NHST. Highly recommended.

- If you're studying a non-stable process that spits out random values, p values are not meaningful b/c they are path dependent. In these cases, the p value isn't meaningful b/c it is a summary of data that has not happened, under assumptions that further data will follow a certain distribution.

- People use 0.05 as a significance level, but need to remember that hypothesis tests are designed to call a set of data sig. 5% of the time, even when the null is true.

- Many studies show that you have a a very good chance of getting a significant result that isn't really significant with a significance level of 0.05 (about 30% of the time). This paper in particular does a good job of explaining the high false discovery rate using a significance level of 0.05 and compares it to the screening problem, and this article summarises the points well. You can use a lower level like 0.001, but it really is up to you to decide what is statistically significant.

The Scientific Method

All of this tells me that it is best, when tackling a solution to go back to the philosophical foundations of why we do things.

Note that you only create a theory or hypothesis after you have evidence. Theories have to be based on evidence, preferably good data-driven evidence. You can't first make up a theory and then look for evidence to confirm or falsify your theory. This is how superstitions and pseudoscience are created. A deliberately vague theory will never be confirmed or falsified, only made to look unlikely. While quantifying how likely or unlikely the existence of an effect is, is the point of science, doing so is a waste of everyone's time if the effect was made up to begin with, so don't do this.

If you see something weird you can't explain, you don't automatically give it a name. That's merely classifying a phenomena, putting it in a box that represents what you already know of the universe, which is incomplete. And your classification system or model or framework could be wrong. You need to do more. It is best to sit on the fence, admit your ignorance, and keep exploring, digging and asking questions of your phenomena, all the while building better and better models to explain it and make predictions. This is preferable to classifying your phenomena in terms of some-pre existing narrative that fits your own socio-cultural context, which would be a failure of critical reasoning.

I see this all the time. Once people identify with a narrative, everything they see will serve to strengthen that narrative. Supporters of a political party do not support that party because the evidence led them to support that party, they do so because of other reasons, like values that they identify with. But once the decision is made, evidence doesn't matter. We are slaves to narratives. Everything that follows is confirmation bias.

We use models because of their usefulness, not because they are correct. It seems to me that the best way to tackle a scientific question or puzzle is to first do exploratory research, just lots of multiple comparisons, or A-B testing, and obviously we wouldn't use p values here. We look at our exploratory data, at possible trends we see and that might or might not be true, that might reflect some underlying connections, and then create hypotheses based on what we've found in the data.

Here is where we switch from exploratory to confirmatory research. To confirm or falsify our hypotheses, we need to run experiments, which can involve hypothesis testing. And we have to gather new data for this. We cannot use the same data set for both exploratory and confirmatory research as that would be cheating ourselves and would not be scientific.

We pre-register our experiments so we can't change our minds later and claim we were always looking for what we ended up finding. This is called the garden of forking paths or researcher degrees of freedom or p hacking - You can only test 1 hypothesis, not 20 and then report only 1. Or drop one condition so you get a sig. p value of < .05.

There are really millions of variables that can correlate significantly with each other. Which is why we get significant correlations when we generate hundreds of 10 number strings of random numbers and then compare two strings. When you compare enough variables, you will find significant results. This is noise. This is just how large data works, or data without theory, or data with a theory that is ad hoc or made up and not evidence based. This is how superstition works. You need to look beyond this, to see if any of these correlations or effects are consistent and not merely noise.

So we conduct our confirmatory research, get our results, and then replicate to see if the results hold. Replication ensures that we confirm that the effect is real and wasn't just a coincidence. Also, keep in mind that if your hypothesis was based on a solid non-noisy phenomena or theory that that you had good reason to believe existed or was true, then replication should merely help ascertain this one way or another and not be a threat to you. It should all be part of the process of good science. If your effect was made up to begin with, or was noisy, then no amount of replication is going to help falsify something that never should have been investigated in the first place. in this sense, the original experiment bears no special status over and above the replication. They both need to be treated the same.

----------------------------------------------------------------------------------------

This then is 3 different experiments that we have conducted to find one effect. And where do p values come in? I think you can use them for confirmatory research, but only to tell you about your sample data distribution, about the probability that the data is consistent with chance, under repeated attempts. But you cannot use p values to tell you about your hypotheses. From what we've seen, p values cannot do that. They were not set up for that purpose and they don't work that way. You should be able to tell what a truly significant result is in your study without p values, or by looking at other statistics. Or maybe using Bayesian statistics.

On Happiness

I've been thinking about happiness recently, which is probably something that someone who is truly happy wouldn't do. Happy people don't think about or look for happiness. They merely live out their happy lives as normal. But over-thinking things is part of who I am, and it brings me an extreme sense of satisfaction, which I suppose is different to happiness but still important.

I meet a lot of expats in London. International working professionals here on a contract. They all come here for a change, to lead a better life, to make more money, to travel and see new places, or other reasons that they claim brings happiness. And I wonder how many of them are happy. Whether this is a useful question to ask is something I'll get to later. But lets say it is. Lets say happiness is important. Do people who move here for work end up happier than they were in their own countries? I'm not sure. A lot of them feel like they're merely chasing happiness, like they're still searching for something that they'll never find, or that they've only found temporarily until another happiness goal catches their fancy. I'm not sure.

There's this TED talk that says that happiness is the mostly the quality of our relationships with other people, and I'm inclined to agree with this from the point of view of my own personal context. I personally derive a lot of happiness from good close personal relationships and shared experiences with family and friends, though I also think that other factors help - like having low expectations about certain things, having a pragmatic view about bad things that happen to you, having a positive attitude towards everything, and not tying your ambitions and career goals to happiness. Work for money, create for love, right?

This other talk separates happiness into synthetic and real. Synthetic happiness comes from doing what you are told will bring you happiness, accepting things you cannot change, and rationalising bad things as normal and happy. Also, people like things more when they think they're going to lose them. It defines real happiness as when we get whatever we want, which is something I don't get because we never get what we want and will constantly be striving from one happiness goal to another i.e one temporary island of happiness to another temporary island of happiness. It could just be semantics, but this isn't real happiness to me, this is just temporary contentment. But I guess this is happiness to a lot of people in the western world, who feel like they need to be in control of every aspect of their lives, and that control brings happiness. I take the other view, which is that since so much is out of your control, you can only be happy by letting go of it all and just do things you enjoy without hurting people, and take everything else in your stride without imagining that the universe is conspiring against you. Which is where the synthetic happiness come in.

Then there's 'The Geography of Bliss' by Eric Weiner. A somewhat humorous look at why people in some countries are generally happier than others. Some of Weiner's book is of course typical western narrative tropes and hyperbole - Columbus, China's greed is bad, etc., but i picked up a lot of interesting points. Weiner visits the happiest countries on Earth to find out what makes them happy, while not confusing correlation/association with causation. Just because happy nations are characterised by certain factors doesn't mean these are causal factors, it could be the other way around.

The happy countries -

- The Dutch have things taken care of, and have permissive attitudes towards sex, drugs, etc.

- The Swiss are less tolerant than the Dutch, they have rules, boredom and nature. They are not ecstatic joyful, but content. They also have cleanliness, punctuality, things taken care of, they don't provoke envy in others, but suppress envy by hiding their wealth. They are surrounded by beauty and nature. They trust their neighbours, and having a sense of history and where they're from. They have fewer choices.

- The Bhutanese don't have unrealistic expectations. They don't try to be happy or try to achieve it. They don't talk about or analyse it. They don't ask themselves if they will cease to be so. Ignorance is bliss. There is also a lot of death, which gives you a different perspective on life. You develop a new way of seeing things after living with it. They are poor, but that doesn't matter. Money is only a means to an end. It is trust in people and institutions. Material wealth doesn't become so important.

- The Qataris leave everything to God. Maybe happiness come from beliefs, not necessarily religious beliefs. They belong to one tribe with many rules, that allows you to have no rules outside it because you just won a lottery and can do anything with the money. You are happy as long as you are a high ranking member of this tribe. You don't need ambition or high expectations. The money takes care of everything. If this culture-less life is to your liking, you are happy. But money isn't everything - it has diminsihing returns. You will always crave somethign else.

- The Icelanders are naive. They are free to try and to fail. They have a conection to their language. They are a small country, feel kinship to each other, protective of their well-being. Enjoy writing. Not affected by SAD. Have multiple identities, no envy of others. Suppress envy by sharing everything with others. A sense of self actualisation and the freedom to do what you want. they are free to share ideas without copyright. Self-delusion might be good - there's no one to tell you not to do somethign or express yourself. They constantly fail and create rubbish, but are happy doing so.

- The Thais have mai pen lai (never mind), jai yen (let it go), sanuk (fun). They have fun at work instead of the American work hard, play hard mentality. Their fun is interspersed throughout the day rather than regimented and taken too seriously. They don't take things too seriously. They don't think about things like happiness to much. Ignorance is bliss? They smile a lot.

The unhappy countries -

- The Moldovans have a lot of envy, are relatively poor compared to their European neighbours - poverty breeds envy of other's riches - there's also lack of trust - if something goes wrong, it is not their responsibility to fix. There's a feeling of powerlessness, helplessness.

The somewhat happy countries -

- The British believe in muddling through, getting by. They are reserved, not tactless, are afraid of offending people, don't hug, are a country of grumps. Does culture impede happiness? I don't think it's that simple. Having lived in England and Scotland, I think people here are definitely happy, they just don't show it (btw, don't ever introduce yourself right away in an English pub - rookie mistake). But I'm not sure why they would rank lower than the other countries.

- The Indians are a mixed lot. The ones who are happy believe that life in an act, and don't take it too seriously. New tech cities are both the problem and the solution. People have long long work hours, poor work life balance, and then special workshops and ashrams to fix them. Calcutta's poorer are happier than America's poor - stronger family ties? (Btw, flattery can get you an interview in India, and much else). He says nothing about unhappy people in India. I guess it could be a lack on trust in your neighbour and public institutions. All the happiest people I know in India derive happiness from relationships in their communities, but not necessarily within communities. Indian diversity can be comforting, but I think people's biases and ingroup-outgroup mentality combined with their narrow-mindedness about culture can serve to increase create distrust and hate.

- The Americans are constantly searching for happiness. Their unhappiness could come from unrealistic expectations. Self help books teach them to look inwards not outwards towards relationships that really matter. Maybe you nee to commit to a place or people to be happy, you can't always have one foot out the door.

What happiness isn't -

To quote the book, "Happiness is not feeling like you need to be somewhere else or doing something else." But I think that's your other goals, which are fleeting and constantly changing. I think it's fine to have them, we all have career and self-fulfilment goals and wants, and striving to accomplish them is fine, but our success or failure in said exercise shouldn't make a difference to our happiness, if in fact happiness is more important.

It's not about ambition or success. Failure might happen despite your best laid plans, and while success can bring you satisfaction, I feel it's the journey, the striving for success that brings you happiness.

Knowledge doesn't necessarily make you happier, though it has other obvious advantages. So is ignorance bliss? Not necessarily, in my opinion. It's not about knowledge vs ignorance w.r.t happiness. Neither is a factor, your happiness depends on other things.

It isn't about money or material wealth. Money helps, but just a little, it doesn't guarantee lasting happiness. Law of diminishing returns.

What happiness is -

We constantly try to synthesise happiness, we think it is something to be found. Perhaps it is more a thing to be created, or a state to be evolved into. To my understanding, it is having close personal healthy relationships with friends and family, living in a society with a lot of trust, and no envy, uncertainty or fear, and finally, having a pragmatic outlook on life, understanding that events are unpredictable, but having something to look forward to and doing your best anyway, about having a sense of not-wanting. Living among a homogenous society with reliable public institutions and like-minded people also helps. The closer knit the community the happier you are, as long as you subscribe to the cultural mores of that community. Tough luck if you don't. Perhaps that's why people move away. To me, certain environmental conditions also matter, like living in clean cool quiet surroundings with access to good food and being intellectually stimulated.

A Reluctant Ombudsman

Sunday, 3 January 2016

On Null Hypothesis Significance Testing, P values and the Scientific Method

On Happiness

Search This Blog

Blog Archive

Labels