In very many studies “statistical significance” is translated to mean “true”. But I’m here to tell you that is far, far off the mark. For several very good reasons. So in the absence of more robust information it’s just one more way that a claim of “evidence basis” can be complete BS.
Emily Oster is one of my favourite people and writes for parentdata.org (she’s big on data in case you hadn’t guessed). She recently wrote one of the best pieces I’ve seen to really get across the abject meaningless of “statistical significance” in so many studies.
Check it out - great graphs that really get her points across.
Publication Bias
You and I have seen this term lots of times, and we take it to mean that bias determines whether a paper gets published or not, rather than the quality of the paper. So what’s this bias?
A lot of papers get published each year and almost exclusively they claim some significant positive outcome. But there are many, many other papers put up for publication that fail, and a lot of these have negative or no outcomes for the exact same type of study.
So the “evidence basis” that is claimed for some interventions can be nonsense because it takes no account of the mass of unpublished studies that show no such thing. (And keep in mind a lot of these papers, while claiming positive outcomes, actually show outcomes that are pretty tepid anyway. They’re not evidence, they’re “on the nose”.)
By selecting primarily for positive outcomes, academic publishing is currently responsible for grievously misleading health professionals in relation to “evidence”. And some of the dirty shenanigans that go on with favours, fear of law suits from eminent researchers who’ve been caught out publishing bunk and object to being outed, etc etc, make the whole problem even worse.
So before you get sucked in by “evidence basis” go hunting for a good debunking or two. There are real scientists out there who see through this stuff like a freshly-cleaned window pane.
Pure Chance
Sure, I’ve seen the coin tossing analogy used before, but not so powerfully as Oster describes it and charts it. (Incidentally, in the same article she also described the notorious dead salmon that achieved statistical significance for brain activity in relation to images it was shown while resting in an fMRI machine.)
Anyway, back to coin tossing. Oster ran a fake study to determine if 25 people who ate a packet of green M&Ms tossed more heads than 25 people who ate a packet of blue M&Ms. I think you’d agree that the colour of the M&Ms was not causal, nor indeed that anything could be causal when the head/tail outcome was randomly generated by computer.
There was a tiny difference (58% vs 54%, pretty much what you’d expect when the result is pure chance) but to get a bigger picture, Oster ran the test another 99 times, and coin tossing being what it is, came up with mostly insignificant differences between the groups, but with 99 more experiments, she came up with 4 that were very much “statistically significant” with p < 0.05.
What this means is that purely by chance, if there are enough studies, at least some of them are going to be “statistically significant” and these are the ones that are more likely to get published. And they can be no more relevant than a coin toss.
Why Do I Go On About This Stuff?
What I’m trying to do here is play a small part in bringing the practice of psychology out of the world of unvalidated hypothetical frameworks and uninformative research, and into reality, where dud interventions are seen for what they are, and more of my colleagues begin to use interventions that are reliable, predictable, efficacious and enduring in their effect, which in turn is markedly clinically significant.
If you feel as I do, please support this newsletter, and the associated free training, as far as you are able, whether by the much-appreciated act of sharing, or by helping to fund it.
See you soon!
Great reminder! I also tell my students that sometimes we can get a highly significant finding, e.g., with a probability of occurring less than once in 10,000 times, (p < .001) quite consistently, but we have to ask ourselves the question of whether it is meaningful. As demonstrated in this link, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4456887/ we are talking about a highly reliable difference of about 20 msec for simple reaction times between men (faster) and women (slower). I cannot think of anything we do as human beings where this would be truly meaningful. So we have to be cautious. Also sample sizes can affect significance levels and so on. A BASIC understanding cannot be ignored. I think you did a very nice job of that basic understanding.