Does organic food cause autism? Could Nicolas Cage movies make you more likely to drown? Six ways to misuse statistics

Back in the 1940s before the polio vaccine was invented, the disease caused a lot of anxiety among parents of small children. How could you reduce your child’s risk of contracting this nasty illness? Some misguided public health experts apparently recommended avoiding ice cream, thanks to a study that showed a correlation between ice cream consumption and polio outbreaks. This study fortunately was BS. Yes, there was a correlation between ice cream consumption and polio outbreaks, but that was because both were common in the summer months. The authors of the study had mistaken correlation (ice cream consumption and polio are more common at the same time) with causation (ice cream increases your risk of disease).

Medical researchers often trawl through data sets to try and figure out what environmental factors cause chronic disease. Unfortunately, these kinds of studies sometimes make the same kinds of mistakes as the ice cream and polio study. Doctor and researcher John Ioannidis got a lot of people all riled up when he claimed in 2005 that “most published research findings are false“, and as controversial as his main claim might be, he was absolutely right when he pointed out there are some serious problems with the way statistics are often used — and some medical research studies are misleading or flawed as a result. Popular science articles in the media and on the Internet compound this problem by ignoring the limitations of the study they’re reporting. Fortunately you don’t have to be a math major to spot these kinds of problems; just some basic critical thinking skills will do. Here are six ways people sometimes misuse statistics and how to spot them.

1) Assuming correlation = causation.

Does spending money on science cause suicide? Clearly it does. Look at the numbers!

And what about the remarkable correlation between deaths in swimming pools and Nicolas Cage movies? Come on. Surely you now realize that Nicolas Cage movies cause drowning?

And my favorite: clearly the real culprit behind autism is the increased consumption of organic food! Look guys the numbers prove it! It’s science!

(For more of these entertaining but ridiculous correlations see this website.)

As you can see, just because there’s a correlation between two things doesn’t mean the one causes the other. There may be a third factor involved. Think about the ice cream and polio study. Ice cream and polio were correlated because there was a third hidden factor the study ignored (the summer) contributing to both. Statisticians call a third hidden variable a confounding factor. Or sometimes you can wind up with a correlation like the Nicolas Cage vs. drowning thanks to sheer luck of the draw — random chance.

2) Data dredging.

Data dredging is a problem in medical research. Let me make up a totally hypothetical example to show you how this works. (And be forewarned I’m going to make this as ridiculous as possible.)

Let’s say you select a random sample of one thousand people and do a survey with two questions: 1) have you seen a Nicolas Cage movie in the last 365 days and 2) as of this moment on a scale from one to twenty how intense is your desire to drown yourself in a freshwater pool? (That freshwater part is important, I think.) Let’s say the average desire for drowning of the Nicolas Cage watchers was 12 while the average desire for drowning of the non-Cage watchers was 10. So people who watched Nick Cage were 1.2 times more likely to want to drown themselves! OMG! But wait… I just randomly picked a sample of a thousand people. If I picked a different sample would I get a different answer? How do I know this result isn’t just luck of the draw?

What many researchers in medical science will do is calculate a p-value. The best way to explain this is with a picture. The mechanism of statistical testing.

In most human populations, many traits will follow a kind of bell-curve distribution like the one above. In our Nicolas Cage example, let’s say the desire for drowning is on the x-axis and the number of people with that expressed desire is on the y. We’re assuming that if you went through the whole population and graphed how many people had a desire for drowning of say 8 or 9 or 10 etc., you’d end up with a bell-shaped curve like the one shown with the same average seen in our non-Cage watching group. So the p-value is calculating, if there is no difference between Cage and non-Cage watchers, if watching Cage really makes no difference, what is the chance of accidentally drawing a sample with an average desire to drown of 12? In other words, could we by chance have drawn a sample up near the far end of the bell curve?

Medical science researchers have arbitrarily chosen a cutoff for the p-value of 5% or 0.05. In other words, if the difference between group A (people who eat meat, people who watch Nicolas Cage, people who work with chemical Y etc.) and group B is big enough the p-value is less than 5% we say “this difference is statistically significant”. This has become a time-honored convention but it suffers from some obvious problems. For one, think about that 5% chance for a minute. It sounds small but really it’s not. If I do a whole bunch of studies with random samples of the exact same size and I do them all the same way, the odds are I’ll get at least one result that looks statistically significant but really isn’t. I’ll get a result that looks significant but was really just caused by luck of the draw.

This is why data dredging is a problem. Some folks will go through datasets and do multiple p-value tests for multiple possible correlations. So they’ll test to see whether Disease X in their sample is correlated with consumption of A or B or C or….and, well, the more of these statistical tests you do, the more likely it is you’ll come up with a result that looks significant but was really just caused by chance. Once they find a result they report it without bothering to mention in the paper “we tested for twenty different correlations and this is the only one we found”. If you’re going to use multiple tests like that you need to use a p-value threshold much lower than 5%, but because 5% has become The Convention We Use in Medical Research some folks will do this anyway. It’s difficult to know how common this problem is because typically the papers that did this don’t say they did it. Which is sort of frustrating.

Which brings me to our next problem…

3) Small sample sizes.

Sample sizes are important for many reasons. For one thing, in a bigger sample you can detect smaller differences more reliably. A difference of 2 in our Nicolas Cage desire-for-drowning study probably wouldn’t give us a p-value less than 5% if our sample size was 6 (this goes back to the way p-values are calculated). If the sample was much larger and the averages were still the same, however, this would be a significant result. So you need big samples to detect small differences.

Remember how with Nicolas Cage we assumed the average desire for drowning in our sample was also the average for the whole population? Right. Well, that’s only going to work if our sample is big enough and was truly randomly selected. With small samples, there’s a much worse danger of ending up with a nonsense result that looks significant but will not be true in other samples. When you see a study that makes conclusions about the general population based on a correlation in a very small sample….be cautious.

It’s a little like if I flipped a coin twice and came up with tails both times. WOuld that prove it was weighted? No, it wouldn’t. If I flip it a hundred times and I get 90 tails, though — now that’s a lot more interesting.

On a related note, it’s also important that the sample look like the population you want to study. Many (if not most) psychology studies, for example, use college undergrads because these studies are done by folks on university campuses, and, well…finding undergrads on a university campus is easy. But does a sample of undergrads really represent the general population?

4) Assuming that a small p-value means your hypothesis is correct.

The p-value does NOT tell you the chance that a hypothesis is false or true. There’s a lot of confusion on this unfortunately, even among scientists who should know better. This graph below from a Nature article about why p-values are a flawed way of testing results illustrates why.

5) Small effect sizes.

The “effect size” is the difference between your two groups. With the Nicolas Cage and drowning study, for example, the effect size was 2 or 20% (the Cage watching sample had on average a 20% greater desire to drown themselves).

All too often, people act like a small effect size means a lot more than it does because it’s “statistically significant”. A result can be statistically significant and yet still be meaningless because the effect size is so small. If I invented a drug that increases average life expectancy by thirty minutes if taken every day for the rest of your life…would you take it? If I told you the average IQ of all Californians was 0.05 points higher than the average IQ of all Arizonans…does that really mean Californians are smarter?

6) Generalizing about a group based on the average for that group.

Unlike the other five ways to misuse statistics on the list, this isn’t something you see in scientific papers very often, but in the popular press…it’s all over the place.

Take this article from WebMD for example about supposed differences between men and women; I selected this randomly from a quick Google search. (You should know BTW before I say anything else that some of the research in this article is hotly disputed, although the article curiously forgets to mention that. In fact this article is a wonderful example of how to disguise unproven stuff as hard science.) The article repeatedly says things like “girls outperform boys in” or “boys generally demonstrate superiority over female peers in”. But what they’re really saying is “in this one study that has not been reproduced we found there was a difference on average between girls and boys. The effect size was so small it was meaningless. Also, we’re going to assume correlation = causation and assume therefore that this difference which may or may not even exist in other studies is genetically based because that plays to our pre-existing bias.”

See how much fun it can be to lie with statistics? But my real points here are these. First, most of the time this article doesn’t give you the info you need to evaluate, it just says “this is the way things are”. (How big was the sample size? The effect size? Has the study been reproduced with another sample? how was the study designed etc.) Just as important, it makes generalizations about a group based on an average. If I tell you that Californians are on average smarter than Arizonans, does it follow that any Californian you meet is smarter than any Arizonan? Of course not. There are undoubtedly some Arizonans who are smarter than many Californians. The difference in IQ that we’re talking about is an average. But many popular press articles will take average differences between groups and assume they apply to all members of that group. They will say, for example, that “women are more like x” or “Californians are more like Y” when what they mean is “on average in this one study women were more like X or Californians were more like Y”. It’s extremely important to remember everyone is an individual. People are not walking averages.

So…how can we ever know whether a correlation really is causation? How do we figure out?

First, try to rule out all possible confounding factors. A good study will look for possible confounding factors and try to correct for them or rule them out. If the polio and ice cream people had bothered to ask whether summer was a possible confounding factor (a hidden third variable that causes both A and B so A and B look like they are correlated) they would have solved their problem.

Second, if you’re dealing with a sample, say a randomly selected group of people, try another sample and see if your correlation still holds true. In other words, is this reproducible? This is perhaps the most important thing to do.

Third, try to think of a way the first thing could cause the other thing to happen and devise an experiment to test that. Let’s say you find a correlation between consumption of a specific chemical and cancer rates. If the structure of the chemical is such that any chemist can tell you it will react with DNA, or if you can show it causes mutations in cells in a petri dish (the Ames test), or if you feed it to mice and they unfortunately turn up with cancer, or if you can show that liver cells in a test tube will convert it into a known or likely carcinogen, now you have a really strong case that your correlation really is causation, because now you can show the chemical probably is causing the correlation seen in your study: its chemical/biochemical reactivity shows you it’s doing what the correlation suggests it is. Unfortunately for many disorders like Alzheimer’s where no one has any idea what’s going on biochemically to cause this disease that’s not yet possible.

107 thoughts on “Does organic food cause autism? Could Nicolas Cage movies make you more likely to drown? Six ways to misuse statistics

  1. Thanks.
    Good stuff ,I’ll try to use it in vain hope of changing peoples minds,even though it’s been widely proven that over 50% of people asked already had their mind made up before hand. Thanks again.

  2. Good info for the statistically-challenged. However, I believe you have mischaracterized Dr. Sandler’s research. ” In 1941, Dr. Benjamin P. Sandler, M.D., published “The production of neuronal injury and necrosis with the virus of poliomyelitis in rabbits during insulin hypoglycemia,” a largely ignored report of his experiments which demonstrated that the poliovirus can only attack neurons suffering from insulin-induced hypoglycemia.
    Considering how the high sugar, high carb diet has been given a serious thumbs down for a number of medical conditions, I’m not sure Dr. Sandler deserves your scorn.
    Dr. Sandler might deserve a thumbs up if you look at the results of his campaign in his hometown.
    From his report: My experimental work with rabbits had been published in January, 1941, in the American Journal of Pathology. Polio has been prevalent every year since then and it reached epidemic proportions in 1944 and 1946. In the summer of 1944 I wrote to a public health agency and suggested that the people in epidemic areas be advised to adhere to a sugarless and starchless diet for the duration of the epidemic. However, no action was taken.” And
    “One of the puzzling characteristics of polio has been its prevalence in warm weather. Many people cut down on protective foods such as meats, fish, and poultry because of a mistaken idea that a “light” diet is better for them in warm weather. And they increase the consumption of cooling foods and beverages, most of them heavily sweetened. It is this increase in consumption of sugar that produces a lowering of blood sugar and thereby a lowering of the body’s resistance to the poliovirus.” and when he spoke out in his hometown in 1948:
    “One of the striking effects was the immediate improvement in morale. Parents felt that they were doing something constructive instead of just standing by and hoping the disease would not strike their homes. Store sales of sugar, candy, ice cream, cakes, soft drinks, and the like, dropped sharply and remained at low level for the rest of the summer.”
    So did the level of new cases of Polio in his hometown.
    ““Up until August 4, 1948, the city of Asheville had 55 cases of polio. If one assumes arbitrarily that the peak had been reached on that date, one could have expected about 55 cases during the decline until the end of the year, since in general during polio epidemics the number of cases following the peak is about equal to the number of cases preceding the peak. However, instead of 55 cases there were only 21 new cases in Asheville from August 4 to December 31.

    Actually, however, in the southeastern United States, polio epidemic peaks are usually reached during early September. If the epidemic had been allowed to run its course without the diet story, there might have been around 75 cases in Asheville by the first week in September (a conservative estimate), with a similar number following the peak. Thus there could have been a total of 150 cases in Asheville for the entire season. Actually, there were 76 cases for the entire season, or about half the expected number.”
    I don’t know the full truth of this situation, but is obvious by concentrating on the ice cream aspect you have tried to make Dr. Sandler look foolish to make a point, something I think is undeserved after just twenty minutes of Internet research. Advocating a healthy diet certainly was a “Do no harm” suggestion, worthy of praise even if not truly protective against polio. Correlation often has roots in causation. The search for those connections is the lifeblood of Science.

  3. Such a coincidence I ran into this read after just disputing an article on why there are no amish children with autism(they said because they kill them at birth because they are a burden with no support to the community) seriously autism isn’t usually detected at birth….. People are so naive..lol

  4. Thank you for this. As someome that once wanted to themself after watching a Nick Cage movie (fresh or salt water would have sufficed), I now know I can be comfortable rationalizing this with data.

  5. Thanks for this, I enjoyed reading it and I think you’re right about statistics being misinterpreted a lot. I object to your implications about Nicolas Cage movies causing people to want to drown, however. You say there’s no obvious correlation but even that suggests that he’s less than a good actor, and I resent that. I resent that.

  6. Great post. I beat my head against a wall every time I teach entry level research classes. I feel like a salmon swimming upstream against a tide of folksy thinking. I agree with allthoughtswork – This is Fox News stuff.

  7. The concern now should be open data trend where governments are putting out their raw statistical data that they collected for citizens or whoever to create their own analyses freely for their own projects.

    I wince at the thought of mixing data sets which each set was derived from different purposes. This deserves a blog post by someone like yourself.

  8. Reblogged this on aviewthroughthespyglass and commented:
    Just came across this blog and it reminded me of my last statistics unit on graphical design. So many people get statistics wrong because they aren’t thinking back to the research question or simply find statistics too daunting to properly take the time to understand where the numbers are coming from. No matter what field of work you are in, it is important to have a basic understanding between correlation and causation – getting it wrong has serious knock-on effects!

  9. Nicely put. Unfortunately, most people don’t have a deep enough background to understand (much less analyze) statistical data properly (remember the cartoon caption “ten out of every eight adults don’t understand math/ratios/fractions…”). I’d rather see the raw data than the conclusions… especially political poll claims.
    Compounding the problem is the internet. The vast amount of ‘information’ out here almost guarantees you can find a study supporting your point. While in studies this should not be a factor, in the greater world of application (read: reporting to an unsuspecting world) skewed data and conclusions should be expected (sample size 6, 95% confidence level)
    Thanks for the clearly worded explanation.
    Phred
    Posted: Information Overload, Confidence Underload

  10. This is awesome. Do you know of a Bayesian way to estimate “p-value”, or some probability of true significance? I’m thinking of someway using the p-value and a prior of how likely the hypothesis is in reality…

  11. I guess the same problems apply to big data where n=all. I’ve read some breathless and enthusiastic accounts of how big data is going to change everything, but I’m not convinced that people won’t continue to ask the wrong questions and take the wrong conclusions.

  12. Really enjoyed reading this, am studying information management and this highlighted so many areas that I’m researching. Thanks for the information provided, it’s much appreciated.

  13. I should mention that it is true that I drown people every time I watch a Nicolas Cage movie. I’ve never made a pie chart of it before, but I can’t seem to think of a time when I haven’t done so. It started with Raising Arizona and peaked around Face Off. So… The data doesn’t lie.

  14. I have never liked statistics, and am currently putting together an informational pamphlet trying to use reason alone to illustrate ideas that might otherwise employ statistics. I will see how successful I am.

  15. Sophistry is not new it is just going to new audience. I wish that the misrepresentations were more attributable to ignorance then they are to greed. After all the ignorant might still give a damn.

  16. As someone educated in science who is shamefully ignorant of statistics (I can thank my university for that), I found this immensely helpful and informative. Love the way that you broke everything down with easier to digest examples.

  17. Thank you for sharing this. I’ve been meaning to write about it, but you just presented it so efficiently. I’m a molecular biologist and that issue is even trickier in our field. People get so caught up with the number of cells pooled to get a result, they often forget that the PROPORTION it represents is so insignificant etc.
    Great job!

  18. Great read. This is a huge problem I see in education research, tons of people basically lying (knowingly and unknowingly) with statistics. I also hate when people cite research without ever sharing the actual research.

    I’m not really sure how to get others to talk about ed research and its data more effectively, but I hope I can find out soon!🙂

  19. I’m an empirist. For me, statistics needs to show a cause/effect relationship experimentally. I get Nicolas Cage make 10 movies a year, then stop for 5 years and observe the effect on drownings. Repeat process. This also relates to past findings, the correlation might have been coincidental by a common factor: in good economy, more Nicolas Cages are made and more people relax in water and drown. Just throwing two numbers together means little, good narrative is needed as well.

  20. Beautiful. This post should be required reading for every high school student around the world. One of the best courses I ever took was how to lie with statistics (corresponded well with my corporate finance course).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s