Up: Blog

Figures Don't Lie, But Liars Do Figure

Posted May 22, 2022 by Ray Patrick

“Correlation Does Not Imply Causation”

If you’ve so much as walked past a Stats 101 course while the door was open, you’ve heard this phrase before. It’s a favorite among statisticians because human nature tempts us to fall into the post hoc logical trap quite easily. It’s also a perennial favorite on R*ddit for a different reason: this little incantation allows you to protect your pet opinions from common-sense people who notice patterns in real life. (As a bonus, you get a little midwit street cred for employing some sciencey words.) For example:

R*DDITOR A. “Religion makes you dumb lol”

R*DDITOR B. “I don’t know about that. A lot of my college professors were religious people, and they were definitely not dumb.”


This is a bad use of this phrase. In fact, if correlation doesn’t imply causation, R*dditor A has no basis to paint all religious people with the same brush based on the behavior of a few that he may know.

The Perils of Simple Regression

Simple linear regression is possibly the most basic statistical model. It is a linear regression with a single explanatory variable. Mathematically, it is a two-dimensional collection of sample points having an independent and a dependent variable (conventionally plotted on the X and Y coordinates). The model attempts to fit the data to a straight line, defined by a certain slope and intercept.

Okun’s Law: Dependent variable (GDP growth) is assumed linear with the independent variable (unemployment rate).
Okun’s Law: Dependent variable (GDP growth) is assumed linear with the independent variable (unemployment rate).

Obviously, real collected data will have some variation and will not exactly lie down on the predicted straight line. The coefficient of determination, R2 (“R-squared”) is the proportion of this variation that is predictable from the independent variable. R2 lies on the interval [0, 1]; the theoretically ideal or perfect fit would have an R2 value of 1, whereas totally random noise would have an R2 of 0. High R2 values indicate a good fit. A good fit suggests that the underlying simple regression model may be reflecting the actual truth.

Of course, this kind of model can get you in trouble if you happen to choose an independent variable that’s really just a proxy for the actual cause. For instance, “Study: College-educated people more likely to stay married” is misleading: actually, college education is a proxy for mid- to high-IQ and a long time preference (ability to seek delayed gratification); these factors probably make people more successful in relationships.

Clickbait Headlines

Follow the Science!
Follow the Science!

Remember what I said earlier about human nature causing us to be easy prey for the post hoc fallacy? It turns out that most people are pretty bad at interpreting statistics - even in deceptively “simple” studies. This leads to the phenomenon of clickbait headlines that sensationalize supposed “findings” that are actually poor interpretations of the data. (Related to this is Betteridge’s Law of Headlines: “Any headline that ends in a question mark can be answered by the word no.”)

Does Ice Cream Cause Polio? According to this Study - It Might

Can’t you just see that as a HuffPo headline?

It’s likely that nobody reading this is old enough to remember when polio was an extremely terrifying thing. Poliomyelitis is a major viral illness that can enter the central nervous system. As the virus multiplies within nerve tissue, muscles are denervated, causing asymmetrical weakness, sensitivity to touch, difficulty swallowing, loss of reflexes, and even paralysis. Although paralysis sometimes reverses as the disease abates, the effects can leave behind skeletal deformities such as joint tightening, club foot, or scoliosis. You probably got the inactivated poliovirus vaccine (IPV) as a small child, as I did. These days, thanks to IPV, polio has been virtually eradicated in the developed world. We barely even think of it today. However, in the mid-20th century, when the disease was endemic and no cure existed, people were understandably quite afraid of it. They would have been vulnerable to fear induced by dubious statistics.

In the late 1940s, a nationwide study conducted over several years found a high correlation between the incidence rate of new cases of polio among children in a community, and per capita ice cream consumption in the community. (Equivalently, a simple regression model, using ice cream consumption to predict the rate of occurrence of new polio cases, had a high coefficient of determination.) If you were to quote the results of this study the way clickbait headlines typically do, it would have touched off a mass panic. (“What’ll we do?! It’s in the ice cream!”)

Fortunately for those of us who like ice cream, a re-examination of the data showed that the high values of both variables occurred in communities where the study collected data in the summertime, and the low values of both occurred in communities where the data was collected during the winter. Polio – which we now know to be a communicable viral infection – spreads more easily when children gather in heterogeneous groups in relatively unsanitary conditions, i.e., it spreads more easily during summer vacation than when the children are in school. The high correlation in no way provided evidence that ice cream consumption causes or promotes polio epidemics.

✉️ Reply to this Post ✉️

Topics: statistics