Questioning Random Statistics
I see and hear it almost every day: someone misinterprets data or makes a false assumption often using terrible data, spreads the stats, and suddenly the masses are shouting this incredibly new and insightful statistic. What a load of crap this usually is. In my mind I go through all the possible holes – did they question a random enough population, was there any duplication of data, is there just correlation and not causation, did they define their variables in a way that changes the meaning of the result? So I’ve embarked on what I think is a noble quest that will probably ultimately be ignored and make people think I’m a curmudgeon.
Exhibit A
Let’s start with this lovely quiz, which I found floating around LinkedIn one day. No supporting data whatsoever, yet I see hundreds of people commenting, giddy with excitement at how silly all of us are for not getting the answer right. I for one don’t think 96% of the population can’t count the squares, but there is only one way to prove this little stat false: analyze the data. Now, before I get started, if you’re thinking “But the average LinkedIn user is smarter than the rest of the population. You can’t compare their answers and assume a random sample”…then great for you. I’m about to analyze the only data I have, though. Why? Because I want to teach people how to question crappy statistics…if you’re already asking the right questions then read on for pure enjoyment while I crack this one wide open.
It turns out that by altering the URL to the post I can change how many comments are shown on the page. I’ll show 1,000 comments at a time, then download the HTML to a local file, and repeat until I’ve downloaded all the comments for the post. At the time of this writing, I was able to save to HTML 4,628 comments. To extract the comments, Python to the rescue, but more particularly Beautiful Soup to the rescue. If you haven’t used Beautiful Soup before it’s a library that makes it easy to parse HTML (think XPath selectors and the like). If I inspect the page, I can see that every comment is classed with class=”commentary”, so using this extremely simple Python script to leverage Beautiful Soup I can extract just the HTML elements with that class and output the results (the user comments) to a .csv file.
Then to get quick and dirty and throw the results into Excel. What I’m really interested in seeing is how many times do people guess the answers 6 squares through 26 squares. A simple formula lets me count how many time each of these answer shows up, and since I don’t want to count appearances of the number 7 in 17 I’ll compensate for that; and since I don’t want to count e.g. a number 9 as an answer if someone is explaining how they got their answer of 15 I will only use comments that lead with a number and count only that leading number. And the results are in! So how many squares are there? I should be able to tell using the data. If 96% of people answer incorrectly (according to the original image), then the correct answer is the one that is guessed only 4% of the time. And there’s the issue with crappy stats like this. That would make the answer most likely 17 since 2.7% of people (the number closest to 4%) think there are 17 squares. But I count 16 (yes, I’m right).
Furthermore – yes, there’s more – if we understand that the answer is actually 16, or 15 if you’re skeptical, then the statistic stated with the question is waaaaaaay off! Only about 70% of people fail this quiz, not the suggested 96%. If you believe that the LinkedIn population is truly that much smarter than the rest of the population then ok…maybe the number of people that fail is slightly closer to 96%; but I believe that we have used a random enough sample.
As you can see, the point is that we should all question and verify statistics more often!
Related Posts

Al

Nikhil Haas
