Posted by: Chris | February 6, 2012

A Poverty of Statistical Thinking

As part of Google’s new company-wide privacy policy, the search engine is disclosing (some) of the information it stores on users to tailor ads.   Specifically, it is allowing one to see what categories Google slots your browsing/emailing histories into (for me, tv shows, software, video games, and colleges came up most commonly) as well as the search algorithm’s best guess for your age and gender.  You can check out your specific results here.

However, this last result, displaying one’s expected gender, has kicked up some bizarre, though sadly not unexpected, controversy.  Because Google’s algorithm aligns with certain gender stereotypes (men like gadgets, women like cooking), some people have decided that it is inherently sexist.

For instance, a distaff comic fan, who bemoans Google’s inability to “understand geek women,” complains:

Google, we’re not going to get mad at the seemingly sexist algorithms in use, we’re just going to tell you this is something you need to fix. Also, Susana does not want you to try and sell her men’s pants because she searches for electronics online. Below are my results. Even with something stereotypically female in my list, “shopping – apparel – footwear,” I am still thought to be a man because of my love for entertainment? I’m just not sure how this makes sense for us, or you, Google.

Likewise, a tech writer of the fairer sex harumphs:

[The algorithm] gives a pretty sad indication of how technology is still clearly defined as a “male” activity.

This wouldn’t be so problematic if it weren’t an entire industry that employs both men and women and is a giant source of innovation and wealth creation. I wonder what my female friends in finance, medical or the legal professions see? It would be disheartening if Google classified those surfing categories as male.

So ladies, get out there and spout off about tech. Let’s show those demographic wunderkinds at Google that there are a few of us (in the 25-33 age range, thank you!) that think packets, semiconductors and programming are an equal-opportunity category.

Of course, these results are not “something that [Google] needs to fix” nor would the postings of a few more women on tech matters altering things slightly.  In determining gender, Google looked for trends in male and female browsing patterns and inferred from general patterns specific results.  Since technology and comic books have a huge demographic skew towards men, it is entirely justifiable that someone whose internet history is dominated by these topics would be presumed male.  Individual variations from the expected gender are, at this level, unavoidable but uninteresting as they do not in any way confound the general presumption in favor of one gender or another.  Indeed, if Google correctly identified these women as such in spite of their internet usage patterns, one would have more grounds to complain about the statistical generalization.

Now one argument that could be made is that the skew for gender on these topics is not sufficiently strong for Google to have a definitive presumption one way or another, but, barring access to Google’s raw data, I think we can tentatively settle on the side of Google understanding how to do math.  Either way. the existence of a few individual miscategorizations would not be sufficient evidence against Google’s classification schema no matter how strong the covariance between gender and nerdy topics.

However, that the first response people reach for, when confronted with fairly noncontroversial statistical data, is to cry sexism and bias provides evidence (albeit only anecdotal) for the paucity of statistical thinking in general, even amongst otherwise intelligent people.  It always surprises me the rapidity with which people move from generalities (on average, dogs are bigger than cats) to absolutes (all dogs are bigger than cats) or attempt to refute the former with specific examples (my tabby is bigger than your Chihuahua).  Razib Khan had a post awhile back on this subject which bears repeating:

1) There are some basic tools which many intelligent people are just plain ignorant of.  On this weblog one reader, a medical doctor, was surprised by Bayes’ rule, but immediately understood its relevance.  This happens to all of us, there are amazing tools out there which we haven’t encountered for a variety of reasons.  And most of us are happy to pick up the tools if we see clear utility.
2) But, there’s another problem, and that is the fact that statistical and probabilistic thinking is a real damper on “intellectual” conversation.  By this, I mean that there are many individuals who wish to make inferences about the world based on data which they observe, or offer up general typologies to frame a subsequent analysis.  These individuals tend to be intelligent and have college degrees.  Their discussion ranges over topics such as politics, culture and philosophy.  But, introduction of questions about the moments about the distribution, or skepticism as to the representativeness of their sample, and so on, tends to have a chilling affect on the regular flow of discussion.  While the average human being engages mostly in gossip and interpersonal conversation of some sort, the self-consciously intellectual interject a bit of data and abstraction (usually in the form of jargon or pithy quotations) into the mix. But the raison d’etre of the intellectual discussion is basically signaling and cuing; in other words, social display.  No one really cares about the details and attempting to generate a rigorous model is really beside the point.  Trying to push the N much beyond 2 or 3 (what you would see in a college essay format) will only elicit eye-rolling and irritation.

I suspect part of this results because the intuition involved in a lot of statistics requires one to be steeped deeply in a culture of probabilistic thinking and, barring an education in the natural sciences with a strong research focus, this is something that is hard to encounter naturally.  For instance, in my current quant-heavy program, there seems to be a number of people who are floundering with some of the logic of statistics, even those that come from a strong math/economics background, because they never imbibed, down to a gut level, this specific reference frame.

Of course, the fooferah over this subject is nothing but a boon for Google.  The ire over miscategorization will likely impel many to provide the search company better and more individual information than had the company sent out an innocent request.



  1. […] certain people are adverse to even acknowledging these trends.  Part of this, as I am liable to push in these parts, is a lack of probabilistic or […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s


%d bloggers like this: