Wednesday, August 9, 2017

Gender Differences and Statistical Distributions


With respect to the recent Google memo, intelligent discussion of the topic may be impeded by bad math being employed by both "sides" of this debate.

First of all, and probably more prevalent, are those who misunderstand the memo author's claim, and are assuming that he's using the average male and average female as proxies for all males and females. He's clearly not, and his memo shows this in notional graph form.

The purple and green lines represent statistical distributions of a given trait (e.g. math skill) distributed over a population. The shape of the purple (women?) and green (men?) distributions are identical in this example, but the green curve is shifted to the right, showing a higher average value for the notional Trait for green (men). Evident in the graph is that even with a higher average, there's a large overlap in the curves, such that for any minimum value of the Trait, there will be some number of men (area under the green curve) and a smaller number of women (area under the purple curve) who have the Trait at that minimum value, or greater.

As a first approximation, the memo goes on to argue that such statistical distributions may explain differing rates of representation among Google's workforce for different genders.

For example, from this link, we see the following data, which has been qualitatively consistent for many years.


If math skill, as measured by an SAT test, correlates well with skills applicable at IT companies, then we might expect those companies to hire more Asian males, Asian females, white males, and white females, in that order, for example. But, of course, the US population is not made up of equal numbers of those four groups. So, it's relevant to compare IT hiring relative to the overall US population. (Note: this certainly ignores the fact that some hiring of foreign workers takes place, but examining that detail isn't necessary now to understand the basic point made in the memo).

Looking specifically at Google, their share of technical jobs held by women stands at 19% (2016), while approximately 32% of those jobs are held by Asian workers, and 57% by white workers. As non-Hispanic whites make up approximately 62% of the US population, this means that whites are slightly underrepresented at Google, still ignoring the math skill results. If we assume that Google gender ratios (.81 : .19) hold across ethnicities, for the sake of argument, that would mean that white men are overrepresented at Google (46%), while white women are underrepresented (11%). White men only make up 31% of the US population, so 46% is considered overrepresentation. However, this effect is much greater for Asian men, who have approximately 26% of tech jobs at Google, compared to 3% of the US population. Likewise, Asian women hold 6% of Google tech jobs, compared to 3% of the population. So, Asian men are employed in Google IT jobs at a level more than 800% of their expected level, based on population alone. Asian women are employed at 200% of their expected level. White men are employed at 150% of expectations, and white women at 35% of expectation.

Qualitatively, these 4 groups' basic results are consistent with the ranking of each of the groups' averages in the SAT math scores table. This is essentially one of the memo's major points - that skewed ethnic or gender representation at Google should be expected.

Limitations


There's multiple problems with the memo's content, as well as the simplified example I provided above.  One problem is that the SAT math scores, and the notional graph of Trait distributions that the memo showed, are both indications only of gross population features. The published results of the tests may simply be averages (as shown in the table above), or perhaps an average and standard deviation, for a normal distribution. But, these two numbers and a distribution type (e.g. "normal") are only a rough approximation of the entire distribution. In reality, distributions of traits can have shapes unlike the ones shown in the notional memo graph above.  For some analyses, this may be irrelevant, and a normal distribution may adequately describe the results.

There's reason to suspect, however, that Google's IT workforce may be an exception here. IT staff in general are not drawn from an entire population. In terms of a trait like math skill, they will generally only be from the upper echelons of ability, likely all above the average. For Google, specifically, though, the staff are likely to come exclusively from the very highest performers in our measured traits. Consider these new notional distributions.


In the above graph, I've represented male (blue) and female (pink) math scores with overlapping distributions. The male graph has a higher average math score. For most minimum values of "mathiness" one could pick, there is a greater area under the blue curve than under the pink curve, suggesting that we should expect more men than women to meet that minimum criterion.  However, recall that Google isn't sourcing employees from near the average.  The "upper tail" of a distribution (aka "right tail") is the part of the curve at the highest values on the X (Mathiness) axis. I have shown a possible zoomed-in view of that portion of the curves in a second plot:



This is completely hypothetical. However, what it attempts to illustrate is that since Google is an elite employer, they are interested in an area of the distributions that may not be well-represented by the overall male-female distributions shown first. It's entirely possible, for example, that the male distribution is not strictly normal. It may not be symmetric to the left and right of the average point (and I have drawn it to be asymmetric). If this is the case, while most parts of the male mathiness curve exist at higher values of mathiness than the female curve, the curves may cross near their upper tails. If so, it may be expected that for very high minimum values of mathiness (i.e. Google's elite standards), we would see more women than men meeting that criterion. This is depicted in my second (zoomed) graph.

Is this the case? That's a harder question to answer than the one about entire male and female populations. Many studies have quantified overall gender traits, but most aren't oriented toward the top 1%, or top 0.1%, of the population. That highest echelon may not be well known, based on existing research. If the second graph is representative, then the Google memo's conclusions are actually completely wrong. For that reason, caution should be used in applying general results to such an elite company as Google.

Other Data


Do I believe the Google memo's main premise is wrong? No, I don't. But, I'm also not confident. In the absence of better data, I'm willing to use the general population math data as a starting point. But, disclaimers about its applicability should be made, and the Google memo did not do that.

However, one reason I think the upper tail data would likely still show an expected male advantage in math is that we do have more sophisticated statistics available for IQ. Whether math scores or IQ are better predictors of ability at a software company is another question. However, IQ results do suggest that while male and female average IQs are similar, the distribution of male IQ is such that at very high levels of IQ, males are disproportionately well-represented. If math or other software skills mimic IQ, then the memo's premise may hold.

As one more simple data point, I tallied the entire graduating class of Caltech for 2017. Caltech has the highest SAT scores among US colleges, and is almost exclusively STEM majors. Therefore, it may help better characterize the "upper tails". I used the commencement program to count students with relevant degrees. For this, I counted any of the majors with "comput" in their name, and also minors, so long as the student's major didn't indicate another likely career choice. For example, a "physics major, computer science minor" was counted, while a "biology major, computational neural systems minor" was not.

Among the students with majors likely desired by a software company, women comprised approximately 27% of the graduates. This is compared to about 40% of the graduating class overall, suggesting women may be choosing, or excel at, other STEM majors more than computer fields. The Caltech class at 27% female, however, is higher than current US averages of 18-20% female, in "computer science" majors. This could mean that in fact, although more men would still be expected near the upper tails, the distributions are skewed at those upper tails to narrow the gender gap shown among the broader population (where women make up less than 20% of computer science majors).

One problem with Caltech is its extreme small size, so to be useful, this same tally should be performed for several years running. Nevertheless, without considering upper tails, Google's current ratio of female IT staff of 19% looks almost entirely expected. If the Caltech result is more representative of the upper tail, though, then perhaps we should expect closer to 27% female IT workers at an elite company like Google.

There are many other limitations to results in the Google memo, and my presentation here. I've only attempted to identify a couple strictly related to statistics. Neither are proper scientific presentations, but merely attempts to further a delicate conversation with the addition of some data.

Tuesday, May 9, 2017

The Comey Letter

Below is a tweetstorm, blogified. Forgive the numbered formatting, please.

1. Nate Silver's been doubling down on this take recently (+ Comey just got the axe), so I think it's time to recap why I think he's so wrong

https://twitter.com/NateSilver538/status/862065731432849408

2. IMO, error attributing HRC's fall to 10/28 Comey letter comes down to:
a) math error
b) Nate covering for a previous statement about polls

3. Two days prior, Silver boldly claimed that those saying the race was tightening were cherrypicking:

https://twitter.com/NateSilver538/status/791403889451040768 

4. In reality, the race was tightening, which should have been clear on 10/26, & was very clear 2-3 days later (before any post-Comey polls)

5. Nate not seeing this, IMO, provides a reason to try to attribute this to a later event, despite polling not really supporting his claim.

6. Now, the math. Silver takes raw polling from pollsters, then applies various adjustments to improve on them. e.g. his “polls+”

7. One problem’s that in this case, it doesn’t appear there’s strong reason to favor this approach. The final polls-only & polls+ projections:


8. Hindsight is 20/20 of course, but we now know who won. It’s hard to make the case that polls+, or other derivative measures, were better.

9. Silver actually acknowledged this during the 2016 election season, but is now relying on derivative data to make his Comey case.

10. To derive “win probabilities”, 538 uses things like economic data, state polling, & also some “smoothing” algorithms for polling data.

11. The smoothing algorithm here, I believe, maybe part of the problem. Smoothing is like curve-fitting. Taking 1 value from noisy data points

12. A simple example is using a 5-day average of polling data, versus a 10-day average. There are advantages to both.

13. But, if you want to attribute influence to events occurring on single days, longer running averages make it harder to see that.

14. To analyze this event, I used raw data from realclearpolitics.com, with different chart customizations.
http://www.realclearpolitics.com/epolls/2016/president/us/general_election_trump_vs_clinton_vs_johnson_vs_stein-5952.html

15. Here is the polling for a 4-way race on 10/29, the first day w/ *any* post-Comey polling. HRC +2.6%

16. I looked at each poll in that day’s average, & calculated at most 2% of respondents could have read about the Comey Letter.

17. Using a very generous assumption, I believe the absolute largest HRC’s lead could have been *before* the letter is therefore 2.7%.

18. The polls closed on election day at HRC +3.3% (top right on previous graph). No indication her lead decreased after the letter.

19. Of course, before 11/8, there was also another FBI statement that nothing much was found. But, combining the 2 events is most applicable.

20. Where Silver & others go wrong, is that the polling averages would drop around 10/29 even w/o the Comey Letter. Why? 10/24 polls.

21. RealClearPolitics shows you all the polls in their average, and we see that on 10/24, HRC had several outstanding results.

22. But, after 5 days, those +14 and +9 polls drop out of the 5-day running average. A drop was imminent for that reason. Compare:

23. HRC at +3.8% on 10/28 becomes HRC +2.6% on 10/29. But, only 2% of respondents in 10/29 data could have heard about the letter.

24. This steep drop could not have been about the letter, and had almost everything to do w/ strong HRC polls leaving the average.

25. Looking at another metric, we have individual favorability polling data for both HRC & Trump. HRC before and after 10/29:



26. Again, we see in the favorability data a set of very good polls (for HRC) on 10/24.

27. Now, for Trump pre-Comey & post-Comey. This clear swing Silver sees is just in the noise:


28. Back to “smoothing”. Look at Trump’s favorability data before & after the *election*. A clear post-election bump, as per usual:

29. In statistics, sometimes it’s necessary to choose valid endpoints. In this case, the election (result) is a clear break in the data.

30. If you plot Trump’s number before & after the election, smoothed w/ one function, you see a gradual change:

31. The previous dataset, w/ heavy smoothing, shows Trump at -21.3% favorability on 10/29.

32. But, take the same dataset & end it on election day (left). Then, look at Trump’s net fav on 10/29 (right):


33. In the previous chart, you see Trump’s net favorability on 10/29 is -23.3%. A full 2pt difference just based on dataset endpoints!

34. So, by using heavy smoothing and data after the election, you can be tricked into seeing effects of the election way back on 10/29!

35. If you isolate the post-election data, use less smoothing, & work w/ the raw polling data, you see the “Comey effect” was mostly noise.

36. I feel a little uncomfortable criticizing Silver here, b/c he’s clearly advanced the state of the art in his field. But, this was bad

37. To be fair, I also had a prediction of my own to “defend”. I’ll let the reader decide who has more at stake, tho.
38. As a follow-on, while I haven't seen a precise description of the 538 math (may not be public), I know they attempt to use econ data ...

39. What you can see is a dip in stock prices immediately after the Comey letter. I suspect this may explain some of Nate's conclusion here.

40. However, just like the polling, we see that stock prices not only recovered after their post-Comey lows, but finished (11/8) above 10/28

41. But, that stock prices even translate into votes, & should be tracked *in addition to* polls, is highly speculative. Same w/ jobs rpt.

42. In short, I think Nate is too enamored w/ his unique methodology, & puts too much faith in its power; especially its news responsiveness