Fire! Or Maybe Not.

Fire! Or Maybe Not. A case of knotty data and reverse discrimination, by Mark Chussil

In this essay we will analyze a difficult problem that’s in the news. It’s about how our beliefs and assumptions guide our analysis of a knotty problem. It’s relevant to anyone who works with data, even if — or perhaps especially if — answers seem obvious.

My home town, New Haven, Connecticut, has been in the news. Not because it’s my home town. Rather, because of a reverse discrimination lawsuit (Ricci v. DeStefano) brought by firefighters against the city. The case is in the news also because Judge Sonia Sotomayor, who’s been nominated for the U.S. Supreme Court, joined her colleagues on the Second Circuit Court of Appeals in a unanimous decision backing the city.

Ricci v. DeStefano is now before the Supreme Court. (Update: for their decision, please see the end of this essay.) I thought it might be nice to bring some cold, rigorous thinking to the hot, emotional case.

The case concerns tests used to promote firefighters to lieutenant or captain. Although the legal arguments focus on which parties have which rights and obligations, the core question (which we might hope is relevant) is whether the tests actually were discriminatory.

Here are the numbers:

  • 41 people passed the captain’s exam: 25 white, 8 black, and 8 Hispanic. The city would have to promote the 9 people with the top scores. 7 were white, 2 were Hispanic, none were black.
  • 77 people passed the lieutenant’s exam: 43 white, 19 black, and 15 Hispanic. The city would have to promote the 10 people with the top scores. All were white.

The city contends that the apparently too-high percentage of whites being promoted by their test scores is evidence of racial discrimination, and so they threw out the test results. The firefighters say that throwing out the test results reverse-discriminates against those who scored well on a fair test.

I decided to calculate whether the high percentage of white promotions was statistically “too” high. If it is statistically unlikely that whites would do so well and non-whites not so well, we’d have evidence that the tests might have been discriminatory (the city’s position). If the odds are high that it the results could happen by chance or merit, we’d have evidence that the tests were not discriminatory (the firefighters’ position).

Note that analysis can say nothing about the intentions behind the tests. Note also that analysis cannot prove that the tests were discriminatory (or not) in some absolute-truth sense. Still, if the odds strongly favor one side or the other, that should count for something. A reasonable person would and should draw different conclusions about X if the odds of X are 1% and if the odds of X are 99%.

I wrote a computer program that looked at every possible way to distribute 41 people in 9 slots (the captain’s exam) and every possible way to distribute 77 people in 10 slots (the lieutenant’s exam). Then, it looked at how many of those possible ways matched the actual racial distribution of the results.

There are 350,343,565 possible combinations of 9 winners on the captain’s exam. Of them, 13,459,600, or 3.8%, had 7 whites, 0 blacks, and 2 Hispanics. Another way to look at the results is how many had 7 whites and 2 non-whites. Under that test, 57,684,000 combinations match 7 whites and 2 non-whites, or 16.5%.

In statistical analysis, 5% is a common threshold for “significance;” that is, 1-in-20 odds, a fairly reliable result. (More-stringent analysis uses 1%.) Thus, the 3.8% supports the city’s case, and the 16.5% does not.

Then there’s the lieutenant’s test. There are (this is not an exaggeration or a joke) 1,096,993,404,430 possible combinations of 10 winners out of 77 people. Of that trillion-plus, 1,917,334,783 fit the 10 whites, 0 non-whites outcome. The odds of that are far below 1%; to be exact, the odds are 0.17%. That suggests it was not an accident that 10 whites got the 10 top scores. Why it happened — the test itself, the scoring, self-selection among those who took the test, something else — is a different question, about which neither the analysis nor I make any statement. (I didn’t mention merit as a reason why it happened. We’ll come back to that.) All we can say is that there’s only 1 chance in almost 600 that such an outcome would occur by chance. That’s like guessing a coin toss correctly 9 times in a row: it can happen but you wouldn’t bet on it. Those results support the city’s case strongly.

That said, 1-in-600 odds don’t prove there is discrimination. Those test results could happen by chance, especially if multiple cities use the same test or the same city uses the test multiple times. It’s like winning the lottery if you play enough times or dying on an airplane if you fly enough times. And again, there’s the question of merit.

It gets more complicated. We know that New Haven took pains to create a test that wouldn’t racially discriminate. Assuming that they were sincere and at least partially effective in their efforts, that should raise our confidence that the test results happened by chance (the firefighter’s position), not by discrimination (the city’s). How much should we raise our confidence that the test was not discriminatory? I don’t know. A place to start might be to compare the results of the contested exam with the results of previous tests, or to look at other cities’ test results.

Here’s a different complication: how do we define or discern discrimination? Presumably it would show up as an unfair boost, not unlike steroids, rather than as a blatant gift. Let’s try an experiment. What if, for instance, whites were surreptitiously given slightly higher scores on the tests than blacks or Hispanics? I don’t know how that would be done, but let’s assume that there was a clever way. The average score on the lieutenant’s test for whites was 71.8, for blacks 63.8, for Hispanics 63.6. An unfair boost is one of several possible explanations for that difference. The existence of the difference does not prove the difference was unfair or even statistically reliable, though it begs to be studied more. Regardless, what if we split the difference on the averages and subtract 4 points from every white candidate’s score? How many of the top 10 scorers would be white?

Answer: still 10.

Subtracting 4 points from the whites’ scores is arbitrary, and it could be argued that it introduces clear bias in an attempt to eliminate assumed bias. Even so, 4 points doesn’t change the promotion list. That’s an argument in favor of the firefighters who brought the lawsuit, though it doesn’t prove much.

What if we completely erase the average differences among the groups by subtracting 8 points from the whites’ scores? We’d have 6 white winners and 4 black winners, an argument in favor of the city. And we still haven’t proved much.

Neither the 4-point experiment nor the 8-point experiment proves anything about the presence or absence of discrimination. They merely show the sensitivity of the promotions to presumed systematic bias of a certain number of points. The experiment is about the size of the arbitrary subtraction, not about discrimination. The experiment comes down to whether the experimenter believes that 4 points, or 8 points, or 0 points, or 2.736 points, or 12.345 points, is the right adjustment for the differences in average scores. It’s an analysis based on an assumption. (If there are data that support a real adjustment, that’d be another matter entirely. I don’t know if any such data exist.)

Let’s try another approach. Forty-three whites passed the lieutenant’s exam, along with 19 blacks and 15 Hispanics. Given the much larger number of whites, we’d expect that there would be more variation among them than within either of the other two groups. That happened: there was a wider range of scores among whites. Some of those scores were at the top end. That supports the firefighters’ case.

At last, here’s the merit issue I’ve been promising you. Calculating the 1-in-600 odds started with the implicit and more-or-less invisible assumption that all the people who passed the test got the same score. It’s a direct consequence of treating each of the trillion-plus combinations of winners as equally probable. (Did you spot that assumption? I didn’t until I got pretty deep into my analysis.) In effect, my calculations answered the question “how probable is it that equally qualified people of different races would produce 10 white winners (the lieutenant’s test) or 7 white and 2 Hispanic winners (the captain’s test)?” But how do we know people’s qualifications? That’s what the tests are supposed to reveal. And if the tests reveal merit, then the test results would be right, by definition. But I don’t know if they do (perhaps someone else does); some tests work and some tests don’t. For now, we can only make assumptions.

Presumably the New Haven Fire Department believes the tests measure something of value. On the other hand, presumably no one believes the tests are perfect. So what should we conclude?

Alas, our analysis is not conclusive. If we assume that the candidates were equally qualified, more or less, the 1-in-600 calculation is pretty compelling in favor of the city. Ditto if we assume the tests and scoring were slanted, intentionally or not, toward the white candidates. On the other hand, the wider variation in the larger group, the less-than-overwhelming 1-in-26 odds on the captain’s test, and the city’s previous efforts to ensure fair tests argue in favor of the suing firefighters.

Most important, there’s the question of whether the tests measure merit. If they do, the firefighters’ case is strong. If they don’t, the city’s case is strong.

The bottom line: no definitive answer yet.

Let’s put New Haven aside and up-level the discussion.

When I ran my computer program, I was sure of my conclusion: New Haven is right, the firefighters are wrong. As I wrote this essay, though, I thought about questions my dear readers might ask, and those questions made me think. I questioned my methods, assumptions, and conclusions. I went back and forth as I crunched numbers every way I could imagine short of making this essay my career. So much for cold, rigorous thinking; instead, I got a cold, hard dose of unwanted humility as my conclusion morphed into I don’t know. More data might help, especially information about the validity of the tests. I don’t have those data, and even if I did, there’s only so much time for analysis before we must make decisions. That’s true in government and in business.

That I have not come to a clear conclusion about Ricci v. DeStefano doesn’t mean this exercise (and your faithful reading) has been useless. Quite the contrary; we can come to conclusions about how we come to conclusions. This exercise has:

  • Made me less likely to jump to a tempting conclusion based on a short stack of factoids.
  • Reminded me that numbers don’t speak for themselves (1-in-600 odds), that analysis reflects basic assumptions (equally probable combinations of winners), and that statistical processes are always at work (wider variations in larger groups).
  • Taught me to actively, deliberately look for contrary data and ideas and to keep asking how my analysis could be wrong or incomplete.
  • Helped me formulate different ways to solve analytic problems. For example, when there is no clear conclusion, ask a different question. “Which mistake would I rather make” is a good one. “What would a smart reader say” is another.
  • Shown me that even though “proof” may be an unattainable standard, we can avoid simplistic answers.
  • Proven to me that inconclusive data can lead to more-thorough thinking.

Both sides care about the truth, and we have learned that the data we’ve seen so far are not sufficient to tell us the truth. We have learned that we can debate the promotion patterns and test scores as long as our voices hold out but those data alone do not reveal the truth. We have learned that making a decision relying solely on the data we’ve seen so far will be at least partially the triumph of persuasion or ideology. Finally (and this is both important and exciting), we have learned what else we need — data on the validity of the tests themselves — to make a decision using reason and analysis.

Update. On June 29, 2009, the Supreme Court ruled in favor of the firefighters in a 5-4 decision written by Justice Anthony Kennedy. The Wall Street Journal reported that Justice Kennedy said employers “must show a ‘strong basis in evidence’ before ignoring results of employment-related tests.” As we saw in this essay, the evidence was mixed, and whether discrimination had been present or absent, it would have been difficult to prove or disprove.

Update. August 5, 2010: Judge Halts New York City From Using Exam Results to Hire Firefighters. Here is the district court’s full decision. Of particular interest are the tight “bunching” of test scores near the top of the range and the use of rank ordering to select candidates to hire. In other words, is a score of 98 actually better than a score of 97.

Related. February 6, 2012: Race and Dealth Penalty Juries, in the New York Times.

Further reading
Steven D. Levitt and Stephen J. Dubner, Freakonomics.
Leonard Mlodinow, The Drunkard’s Walk.
John Allen Paulos, Innumeracy.
Jay Russo and Paul Schoemaker, Decision Traps.
Nassim Nicholas Taleb, Fooled by Randomness.
See also Marvelous Techniques.

Share This Comment

Babette Bensoussan

Great essay. Found it to be an excellent example of how we get caught up in analysis. My concern is that most people end up paralysed because they can’t come to a conclusion that they assume has to be fair and equitable! Well done… Mark

Mark Chussil

Thanks, Babette! You make a very interesting point. It hadn’t occurred to me that some people will get stuck when they get an answer that violates their idea of what a good answer should look like. What do we do, as caring human beings, when an analysis leads in a direction we find unpalatable? I find the flip side interesting too, that people will persist with strong opinions that are not supported by data. Sometimes analysis shows that we just don’t know an answer, at least not yet. As Carl Sagan said, it’s okay to reserve judgment until the evidence is in.