Skip to main content

Weapons of Math Destruction by Cathy O'Neil

Weapons of Math Destruction (2016) by Cathy O'Neil

Solid work of practical ethics, covering the rights and wrongs of applied statistics in general, but particularly our mass, covert automation of business logic in schools, universities, policing, sentencing, recruitment, health, voting... Much to admire: she is a quant (an expert high-stakes modeller) herself, understands the deep potential of modelling, and prefaces her negative examples with examples of great models, methods of math construction - the moneyball wonks, the FICO men,
in mid-2011, when Occupy Wall Street sprang to life in Lower Manhattan, I saw that we had work to do among the broader public. Thousands had gathered to demand economic justice and accountability. And yet when I heard interviews with the Occupiers, they often seemed ignorant of basic issues related to finance...
They are lucky to have her!

A 'Weapon of Math Destruction' is a model which is unaccountably damaging to many people's lives. She doesn't cash out her criteria ("Opacity, , but here's an enumeration.

Standardisation - the agreement of units and interfaces that everyone can use - is one of those boring but absolutely vital and inspiring human achievements. WMDs are a dark standardisation.

In her account, a model doesn't have to be predatory to be a WMD: for instance the recidivism estimator was first formulated by a public-spirited gent.

She says "efficiency" when she often means "accuracy, and thereby efficiency". This makes sense rhetorically, because she doesn't want to give the predatory models a halo effect from the less backhanded word "accurate". Also lards the texts with "Big Data"

The writer Charles Stross, asked recently to write the shortest possible horror story, responded: "Permanent Transferrable Employee Record". O'Neil makes us consider that this could be easily augmented with "including employee medical histories".

She covers most of the objections I would make to an unsubtle author:
It is true, as data boosters are quick to point out, that the human brain runs internal models of its own, and they’re often tinged with prejudice or self-interest. So its outputs—in this case, teacher evaluations—must also be audited for fairness. And these audits have to be carefully designed and tested by human beings, and afterward automated. In the meantime, mathematicians can get to work on devising models to help teachers measure their own effectiveness and improve.

Optimising for truth. Wants fairness instead. Can you be fair without knowing the truth?
Not optimising for profit, that is bad, though one argument of the profit function is truth.

She does not include inaccuracy as a named criterion for WMDs, but her discussions sometimes require it. This is maybe the core shortcoming of the book: it doesn't wrestle much with the hard tradeoff involved in when modelling unfair situations, e.g. living in a bad neighbourhood which increases your risks and insurance costs through no fault of your own. She comes down straightforwardly on the direct "make the model pretend it isn't there" diktat.

But then she notes a case where fairness trumping: value-added teacher rating:
The teacher scores derived from the tests measured nothing. This may sound like hyperbole. After all, kids took tests, and those scores contributed to Clifford’s. That much is true. But Clifford’s scores, both his humiliating 6 and his chest-thumping 96, were based almost entirely on approximations that were so weak they were essentially random.

The problem was that the administrators lost track of accuracy in their quest to be fair. They understood that it wasn’t right for teachers in rich schools to get too much credit when the sons and daughters of doctors and lawyers marched off toward elite universities. Nor should teachers in poor districts be held to the same standards of achievement. We cannot expect them to perform 1miracles.

So instead of measuring teachers on an absolute scale, they tried to adjust for social inequalities in the model. Instead of comparing Tim Clifford’s students to others in different neighborhoods, they would compare them with forecast models of themselves. The students each had a predicted score. If they surpassed this prediction, the teacher got the credit. If they came up short, the teacher got the blame. If that sounds primitive to you, believe me, it is.

My preferred measure would be to not prevent models from being rational, but instead make transfers to the victims of empirically unfair situation. (This looks pointlessly indirect, but price theory, and the harms of messing with them, is one of the few replicated economic.) My measure has the advantage of not requiring a massive interpretative creep of regulation: you just see what the models do as black boxes and then levy justice taxes after.
Statistically speaking, in these attempts to free the tests from class and color, the administrators moved from a primary to a secondary model. Instead of basing scores on direct measurement of the students, they based them on the so-called error term — the gap between results and expectations. Mathematically, this is a much sketchier proposition. Since the expectations themselves are derived from statistics, these amount to guesses on top of guesses. The result is a model with loads of random results, what statisticians call “noise.”
Past unfairness gives the prior. Marx's critique of emp.
Innocent people surrounded by criminals get treated badly, and criminals surrounded by a law-abiding public get a pass. And because of the strong correlation between poverty and reported crime, the poor continue to get caught up in these digital dragnets. The rest of us barely have to think about them.
She flip-flops between thinking that false positives are the problem, and that any positives based on uncomfortable variables are the problem. Surprisingly big evidential gaps:
By 2009, it was clear that the lessons of the market collapse had brought no new direction to the world of finance and had instilled no new values. The lobbyists succeeded, for the most part, and the game remained the same: to rope in dumb money. Except for a few regulations that added a few hoops to jump through, life went on.

Engrossing car crash of American higher education. She credits the US News ranking with creating the whole mess though, which can't be right. Certainly it had some effect, and could fix some of its harm by including tuition fee size as a negative factor.

even those who claw their way into a top college lose out. If you think about it, the college admissions game, while lucrative for some, has virtually no educational value. The complex and fraught production simply re-sorts and reranks the very same pool of eighteen-year-old kids in newfangled ways. They don’t master important skills by jumping through many more hoops or writing meticulously targeted college essays under the watchful eye of professional tutors. Others scrounge online for cut-rate versions of those tutors. All of them, from the rich to the working class, are simply being trained to fit into an enormous machine—to satisfy a WMD. And at the end of the ordeal, many of them will be saddled with debt that will take decades to pay off. They’re pawns in an arms race, and it’s a particularly nasty one.

Predictive models are, increasingly, the tools we will be relying on to run our institutions, deploy our resources, and manage our lives. But as I’ve tried to show throughout this book, these models are constructed not just from data but from the choices we make about which data to pay attention to—and which to leave out. Those choices are not just about logistics, profits, and efficiency. They are fundamentally moral...

Big Data processes codify the past. They do not invent the future. Doing that requires moral imagination, and that’s something only humans can provide. We have to explicitly embed better values into our algorithms, creating models that follow our ethical lead.

Quick, sharp-toothed and filling. Extra half a point for requiring no technical background; it proves clear thought


Other highlighted passages:
Anywhere you find the combination of great need and ignorance, you’ll likely see predatory ads.

All of these data points were proxies. In his search for financial responsibility, the banker could have dispassionately studied the numbers (as some exemplary bankers no doubt did). But instead he drew correlations to race, religion, and family connections. In doing so, he avoided scrutinizing the borrower as an individual and instead placed him in a group of people — what statisticians today would call a “bucket.” “People like you,” he decided, could or could not be trusted.

Fair and Isaac’s great advance was to ditch the proxies in favor of the relevant financial data, like past behavior with respect to paying bills. They focused their analysis on the individual in question—and not on other people with similar attributes. E-scores, by contrast, march us back in time. They analyze the individual through a veritable blizzard of proxies. In a few milliseconds, they carry out thousands of “people like you” calculations. And if enough of these “similar” people turn out to be deadbeats or, worse, criminals, that individual will be treated accordingly.

From time to time, people ask me how to teach ethics to a class of data scientists. I usually begin with a discussion of how to build an e-score model and ask them whether it makes sense to use “race” as an input in the model. They inevitably respond that such a question would be unfair and probably illegal. The next question is whether to use “zip code.” This seems fair enough, at first. But it doesn’t take long for the students to see that they are codifying past injustices into their model. When they include an attribute such as “zip code,” they are expressing the opinion that the history of human behavior in that patch of real estate should determine, at least in part, what kind of loan a person who lives there should get.

In other words, the modelers for e-scores have to make do with trying to answer the question “How have people like you behaved in the past?” when ideally they would ask, “How have you behaved in the past?” The difference between these two questions is vast. Imagine if a highly motivated and responsible person with modest immigrant beginnings is trying to start a business and needs to rely on such a system for early investment. Who would take a chance on such a person? Probably not a model trained on such demographic and behavioral data.

should note that in the statistical universe proxies inhabit, they often work. More times than not, birds of a feather do fly together. Rich people buy cruises and BMWs. All too often, poor people need a payday loan. And since these statistical models appear to work much of the time, efficiency rises and profits surge. Investors double down on scientific systems that can place thousands of people into what appear to be the correct buckets. It’s the triumph of Big Data.
This is not to say that personnel departments across America are intentionally building a poverty trap, much less a racist one. They no doubt believe that credit reports hold relevant facts that help them make important decisions. After all, “The more data, the better” is the guiding principle of the Information Age. Yet in the name of fairness, some of this data should remain uncrunched.


Insurance is an industry that draws on the majority of the community to respond to the needs of an unfortunate minority. In the villages we lived in centuries ago, families, religious groups, and neighbors helped look after each other when fire, accident, or illness struck. In the market economy, we outsource this care to insurance companies, which keep a portion of the money for themselves and call it profit.

Mistaken in the US: the "loss ratio" of US insurance as a whole is greater than 100: loss-making except for financial return on held premiums.


Campaigns scoring voters for ad targeting is new to me:
The campaigns use similar analysis to identify potential donors and to optimize each one. Here it gets complicated, because many of the donors themselves are carrying out their own calculations. They want the biggest bang for their buck. They know that if they immediately hand over the maximum contribution the campaign will view them as “fully tapped” and therefore irrelevant. But refusing to give any money will also render them irrelevant. So many give a drip-feed of money based on whether the messages they hear are ones they agree with. For them, managing a politician is like training a dog with treats. This training effect is all the more powerful for contributors to Super PACS, which do not limit political contributions.

The campaigns, of course, are well aware of this tactic. With microtargeting, they can send each of those donors the information most likely to pry more dollars from their bank accounts. And these messages will vary from one donor to the next.

Occasional data-poor hyperbole:
Workers often don’t have a clue about when they’ll be called to work. They are summoned by an arbitrary program. Scheduling software also creates a poisonous feedback loop. Consider Jannette Navarro. Her haphazard scheduling made it impossible for her to return to school, which dampened her employment prospects and kept her in the oversupplied pool of low-wage workers. The long and irregular hours also make it hard for workers to organize or to protest for better conditions. Instead, they face heightened anxiety and sleep deprivation, which causes dramatic mood swings and is responsible for an estimated 13 percent of highway deaths. Worse yet, since the software is designed to save companies money, it often limits workers’ hours to fewer than thirty per week, so that they are not eligible for company health insurance. And with their chaotic schedules, most find it impossible to make time for a second job. It’s almost as if the software were designed expressly to punish low-wage workers and to keep them down.

The solution for the statisticians at St. George’s—and for those in other industries—would be to build a digital version of a blind audition eliminating proxies such as geography, gender, race, or name to focus only on data relevant to medical education. The key is to analyze the skills each candidate brings to the school, not to judge him or her by comparison with people who seem similar...

we’ve seen time and again that mathematical models can sift through data to locate people who are likely to face great challenges, whether from crime, poverty, or education. It’s up to society whether to use that intelligence to reject and punish them—or to reach out to them with the resources they need. We can use the scale and efficiency that make WMDs so pernicious in order to help people. It all depends on the objective we choose.
But how do know we what's relevant to medical education, except by correlation discovery?

It would also be a cinch to pump up the income numbers for graduates. All colleges would have to do is shrink their liberal arts programs, and get rid of education departments and social work departments while they’re at it, since teachers and social workers make less money than engineers, chemists, and computer scientists. But they’re no less valuable to society.