August 08, 2003
Statistical Prediction Rules More Accurate Than Experts

J.D. Trout & Michael Bishop, writing in an essay entitled "50 Years of Successful Predictive Modeling Should be Enough: Lessons for Philosophy of Science" argue that we continue to rely too much on the individual judgements of experts to make important decisions on subject matters where automated computer implementation of Statistical Prediction Rules would yield more accurate results. (PDF format)

In 1954, Paul Meehl wrote a classic book entitled, Clinical Versus Statistical Prediction: A Theoretical Analysis and Review of the Literature. Meehl asked a simple question: Are the predictions of human experts more reliable than the predictions of actuarial models? To be a fair comparison, both the experts and the models had to make their predictions on the basis of the same evidence (i.e., the same cues). Meehl reported on 20 such experiments. Since 1954, every non−ambiguous study that has compared the reliability of clinical and actuarial predictions (i.e., Statistical Prediction Rules, or SPRs) has supported Meehl’s conclusion. So robust is this finding that we might call it The Golden Rule of Predictive Modeling: When based on the same evidence, the predictions of SPRs are more reliable than the predictions of human experts.

It is our contention that The Golden Rule of Predictive Modeling has been woefully neglected. Perhaps a good way to begin to undo this state of affairs is to briefly describe ten of its instances. This will give the reader some idea of the range and robustness of the Golden Rule.

1. A SPR that takes into account a patient’s marital status, length of psychotic distress, and a rating of the patient’s insight into his or her condition predicted the success of electroshock therapy more reliably than a hospital’s medical and psychological staff members (Wittman 1941).

2. A model that used past criminal and prison records was more reliable than expert criminologists in predicting criminal recidivism (Carroll 1982).

3. On the basis of a Minnesota Multiphasic Personality Inventory (MMPI) profile, clinical psychologists were less reliable than a SPR in diagnosing patients as either neurotic or psychotic. When psychologists were given the SPR’s results before they made their predictions, they were still less accurate than the SPR (Goldberg 1968).

4. A number of SPRs predict academic performance (measured by graduation rates and GPA at graduation) better than admissions officers. This is true even when the admissions officers are allowed to use considerably more evidence than the models (DeVaul et al. 1957), and it has been shown to be true at selective colleges, medical schools (DeVaul et al. 1957), law schools (Dawes, Swets and Monohan 2000, 18) and graduate school in psychology (Dawes 1971).

5. SPRs predict loan and credit risk better than bank officers. SPRs are now standardly used by banks when they make loans and by credit card companies when they approve and set credit limits for new customers (Stillwell et. al. 1983).

6. SPRs predict newborns at risk for Sudden Infant Death Syndrome (SIDS) much better than human experts (Lowry 1975; Carpenter et. al. 1977; Golding et. al. 1985).

7. Predicting the quality of the vintage for a red Bordeaux wine decades in advance is done more reliably by a SPR than by expert wine tasters, who swirl, smell and taste the young wine (Ashenfelter, Ashmore and Lalonde 1995).

The writers cite additional examples not excerpted here. Paul Meehl thinks we should place more trust in models for decision-making.

Upon reviewing this evidence in 1986, Paul Meehl said: “There is no controversy in social science which shows such a large body of qualitatively diverse studies coming out so uniformly in the same direction as this one. When you are pushing [scores of] investigations [140 in 1991], predicting everything from the outcomes of football games to the diagnosis of liver disease and when you can hardly come up with a half dozen studies showing even a weak tendency in favor of the clinician, it is time to draw a practical conclusion” (Meehl 1986, 372−3).

The writers discuss why humans are reluctant to admit that their subjective judgement has a high error rate.

Resistance to the SPR findings runs very deep, and typically comes in the form of an instance of Pierce’s Problem. Pierce (1878, 281−2) raised what is now the classic worry about frequentist interpretations of probability: How can a probability claim (say, the claim that 99 out of 100 cards are red) be relevant to a judgment about a particular case (whether the next card will be red)? After all, the next card will be red or not, and the other 99 cards can’t change that fact. Those who resist the SPR findings are typically quite willing to admit that in the long run, SPRs will be right more often than human experts. But their (over)confidence in subjective powers of reflection leads them to deny that we should believe the SPR’s prediction in some particular case.

The writers go on to discuss why experts have excessive confidence about their abilities and how they underestimate their rate of errors when making judgements.

On a practical personal level what can we do to get better diagnoses and better advice? Try to get direct access to automated decision-making systems and when that is not possible seek out experts who use such systems routinely. Given that most experts in most fields are unwilling to use such systems for the foreseeable future we will have to continue to rely on flawed human judgement the vast bulk of the time.

Vermont Dr. Lawrence L. Weed has developed an expert system for medical diagnosis called the Problem Knowledge Coupler. See this Boston Globe report on Dr. Weed and the reception that the Problem Knowledge Coupler has received in the medical community.

Humans, Weed argues, cannot consistently process all of the information needed to diagnose and treat a complicated problem. The more information the physician gets about a patient, the more complex the task becomes. A doctor working without software to augment the mind, he argues, is like a scientist working without a microscope to augment the eye.

Some accomplished physicians and scientists who have explored ways to use artificial intelligence to diagnose patients say that it is impossible with today's technology. Many other doctors strongly oppose the mere concept, calling software incapable of matching their expertise; computers merely get in the way, they argue. But a small band of physicians and Weed's company's biggest customer, the US Department of Defense, have begun to use the Knowledge Couplers, and an early study suggests that their patients are healthier for it. If the software catches on, Weed's ideas may forever change the way doctors make decisions, removing much of the mystery and leaving us, the patients, with more control over our care. Weed's supporters say the medical industry will one day recognize the genius behind the software, much as it recognized the promise of Weed's first major innovation, which changed medicine four decades ago.

Training of large numbers of experts by universities has probably had the perverse effect of increasing the number of people running around making highly confident but wrong judgements. But the tendency to not notice our errors and to place excessive confidence in our subjective judgements is something that all humans suffer from to varying degrees. Unfortunately, few people receive much training in statistics and in methods of making more rational judgements and a great deal of potential for expert systems is unrealized because people are unwilling to acknowledge how much expert systems could help them.

Share |      Randall Parker, 2003 August 08 12:38 PM  Expert Systems


Comments
David Weisman said at August 8, 2003 5:32 PM:

But part of the point is that experts can sometimes seek out information an expert system might not know to look for, or an expert might glean more 'information' from raw observations than a program.

Not that expert systems might not outperform experts in certain fields, but this test seems biased against some things experts may still do better than many systems.

Randall Parker said at August 8, 2003 10:06 PM:

David, One of the points made in the full article is that in the vast majority of the studies examined even when experts have more information than the models the models do a better job than the experts do.

Rob said at August 10, 2003 10:05 AM:

I think we will see much more of this in the future. Humans hate to lose out to machines, but we keep doing it. Human brains have a knack for making errors such as confirmation bias, anchoring, etc, and machines don't, but for some reason we are always more comfortable with an error-prone human in control.

Nancy Lebovitz said at August 11, 2003 12:46 AM:

I wonder how the statistical systems do compared to the best experts rather than compared to the average experts.

Bob said at August 11, 2003 9:02 AM:

Nancy,

If you read the pdf, you will find this gem:

9. In predicting the presence, location and cause of brain damage, a SPR outperformed experienced clinicians and a nationally prominent neuropsychologist (Wedding 1983).

And this one too:

Dawes recounts another vivid example. He was presenting a finding in which a SPR had outperformed various medical doctors in predicting the severity of disease and death. In the question period, “the dean of a prestigious medical school stated during the question period that ‘if you had studied Dr. So−and−so, you would have found that his judgments of severity of the disease process would have predicted the survival time of his patients.’ I could not say so, either publicly or privately, but I knew that the physician involved in fact was Dr. So−and−so…” (Dawes 2000, 151).

And finally:

1. A common complaint against the SPR findings begins by notes that the whenever humans are found to be less reliable than SPRs, humans are typically forced to use only evidence that can be quantified (since that’s the only evidence that SPRs can use). The allegation is that this rigs the competition in favor of the SPRs, because experts are not permitted to use the kinds of qualitative evidence that could prompt use of the experts’ “human experience”, “intuition”, “wisdom”, “gut feelings” or other distinctly subjective human faculties. Besides the fact that this is an expression of hope rather than a reason to doubt the SPR findings, this complaint is bogus. It is perfectly possible to quantitatively code virtually any kind of evidence that is prima facie non−quantitative so that it can be utilized in SPRs. For example, the SPR that predicts the success of electroshock therapy employs a rating of the patient’s insight into his or her condition. This is prima facie a subjective, non−quantitative variable in that it relies on a clinician’s diagnosis of a patient’s mental state. Yet, clinicians can quantitatively code their diagnoses for use in a SPR.
2. A legitimate worry about SPRs has come to be known as the “broken leg” problem. Consider an actuarial formula that accurately predicts an individual’s weekly movie attendance. However, if we knew that the subject was in a cast with a broken leg, it would be wise to discard the actuarial formula (Dawes, Faust and Meehl, 1989). While broken leg problems will inevitably arise, it is difficult to offer any general prescriptions for how to deal with them. The reason is that in studies in which experts are given SPRs and are permitted to override them, the experts inevitably find more broken leg examples than there really are. In fact, such experts predict less reliably than they would have if they’d just used the SPR (Goldberg 1968, Sawyer 1966, Leli and Filskov 1984).

Nancy and David, you are demonstrating the human biases and behaviours described in the study.

Bill said at August 11, 2003 9:38 AM:

Bob,

I have a concern related to your #1 and to the feasibility of using SPRs. If clinicians trust themselves more than they trust SPRs and if they are being asked to code subjective information for use by an SPR, how does one stop the clinicians from learning, over time, to shade their subjective input in order to lead the SPR towards their own faulty conclusions?

Flippantly: "I KNOW this patient needs immediate treatment X. If I have to give this patient a 'very severe' on this stupid SPR input form to get him treatment X, then that is what I will do!"

Notice, you can't really get out of this by reference to study results --- this becomes an issue only when the SPR's judgement is dispositive in the treatment decision.

Bob said at August 16, 2003 5:19 PM:

Hi Bill,

It's not my #1; I quoted it from the pdf Randall cited. Assuming a flat maximum, the doctor might not be able to skew the result sufficiently in some cases and would have to skew the measurement so far in other cases as to throw his or her intellectual dishonesty in his or her face.

The doctor is more likely to prescribe a harmless and useless treatment. As long as the doctor orders the tests to confirm the SPR's diagnosis, the patient will still benefit.

M.Kanagaraj said at March 22, 2004 4:15 AM:

WHICH IS BETTER, clinical judgements OR statistical predictions ?

Thinker said at September 15, 2004 8:10 PM:

Bluntly, an expert system has some good points and some bad points. But expert systems should NOT make judgements. But they are, and the goal is not patient care, it is cost cutting.

With a vengeance.

'Expert' systems run by managed care organizations - often headed by bean counters with no relevant medical experience are ruining medicine.

These glorified software tools that rely on statistical modeling to dictate treatment options to doctors are causing major problems.

Sure, they have managed to reduce the average time spent with patients down to six to eight minutes each. For doctors who do not stay current with established practice, they sometimes help.

But what happens when a patient does not clearly articulate his or her symptoms in the time alotted? Or when the system does not know important facts (more often than not) A machine cannot 'intuit' what was left out. Medical problems are way too complicated for a machine.

And even if they were not, often a machine could not possibly have the context it would need to make the appropriate suggestions. Computers with the real-world smarts to begin to address those issues are at least 20 or 30 years away.

Sure, they have reduced the cost to HMOs by dictating what is sometimes 'cost effetive' treatment - for most patients - but there are enogh statistical 'outliers' to make the system fail spectacularly much more often than it should.
Also, the emphasis is on profit, not effective care. Sometimes this results in a delaying of adequate treatment until the patient is dead.

Because of this blind rush towards cost-cutting, they have created the situation mentioned above where doctors (or, in the sterile, commoditized language of managed care - 'service providers') have to exaggerate a patients medical situation in order to obtain payment for treatment.

This kind of brutal, cost-driven environment is also driving young people away from medicine and the medical sciences.

yogakaruppiah said at March 11, 2007 1:39 AM:

which is better clinical judgement or statistical predictions

Mario said at October 9, 2017 12:00 PM:

Hi Bob,

You wrote "If you read the pdf, you will find this gem ..." followed by three quotes. Did you know that the second quote is not present in the PDF, which is the published paper? It seems it is only in the draft.

It seems that ironically, by quoting three examples to affirm your belief, you are behaving like the clinicians Dawes is quoted as describing in a different anecdote that precedes the second on in the draft.

Then everyone generalizes from N examples, for some N>=1.

Mario

“Everyone generalizes from one example. At least, I do.” –Vlad Taltos (Issola, Steven Brust)

Post a comment
Comments:
Name (not anon or anonymous):
Email Address:
URL:
Remember info?

                       
Go Read More Posts On FuturePundit
Site Traffic Info
The contents of this site are copyright ©