J.D. Trout & Michael Bishop, writing in an essay entitled "50 Years of Successful Predictive Modeling Should be Enough: Lessons for Philosophy of Science" argue that we continue to rely too much on the individual judgements of experts to make important decisions on subject matters where automated computer implementation of Statistical Prediction Rules would yield more accurate results. (PDF format)
In 1954, Paul Meehl wrote a classic book entitled, Clinical Versus Statistical Prediction: A Theoretical Analysis and Review of the Literature. Meehl asked a simple question: Are the predictions of human experts more reliable than the predictions of actuarial models? To be a fair comparison, both the experts and the models had to make their predictions on the basis of the same evidence (i.e., the same cues). Meehl reported on 20 such experiments. Since 1954, every non−ambiguous study that has compared the reliability of clinical and actuarial predictions (i.e., Statistical Prediction Rules, or SPRs) has supported Meehl’s conclusion. So robust is this finding that we might call it The Golden Rule of Predictive Modeling: When based on the same evidence, the predictions of SPRs are more reliable than the predictions of human experts.
It is our contention that The Golden Rule of Predictive Modeling has been woefully neglected. Perhaps a good way to begin to undo this state of affairs is to briefly describe ten of its instances. This will give the reader some idea of the range and robustness of the Golden Rule.
1. A SPR that takes into account a patient’s marital status, length of psychotic distress, and a rating of the patient’s insight into his or her condition predicted the success of electroshock therapy more reliably than a hospital’s medical and psychological staff members (Wittman 1941).
2. A model that used past criminal and prison records was more reliable than expert criminologists in predicting criminal recidivism (Carroll 1982).
3. On the basis of a Minnesota Multiphasic Personality Inventory (MMPI) profile, clinical psychologists were less reliable than a SPR in diagnosing patients as either neurotic or psychotic. When psychologists were given the SPR’s results before they made their predictions, they were still less accurate than the SPR (Goldberg 1968).
4. A number of SPRs predict academic performance (measured by graduation rates and GPA at graduation) better than admissions officers. This is true even when the admissions officers are allowed to use considerably more evidence than the models (DeVaul et al. 1957), and it has been shown to be true at selective colleges, medical schools (DeVaul et al. 1957), law schools (Dawes, Swets and Monohan 2000, 18) and graduate school in psychology (Dawes 1971).
5. SPRs predict loan and credit risk better than bank officers. SPRs are now standardly used by banks when they make loans and by credit card companies when they approve and set credit limits for new customers (Stillwell et. al. 1983).
6. SPRs predict newborns at risk for Sudden Infant Death Syndrome (SIDS) much better than human experts (Lowry 1975; Carpenter et. al. 1977; Golding et. al. 1985).
7. Predicting the quality of the vintage for a red Bordeaux wine decades in advance is done more reliably by a SPR than by expert wine tasters, who swirl, smell and taste the young wine (Ashenfelter, Ashmore and Lalonde 1995).
The writers cite additional examples not excerpted here. Paul Meehl thinks we should place more trust in models for decision-making.
Upon reviewing this evidence in 1986, Paul Meehl said: “There is no controversy in social science which shows such a large body of qualitatively diverse studies coming out so uniformly in the same direction as this one. When you are pushing [scores of] investigations [140 in 1991], predicting everything from the outcomes of football games to the diagnosis of liver disease and when you can hardly come up with a half dozen studies showing even a weak tendency in favor of the clinician, it is time to draw a practical conclusion” (Meehl 1986, 372−3).
The writers discuss why humans are reluctant to admit that their subjective judgement has a high error rate.
Resistance to the SPR findings runs very deep, and typically comes in the form of an instance of Pierce’s Problem. Pierce (1878, 281−2) raised what is now the classic worry about frequentist interpretations of probability: How can a probability claim (say, the claim that 99 out of 100 cards are red) be relevant to a judgment about a particular case (whether the next card will be red)? After all, the next card will be red or not, and the other 99 cards can’t change that fact. Those who resist the SPR findings are typically quite willing to admit that in the long run, SPRs will be right more often than human experts. But their (over)confidence in subjective powers of reflection leads them to deny that we should believe the SPR’s prediction in some particular case.
The writers go on to discuss why experts have excessive confidence about their abilities and how they underestimate their rate of errors when making judgements.
On a practical personal level what can we do to get better diagnoses and better advice? Try to get direct access to automated decision-making systems and when that is not possible seek out experts who use such systems routinely. Given that most experts in most fields are unwilling to use such systems for the foreseeable future we will have to continue to rely on flawed human judgement the vast bulk of the time.
Vermont Dr. Lawrence L. Weed has developed an expert system for medical diagnosis called the Problem Knowledge Coupler. See this Boston Globe report on Dr. Weed and the reception that the Problem Knowledge Coupler has received in the medical community.
Humans, Weed argues, cannot consistently process all of the information needed to diagnose and treat a complicated problem. The more information the physician gets about a patient, the more complex the task becomes. A doctor working without software to augment the mind, he argues, is like a scientist working without a microscope to augment the eye.
Some accomplished physicians and scientists who have explored ways to use artificial intelligence to diagnose patients say that it is impossible with today's technology. Many other doctors strongly oppose the mere concept, calling software incapable of matching their expertise; computers merely get in the way, they argue. But a small band of physicians and Weed's company's biggest customer, the US Department of Defense, have begun to use the Knowledge Couplers, and an early study suggests that their patients are healthier for it. If the software catches on, Weed's ideas may forever change the way doctors make decisions, removing much of the mystery and leaving us, the patients, with more control over our care. Weed's supporters say the medical industry will one day recognize the genius behind the software, much as it recognized the promise of Weed's first major innovation, which changed medicine four decades ago.
Training of large numbers of experts by universities has probably had the perverse effect of increasing the number of people running around making highly confident but wrong judgements. But the tendency to not notice our errors and to place excessive confidence in our subjective judgements is something that all humans suffer from to varying degrees. Unfortunately, few people receive much training in statistics and in methods of making more rational judgements and a great deal of potential for expert systems is unrealized because people are unwilling to acknowledge how much expert systems could help them.
|Share |||Randall Parker, 2003 August 08 12:38 PM Expert Systems|