Saturday 17 November 2012

Why machine learning (without expert input) may be doomed to fail

With the advent of ‘big data’  there has been a presumption (and even excitement) that machine learning, coupled with statistical analysis techniques, will reveal new insights and better predictions in a wide range of important applications. The perception is being reinforced by the impressive machine intelligence results that organisations like Google and Amazon routinely provide purely from the massive datasets that they collect.

But for many critical risk analysis problems (including most types of medical diagnosis and almost every case in a court of law) decisions must be made where there is little or no direct historical data to draw upon, or where relevant data is difficult to identify. The challenges are especially acute when the risks involve novel or rare systems and events (e.g. think of novel project planning, predicting events like accidents, terrorist attacks, and cataclysmic weather events). In such situations we need to exploit expert judgement. This latter point is now increasingly widely understood. However, what is less well understood is that, even when large volumes of data exist, pure data-driven machine learning methods alone are unlikely to provide the insights required for improved decision-making. In fact more often than not such methods will be inaccurate and totally unnecessary.

To see a simple example why, read the story here.

Tuesday 23 October 2012

The impact of multiple possible test results on disease diagnosis

In our new book, we cite the famous Harvard Medical School experiment where doctors and medical students were asked the following question.

"One in a thousand people has a prevalence for a particular heart disease. There is a test to detect this disease. The test is 100% accurate for people who have the disease and is 95% accurate for those who don't (this means that 5% of people who do not have the disease will be wrongly diagnosed as having it)."

The answer (as explained here) is a bit less than 2% (which is interesting because most of the people in the study gave the answer as 95%).

Now a reader has posed the following problem:

 I understand the 2% answer of a random person having the heart disease. What is the probability of that person having the disease if a 2nd test comes back positive? What about a 3rd test?

I have added this as an exercise to Chapter 6 of the book and provided a full answer - using a Bayesian network. In summary, if the tests are really independent (with the same level of accuracy) then if  two tests are positive the probability of the disease rises to 28.592%; and when three tests are positive it rises to 88.899%.
However, the tests can be dependent on each other and it is also possible that there are some personal features of the patient that lead to common test errors. These two situations (with particular assumed prior probabilities of the dependencies leads to a far lower probative value of the multiple positive test results. As shown in the solution, we get:

When the tests are directly dependent and two are positive the probability of disease only increases to 3.229%; with all three positive the probability only increases to 4.002%.

When there is a common source of error and two tests are positive the probability of disease increases to 13.805%; with all 3 tests positive, probability increases to 64.023%.

Tuesday 19 June 2012

Prosecutor fallacy again in media reporting of David Burgess DNA case

There are numerous reports today of the trial of David Burgess accused of killing Yolande Waddington in the 1960s. Burgess had been ruled out as a suspect at the time of the crime because his blood was found not to match that of blood on a sweater belonging to Yolande. However, new DNA analysis has found that Burgess's DNA profile does match that of blood found on a sack that was at the crime scene.

Following on from previous blog postings it is clear that again reporters are making the prosecutor fallacy even though it appears not to have been made in court (but as I explain below I believe that other errors were made in court). For example, the Sun provides a classic example of the prosecutor fallacy:
Scientists said the chances of the DNA on the sack not belonging to him were less than “one in a billion”. 
However, both the Mail and Guardian report what I assume were the actual words used by the forensic scientist Mr Price in court:
... the probability of obtaining this result if it is due to DNA from an unknown person who is unrelated to David Burgess is smaller than one in a billion, a thousand million.
Ignoring the fact that all kinds of testing/cross contamination errors have not been factored in to the random match probability of one in a billion, then there is nothing wrong with the above statement because if we let:

  • H be the hypothesis "DNA found at scene does not belong to defendant or a relative".
  • E be the evidence "DNA found is a match to defendant".
Then the probability of E given H (which is what is stated in the above quote) is indeed one in a billion.

But what is VERY interesting is that Burgess was ruled out in the original investigation because his blood type did NOT match the sample from the scene. To explain the 'change' we get the following quote in the Guardian article:
"Mr Price said the initial test on the bloodstained sweater may have been flawed and that the difference between Burgess’s blood type and that found on the sweater could be due to a mistake in the process that was known to occur sometimes."
In other words the forensic scientist claims that the lack of a positive blood match first time round is "due to a mistake in the process" but he appears never to consider the possibility of any mistake in the process leading to a positive DNA match. Perhaps conveniently for the CPS it appears the sweater has somehow got 'lost'  (curious how that crucial crime scene item should vanish whereas the sack which was never tested orginally should remain) so there was no attempt to test the DNA of the blood on the sweater.

If the latest ruling from the USA is anything to go by, there is going to be even less less chance of questioning the accuracy of DNA and other types of forensic analysis in future.

Friday 8 June 2012

DNA Cold-case: The prosecutors' fallacy just will not go away


The Daily Mail has an interesting report about the trial of John Molt who is accused of being the masked rapist who attacked a 15-year-old girl. The report says Molt was "caught after his father provided a DNA sample 12 years later".

In a previous report about the Stephen Lawrence trial I noted that the media was wrongly attributing statements to expert witnesses - reporting them as if they had made the prosecutors fallacy when in reality they did not. However, in this latest case the Daily Mail report seems to be directly quoting the prosecuting barrister:
Carolyn Gardiner, prosecuting at Chelmsford Crown Court, said: ‘The probability of that semen not coming from Jon Molt is one in a billion. Members of the jury, just think what a huge number that is.’
If that quote is accurate then it would be hard to find a more blatant example of the prosecutors' fallacy and a more ill-informed way of presenting it (and that's even if we assume that the one in a billion DNA random match probability is roughly correct, whereas it is almost certainly massively underestimated). 

The fact is that even very clever people continue to make the prosecutors' fallacy and it will continue to bias judgements. Shortly before taking on his current high-profile role no less a judicial luminary than Lord Justice Leveson told me that he still did not understand the fallacy.

Yet, at a recent meeting I attended at the Home Office (about using Bayes in court) a senior barrister asserted that 'no lawyer would EVER make the prosecutors' fallacy because it is now so widely known and understood'. This same barrister (along with others at the meeting) also ridiculed the point I had made about the Stephen Lawrence case, namely that if media reporters with considerable legal knowledge made the fallacy when interpreting what the experts had said then it is almost certain that the jury misinterpreted the evidence in the same way. In other words, although the prosecutor fallacy was not stated in court, it may still have been made in the jury's decision-making. The barrister claimed that 'no jury member sitting in court listening to the evidence would possibly make that mistake".

Friday 20 April 2012

News on DNA evidence

The Australian High Court has ruled that it is OK to state as - a percentage - the probability of a person NOT sharing a given DNA profile with a defendant. The case in question involved Yusuf Aytugrul, who was convicted of murdering his ex-girlfriend. The Appeal Court ruled that he was not unfairly prejudiced by the way DNA evidence was presented to the jury. Expert evidence was presented at Aytugrul's trial in 2009 of a match between the accused's DNA and mitochondrial DNA taken from a hair found on Bayrak's thumbnail. Specifically the exter stated that 99.9 percent of people do not share a DNA profile with an accused. Full report here

Unfortunately, what was not discussed in the Appeal was whether the 99.9% number was correct or meaningful. Since mitochondrial DNA is known special problems, the 99.9% figure is almost certainly not correct. People are beginning to realise that not all DNA evidence is especially probabative. 'Matches' involving mixture DNA and/or low template DNA may not tell you very much at all, especially given the possibility of cross contamination and subjectivity in analysis. I am currently working on a case where this is a particular concern. Itiel Dror's recent paper Subjectivity and bias in forensic DNA mixture interpretation is a must read as it highlights the extent to which the experts' results are subjective and influenced by contextual bias.

More general concerns about the validity of a range of forensic evidence is covered in this very interesting programme.

Tuesday 27 March 2012

Judea Pearl wins Turing Award

Judea Pearl, who has done more than anybody to develop Bayesian networks and causal reasoning, has won the 2011 Turing Award for work on AI reasoning.

We are also delighted to announce that Judea has written the Foreword for our forthcoming book.

The danger of p-values and statistical significance testing

I have just come across an article in the Financial Times (it is not new - it was published in 2007) titled "The Ten Things Everyone Should Know About Science".  Although the article is not new the source where I found the link to it is, namely right at the top of the home page for the 2011-12 course on Probabilistic Systems Analysis at MIT. In fact the top bullet point says:
The concept of statistical significance (to be touched upon at the end of this course) is considered by the Financial Times as one of " The Ten Things Everyone Should Know About Science".
The FT article does indeed list "Statistical significance" as one of the ten things, along with: Evolution, Genes and DNA, Big Bang, Quantum Mechanics, Relativity, Radiation, Atomic and Nuclear Reactions, Molecules and Chemical Reactions, and Digital data.   That is quite illustrious company, and in the sense that it helps promote the importance of correct probabilistic reasoning I am delighted. However, as is fairly common, the article assumes that 'statistical sugnificance' is synonymous with p-values. The article does hint at the fact that there there might be some scientists who are sceptical of this approach when it says:
Some critics claim that contemporary science places statistical significance on a pedestal that it does not deserve. But no one has come up with an alternative way of assessing experimental outcomes that is as simple or as generally applicable.
In fact, that first sentence is a gross under-statement, while the second is simply not true. To see why the first sentence is a gross understatement look at this summary (which explains what p-values are) that appears in Chapter 1 of our forthcoming book (you can see full draft chapters of the book here). To see why the second sentence is not true look at this example from Chapter 5 of the book (which also shows why Bayes offers a much better alternative). Also look at this (taken from Chapter 10) which explains why the related 'confidence intervals' are not what most people think (and how this dreadful approach can also be avoided using Bayes).

Hence it is very disappointing that an institute like MIT should be perpetuating the myths about this kind of significance testing. The ramifications of this myth have had (and continues to have) a profound negative impact on all empirical research. The book "The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives (Economics, Cognition & Society)" by Ziliak and McCloskey (The University of Michigan Press, 2008) provides extensive evidence of flawed studies and results published in reputable journals across all disciplines. It is also worth looking at the article "Why Most Published Research Findings Are False". Not only does it mean that 'false' findings are published but also that more scientifically rigorous empirical studies are rejected because authors have not performed the dreaded significance tests demanded by journal editors or reviewers.  This is something we see all the time and I can share an interesting anecdote on this. I was recently discussing a published paper with its author. The paper was specifically about using the Bayesian Information Criteria to determine which model was producing the best prediction in a particular application. The Bayesian analysis was the 'significance test' (only a lot more informative).Yet at the end of the paper was a section with a p-value significance test analysis that was redundant and uninformative. I asked the author why she had included this section as it kind of undermined the value of the rest of the paper. She told me that the paper she submitted did not have this section but that the journal editors had demanded a p-value analysis as a requirement for publishing the paper.

Thursday 26 January 2012

Prosecutor Fallacy in Stephen Lawrence case?

If you believe the 'respectable' media reporting of the Stephen Lawrence case then there were at least two blatant instances of the prosecutor fallacy committed by expert forensic witnesses.

For example the BBC reported forensic scientist Edward Jarman as saying that
"..the blood stain on the accused's jacket was caused by fresh blood with a one-in-a-billion chance of not being victim Stephen Lawrence's."
Numerous other reports contained headlines with similar claims about this blood match. The Daily Mail (above) states explicitly:
".. the chances of that blood belonging to anyone but Stephen were rated at a billion to one, the Old Bailey heard."
Similar claims were made about the fibres found on the defendants' clothing that matched those from Stephen Lawrence's clothes.

If the expert witnesses had indeed made the assertions as claimed in the media reports then they would have committed a well known probability fallacy which, in this case, would have grossly exaggerated the strength of the prosecution evidence. The fallacy - called the prosecutor fallacy - has a long history (see, for example our reports here and here as well as summary explanation here).  Judges and lawyers are expected to ensure the fallacy is avoided because of its potential to mislead the jury. In practice the fallacy continues to be made in courts. However, despite the media reports the fallacy was not made in the Stephen Lawrence trial. The experts did not make the assertions that the media claimed they made. Rather, it was the media who misunderstood what the experts said and it was they who made the prosecutor fallacy. What is still disturbing is that, if media reporters with considerable legal knowledge made the fallacy then it is almost certain that the jury misinterpreted the evidence in the same way. In other words, although the prosecutor fallacy was not stated in court, it may still have been made in the jury's decision-making.

To understand the fallacy and its impact formally, suppose:
  • H is the hypothesis "the blood on Dobson's jacket did not come from Stephen Lawrence"
  • E is the evidence "blood found matches Lawrence's DNA".

What the media stated is that the forensic evidence led to the probability of H (given E) being a billion to one.

But, in fact, the forensic experts did not (and could not) conclude anything about the probability of H (given E). What the media have done is confuse this probability with the (very different) probability of E given H.

What the experts were stating was that (providing there was no cross contamination or errors made) the probability of E given H is one in a billion. In other words what the experts were asserting was

"The blood found on the jacket matches that of Lawrence and such a match is found in only one in a billion people. Hence the chances of seeing this blood match evidence is one in a billion if the blood did not come from Lawrence".

In theory (since there are about 7 billion people on the planet) there should be about 7 people (including Lawrence) who would have the same matching blood DNA to Lawrence. If none of the others could be ruled out as the source of the blood on the jacket then the probability of H given E is not one in a billion as stated by the media but 6 out of 7. This highlights the potential enormity of the fallacy.

Even if we could rule out everyone who never came into contact with Dobson that would still leave, say, 1000 people. In that case the probability of H given E is about one in a million. That is, of course, a very small probability but the point is that it is a very different probability to the one the media stated.

The main reason why the fallacy keeps on being repeated (by the media at least) in these kind of cases is that people cannot see any real difference between a one in a billion probability and a one in a million probability (even though the latter is 1000 times more likely). They are both considered 'too small'.

Finally, it is also important to note that the probabilities stated were almost meaningless because of the simplistic assumptions (grossly favourable to the prosecution case) that there was no possibility of either cross-contamination of the evidence or errors in its handling and DNA testing. The massive impact such error possibilities have on the resulting probabilities is explained in detail here.