Friday, 17 January 2020

Understanding Bayes theorem

As part of my presentation at the Wolfson Institute of Preventative Medicine today I got some audience participation using the mentimeter tool. One of the things I did was to test the participants' understanding of Bayes before and after the seminar. I posed this question*:

The results were very interesting. Before the seminar the 'average' probability answer was 76% (but note the variation in the distribution)

After, the average was  9.4%

The correct answer is just below 0.5%:

*Based on example from:  Neapolitan, Richard, Xia Jiang, Daniela P. Ladner, and Bruce Kaplan. 2016. “A Primer on Bayesian Decision Analysis With an Application to a Kidney Transplant Decision.” Transplantation 100 (3): 489–96.

Wednesday, 11 December 2019

Problems with DNA mixed profile evidence: the case of Florencio Jose Dominguez

I have written many times before about the potential problems when using the likelihood ratio (LR) as a measure of probative value of evidence. The problems are especially acute when the evidence consists of a tiny sample of DNA for which there are at least two people contributing - often referred to as a low template mixed DNA profile. Over the last year I have been working with lawyer Matthew Speradelozzi on a case in San Diego that challenged the use of new statistical analyses for such a mixed profile. The case was settled Friday when Florencio Jose Dominguez (who was sentenced to 50 years to life for a 2008 murder) was released after pleading guilty to a reduced charge.

The major controversy involves what is called probabilistic genotyping software (in this case STRmix from ESR) that claims to be able to analyse low template mixtures and determine the most likely contributing profiles by taking account of information like the relative peak heights at loci on the electropherogram (epg), which is the graph that DNA analysts use to decide which components (alleles) are present in a sample. The DNA analysts first determine the number of contributors there are in the mixture and then provide a LR that compares the probability of the evidence assuming the suspect is one of the contributors against the probability of the evidence assuming the none of the contributors are related to the suspect. While the probabilistic genotyping software can be effective if the ‘size’ of the different contributors is very different, it is much less effective when it is not (as with Dominguez who was claimed to be one of at least two unknown contributors of a similar ‘size’). Moreover, in contrast to single profile DNA cases, where the only residual uncertainty is whether a person other than the suspect has the same matching DNA profile, it is possible for all the genotypes of the suspect’s DNA profile to appear at each locus of a DNA mixture, even though none of the contributors has that DNA profile. In fact, in the absence of other evidence, it is possible to have a very high LR for the hypothesis ‘suspect is included in the mixture’ even though the posterior probability that the suspect is included is very low. Yet, in such cases a forensic expert will generally still report a high LR as ‘strong support for the suspect being a contributor’, which is potentially highly misleading. We have submitted a paper describing this and many other issues relating to the reliability of probabilistic genotyping software and will report on it here in due course.

ESR have issued their own statement.

See also

Friday, 6 December 2019

Simpson's paradox again have a post about our paper on Simpson's paradox (we wrote this in 2015 but only just uploaded it to arxiv). The full paper is here.

The paradox is covered extensively in both “The Book of Why" by Pearl and Mackenzie (see my review) and also David Spiegelhalter’s “The Art of Statistics: How to Learn from Data”(see my review). Speigelhalter's book contains a particularly good example of Cambridge University admissions data:

Overall the acceptance rate was higher for men and than women, but in each subject the rate was higher for women than men. This is explained by the observation that women were more likely to apply for those subjects where the overall accepance rates were lower. In other words the relevant causal model is this one:

See also: Doctoring Data

Monday, 25 November 2019

Bayesian networks for cybersecurity risk

Our new paper describing a Bayesian network approach to cybsersecurity (with lead author PhD student Jiali Wang) has been published in Computers & Security.

The print version will appear Feb 2020, but the online version is available now:

An open access pre-publication version is also available for download.

Lead author: Jiali Wang

Monday, 7 October 2019

Bayesian networks research on treating injured soldiers gains DoD funding

The research which this new US DoD funding supports is the continuation of a long term collaboration between the RIM (Risk and Information Management) Group at Queen Mary (with William Marsh taking the lead) and the Trauma Sciences Centre led by surgeon Col Nigel Tai.

The underlying AI decision support is provided by causal Bayesian Networks. Two of the previous models can be accessed and run online at

Institute of Applied Data Science seminar: "Why machine learning from big data fails"


On 3 October Norman Fenton gave a seminar: "Why machine learning from big data fails – and what to do about it", at the Institute for Applied Data Science, Queen Mary University. Here are the powerpoint slides for his presentation.

Tuesday, 24 September 2019

Naked Statistical Evidence

Consider the hypothetical scenario:
All 100 prisoners in a prison participate in a riot, and 99 of them participate in attacking and killing a guard (the other returned to his cell briefly after the riot). With the guard dead, all 100 prisoners then escape. The next day one of the prisoners is captured and charged with participating in the murder of the guard. While admitting to participating in the riot the prisoner claims that he was the one who was not involved in attacking the guard. In the absence of any other evidence there is 99% probability the prisoner is guilty. Is this sufficient to convict?
Christian Dahlman
The latest episode of the evidence podcast "Excited Utterance" has an excellent interview with our colleague Christian Dahlman of Lund University about this kind of "naked statistical evidence", available on itunes, and also here:

Christian contrasts the above kind of naked statistical evidence with forensic evidence, such as a footprint found at a crime scene whose pattern 'matches' that of a shoe worn by the suspect. Whereas the causal link between the statistical evidence and guilt goes from the former to the latter, the  causal link between the forensic evidence and guilt goes from the latter to the former:

This difference is central to the recent paper about the 'opportunity prior' that we co-authored with Christian. The fact that the suspect was at the prison means that he had the 'opportunity' to participate in the killing and that the prior probability for guilt given the naked statistical evidence is 99%.

Christian talks about his latest paper, and at the end of the interview (24:50), he defends the Bayesian approach to legal evidence against attacks from some legal scholars (this is something we also did in our recent paper on countering the ‘probabilistic paradoxes in legal reasoning’ with Bayesian networks).