Improving public understanding of probability and risk with special emphasis on its application to the law. Why Bayes theorem and Bayesian networks are needed
Most people watching this weekend's Premier League matches were genuinely concerned about the penalty decisions that were awarded against Spurs (v Newcastle) and for Manchester Utd (v Brighton).
As a result today there have been several posts with the above "Team Fortune Ladder" from UREF. This is a record of the net VAR decisions that went in favour of each team during last season (2019-20). And it seems that the 'good fortune' of Man Utd and the 'bad fortune' of Spurs is a continuation of a pattern from last season, with Man Utd clear at the top with a net total 17 decisions in their favour and Spurs clear at the bottom with a net 12 against.
But are these results through luck or bias? We can do a simple statistical analysis (full details below for those interested) to see if the assumption that there was no bias is realistic. Let's assume there were on average 2 VAR decisions per game last season. Then, for any given team this means that over the season they were involved in 76 VAR decisions (as they each played 38 games). With the 'no bias' assumption we would expect VAR decisions to 'balance out', i.e. that the net total decisions in favour should be 0 (like Sheffield Utd in the table). However, because of inevitable statistical variation some teams will have a positive net number and some teams negative; we'd expect most teams to be between plus or minus 5. In fact, we can calculate the probability that any given team would end up over the season with a net total of more than 16 decisions in favour. The answer is just 2.5%, i.e. this is highly unlikely; similarly we can calculate the probability that any given team would end up over the season with a
net total of more than 11 decisions against them. The answer is just over 10%, i.e. this is also unlikely.
However, we have to take account of the fact that there are 20 teams. It turns out that the probability that at least one team will end up with net total of more than 16 decisions in their favour is actually 40%. So, in fact this is not that unlikely; would expect to see something like this happen in 2 out of every 5 seasons.
Similarly, the probability that at least one team will end up with net total of more
than 11 decisions against them is about 89%, which is very likely, i.e. we would expect to see something like this happen in about 9 out of 10 seasons.
So, while individually Man Utd were exceptionally lucky and Spurs very unlucky, the table for 2019-20 provides only limited evidence of bias. However, that fact that the trend in favour of Man Utd and against Spurs has continued into the 2020-21 season means that this conclusion could change once I add the new data (I understand Spurs have a net 5 against in just 3 games this season). Moreover, it is interesting to note that a while ago in published work done with my colleague Anthony Constantinou we did find real evidence of referee bias in favour of Man Utd.
And let's not forget this - from Spurs againt Man Utd in 2005 (before VAR) - that was ruled to be not over the line.
See also:
Constantinou, A. C., Fenton, N. E., & Pollock, L. (2014). "Bayesian networks for unbiased assessment of referee bias in Association Football". Psychology of Sport & Exercise, 15(5) 538–547, http://dx.doi.org/10.1016/j.psychsport.2014.05.009. Pre-publication draft here.
The Government's official COVID data are on the website https://coronavirus.data.gov.uk. These are the data I have been using for my regular updates of cases per 1000 people tested (because this crucial plot is not on the government website).
I discovered today that the data on hospital admissions are not only flawed but also reveal a systematic problem with any data based on "suspected" COVID cases. The website says:
“Wales include suspected COVID-19 patients while the other nations include only confirmed cases”
By looking at the raw data (such as that above which includes the most recent complete data for all 4 nations) it is clear that Wales is massively overestimating the number of COVID hospital admissions. Note that Scotland consistently has about twice the number of new
confirmed case as Wales, yet Wales typically has about TWENTY times the
number of admissions than Scotland. In fact based on the above data, whereas in England, Scotland and NI on average around 4% of COVID cases are admitted to hospital, for Wales the figure is 61%.
Indeed, in July (not shown above but you can find it all on the website) when there were almost zero hospital admissions in the rest of the UK (and very few new cases) Wales was typically reporting 50-100 new COVID hospital admissions every day.
Unless Wales is routinely admitting people who should not be hospitalized, we can conclude that at least 90% of the Wales 'COVID' admissions were not COVID at all and that 'suspected' COVID cases are generally not COVID. With UK admissions currently being recorded as around 250 per day including around 60 in Wales this means the real UK admissions number should be reduced to less than 200.
There are two lessons to be drawn from this:
Any graphs/analysis of hospital admissions should exclude Wales
Any data about suspected COVID cases (whether with respect to hospital admissions, deaths, or anything else) should be completely ignored or treated as massively exaggerated
There has been much recent debate - and controversy - about the impact of false positives in Covid testing. As I showed in this video: if the current rate of infection is 1 in 200 (i.e. 1 in every 200 people is currently infected with the virus) and if a person is selected at random to be tested then, if that person tests positive, there is actually only about a 1 in 6 chance (less than 17%) the person actually has the virus. This assumes the test has a 2% false probability rate (i.e. 2 out of every 100 people who don't have the virus will wrongly be tested positive) and a 20% false negative rate (i.e. 20 out of every 100 people with the virus will wrongly be tested negative).
Obviously the 1 in 200 pre-test (i.e. 'prior') probability assumption is critical. A person who is tested because they have been in contact with somebody confirmed as having the virus will have a much higher pre-test probability of having the virus. If we assumed it was 50% then, if that person tests positive, there is a 97% chance they have the virus.
The BMJ have produced an excellent infographic which allows you to adjust all of the key parameters (namely the pre-test probability, false positive rate, and false negative rate). However, there is a severe limitation. The graphic does not allow you to enter pre-test probabilities of less than 1% (as I found out when I tried to enter the value 0.5% that I had used in my video - it automatically rounded it up to 1%). This is a curious limitation, given that the current infection rate is widely believed to be much lower than 1%; if it was 1% this would mean 680,000 people in the UK were infected right now, i.e. not including those who were previously infected (if it was that high this would confirm the belief of many that the virus goes unnoticed in most people).
Moreover, it also very curious that the default setting in the BMJ infographic has the pre-test probability set at a ludicrously high 80%. Even for a person with symptoms and having been in contact with a person with Covid this is actually too high (see this post and video). With that prior assumption somebody testing positive is, of course, almost certain to have the virus.
By focusing on the notion that people getting tested have a relatively high pre-test probability of having the virus, an article in the Huffington Post uses the BMJ infographic to hammer those claiming that most people testing positive do not have the virus. For example, they suggest a scenario where the pre-test probability is 20% and the false positive rate is 1%. With these assumptions, somebody testing positive has a 94% chance of having the virus.
In reality there is massive uncertainty about all of the three parameters as explained in this article. Very early on during this crisis we argued (see also the other links below) that a more intelligent approach to data collection and analysis was needed to learn these parameters; in particular, there was a need to consider causal models to explain observed data. A basic causal model showed that it was critical to distinguish between people who had no, mild and severe symptoms, both when recording those being tested and when recording those testing positive. Yet there are no publicly available data on those being tested which makes these distinctions (we just have 'total tested' per day) and neither do we have it for those testing positive; all we can do is make some crude inferences based on the number hospitalised, but even then this published daily number includes all patients hospitalized with Covid not because of Covid. And, even worse, there are fundamental errors in some of the UK data on hospital admissions.
There is even less truly known about the accuracy of the various tests (*see the statement below by my colleague Dr Scott McLachlan on this issue) because - in the absence of any 'gold standard' test for Covid there is no way to determine this accuracy. And there have been no independent evaluations of test accuracy.
There is plenty of anecdotal evidence that most people testing positive either really don't have the virus or will be totally unaffected by it. For example, as part of the standard testing of professional footballers earlier this week 18 out of 25 of Leyton Orient's players - and many of their staff - tested positive. None of these people had - or will have - any symptoms at all. The same is true of the many other footballers and staff (including many older managers/trainers) who have tested positive and the hundreds of Scottish University students who tested positive this week.
That is why I have been posting on twitter updates of my cases (i.e. positive tests) per 1000 tested graph as a contrast to the naive cases only graph which all the media post (obviously cases have been rising as number of tests have been rising). This simply takes the (separate) daily cases (i.e. those testing positive) and number tested from https://coronavirus.data.gov.uk (the Government website) and divides the former by the latter:
It is curious that the website produces all kinds of plots but not this most obvious and informative one. As we explained here this
shows that, contrary to the Government claims made this week, there is
no real real evidence of exponential growth in the virus.
People have been responding by saying 'ah but in April we were only testing those hospitalized with severe symptoms'; this is generally true (in fact the testing 'strategy' - i.e. who primarily gets tested has changed several times, which is why we argued long ago it needs to factored into a causal model); so the proportion of positives among those tested in April was obviously a lot higher than now. However, it is also the case that the proportion of people now testing positive who will be totally unaffected by the virus (whether they have it or not) is also much higher. That is why we need to distinguish between mild, severe and asymptomatic cases. We really need to see the plot of severe cases, and as I mentioned above, hospitalizations is the best approximation we have for that. However, that is also compromized for reasons explained above and the fact that we are now entering the normal flu season when hospital admission inevitably rise significantly.
*On the issue of what is know about test accuracy, Dr Scott McLachlan says:
During the last five months we have collected every preprint and journal publication that we could locate on COVID-19 testing with rt-PCR and antibodies. The issues of false positives (FP) and false negatives (FN) are more complicated than test product developers, some academic authors, and the mass and social media have presented.
First, discussion of a single FP or FN rate completely misses the fact that there are multiple tests from different vendors being used at the moment. The NHS alone are using at least five different primary rt-PCR tests. Each has a different manufacturer or seeks a different RNA target and therefore has a different sensitivity and specificity profile. What we do not have as yet is independent laboratory verification of the manufacturer’s claimed sensitivity and specificity. As well as those five there are a range of perhaps three or four, including the DNANudge cartridge test that comes from the UCL company that also market DNA NFC identity wristband in malls such as White City Westfield. Their test documentation focuses predominately on their claims of near zero FNs - because FNs were have been the leading subject of much media and academic literature in recent weeks, and brushes over the fact that by their own numbers they make around 3% FP.
Second, we have very little in the way of credible independent 3rd party verification for any of the COVID tests. Everything we get at the moment either consists of self-validation by the manufacturer lab (which should not be accepted wholesale without independent verification but sadly has been during COVID times), or of poorly constructed literature from well-meaning medics who, in one example, used an unvalidated PCR test as the standard by which to assess the accuracy of chest CT/CXR for diagnosing COVID (for the record, the CT did far better than the PCR test they used as *cough gold-standard… and they ended up acknowledging the PCR test resulted in FNs and FPs at far higher levels than expected).
As best as we have been able to identify from the literature collected that has assessed the rt-PCR and lab-based antibody tests:
FNs are occurring at a rate of between 3-30% for rt-PCR COVID-19 tests - depending on the test type, manufacturer, and the lab that ran the tests.
FPs are occurring at a rate of between 0.8% (lowest value accepted in the literature) and 7.9% (in a recent EU-based preprint) for rt-PCR COVID-19 tests.
Sens/Spec for rt-PCR COVID-19 tests ranges from 87%-100% depending on which manufacturers test and whether they performed their own testing or wrote their own academic report.
The antibody tests have an accuracy of somewhere between 30 and 93%, again depending on whose antibody test you review, whether it was IgG or IgM, and whether they averaged the score of all antibodies the test assayed for, or reported them individually. Antibody tests tended to be really good at identifying one antibody (often IgG), and less accurate or specific for the other (most often IgM).
A short paper by Neil et al uses Bayesian analysis to examine the latest (up to 22 Sept) Covid data to determine whether there is evidence to support the Government claim of an exponential 'second wave'. Concludes there remains insufficiently solid evidence, despite the number of tests done, to support any claim there is an exponential increase. There is no reason to panic.
Infection prevalence between April and September (Week 1 is 12 Aug and week 39 is 19 Sept
I have written many times before on this blog about the benefits and (extreme) limitations of using the likelihood ratio (LR) to determine the strength of forensic match evidence - such as DNA evidence. The LR is the probability of the evidence (i.e. 'finding a match') under the prosecution hypothesis divided by the probability of the evidence under the defence hypothesis.
When a forensic expert calculates a high LR for the prosecution hypothesis "DNA found at the crime scene matches the DNA profile of the suspect" they typically conclude that ‘this provides strong support for the prosecution hypothesis’. However, it is well known that a high LR does not necessarily mean a high (posterior) probability that the hypothesis is true because this depends on the prior probability of the hypothesis. Assuming that it does is an example of what is called the prosecutor's fallacy; judges are expected to warn both lawyers and expert witnesses when this mistake is made in court. What is less well understood is that, in order to draw any rational conclusions at all about the probative value of a high LR, the defence hypothesis used to determine the LR has to be the negation of
the prosecution hypothesis (formally we say the hypotheses must be mutually exclusive and exhaustive). So, in the above example the defence hypothesis has to be ‘DNA does not come
from the suspect’. Yet in practice, it is common to use ‘DNA comes
from a person unrelated to the defendant’ which of course is not the negation of the prosecution hypothesis. When the defence hypothesis is not the negation, then a high LR could actually mean the exact opposite of what is claimed: it could provide more support for the hypothesis that the DNA does not come
from the suspect than it does for the prosecution hypothesis.
While the problem of using the LR with hypotheses that are not mutually exclusive and exhaustive is known (but not widely understood) - for 'single' DNA profiles (i.e. those where the DNA found can only possibly come from a single person) it is even more serious for DNA mixture profiles (i.e those where the DNA is a mixture of at least two people). However, for mixture profiles, not having mutually exclusive and exhaustive hypotheses when using the LR is just one of several serious problems. A new paper shows the extent to which a very high LR for a ‘match’ from a DNA mixture profile – typically computed from probabilistic genotyping software - can be misleading even if the hypotheses are mutually exclusive and exhaustive. The paper shows that, in contrast to single profile DNA ‘matches’ where the only residual uncertainty is whether a person other than the suspect has the same matching DNA profile, it is possible for all the genotypes of the suspect’s DNA profile to appear at each locus of a DNA mixture, even though none of the contributors has that DNA profile. In the absence of other evidence, the paper shows it is possible to have a very high LR for the hypothesis ‘suspect is included in the mixture’ even though the posterior probability that the suspect is included is very low. Yet, in such cases a forensic expert will generally still report a high LR as ‘strong support for the suspect being a contributor’.
The problems are especially acute where there are very small amounts of DNA in the mixture (so-called low template DNA abbreviated to LTDNA). In certain circumstances, the use of the LR may have led lawyers and jurors into grossly overestimating the probative value of a LTDNA mixed profile ‘match’.
Following on from my analysis of the trend in Covid deaths (and using the same dataset), here is a plot (for each day since 8 April*) of the number of new cases per 1000 people tested. Contrary to what is being shown by the media (see below), this plot is the one that should be used to base decisions on if and when new social distancing or lockdown rules are needed.
When we consider that a higher proportion of new cases now (compared to the first 2 months) are minor or totally asymptomatic**, I find it is rather incredible that the following graph of just new cases (that strongly indicates 'second wave') is the one used by almost all media outlets. This graph does not take account of the increase in number of people tested. Neither does it take account of the decrease in proportion of deaths per cases, nor the false positive test rate (and so further exagerrates the scale of the problem):
Of course this graph conveniently shows 'strong evidence' of a 'second wave' - something which better fits the narrative of a lot of influential people.
*This is the first date for which there is a record of number of people tested.
**As evidenced by the continued very low death rates shown in my previous piece:
One of the classic problems used to evaluate how well lay people perform probabilistic updating is the "Blue/Green Cab accident problem" (or equivalently with buses). The problem is usually expressed as follows:
A cab was involved in a hit-and-run accident at night. Two cab companies, the Green and the Blue, operate in the city.
90% of the cabs in the city are Green and 10% are Blue.
A witness identified the cab as Blue. The court tested the reliability of the witness under the circumstances that existed on the night of the accident and concluded that the witness correctly identified each of the two colours 80% of the time and failed 20% of the time.
What is the probability that the cab involved in the accident was Blue rather than Green?
A major finding of the original study was that many participants neglected the population base rate data (i.e. that 90% of the cabs are Green) entirely in their final estimate, simply giving the witness’s accuracy (80%) as their answer.
However, what was not considered in this and similar studies was any uncertainty about the witness reliability (this is called second order uncertainty). If the 80% figure was based on 80 correct answers in 100 then the 80% estimate seems reasonable. But what if it was based only on 5 tests in which the witness was correct in 4? In that case there is much more uncertainty about the "80%" figure; using a Bayesian network model to get the correct 'nomative' solution, it can be shown that while the witness's report does increase the probability
of the cab being Blue, it simultaneously decreases our estimate of their
future accuracy (because Blue cabs are so uncommon).
A new paper by lead author Stephen Dewitt that addresses how well lay people reason about this second order uncertainty has been published in Frontiers in Psychology. It was based on a study of 131 participants, who were asked to update their estimates of both the probability the cab
involved was Blue, as well as the witness's accuracy, after they claim
it was Blue. While some participants responded normatively, most wrongly assumed that one of the probabilities was a
certainty; for example, a quarter assumed the cab was Green, and thus the
witness was wrong, decreasing their estimate of their accuracy. Half of the participants refused to make any change to the witness reliability estimate.
Dewitt, S., Fenton, N. E., Liefgreen, A, & Lagnado, D. A. (2020). "Propensities and second order uncertainty: a modified taxi cab problem". Frontiers in Psychology, https://doi.org/10.3389/fpsyg.2020.503233
The UK now enters a new period of 'semi lockdown' with private gatherings restricted to groups of 6 people. This seems to be in response to an 'increase in confirmed cases'; but such an increase is inevitable as more people are being tested (especially young people, many of whom will not suffer any major symptoms), and with the false positive problem. It seems to me that the most relevant COVID19 trend plot should be number of deaths per 100K people tested. I've not seen such a plot - so I just downloaded the relevant raw data from https://ourworldindata.org/covid-deaths and produced the following plot for the UK from 1 April 2020:
As the 'tail' is too small to see, here is the same data, but just from 1 May 2020
For comparison, here is the plot for the USA. Not the same kind of decrease because it is made up of many large states which had increasing infection rates at different times.
And another comparison is Israel, which is the first country to implement a second complete shutdown (beginning this week). The issue here is that, because numbers are relatively small and outbreaks occur mainly in very localised orthodox (Jewish and Muslim) communities, bigger variation than UK/USA is inevitable
And here are the three countries together plotted on the same scale. UK now doing much better than USA and Israel (which both are similar now despite being very different in May)
It does seem that the introduction now of the new UK rules is a very strange move.....