Probability and Risk: Why we know so little about COVID-19 from testing data - and why some extra easy-to-get data would make a big difference

This blog post provides some context for a short article (with Martin Neil, Scott McLachlan and Magda Osman) that was published in LockdownSkeptics and which has received quite a bit of attention.

The daily monitoring of COVID-19 cases (such as the very crude analysis we have been doing) are intended ultimately to determine what the 'current' population infection rate really is and how it is changing.

However, in the absence of a gold-standard test for COVID-19, it is always uncertain whether a person has the virus (let alone whether a person can infect someone else). Obviously this means that the population infection rate (sometimes referred to as the community infection prevalence rate) on a given day is also unobservable. The best we can do is estimate it from data that are observable. To get a feel for how complex this really is to do properly - and why current estimates are unreliable, here is a (massively simplified, yes really - see Notes about simplified assumptions below) schematic** showing the information we need to get these estimates.

Note that all the variables we need to 'know' for accurate estimation (the rectangle boxes coloured light red and white) are unobservable. Hence, we are totally reliant on the other things (the variables represented by yellow and blue rectangles) which are observable.

But here is the BIG problem: the only accessible daily data we have (e.g. from https://coronavirus.data.gov.uk/) are the two blue rectangles: number of tests processed and number of people testing positive. This means that any estimates of the things we really want to know are poor and highly uncertain (including the regular updates we have been providing based on this data). Yet, in principle, we should easily be able to get daily data for all the yellow rectangles and, if we did, our estimates would be far more accurate. Given the critical need to know these things more accurately, it is a great shame that these data are not available.

Notes about simplified assumptions

There are many such assumptions, but here I list just the most critical ones:

We make a crucial distinction between people who do and do not have COVID symptoms - for the important reason that a) the former are more likely to be tested than the latter, and b) the testing accuracy rates will be different in each case. However, we don't (but really should) also distinguish between people who have and have not been in recent contact with a person tested positive, because again a) the former are more likely to be tested; and b) the testing accuracy rates will be different in each case. It could also be reasonably argued that we should also distinguish between different age categories.
We are making the massively simplified assumption that the testing process is somehow 'constant'. Not only are there many different types of tests, but for the most common - PCR testing - there are massive variations depending on what 'Ct value' is used (i.e. the number of cycles) and small changes can lead to radically different false positive rates. If there are government changes to the ct value guidelines then this can cause apparent (but non-real) massive changes in the 'population infection rate' from one day to the next.
While we have allowed for the fact that some people are tested multiple times (hence the observable, but never reported, variable number of people tested more than once) this actually massively over-simplifies a very complex problem. If a person tests positive where the ct value was above 40, then (because it is known that ct values even above 30 lead to many false positives) the recommendation is to retest, but we do not know if and when this happens and how many retests are performed. Similarly, some people may receive multiple negative tests before a single positive test and such people would count only as one of the people testing positive.

**The schematic is actually a representation of what is called a Bayesian network; the direction of the arrows is important because every variable (box) that has arrows going into it is calculated as an arithmetic or statistical function of the variable which are its 'parents'.

As all unobserved variables like population infection rate are never known for certain they will always be represented as a probability distribution (which could be summarised, for example as "a 95% chance of being between 0.1% and 20%" or something like that). As we enter observed data (such as number of people testing positive) we can calculate the updated probability of each unobserved variable; so, for example, the population infection rate might change to "a 95% chance of being between 0.1 and 10%". The more data we enter for the observable variables the more accurate the estimates for the unobserved variables will be. Unlike traditional statistical methods, Bayesian inference works 'backwards' (in the reverse direction of the arrows) as well as forwards.

We have published many papers and reports applying Bayesian network analysis to COVID data. For this and related work see, for example:

Probability and Risk

Sunday, 11 October 2020

Why we know so little about COVID-19 from testing data - and why some extra easy-to-get data would make a big difference

4 comments: