Thursday, 4 February 2021

What can we learn from very few data points (with implications for excess death numbers)?

Let's suppose that a museum decides to spend money in Sept 2020 advertising for new members. To see if the advert has worked you manage to find the data for numbers of new members (adjusted for changing population size) in October in each of the 5 previous years. The numbers are:

  • Oct 2015: 176
  • Oct 2016: 195
  • Oct 2017: 169
  • Oct 2018: 178
  • Oct 2019: 162

Suppose that, in Oct 2020, we see 178 new members. This is above the preceding 5-year average of 176, but we actually saw higher numbers in two of the five previous years. So, nobody would seriously suggest that the 'above average' number of new members was due to the advertising. But what if we saw 197 new members? Or 200, 210, 220, 250?  At what point could we reasonably conclude that the number is sufficiently 'higher' for there to have to be some causal explanation such as the advertising or some other factor?

The classical statistical aproach to answering this question is to 'fit' the data to a statistical distribution, such as a Poisson or Normal distribution. This enables us to determine the range within which we would 'expect' a new number to fall if there had been no intervention. The Poisson distribution 'fit' for the 5 years is:

Note: the Poisson distribution has just one parameter, namely the mean which is 176 in this case; the variance is the same as the mean

So, if we set the threshold at 95%, and observed say 200 new members, we might conclude - as evidence to support the impact of the advertising - that:

"The number of new members significantly exceeds the 5-year average (95% confidence bound)."

(The best fit Normal distribution has mean 176 and variance 152.5 so is 'narrower' than the Poisson above, with slightly lower percentiles, namely 196 and 204 for the 9% and 99% respectively, so if we felt that was a more reasonable model, we would conclude that a value of 197 was above the 95% confidence bound for the 5-year average).

But, even with the tiny sample of 5 data points we have one datapoint of 195 (in Oct 2016) which is very close to being beyond the 5-year upper 95% confidence bound. So why should we consider 200 especially high? 

Indeed, if we had data for more than 5 previous years of October new members, we might discover that every 10 years or so there is an October surge due to things that have nothing to do with advertising; maybe, there are major school initiatives every so often, or a local TV station runs a story about the museum etc. Perhaps in Oct 1998 there were 2000 new members in October which is assumed to have been due to a Hollywood movie having a set filmed there then. So, assuming it was available and adjusted for population size, how far back should we go with the data?

If we really must rely on such tiny datasets for making the kind of inferences here, then simply 'fitting' the tiny dataset to a particular distribution does not capture the full uncertainty we have about new member numbers. Fortunately, the Bayesian approach to learning from data enables us to accommodate this type of uncertainty along with any prior knowledge we have (although in this case we do not include any explicit prior knowledge). The Bayesian model* (see below for details) produces quite different results to the standard distribution fitting models. The 95% and 99% upper confidence bounds turn out to be 205 and 227 respectively. In other words, if there were say 204 new members in October 2020 then we would not be able to reasonably claim that this exceeded the 5-year average upper 95% confidence bound. 

It is also important to note that, using just 5 data points also make the results extremely sensitive to small changes. Suppose, for example, that the 2019 was not 162 but was 120 (with all other numbers exactly the same). Then, although this makes the 5-year average much lower (it drops to 166) the (Bayesian learnt) distribution becomes 'wider' (i.e. the variance inceases) so that the 95% and 99% upper confidence bounds turn out to be much higher at 236 and 294 respectively.

You may be wondering why these differences are important. It is because the number of 'excess deaths' is now widely used as the most important indicator of the impact of  COVID and/or lockdowns. And one of the standard approaches for determining whether increased death counts are likely explained by COVID and/or lockdowns is to use the previous 5-year datasets of death numbers and the model 'fitting' approach described above.

Indeed this issue - and the limitations of using such 5-year averages - is the subject of this very interesting analysis "Home Depot, Hogwarts, and Excess Deaths at the CDC"  by Kurt Schulzke. 

In fact, the 2015-2019 numbers used in the hypothetical museum example above are exactly the week 15 numbers for fatalities per million of the population for the state of Nebraska.

Based on the CDC approach (which uses the Poisson distribution) if the week 15 number had been above 198 it would have been classified as beyond the 5-year average upper 95% confidence bound. But a number above 205 would have been required for the more realistic Bayesian approach. The actual number was 178 - which could still be reported as being 'above the 5-year average' but is, of course, not at all unusual.

So what this all means is that you need to be very wary if you see conclusions about the current week, month or year death numbers being 'significantly above the 5-year average'. 

Here are the details for those interested in the Bayesian learning models (you can run these models, including with different values using the free trial version of AgenaRisk (www.agenarisk.com); the model you need is here (right click and 'save as' to save this as a file which you then open in AgenaRisk)

 

 


*As you can see from above we considered two different Bayesian models, one based on the Normal distribution and the other based on the Poisson. The problem with the Poisson distribution is that it is most suited to those situations where we are measuring the number of occurences of fairly rare events in a given period, so that the number are typically very low (like number of buses arriving at a stop every 10 minutes). Also, its assumption of a constant mean rate equal to the variance is intrinsically contradicted by death data. Even before COVID19, different types and severity of flu at different times of the year (from one year to the next) causes significant fluctuations which, over a long period, cannot be 'fitted' well to the 'narrow' Poisson distribution. Hence, the Normal distribution - whose variance is independent of the mean and which can be 'learnt' from the data - is more suitable. However, even the Normal will generally be too 'thin tailed' to properly model unusual and rare deviations which might be expected with death data.

See: "Home Depot, Hogwarts, and Excess Deaths at the CDC"  by Kurt Schulzke.


1 comment:

  1. A statistics PhD, for now anonymous, responds:

    My guess is that a Poisson GLM allowing over-disperion could produce comparable inference to a normal when the counts are routinely above, say, 30-100. Another approach could be a negative binomial but I don't know if Farringtonflexible contains such an option ... Because of the CLT the Poisson starts to mimic the normal and the overdispersion parameter gets flexibility between the mean and variance.

    Extreme quantiles take larger means to get similarity between the decay in the tails, so using 99th percentiles instead of 95th would not be anticipated to lead to approximations that are as accurate.

    Thoughts?

    ReplyDelete