News about testing for viruses has reminded me of a couple problems that I linked to some time ago, but never dealt with directly. The question is, given data such as the result of a (fallible) blood or swab test, how sure can we be of the results? The answer is sometimes surprising.
False positives and negatives
We’ll start with this question from 1999:
Probability in Virus Testing In a country, 1 person in 10,000 (.01%) has the EB virus. A test that identifies the disease gives a positive indication 99% of the time when an individual has the disease. However, 2% of the time it gives a positive indication when a person doesn't have the disease. a) If a randomly selected individual is tested and the test turns out positive, what is the chance that this individual has the virus? b) Construct a 2 by 2 table. I am having trouble constructing the table. I am also not sure what I am supposed to do with the numbers to arrive at the answer for (a).
Here we know how common the virus is, and how often the test gives a true positive, or a false positive, result. We want to know the probability that it is correct when the result is positive. At first glance, you might think that’s already been answered! But in probability, a careful reading of any problem is essential. There is a big difference between the test being positive 99% of the time when you are sick, and you being sick 99% of the time when it is positive! (The latter, we’ll see, is far from true.)
Doctor Anthony replied, first making the table:
Initial Prob. Initial Prob. Has the virus Does not have virus Prob = 0.0001 Prob = 0.9999 ------------------------------------- 0.00001 x 0.99 0.9999 x 0.02 Test is positive = 0.0000099 = 0.019998 0.00001 x 0.01 0.9999 x 0.98 Test is negative = 0.0000001 = 0.979902
This is the requested table; but we need to think carefully about what it means.
Each entry gives the proportion of the entire population that fits each description: in reading order, those who have the virus and test positive, those who don’t have the virus but test positive, those who have the virus but test negative, and those who don’t have the virus and test negative. So the first column contains the 0.01% of all people with the virus; 99% of these test positive, and 1% test negative. The second column contains that 99.99% of the population who don’t have the virus; 2% of these test positive, while 98% test negative.
I have marked in red the false negatives (whom the test wrongly identified as clean, also called a Type II error), and in green the false positives (in whom the test wrongly identified the virus, also called a Type I error). There are many more of the latter, simply because there are so many healthy people. Both are dangerous; false negatives can keep people from being treated, which is very bad; but false positives scare people, and also reduce faith in the test.
Now apply this to a person who tests positive:
Since we are told that the test is positive, the sample space is confined to the top row. 0.0000099 Prob has virus = -------------------- = 0.0004948 0.0000099 + 0.019998 With such a low probability of a person actually having the disease given a positive result, the test itself is practically worthless.
What happened here is that with so many false positives, the few true positives are swamped. A positive result is far from indicating that you are in fact sick.
What if you test negative? Then the probability that you are actually healthy is, by the same reasoning, $$P(\text{no virus}) = \frac{0.979902}{0.0000001 + 0.979902} = 0.9999999$$ So you can be assured (99.99999%) that you are virus-free – but you were already 99.99% sure anyway.
The 99% in this problem, the true positive rate, is called the sensitivity of the test, because it tells us how sensitive the test is to what it is looking for. The true negative rate, 98%, is called the specificity, because it tells how specific a positive result is to this particular condition. As we see here, the specificity can be extremely important when we are testing for something rare!
As a current example, in evaluating a test for SARS-CoV2, it is important on one hand that it should be sensitive (find most cases of the virus, so they can be isolated or treated); but if it is not also specific (e.g. not showing positive results for someone who has had a cold caused by a different coronavirus), it may be useless. On the other hand, the significance of any result is strongly affected by the prevalence of the virus in the population — which we can’t be sure of without testing.
More on measures of accuracy
I want to include here a question from Stephen in 2004 that was not put in the archive, that covers a little more:
If a particular test for AIDS was accurate in detecting the disease 95% of the time, and the test had a false positive rate of only 1%, if 1% of the population being tested had the disease: (a) what proportion of the population would be diagnosed as having AIDS? (Hint: this answer requires you to make two calculations, not just one.) The hint the teacher gives us in question (a) is that there are two calculations needed to come up with the answer. I don't see why this is so. To me it seems like the proportion of the population that would be diagnosed would be 95% (accuracy rate) of the 1% of the population that has the disease.
Doctor Pete answered:
Is this the exact wording of the question? I will make a few assumptions about what you mean by "95% accuracy" and "1% false positives." I take these to mean that 95 out of 100 people with disease will be correctly identified, and the remaining 5 will be incorrectly identified as healthy (Type I error); and out of 100 people, 1 will be identified as having the disease when in fact he is healthy (Type II error). The null hypothesis, of course, is that the tested individual has the disease.
The assumption is that “accuracy” means sensitivity. It can instead be taken to mean the proportion of results that are correct, that is, (TP + TN)/n as defined below.
In this situation, there are four categories: TP = True Positive = Subject tests positive and is indeed HIV+ FP = False Positive = Subject tests positive but is in fact HIV- (Type II) TN = True Negative = Subject tests negative and is indeed HIV- FN = False Negative = Subject tests negative but is in fact HIV+ (Type I) Then we have the following facts: TP + FP = Total positive test results FN + TN = Total negative test results TP + FN = Total number of HIV+ subjects FP + TN = Total number of HIV- subjects Furthermore, n = TP + FP + FN + TN = Total number of test subjects.
These, of course, are the four cells of our table, and the sums of the rows and columns. Now we have to fill them in:
Now suppose n = 10000. We are told that the prevalence rate of HIV is 1%. In other words, TP + FN = 100, FP + TN = 9900. Since the probability of a Type II error (false positive) is 1%, we then also have FP = 100. Hence, TN = 9900 - FP = 9800. Since the test accurately identifies 95% of all HIV+ test subjects, that means that TP = 0.95(100) = 95, and FN = 5. Therefore, out of 10000 trials, the test will give TP + FP = 95 + 100 = 195 positive results, which is 1.95%.
Stephen had taken into account only TP, not FP.
I hope this is what you had in mind; I can't be sure because the term "accuracy" is rather vague. Incidentally, we have Sensitivity = TP/(TP+FN) = 95/100 = 95%, which is what you called "accuracy" Specificity = TN/(FP+TN) = 9800/9900 = 98.99% Positive predictive value (PPV) = TP/(TP+FP) = 95/195 = 48.72% Negative predictive value (NPV) = TN/(TN+FN) = 9800/9805 = 99.95% Efficiency = (TP+TN)/n = 98.95% Sensitivity and specificity are independent of the prevalence rate, and are indicators of a test's "clinical value." The low PPV tells you that slightly more than half of all positive tests are false. The high NPV tells you that you can be very confident that a negative test result is in fact correct. Thus, the test is not suitable for testing populations with a low prevalence, because it tends to cause unnecessary alarm (too many false positives). I strongly recommend you repeat this exercise with a high prevalence rate, say 20%, and calculate sensitivity, specificity, PPV, NPV of the test. Then interpret these results and compare them to the 1% prevalence results.
For a slightly more complicated question, where the test can be inconclusive, see:
Test for Tuberculosis
Identifying a two-headed coin from test data
Here is a question from 2003 that is superficially quite different, but has the same underlying idea:
Two-Headed Coin and Bayesian Probability In a box there are nine fair coins and one two-headed coin. One coin is chosen at random and tossed twice. Given that heads show both times, what is the probability that the coin is the two-headed one? What if it comes up heads for three tosses in a row? I understand that there are 10 coins in total. My teammates tried it out also and they got 4/(9 + 4) for the first part and 8/(9 + 8) for the second part. I don't understand how they got this.
Rather than a disease, we have a defective coin, and rather than a blood test, we have a toss test. But basically it is the same thing: Deciding an underlying cause (the kind of coin) from an observation (the results of some tosses). Notice that this “test” will not produce false negatives: If the test comes out negative (either toss results in tails), then we know for sure that it is a normal coin! But it can have false positives.
Doctor Mitteldorf took this one, using a tree rather than a table:
Dear Maggie, Here's a way to think about it. Make a tree: flip two heads (1/4) / choose fair coin (9/10) / \ / flip anything else (3/4) 10 coins \ \ choose two-headed coin (1/10) -> flip 2 heads (1/1)
This shows the probabilities of choosing a fair or defective coin, and then of each outcome for the coin we flip:
- P(fair) = 9/10;
- P(two heads | fair) = 1/4;
- P(any tails | fair) = 3/4;
- P(defective) = 1/10;
- P(two heads | defective) = 1.
Study this tree, and it becomes clear that there are 3 possibilities: 1 - the top one has probability (9/10)*(1/4) = 9/40 2 - the next one has probability (9/10)*(3/4) = 27/40 3 - the last one has probability (1/10)*1 = 1/10
These are the probabilities
- P(fair and 2 heads) = 9/40,
- P(fair and any tails) = 27/40, and
- P(defective and 2 heads) = 4/40,
which add up to 1. In total, P(2 heads) = 13/40 and P(any tails) = 27/40.
Before you did the experiment, these were all the possibilities there were. Then you did the experiment. What did it tell you? It told you that the middle option is out. The coin did NOT show a tail, so we know it wasn't the second outcome. This narrows our universe to the 9/40 and the 1/10. The trick now is to re-normalize these probabilities so that they show a total probability of 1, but stay in the same ratio. Within that universe (all the possibilities that are left) lines (1) and (3) remain in the ratio 9:4. So the probability of the top one is 9/13 and the bottom is 4/13, where 13 is just the sum of 9 and 4.
That is, $$P(\text{defective | 2 heads}) = \frac{P(\text{defective and 2 heads})}{P(\text{2 heads})} = \frac{9/40}{9/40 + 4/40} = \frac{9}{13}$$
Can you extend this reasoning to come up with the corresponding result for three flips? (This kind of reasoning is called Bayesian probability, and it is one of the most confusing topics in probability at any level of study.)
Do you see the similarity of this to the medical test problems? In all these cases, in order to determine the actual probabilities, we had to know the population: what percent had the disease, and (here) what percent of the available coins were defective. In real life, we may not know that …
Identifying a fair coin from test data (without knowing the population)
What if we knew a little less about the environment? Here is a broader question from 2004:
Probability Philosophy and Applying Inference If I flip a coin 4 times and they all turn out to be heads, what is the probability that the coin is fair? Because I am not sure if there is a proper comparison distribution, would I have to use the T-distribution and is this problem even answerable? Would the question be answerable if I flipped the coin 50 times and it turned out heads each time? Any help would be great.
Nothing is said here about where the coin comes from; otherwise, this is essentially the same question.
Doctor Schwa answered:
Hi John, If you're a "frequentist" probability philosopher, the question has no answer: either the coin is fair, or it isn't--what's the repeated event from which we can abstract a probability? That is, you can say "the probability of this die showing 6 is about 1/6" based on rolling it a lot of times. But you can't do the same for "this coin is fair" because it is either always fair, or not--there's no variation!
If we knew anything about the coin, such as that it was known to be fair, or that it is from a population in which 1% of coins are made with double heads, or that it has produced heads 1000 times, we could answer the question; but as it is, the probability that it is a fair coin is not really defined from this perspective, in which probability is defined in terms of repeated experiments.
On the other hand, a "Bayesian" probability philosopher would be perfectly happy with your question. They would ask for one more piece of information first, though: what do you know about the person giving you the coin? How much did you trust them? The Bayesian would then use P(coin is fair) in the abstract, followed by P(4 heads in a row | coin is fair) compared with P(4 heads in a row | coin is unfair), to eventually determine P(coin is fair | 4 heads in a row).
Bayesian probability is about subjective judgments of one-time situations. In order to make this judgment, we need to start with an assessment of the a priori probability that any given coin is fair. In effect, that would be an estimate of the population the coin comes from.
But there’s another perspective, which is a misreading of hypothesis testing in statistics; the phrasing of the problem suggests that this is the context:
More likely than any of the above nonsense, though, what your teacher really wants you to calculate is a P-value. A P-value is NOT the probability that the coin is fair, though many people (and many statistics textbooks, even!) often misstate it as such. A P-value is just one of the probabilities that a Bayesian would use: P(4 heads in a row | coin is fair).
This goes right back to our initial discussion, of the difference between “the probability that I have the virus given that the test is positive” and “the probability that the test is positive given that I have the virus”. The p-value is the probability that we would get the results we see, if we assume that the coin is fair, rather than the probability that the coin is fair, given the results.
That is, you can answer the question "how likely is a fair coin to produce 4 heads in a row", and if that P-value is small enough, decide to reject the hypothesis that the coin is fair--but it's certainly not the probability that the coin is fair! You're simply casting doubt on the ASSUMPTION that it's a fair coin by noticing that the DATA you got would be really unlikely if the coin was fair. It's the DATA (the 4 heads in a row) that have a really low probability. I hope that response helps!
In this case, the probability that a fair coin will show heads four times in a row is \(\left(\frac{1}{2}\right)^4 = \frac{1}{16} = 0.0625 = 6.25\%\). We commonly pick a threshold of 5% or 1% for the p-value, figuring that anything less likely than that to have produced the results we see is hard to believe. In this case, a fair coin showing heads four times is within the realm of believability.