Welcome back to our series on ROC curves and their statistics. This will be our last installment that discusses ROCs -- next we move on to odds ratios and risk ratios and more like them.
Now we look at how two methods can be compared from a clinical point of view using ROC curves. Recall that a perfect diagnostic test will yield a curve on the 1-specificity vs. sensitivity chart that has an area under the curve (AUC) of 1.0 (Figure 1, red).
A test that has an area of 0.5 is a poor (useless) test -- much like flipping a coin. It is right 50% of the time (The blue dashed line in Fig. 1).
Next, we look at the AUC for two or more tests (Figs 2 and 3) and compare the AUCs. The idea being that the curve with the higher area would be the best test. As you can see the dashed line (Method A) has the higher area than the solid line (Method B). In other words, Method A is accurate more often than Method B (assuming that the diagnosis and the laboratory value are correct).
Since these curves use a number of points and usually a diagnostic test has a single cut off (at most two are used), we consider another approach to evaluating two or more tests. In this approach, a value for specificity (or sensitivity) is chosen for all the methods being evaluated, say 90% or 0.9. With that in mind, the sensitivity at a specificity of 0.9 is found for the tests. Below is a table using this and comparing 3 tests (e.g. troponin, CK-MB and myoglobin).
From the table, it is clear that Method A is the best of the three tests, keeping in mind this applies only at a specificity of 0.9. It is possible that, if the specificity cut off was changed, another method might be better. This is not likely.
Another way to compare tests is to use the efficiency (TP+TN/total number of patients studied). For example, Test X has an efficiency of 89% at the cut off used while Test Y has an efficiency of 93%. If the clinicians are concerned with the number of right answers, they would choose Test Y. Another way to compare tests is to select a sensitivity, say 95%, and then see how the specificities compare when the sensitivity is the same. For example, Test A at a sensitivity of 95% has a specificity of 84%, while at a sensitivity of 95% Test B has a specificity of 78%. Test A appears better, at least at this cut off. One of our concerns is that it is not uncommon to read a method comparison article and see something like this: Test M had a sensitivity of 0.96 with a specificity of 0.88 while Test N had a sensitivity of 0.89 with a specificity of 0.94." Is it not possible that with a different cut off Test N would have nearly the same sensitivity and specificity as Test M? Or vice versa?
The last topic we want to discuss is the combination of two tests. These two tests could be used at the same time, or Test A then Test B if A is positive (or negative) or Test B then A. Let's consider an example.
Here are the 4 possible scenarios with the sensitivity and specificity, as well as the number of tests run for each. There are 100 patients to be tested 30 with the disease and 70 without (the sensitivity and specificity are rounded.)
Both A and B at the same time.
- Either A or B must be positive to be positive
Sensitivity 98, specificity 88, total tests 200
- Both A and B must be positive to be positive
Sensitivity 94, specificity 84, total tests 200
- First A. If positive measure B. Both must be positive to be positive.
Sensitivity 94, specificity 86, total tests 128
- First B. If positive, measure A. Both must be positive to be positive
Sensitivity 91, specificity 85, total tests 127
From this, we can make a decision regarding which of the 4 to use depending on whether we want the highest sensitivity or specificity or the least cost. Or a compromise.
Before we leave our discussion on ROCs, we want to do a thought experiment to illustrate a situation you will encounter in reading articles that discuss sensitivity and specificity. In our experiment, we look at test T used to detect disease D. In this case, let us say that T will be positive 99% of the time if the patient has D, and it will be negative 99% of the time if the patient does not have D.
One more statistic: This is an uncommon disease. It is found in only 0.1% of the time - only 1 per thousand persons has D. In our experiment, we test 100,000 people for D using test T. The results are in the table below.
Note that, even with a sensitivity of 0.99 and a specificity of 0.99, there are still 999 FP! It is usually helpful to include the PPV and the NPV in such studies.
Case study 1
Diagnosis of neonatal sepsis may be difficult because clinical presentations are often nonspecific, bacterial cultures are time-consuming and other laboratory tests lack sensitivity and specificity. This study investigated the role of procalcitonin (PCT), C-reactive protein (CRP), interleukin (IL)-6, IL-8 and tumor necrosis factor-alpha (TNF-alpha) in establishing the diagnosis and evaluating the prognosis of neonatal sepsis. The study found that the AUC for PCT, TNF-alpha, IL-6, CRP and IL-8 were 1.00, 1.00, 0.97, 0.90 and 0.68, respectively. For the cut-off value of PCT > or = 0.34 ng/ml, the test was found to have a sensitivity of 100%, specificity of 96.5%, positive predictive value of 96.2%, negative predictive value of 100% and diagnostic efficacy of 98.3% for bacterial sepsis in neonates. For the cut-off value of TNF-alpha > or = 7.5 pg/ml, sensitivity, specificity, positive predictive value, negative predictive value and diagnostic efficacy were found to be 100%, 96.6%, 96.2%, 96.5% and 98.3%, respectively. It was detected that sensitivity, specificity and diagnostic efficacy values were lower for IL-6, CRP and IL-8.
Role of procalcitonin, C-reactive protein, interleukin-6, interleukin-8 and tumor necrosis factor-alpha in the diagnosis of neonatal sepsis.1
Based on these data which marker would you use if you were allowed only 1? Why?
Case study 2
The consequence of the imbalance between the erythroid marrow iron requirements and the actual supply is a reduction in red cell hemoglobin content, which causes hypochromic mature red cells and reticulocytes. One instrument reports reticulocyte hemoglobin equivalent (Ret-He) and the percentages of erythrocyte subsets, including the hypochromic fraction (%Hypo-He. 90 healthy subjects, 85 patients with chronic kidney disease (CKD) and 65 patients on dialysis (HD) receiving therapy and 91 patients with iron deficiency (IDA) were analyzed.
The results of ROC curves analysis for the diagnosis of iron deficiency (gold standard sTfR > 21 nm) were as follows: Ret-He area under curve (AUC) 0.935 cutoff 29.8 pg, sensitivity 90.7%, specificity 83.1%. % Hypo-He AUC 0.925 cutoff 3.5%, sensitivity 87.3%, specificity 88.0%.
What conclusions can you draw from these data?
Erythrocyte and reticulocyte indices in the assessment of erythropoiesis activity and iron availability.2
- Kocabas E, Turk J Pediatr. 2007 Jan-Mar;49(1):7-20.
- Urrechaga E, Int J Lab Hematol. 2013 Apr;35(2):144-9