|
The Certainty of Uncertainty in Medical Diagnostic Testing: Considerations for Referring Physicians
By H.L. Magill, M.D., Mohammed Moinuddin, M.D. and J. Buchignani, M.D.
Despite our best efforts, uncertainty frequently persists in the practice of medicine. Medical diagnostic tests (imaging procedures or laboratory tests) are often perceived by physicians as providing absolute answers or as clarifying uncertainty to a greater degree than is warranted. When a diagnosis turns out to be at variance with the results of a diagnostic test, the clinicians assumption may be that the test was either misinterpreted or that the test is no good. Such a binary approach to the interpretation of diagnostic testing, i.e., assuming a clearly positive or negative result (like an off-on switch), is too simplistic and may be counterproductive in the workup of a patient. Instead, the results of a diagnostic test should be viewed on a continuum from negative to positive and as giving the likelihood or probability of a certain diagnosis. When radiographs reveal cortical disruption with malalignment of the distal humerus in a patient with elbow trauma, the diagnosis of fracture is certain. If an elbow joint effusion is the only radiographic finding, the likelihood of fracture is relatively high but not definite. A fracture is unlikely if radiographs are negative; however, if the patient has pain or tenderness on physical exam, a bone scan might reveal an occult fracture or symptoms could result from soft tissue injury alone.
Diagnostic testing, whether by radiography, ultrasound, computed tomography (CT), magnetic resonance imaging (MRI), nuclear medicine, or laboratory techniques is inherently imperfect. Each procedure has a threshold of intrinsic resolution below which detection of disease is not possible, such as the inability of CT to detect tumor in a normal sized lymph node. Tissues/organs have limited morphologic response to diseases and tissue repair is most often nonspecific as to inciting injury or disease. Patient body habitus, difficulties in positioning, and motion may adversely impact upon the accuracy of diagnostic imaging procedures by degrading image quality or by causing artifacts that mimic the changes of disease. Also, the accuracy of any diagnostic test depends on the expertise and experience of the individual who interprets the test, as well as on the amount and quality of pre-test clinical information available about the patient. An excellent example is the interpretation of a low TSH when thyroid chemistries are otherwise normal. Subclinical hyperthyroidism is an obvious diagnosis, particularly given the additional history of atrial fibrillation or osteoporosis. An experienced and knowledgeable physician will consider other causes of low TSH such as suppression by medication that the patient has received (e.g., thyroxine or steroids) or rarely by the ectopic (neoplastic) production of thyroid hormone.
Probability theory has been used in the medical literature in an attempt to quantify the accuracy of diagnostic testing (imaging or clinical laboratory procedures). The contingency table, perhaps the most widely applied method of analysis, provides some information about the relative accuracy of a certain diagnostic test in a specific setting (usually a preselected or a defined patient population) against a gold standard method of evaluation. This gold standard, often more invasive or expensive, could be histologic diagnosis, another imaging procedure, a clinical laboratory test, or clinical outcome (each with its own inherent limitations that should be kept in mind). Comparison of the results of a diagnostic test, against those of the gold standard in a selected population or study group of patients, defines the sensitivity and specificity of that diagnostic test. The number of true positive (TP) results divided by the total number of patients with disease (TP + false negative [FN] results), as determined by the gold standard, gives the sensitivity of the test [TP / (TP+FN)]. The number of true negative (TN) results divided by the number of patients without disease (TN + false positive [FP] results), as determined by the gold standard, gives the specificity of the test [TN / (TN+FP)].
When applying the sensitivity or specificity of a diagnostic test reported from the medical literature to clinical practice, one must keep in mind the possibility of pre- or post-test bias. Pre-test bias may optimize the reported sensitivity or specificity of a diagnostic test, if the population selected for validation has a very high incidence of a disease in question or conversely is atypically healthy. The determination of a tests specificity can be especially problematic. If the test is invasive or expensive, the acquisition of a sufficient number of normal subjects to accurately establish specificity could be difficult. Post-test selection bias occurs when the results of a screening test primarily determine the subsequent work-up of patients. A positive test usually prompts further evaluation (often by the gold standard), while work-up may stop after a negative test. Selection of only positive tests for confirmation will spuriously elevate the false positive rate, thereby decreasing specificity of the screening test. A good example of post-test selection bias has occurred with thallium myocardial perfusion imaging, initially reported to have specificity approximating 85%, but with subsequent reports of much lower specificity (45-50%) as the procedure gained widespread acceptance and mostly only positive studies underwent further evaluation by coronary arteriography. Because of these potential difficulties in accurately determining test specificity, the concept of normalcy, i.e., selection of normal patients by a low clinical pre-test likelihood of disease (<5%) rather than by results of gold standard testing (often more invasive and added expense), has been proposed as a proxy for specificity.
Test specificity can also be lowered by abnormality that persists from a previous episode of disease. A patient with Hodgkins Disease in remission can still show CT evidence of residual mediastinal mass from granulation tissue or fibrosis. Previous pulmonary embolism or granulomatous infection can cause chronic perfusion defects with normal or nearly normal ventilation. The diagnosis of new or recurrent disease in these instances is often difficult and usually requires additional evaluation, including comparison with prior studies when available, for accurate diagnosis. Patient age is also a consideration in validation of diagnostic tests. Elderly patients are more likely to have experienced previous episodes of a disease or to have additional medical problems that could affect the accuracy of a diagnostic test. Determination of sensitivity and specificity of diagnostic procedures in pediatric patients may be difficult, since parents are often reluctant for their children to participate in research requiring additional studies with ionizing radiation or more invasive tests beyond those absolutely necessary for diagnosis.
While mindful of the above concerns, physicians should have some concept of the sensitivity and specificity of diagnostic tests performed on their patients. A patient with questionable or absent clinical symptoms and findings, but with clearly elevated serum T3 and T4 (not a laboratory error or the wrong patient), has hyperthyroidism given the high sensitivity of the T3 and T4 tests. These thyroid chemistries are more reliable than clinical evaluation when disagreement occurs. Conversely, a good history for gastroesophageal reflux disease (GERD) requires treatment, even when a radiographic contrast study, nuclear medicine reflux study, or endoscopy is negative. These procedures, although helpful in many cases, do not have very high sensitivity for GERD.
As a general rule, diagnostic imaging procedures that are very sensitive are less specific (higher false positive rate), as illustrated by the following case. A diabetic patient with peripheral neuropathy developed redness and swelling of his foot. Possible clinical diagnosis included osteomyelitis and MRI was done for further evaluation. The MRI showed both bone marrow and adjacent soft tissue edema that were felt to be compatible with osteomyelitis. The patient received a prolonged course of antibiotics, but clinically did not improve. A subsequent 99m-Techetium-HMPAO labeled white blood cell (WBC) scan was negative. Question arose as to whether the WBC scan was falsely negative or the MRI was falsely positive for infection. The patient underwent amputation of the foot because of continued symptoms and failure to respond to therapy. Histological evaluation revealed neuropathic changes with fracture but no evidence of osteomyelitis. Neuropathic joint disease, fracture, and osteomyelitis all produce soft tissue and bone marrow edema that is detected with high sensitivity by MRI, irrespective of the cause, thus compromising the specificity of MRI.
Bayes theorem, applied to diagnostic testing, predicts the likelihood of disease (post-test probability) as a function of a tests sensitivity and specificity and of the disease likelihood prior to performing the test (pretest probability). The pretest probability of a disease is based on the prevalence of the disease in the population available for testing, a specific patients symptoms and findings from physical exam, and the results of previous diagnostic testing done on the patient. The ventilation-perfusion lung (VQ) scan, often used in the diagnostic workup of pulmonary thromboembolic disease, illustrates a clinically useful application of Bayes theorem. The results of a VQ scan are expressed as an index of post-test probability: low, intermediate, or high probability for pulmonary thromboembolism (PE). Understanding the clinical significance of these probabilities is extremely important for optimal patient care. High probability usually means an 85% or greater chance of PE, moderate probability 20-84% chance, and low probability less than 20% chance of PE (PIOPED study; some other studies set low probability at 15% or less). Since the mortality of PE is 30-32% and since the VQ scan probabilities alone do not impart sufficient confidence for therapeutic decisions about anticoagulation, additional data are required. Clinical information such as risk factors (use of birth control pills, prolonged immobilization, coagulopathies, cardiac disease, etc.) or another test such as Doppler ultrasound, showing deep vein thrombosis in an extremity, often provides this data. The PIOPED study indicates a 96% likelihood of PE given the combination of a high clinical index of suspicion and a high probability VQ scan. A low probability VQ scan with a low clinical index of suspicion of PE imparts a 98% confidence level for the absence of PE according to the PIOPED data.
The combination of scintigraphic findings and clinical information is essential to making a proper treatment decision. A decision not to anticoagulate a patient, based only on a low probability lung scan without consideration of clinical information or findings, is dangerous with death reported in rare cases. If there is disparity between the clinical and the scintigraphic impression (high verses low or vice versa) or if the VQ scan is indeterminate (intermediate/moderate probability for PE) further evaluation is needed, usually Doppler ultrasound, helical CT, or pulmonary angiogram. Doppler ultrasound is helpful when positive for deep venous thrombosis (DVT), but has moderate sensitivity and specificity for detecting relevant sites of DVT because it is operator dependent, provides limited evaluation of pelvic veins, and doesnt always differentiate acute from chronic disease. Detection of acute DVT by nuclear medicine imaging (AcuTect; recently approved by FDA) may prove beneficial in patients difficult to evaluate by ultrasound, e.g. lower extremity cast or recent surgery, obesity, or chronic deep venous disease; however, more investigation is needed to refine the accuracy and efficacy of this technique. Helical CT detects PE in the central and lobar pulmonary arteries with sensitivity similar to that of pulmonary angiography, but with decreased sensitivity for detection of PE in segmental pulmonary arteries. Pulmonary angiography remains the gold standard for definitive diagnosis of PE.
An algorithm is a convenient way to summarize known statistical information and can provide guidelines for the most efficient use of diagnostic testing and/or the most appropriate patient management. An example of an algorithm useful for the evaluation and management of patients with possible PE is as follows

Even if the sensitivity and specificity of a test is not readily available, some concept of the information provided by the test and the relevance of this information to a patients diagnosis is essential. Palpable thyroid nodules occur frequently, estimated to arise in 5-50% of the population in the USA. Clinical concerns are whether the nodule has autonomous function producing hyperthyroidism or whether the nodule is malignant. Clinical evaluation and thyroid function chemistries (T4, T3, TSH, etc.) readily determine if hyperthyroidism is present, while clinical data gives estimation of the relative risk of malignancy (increased for children, older patients particularly males, and those with history of radiation therapy to the head/neck or chest). Ultrasound and thyroid scans are often requested for the initial evaluation of patients with thyroid nodules. Ultrasound is quite sensitive in detecting additional thyroid nodules below the threshold of palpation and in distinguishing cystic from solid nodules. However, ultrasound cannot differentiate benign from malignant thyroid disease, thyroid cancer can present as a complex cystic or solid lesion, and thyroid cancer can arise in a multinodular thyroid gland. Thyroid scans determine if a nodule is functioning (hot) or non-functioning (cold). Many functioning nodules do not cause hyperthyroidism, most thyroid nodules (90-95%) are non-functioning, and most cold nodules are benign (80-88%) rather than malignant. Therefore, ultrasound and thyroid scans should play a secondary role in the initial evaluation of thyroid nodules. Fine needle aspiration thyroid biopsy (FNA) is the only reliable way to differentiate benign from malignant thyroid nodules, although in a small minority of patients, even FNA can be non-diagnostic because of sampling error, acellular specimen, or non-specific cytology.
Sensitivity and specificity partially define the efficacy of a diagnostic test, but do not answer the clinical concern of whether a patient with a positive or negative test result does or does not have a disease. These questions are addressed by the positive predictive value (PPV) and negative predictive value (NPV) of the test. The PPV is the percent of positive tests that are truly positive [TP / (TP+FP)] and the NPV is the percent of negative tests that are truly negative [TN / (TN+FN)]. The accuracy of the test can then be defined as the percent of all tests that are truly positive or truly negative [(TP+TN) / total number of tests performed in a given population or study]. The PPV of a test can be viewed as providing information about whether or not a positive result is likely to be falsely positive and, therefore, is influenced by the tests specificity. In turn, the NPV reflects how good a test is in separating abnormal (disease) from normal, and is influenced by the tests sensitivity. A low PPV means a greater chance for a false positive result and a low NPV means a greater chance that the test wont detect a disease that is present.
The positive and negative predictive values and the accuracy of a test are all influenced by the incidence (frequency) of a disease in the population studied, with this effect most significant at the extremes. When the disease incidence is very low (£5%), a positive test result is more likely to be a false positive than a true positive, thus lowering the PPV of the test. A classical example would be to use thallium exercise myocardial perfusion imaging to screen patients under the age of 30 with chest pain for coronary artery disease. Symptomatic coronary artery disease is very uncommon at this age, particularly in premenopausal women or patients without multiple risk factors. Therefore, in these patients a positive thallium scan has a good chance of being a false positive. Conversely, if the disease incidence is high (&Mac179;95%) false negative results increase in frequency relative to true negative results, lowering a tests NPV. A negative thallium scan should not delay coronary arteriography in a 60 year old man with typical angina and multiple risk factors for coronary artery disease, since a false negative result from balanced coronary artery disease is more likely than a true negative result. The accuracy of thallium scanning is sufficiently lowered in either instance to question the appropriateness of the study. Diagnostic imaging is most useful and accurate when the pretest probability of a disease in question is in the intermediate range.
The risk-benefit-cost of a test for a patient is also important to consider. If the risk of complication is &Mac179;5% and/or the test is expensive, then use of the test in order to gain an additional 3-5% confidence level from a positive result would be questionable. However, an invasive or expensive test may be necessary to establish a diagnosis with the greatest certainty possible, when the disease has serious complications or treatment of the disease may have potentially serious side effects. Biopsy for confirmation of suspected temporal arteritis is usually indicated, because treatment with steroids has serious potential side effects of osteonecrosis, osteoporosis with insufficiency fracture, or altered response to infection, and untreated temporal arteritis could result in blindness.
In conclusion, diagnostic testing provides the referring physician with information about the likelihood of a certain disease(s), based on anatomic changes and/or physiologic data. All imaging procedures and laboratory studies possess some diagnostic uncertainty because of inherent technical limitations, patient limitations, or disease prevalence in the patient population. These reasons explain why a diagnostic test result can differ from a clinical or actual diagnosis and why diagnosis should not be exclusively based on test results. When requesting a diagnostic test on a patient, the physician should consider what level of diagnostic certainty is appropriate for the patient, estimate the pre-study likelihood that the patient has a disease(s) in question, and understand the inherent limitations of the test. If a previous test result already provides 90% likelihood of a certain diagnosis, is additional evaluation necessary (risk-benefit-cost considerations) to increase the likelihood to 95% or greater? If the likelihood of a disease is low, consider the consequences of a positive test result that might well be a false positive. Will this ambiguous result necessitate additional procedures or cause unnecessary, dangerous, or expensive treatment? It is not always easy to answer these questions for patients with complex medical problems or to know whether a certain diagnostic test is likely to be of benefit. Discussion of the case with a colleague in diagnostic imaging, pathology, or the clinical laboratory will often provide guidance as to the most appropriate studies to perform. While this approach may take more effort, it is usually beneficial and often prevents a non-indicated test or one that could provide confusing/misleading results, and at the very least prevents a needless increase in medical costs.
Additional Reading:
1. Goldfarb LR, Goldfarb RC, Seldin DW. Clinical Decision Making in Nuclear Medicine. Nuclear Medicine Annual 1989 (Raven Press Ltd., NY): 225-264.
2. Utiger RD. Subclinical Hyperthyroidism Just a Low Serum TSH Concentration or Something More? (Editorial). New Engl J Med 1994; 331: 1302-1303.
3. Boudreau RJ, Remley KB. Advances in Imaging the Thyroid and Parathyroid Glands: Nuclear Scintigraphy, Computed Tomography, and Magnetic Resonance. Thyroid Today 1996; 19 (4): 1-11.
4. The PIOPED Investigators. Value of the Ventilation/perfusion Scan in Acute Pulmonary Embolism: Results of the Prospective Investigation of Pulmonary Embolism Diagnosis (PIOPED). JAMA 1990; 263: 2753-2759.
5. Gottschalk A, Sostman HD, Coleman RE, et al. Ventilation-Perfusion Scintigraphy in the PIOPED Study. Part II. Evaluation of the Scintigraphic Criteria and Interpretations. J Nucl Med 1993; 34:1119-1126.
6. ACCP Consensus Committee on Pulmonary Embolism. Special Report. Opinions Regarding the Diagnosis and Management of Venous Thromboembolic Disease. Chest 1996; 109: 233-37.
|
|