Short Article - (2019) Volume 1, Issue 3
Siamak Sabour
Department of Clinical Epidemiology, School of Health, Safety Promotion and Injury Prevention Research Center, Shahid Beheshti University of Medical Sciences, Tehran, IR, Chamran Highway, Velenjak, Daneshjoo Blvd, Iran
Diagnostic researches are among interesting field of clinical researches. However, methodological and statistical issues in such researches are not being considered appropriately. Diagnostic value should be considered as diagnostic accuracy (validity) and diagnostic precision (reliability or agreement). In case of binary variable, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), likelihood ratio positive (LR+), likelihood ratio negative (LR-) as well as odds ratio (ratio of true to false results) are the most appropriate estimates to evaluate validity of a test compared to a gold standard. Therefore, it is better to report all these validity estimates together. Otherwise, final interpretation will be confusing. Moreover, it is important to know that for clinical purposes, reporting diagnostic added value should be considered using receiver operating characteristic (ROC) curve because all the above mentioned validity estimates can be acceptable while diagnostic added value may be clinically negligible. Regarding quantitative variables, Interclass correlation coefficient (Pearson r or spearman rho) can be considered as an appropriate statistical test to assess validity.
Diagnostic researches are among interesting field of clinical researches. However, methodological and statistical issues in such researches are not being considered appropriately. Diagnostic value should be considered as diagnostic accuracy (validity) and diagnostic precision (reliability or agreement). In case of binary variable, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), likelihood ratio positive (LR+), likelihood ratio negative (LR-) as well as odds ratio (ratio of true to false results) are the most appropriate estimates to evaluate validity of a test compared to a gold standard. Therefore, it is better to report all these validity estimates together. Otherwise, final interpretation will be confusing. Moreover, it is important to know that for clinical purposes, reporting diagnostic added value should be considered using receiver operating characteristic (ROC) curve because all the above mentioned validity estimates can be acceptable while diagnostic added value may be clinically negligible. Regarding quantitative variables, Interclass correlation coefficient (Pearson r or spearman rho) can be considered as an appropriate statistical test to assess validity.
Reliability (precision or agreement) as a different methodological issue of the diagnostic value should also be assessed using appropriate estimate. For qualitative variables, weighted kappa should be used with caution. Two important weaknesses of Cohen’s kappa to assess agreement of a qualitative variable are as follows. First, it depends on the prevalence in each category, which means it can be possible to have different k values having the same percentage for both concordant and discordant cells. Table 1 shows that in both (a) and (b) situations, the prevalence of concordant cells are 90% and of discordant cells, 10%; however, we get different kappa values (0.44 as moderate and 0.80 as very good, respectively). Kappa value also depends on the number of categories. In such a situation, a weighted kappa is a preferable test, giving an unbiased result. Finally, the P value or 95% CI is not reported for a weighted kappa in reliability analysis, because statistically significant does not necessarily means clinically important. Regarding quantitative variables, Intraclass correlation coefficient (ICCC) agreement single measure and Bland Altman plot can be considered as appropriate tests to assess reliability.
Eleven different ETDRS charts were created, each with a different number of characters appearing in each row. A computer simulation was programmed to run 10,000 virtual patients, each with a unique visual acuity, false-positive and false-negative error value.
The existing comprehensive reviews on this topic were published about 11 years ago [14, 34]; knowledge, ideas, and research in this field has evolved significantly since then. Several new methods have been proposed and some existing methods have been modified. It is also possible that some previously identified methods may now be obsolete. Therefore, one of the aims of this systematic review is to review new and existing methods employed to evaluate the test performance of medical test(s) in the absence of gold standard for all or some of the participants in the study. It also aims to provide easy to use tools (flow-diagrams) for the selection of methods to consider when evaluating medical tests when sub-sample of the participants do not undergo the gold standard. The review builds upon the earlier reviews by Rutjes et al and Reitsma et al. This review sought to identify methods developed to evaluate a medical test with continuous results in the presence of verification bias and when the diagnostic outcome (disease status) is classified into three or more groups (e.g. diseased, intermediate and non-diseased). This is a gap identified in the review conducted by Alonzo in 2014 The existing comprehensive reviews on this topic were published about 11 years ago; knowledge, ideas, and research in this field has evolved significantly since then. Several new methods have been proposed and some existing methods have been modified. It is also possible that some previously identified methods may now be obsolete. Therefore, one of the aims of this systematic review is to review new and existing methods employed to evaluate the test performance of medical test(s) in the absence of gold standard for all or some of the participants in the study. It also aims to provide easy to use tools (flow-diagrams) for the selection of methods to consider when evaluating medical tests when sub-sample of the participants do not undergo the gold standard. The review builds upon the earlier reviews by Rutjes et al and Reitsma et al. This review sought to identify methods developed to evaluate a medical test with continuous results in the presence of verification bias and when the diagnostic outcome (disease status) is classified into three or more groups (e.g. diseased, intermediate and non-diseased). This is a gap identified in the review conducted by Alonzo in 2014.
Siamak Sabour1,2* 1Department of Clinical Epidemiology, School of Health, Shahid Beheshti University of Medical Sciences, Tehran, IR, Iran 2Safety Promotions and Injury Prevention Research Center, Shahid Beheshti University of Medical Sciences, Tehran, IR, Iran *Corresponding author: Siamak Sabour, Department of Clinical Epidemiology, School of Health, Safety Promotion and Injury Prevention Research Center, Shahid Beheshti University of Medical Sciences, Tehran, IR, Chamran Highway, Velenjak, Daneshjoo Blvd, Iran
Email:s.sabour@sbmu.ac.ir