Journal of clinical epidemiology
-
Any attempt to generalize the performance of a subjective diagnostic method should take into account the sample variation in both cases and readers. Most current measures of the performance of a test, especially the indices of reliability, only tackle the variation of cases, and hence are not suitable for generalizing results across the population of readers. We attempted to study the effect of readers' variation on two measures of multireader reliability: pair-wise agreement and Fleiss' kappa. ⋯ The majority of the current agreement studies is likely limited by the number of readers and is unlikely to produce a reliable estimate of reader agreement.