Categorization as a potential explanation for low inter-rater reliability.

Gingerich A. — University of Northern British Columbia

Eva K, Regehr G. — University of British Columbia

Kogan J. — University of Pennsylvania

Holmboe ES, Conforti L. — American Board of Internal Medicine

Presented: Association of American Medical Colleges Conference, November 2012

Purpose: Rater-based assessments are widely used in medical education despite concerns of low inter-rater reliability. In impression formation, some inter-rater variability has been explained by the cognitive process of categorization. The conceptualization of the underlying process of categorization has been described variously by different researchers as the assignment of the rated individual to either (1) a pre-formed stereotype or “Label,” (2) one of four “Clusters” based on a crossing of warm/cold and competent/incompetent judgments, or (3) an ad hoc “Person Model” that is created on the fly. This study was conducted to determine whether any of these three conceptualizations of categorization is helpful in explaining variance in Mini-CEX ratings.

Methods: A total of 176 transcripts collected in an earlier study from 44 raters who scored, then commented on four videotaped clinical performances, were serially categorized using each of the three conceptualized categorization processes. For each video performance, the amount of variance in rater scores (partial eta squared) accounted for by category assignment using each of the three conceptualizations was calculated using ANOVA.

Results: The variance explained when transcripts were categorized according to prominent “Labels” ranged from 8% to 47% across the four videos; 0% to 52% of the variance was explained using the “Clusters” conceptualization. The “Person Model” conceptualization led to categorizations that explained 25% to 43% of score variance.

Principal Findings: A total of 33,483 residents took the Step 2 and the ABIM certification exams over the 6 annual PGY1 cohorts. Adjusting for demographics, program and Step 2 exam score, the model predicted nearly half of the variation in ABIM certification exam scores (R2 = 0.44). The mean ABIM certification exam score for the unexposed cohort that completed residency before the reform, the PGY1 2000 cohort, was 491 (SD = 85). The adjusted mean differences in scores between PGY1 cohorts starting in 2001, 2002, 2003, 2004 and 2005 versus 2000 was only -5.8 (95% CI = -8.1, -3.6), -3.7 (-6.1, -1.2), 2.6 (0.4, 4.8), 11.2 (8.9, 13.4) and 11.4 (9.1, 13.7) points respectively (none exceeding a fifth of a standard deviation).

Conclusions: While these data are preliminary, efforts to apply the three conceptualizations of categorization indicate that a meaningful amount of variance in rater-based assessments can be explained in most cases. More investigation is required to identify alternative variables that might complement the explanatory power of the categorizations, to determine the extent to which the three conceptualizations offer independent information as opposed to being three ways of describing the same underlying divisions, and to explore whether the variance within and across the categorizations is meaningful.

For more information about this presentation, please contact Research@abim.org