Comparison of all-or-none and weighted average scoring methods in practice performance measurement.

Weng W, Hess BJ, Lipner RS. — American Board of Internal Medicine

Presented: AcademyHealth Annual Research Meeting, June 2011

Research Objective: There is a chasm between ideal patient care and the reality of current medical care. Researchers often use the all-or-none method to measure ideal or excellent care. On the other hand, the weighted average method, a compensatory model which gives partial credit for care that may not be perfect allows researchers to measure care on a continuum and identify outliers in terms of poor performance. This study compares the two methods on reliability, concordance of physician rankings and ability to identify bottom and top performers.

Study Design: The ABIM Diabetes Practice Improvement Module (PIM) was used to collect medical record data for 10 clinical (intermediate-outcomes and processes) measures. Two physician performance composites were created based on the all-or-none and a weighted average method developed by an expert panel. The all-or-none composite reports the percentage of a physician's patients meeting all performance measures. The weighted average composite calculates the percentage of times each measure was met, and then takes a weighted average across all measures for each physician. A bootstrap procedure was used to estimate the reliabilities of each composite. Patient data were replicated using (1,000) bootstrap samples per physician. Physicians were grouped into deciles by their performance rankings for each composite and the concordance of the two composite rankings was examined.

Population Studied: From October 2005 to March 2010, 2,822 physicians completed the Diabetes PIM with at least 10 patients. It yielded 63,869 patient chart reviews.

Principal Findings: 982 (35%) physicians failed to have any patient meeting all 10 measures, resulting in a value of 0 for their all-or-none composite. The means of the all-or-none and weighted average composite were 0.10 (SD=0.12) and 0.71 (SD=.13), respectively. The reliabilities of all-or-none and weighted average composite were 0.78 and 0.91, respectively. For the 282 physicians in the top (10th) decile of the all-or-none composite, 200 were in the 10th decile of weighted average composite, 63 in the 9th decile, but 19 in next three deciles. Therefore, the concordance between the two composites in identifying top performers was high. On the other hand, the all-or-none composite resulted in 982 physicians who resided at the bottom decile. Those physicians were spread across all 10 deciles of the weighted average composite with 247, 202 and 170 physicians in the three bottom deciles, and 132 physicians in the top five deciles. Therefore, there was large discordance when identifying bottom performers.

Conclusions: The weighted average method had much higher reliability than the all-or-none composite; hence it could more accurately replicate the measurement with a different set of patients. Both methods identified largely the same physicians as top performers, but the two methods differed considerably for the bottom performers. The all-or-none method was not able to distinguish very poor performers from physicians providing minimally competent care.

Implications for Policy, Delivery or Practice: When trying to determine whether competent care has been provided for the set of diabetes care quality measures, the weighted average (or a similar compensatory method) is preferable to all-or-none method because of its higher reliability and its ability to identify both bottom and top performers.

For more information about this presentation, please contact Research@abim.org.