Estimating reliability and decision consistency of physician practice performance assessment.

Weng W, Arnold GK, Lynn LA, Lipner RS. — American Board of Internal Medicine

Presented: AcademyHealth Annual Research Meeting, June 2009

Research Objective: Physician-level pay-for-performance or reward programs often require patient samples of 25. However, little is known whether this provides sufficient reliability and decision consistency (i.e., reward granted or not). This study evaluates reliabilities of individual performance measures and full profiles, consistency of reward decisions and appropriateness of sample size requirement.

Study Design: Data for 10 clinical and two patient experience measures were obtained from medical record audits and patient surveys completed as part of the Diabetes Practice Improvement Module developed by American Board of Internal Medicine (ABIM). First, we created physician performance profiles equivalent to Bridges to Excellence’s Diabetes Care Link program from the clinical measures, awarding physicians points if their average per measure reached a predetermined criterion (e.g., >40% of patients). Points for all measures were summed to determine the recognition decision. Second, a bootstrap procedure for estimation of reliability and decision consistency was applied in this complex assessment (e.g., measure specific criterion, skewed measure distributions, interrelated measures with different importance weights). Patient experience measures were then added to the assessment to examine their impact on reliability and decision consistency. Finally, the analyses were replicated for a second chronic condition using data from ABIM's Hypertension PIM.

Population Studied: Between October 2005 and October 2007, 957 physicians completed the Diabetes PIM with at least 10 patients between the ages of 18-75, providing 20,131 chart audits and 18,706 patient surveys; 657 physicians who completed the Hypertension PIM with at least 10 patients between the ages of 18-75, providing 13,073 chart audits and 14,897 patient surveys.

Principal Findings: Chart and survey data were replicated using (1,000) bootstrap samples per physician. Intermediate outcomes measures reliabilities ranged from .51 to .58; process measures ranged from .38 to .80 for the average audit sample of 21 patients per physician. The full profile assessment reliability was .79, which translates to a reliability of .82 for a sample of 25 patients. Decision consistency index refers to the consistency of decisions over many measurements or patient samples. Index values close to 1.0 indicate fewer false classifications. Higher decision consistencies were achieved for very low or high cut scores. Even for the worst case, the decision consistency index was still .84. Reliabilities for both patient experience measures were .56. When added to the full profile, the two experience measures increased the reliability to .81. The findings for hypertension were similar, with full profile reliability of .81.

Conclusion: Although individual measures do not yield sufficient reliabilities themselves, a full profile of about 10 measures given a sample size of 25 patients per physician provides satisfactory reliability and decision consistency. Adding patient experience measures increases the reliability slightly. A sample of 25 patients achieves a reasonable reliability for both conditions studied.

Implications for Policy, Practice or Delivery: The findings help understand the reliability and decision consistency of physician reward programs, performance assessment and the appropriateness of sample size requirement. Bootstrapping estimation is a practical method for assessing reliability and decision consistency.

For more information about this presentation, please contact Research@abim.org.