Research Article - (2021) Volume 18, Issue 9
Julie J Lanz1, Paul Gregory2* and Larry Harmon2
1Department of Psychology, University of Nebraska at Kearney, Kearney, NE 68849, United States
2Physicians Development Program-Inc / PULSE 360 Program, 2000 South Dixie Highway, Suite 103, Miami, FL 33133, United States
Received Date: September 06, 2021; Accepted Date: September 20, 2021; Published Date: September 27, 2021
Objective: With a growing number of foreign-trained physicians joining the United States workforce, there is a need to assess their job performance. The purpose of this study was to explore the potential for rater biases in a 360-degree, multisource competency assessment for physicians related to demographic protected class status (gender, country of training, native language, and age).
Methods: We conducted a non-experimental retrospective analysis on physicians working in the United States (n=258) who participated in a physician assessment and education program.
Results: There were no significant differences in raters’ perceptions at the scale level of overall teamwork, motivating or discouraging behaviors, technical practice, and patient interactions, nor the item level based on demographic differences.
Conclusion: The PULSE 360 can be used to evaluate physician performance without bias concerns due to demographic differences including gender, country of physician medical training, physician native language, and age.
Multisource performance appraisal; 360-degree survey; Multisource survey; Multisource rater feedback; Interpersonal and communication skills; Non-United States-trained physician; United States-trained physician
In 2017, foreign-trained physicians made up over 25% of practicing physicians in the United States [1]. Given their substantial contributions to the U.S. healthcare system [2,3], assessing the performance of diverse physicians without bias is critical. Policy suggestions have been developed recently to promote the benefits of employing international health care workers, such as treating them transparently and fairly [4]. However, comparing the performance of U.S.-trained physicians (USTP) to non-U.S.-trained physicians (NUSTPs) is not well-understood [5]. Exploring soft skills like professionalism, interpersonal and communication skills, teamwork, and patient interactions is critical, and physician competency assessments should be unbiased regardless of demographic differences like age, gender, and nationality. The purpose of this paper is to explore potential bias in a multisource competency assessment program when evaluating U.S. and non-U.S.-trained physicians working in the United States. While previous studies on physician diversity have explored where physicians were educated or trained (international medical graduates), in this study, NUSTP refers to physicians who completed their residencies outside the United States.
Maintaining quality patient care is important because 6-12% of physicians are referred to remediation for poor clinical skills [6]. The Institute of Medicine estimates that physician dyscompetence is one contributor to preventable medical errors at an estimated cost of $17 billion [6,7]. To determine which physicians should be referred for dyscompetence, one model for performance remediation starts with an assessment of the physician’s competence [8]. Reasons for evaluating physician performance range from appraisal to recertification, identifying high-risk physicians, and remediating those with a previous history of poor performance [9]. A common framework for maintaining physician competency is the American Board of Medical Specialties, which developed the Maintenance of Certification (ABMS MOC). This four-part framework includes: maintaining licensure, lifelong learning, cognitive expertise, and quality improvement [10]. The value of an assessment to support lifelong learning and quality improvement is underscored by Hawkins et al. [10], who call for more research into the validity of multisource feedbacks. For example, assessment methods must evaluate the potential for discrimination based on protected classes. Assessment of physician competencies regardless of gender, country of training, native language, and age is a key first step towards improving job performance.
Multisource, also known as 360-degree, feedback is the use of physicians’ team members (other providers, nurses, and staff) to evaluate job performance [11,12]. Multisource feedback is an important component of professional development for physicians, and it can lead to performance improvement up to five years later [13]. The scope and depth of multisource feedback is valuable given the argument that patient evaluations of physician performance are subjective at best [14]. For example, patient evaluations have been found to be influenced by the race and gender of the physician such that only physicians who were white and male benefited from a customer satisfaction judgment, even after controlling for objective measures of performance [15]. Beyond clinical skills, physician performance is based on a combination of individual differences including specialty area, gender, and age [16,17].
Evidence suggests that biases against international medical graduates may lead to more complaints against physicians and disciplinary outcomes [18], but findings on biased physician performance evaluations are mixed [19]. Given the inconclusive evidence, having two examiners appears to mitigate potential gender or ethnic biases against physicians who are being evaluated based on their clinical performance [20]. Some research has explored the use of multi-rater assessments on international medical graduates and found them reliable [21], but little research has examined bias in physician assessment as a function of training country (i.e., USTPs versus NUSTPs). Of the research on assessing physician performance, one experiment found that after holding education, experience, and personality consistent, international medical graduates were rated more poorly than those who had born in the prospective patients’ home country. However, physicians who had been trained in an industrialized and high-income country benefited on their evaluations [22]. There are no significant differences in mortality rates for international versus national practitioners, but differences may exist in regard to the soft skills of communication, teamwork, and ethical issues [23,24]. Part of this bias may be a function of the examiners themselves [25]. In one study, international medical graduates had lower mortality rates than U.S. medical graduates [26]. Further, there is evidence that in Canada, international medical graduates are disciplined for misconduct more frequently than North American medical graduates [27]. In Australia, international medical graduates receive more complaints and disciplinary adverse findings [18]. Thus, there is a critical need for unbiased tools to evaluate physicians on their job performance. Given the mixed findings on the effects of gender, country of training, language, and age on performance, the following is predicted:
Hypothesis 1: There will be significant differences in PULSE 360 physician performance (as rated by colleagues) based on gender such that women will have higher scores.
Hypothesis 2: There will be significant differences in PULSE 360 physician performance based on country where training occurred such that USTPs will have higher scores.
Hypothesis 3: There will be significant differences in PULSE 360 physician performance based on first language spoken such that native English speakers will have higher scores.
Hypothesis 4: There will be significant differences in PULSE 360 physician performance based on age such that younger physicians will have higher scores.
Design
A non-experimental retrospective analysis of data was conducted on 258 physicians who participated in a physician competency evaluation program.
Statistical Analyses
For hypotheses 1-4, we conducted a post-hoc power analysis using G*Power v. 3.0.10 to determine the sample size needed to detect significant effects [28]. Given a two-tailed independent samples t-test with a large effect size (d = 0.5, α = .05), the preferred sample size is 210 participants. Independent samples t-tests were conducted in order to evaluate potential biases in PULSE 360 scale scores due to demographic differences including: gender, country in which residency occurred, first language spoken, and age (Tables 1-4).
Scale Score | Gender | |||||
---|---|---|---|---|---|---|
Male (n=206) |
Female (n=52) |
Mean Comparison (df=256) |
||||
PULSE 360 Scale Score | m | sd | m | sd | t | p |
Teamwork Index Score | 66.5 | 24.0 | 65.0 | 23.4 | .401 | .69ns |
Motivating Behavior Score | 81.4 | 10.2 | 81.2 | 10.2 | .153 | .88ns |
Discouraging Behavior Score | 28.2 | 9.9 | 29.2 | 9.4 | -.625 | .53ns |
Technical Practice Score | 86.2 | 10.3 | 87.9 | 8.3 | -1.14 | .25ns |
Patient Interaction Score | 87.1 | 9.7 | 87.5 | 9.1 | -.225 | .82ns |
*All post hoc comparisons for between group differences were non-significant for all scale scores
Table 1: t-Test comparison of mean pulse 360 scale scores by gender.
Scale Score | Physician trained in the US? | |||||
---|---|---|---|---|---|---|
Yes (n=157) |
No (n=97) |
Mean Comparison (df=252) |
||||
PULSE 360 Scale Score | m | sd | m | sd | t | p |
Teamwork Index Score | 65.3 | 22.9 | 67.5 | 25.6 | -.712 | .48ns |
Motivating Behavior Score | 81.1 | 9.6 | 81.7 | 11.1 | -.433 | .67ns |
Discouraging Behavior Score | 28.9 | 9.6 | 27.7 | 10.2 | .946 | .35ns |
Technical Practice Score | 86.0 | 9.7 | 87.3 | 10.5 | -.952 | .34ns |
Patient Interaction Score | 87.0 | 9.1 | 87.4 | 10.3 | -.324 | .75ns |
missing n=4 |
*All post hoc comparisons for between group differences were non-significant for all scale scores
Table 2: t-Test comparison of mean pulse 360 scale scores by trained in US status.
Scale Score | Native English Speaker? | |||||
---|---|---|---|---|---|---|
Yes (n=166) |
No (n=84) |
Mean Comparison (df=248) |
||||
PULSE 360 Scale Score | M | sd | m | sd | t | p |
Teamwork Index Score | 66.4 | 23.3 | 66.8 | 24.8 | -.122 | .90ns |
Motivating Behavior Score | 81.4 | 9.9 | 81.5 | 10.7 | -.032 | .98ns |
Discouraging Behavior Score | 28.3 | 9.6 | 28.1 | 9.8 | .207 | .84ns |
Technical Practice Score | 86.6 | 9.5 | 86.1 | 10.9 | .359 | .72ns |
Patient Interaction Score | 87.3 | 9.3 | 87.4 | 9.9 | -.031 | .98ns |
missing n=8 |
*All post hoc comparisons for between group differences were non-significant for all scale scores
Table 3: t-Test comparison of mean pulse 360 scale scores by native English speaker status.
Scale Score | Age Range of Physicians | |||||
---|---|---|---|---|---|---|
62 or younger (n=126) |
63 or older (n=125) |
Mean Comparison (df=249) |
||||
PULSE 360 Scale Score | M | sd | m | sd | t | p |
Teamwork Index Score | 67.7 | 22.6 | 64.5 | 24.5 | 1.06 | .290ns |
Motivating Behavior Score | 82.0 | 9.9 | 80.6 | 10.2 | 1.10 | .272ns |
Discouraging Behavior Score | 27.8 | 9.0 | 28.9 | 10.3 | -0.92 | .357ns |
Technical Practice Score | 86.4 | 8.9 | 87.8 | 9.1 | .275 | .783ns |
Patient Interaction Score | 87.8 | 9.1 | 86.5 | 9.8 | 1.13 | .257ns |
missing n=7 |
*All post hoc comparisons for between group differences were non-significant for all scale scores
Table 4: t-Test comparison of mean pulse 360 scale scores by age range.
Participants
Eighty percent (80%) of the physicians were males (n=206), 61% were trained in the United States (n=157), 64% had English as their first language (n=166), the average physician age was 61 (range 41-84), and 78% were board certified (n=202). Age was median split into two groups: 1) 62 or younger (n=126, 49%), 2) 63 or older (n=125, 48%), n=7 missing data (3%). There were thirty (n=30) different specialties represented within the sample of physicians, including internal medicine, obstetrics and gynecology, surgery, anesthesiology etc.). All physicians in the sample were referred to an evaluation conducted through a physician competency assessment program in the United States.
PULSE 360 Surveys
This study investigates the use of the PULSE 360 Survey to evaluate physician performance across protected classes. It has provided multisource feedback assessments for physicians since 2001 with over 15,000 unique healthcare professionals participating in the program receiving over a million completed surveys of feedback. The original PULSE 360 Survey was developed with the help of subject matter experts (SMEs) including senior physician leaders, quality experts, nurses, and other healthcare leaders to determine the behaviors that they believed were most associated with physicians providing a high quality of patient care. This led to the creation of over 100 behavioral rating items, which have been revised through years of item analyses and outcome studies to the most commonly used survey today, which consists of about 25 behavioral items. PULSE conducts assessments for academic medical centers, community hospitals, and other healthcare organizations throughout the US and Canada.
The PULSE 360 Survey is an assessment of leadership, teamwork, professionalism, interpersonal and communication skills, and other physician behaviors based on multisource feedback from other physicians, advanced practice providers, clinical and administrative staff, and trainees who work with a physician. The survey used with the current sample included n=25 items and is made up of 5 performance domains, including a total composite performance score known as the Teamwork Index (TI) Score. All items are scored using a 5-point Likert type scale regarding the extent to which raters perceive a physician engages in a target behavior. The internal consistency reliability estimates for all performance domains are as follows: TI Score (25 items) α = .92, Motivating Behavior Score (9 items) α = .85, Discouraging Behavior Score (7 items) α = .84, Technical Practice Score (5 items) α = .82, and Patient Interaction Score (4 items) α =.79. TI scores typically range from 0 to 100 with a national mean score of 68.9 for physicians while the other scale scores (Motivating, Discouraging, Technical Practice and Patient Interaction) range from 20 to 100 based on a proprietary scale calculation we use to standardize the data. Prior research has demonstrated both the internal and external validity of PULSE 360 item and scale scores in relation to important physician outcomes such as malpractice risk and patient satisfaction [12,29-34].
Data Analyses
The PULSE 360 survey item data is collected at the ordinal level of measurement while scale scores created from this data are interval level data. We opted to perform parametric analyses (independent sample t-tests) because the observed data demonstrated an approximately normal distribution. However, we also conducted non-parametric chi-square comparisons of expected distribution of scores for each hypothesis given the ordinal nature of the item level data. We report the results of the parametric analyses only because the non-parametric analyses produced the same results/ conclusions at both the scale and item levels.
Hypothesis 1 was not supported. There were non-significant differences in the mean PULSE 360 scores for male vs. female physicians (Tables 1-4 for independent samples t-test results). Hypothesis 2 was not supported; there were non-significant differences in the mean PULSE 360 scores for US-trained vs. foreign-trained physicians. Hypothesis 3 was not supported; there were non-significant differences in the mean PULSE 360 scores for native English speakers vs. non-native English speakers. Hypothesis 4 was not supported; there were non-significant differences in the mean PULSE 360 scores between age groups (Table 4). Additionally, all post hoc comparisons of item level differences yielded non-significant differences in mean scores on all PULSE 360 Scale scores for all comparison groups mentioned in Hypotheses 1-4.
Physicians are referred for performance problems for a variety of reasons (e.g., behavioral concerns, professionalism, physical illness, substance abuse, and missing clinical skills). Because physician dis-competence affects patient safety, addressing these issues effectively is important. Overall, programs aimed at improving physician performance have shown mixed results because there is a lack of standardization, and more outcome-based research is needed [35]. Furthermore, it is crucial to establish that whatever tool is used to assess behavioral performance is not subject to bias from raters due to the demographic characteristics of the physicians receiving feedback.
This study compared the performance of USTPs and NUSTPs in a competency evaluation program using a multisource assessment tool. Given the growing demand for NUSTPs in the United States and potential bias in how their performance is assessed, there is a need to fairly select, train, and support diverse groups of physicians [34]. Physicians in this study were evaluated on their performance using the PULSE 360 Survey, and they were compared across gender, country of training, native language, and age. These demographic characteristics in theory should not be related to others’ perceptions regarding a physicians’ competence. There were no significant differences in physicians’ reported performance on professionalism, teamwork, motivating behaviors, discouraging behaviors, technical practice style, or patient interactions.
Understanding actual and measured differences in physician performance is complex. Evaluators do not seem to prefer same-sex or same-ethnicity candidates [19]. In UK samples, women appear to outperform men [36]. In particular, women have higher scores than men on clinical skills assessments in some areas like women’s health [37]. However, U.S. findings remain mixed. Some U.S. studies indicate no differences between men and women [38] while others suggest that women outperform men [39]. Beyond gender differences, there is scarce research on if and why USTP and NUSTP show differences in performance. Physicians who trained or were certified in a different country than the U.S. or Canada seem to perform worse than physicians who were trained, certified, and work in the United States [40]. Given these mixed findings, programs that provide individualized feedback and training for NUSPTs based on their competencies can provide the support needed for physicians to be successful on the job [41].
Ultimately, maintaining physician competency as a function of skills rather than demographic characteristics is critical for healthcare organizations. Bourgeois-Law, Teunissen and Regehr [42] recommend considering each physician in need of remediation as a unique individual and not categorize them based on demographic or sociocultural constructs. Thus, the use of valid, reliable, and non-discriminatory multisource feedback tools is critical.
The physicians in our current sample were recruited to a physician assessment program for a variety of reasons that may not be representative of all practicing physicians in the United States [43]. Further research will be needed to explore these findings more thoroughly, but at least within our sample, there were no significant variations in feedback scoring patterns attributable to protected class membership.
Providing non-United States trained physicians with the tools needed to be successful is critical. The use of 360-degree, multisource feedback may provide a more comprehensive and unbiased assessment of others’ perceptions of physician behavior and performance within the healthcare team than traditional single-source methods of feedback. Future research will need to address the impact of protected class status more directly on raters’ perceptions of physician behavior. There are no theoretical rationales to explain potential differences in how physicians’ performance of technical or non-technical skills are perceived by others as it relates to protected class status. Therefore, any assessment of behavioral performance should be able to provide reliable assessment scores regardless of protected class status for physicians who otherwise are expected to show similar performance.
1. Two of the authors (PG, LH) are employees of the Physicians Development Program Inc / PULSE 360 Program, Miami, FL, USA.