Assessing Physician Performance Using 360-degree Multisource Surveys: Do Biases Exist Due to Gender, Country of Training, Native Language, and Age?

Julie J Lanz; Paul Gregory; Larry Harmon

doi:10.36648/2049-5471.21.18.259

Research Article - (2021) Volume 18, Issue 9

Assessing Physician Performance Using 360-degree Multisource Surveys: Do Biases Exist Due to Gender, Country of Training, Native Language, and Age?

Julie J Lanz¹, Paul Gregory^2* and Larry Harmon²

¹Department of Psychology, University of Nebraska at Kearney, Kearney, NE 68849, United States

²Physicians Development Program-Inc / PULSE 360 Program, 2000 South Dixie Highway, Suite 103, Miami, FL 33133, United States

*Corresponding Author:: Paul Gregory
Physicians Development Program Inc / PULSE 360 Program, 2000 South Dixie Highway, Suite 103, Miami, FL 33133, United States
Tel: +305-285-8900 x202
E-mail: paul@pdpflorida.com

Received Date: September 06, 2021; Accepted Date: September 20, 2021; Published Date: September 27, 2021

Visit for more related articles at Diversity & Equality in Health and Care

Abstract

Objective: With a growing number of foreign-trained physicians joining the United States workforce, there is a need to assess their job performance. The purpose of this study was to explore the potential for rater biases in a 360-degree, multisource competency assessment for physicians related to demographic protected class status (gender, country of training, native language, and age).

Methods: We conducted a non-experimental retrospective analysis on physicians working in the United States (n=258) who participated in a physician assessment and education program.

Results: There were no significant differences in raters’ perceptions at the scale level of overall teamwork, motivating or discouraging behaviors, technical practice, and patient interactions, nor the item level based on demographic differences.

Conclusion: The PULSE 360 can be used to evaluate physician performance without bias concerns due to demographic differences including gender, country of physician medical training, physician native language, and age.

Keywords

Multisource performance appraisal; 360-degree survey; Multisource survey; Multisource rater feedback; Interpersonal and communication skills; Non-United States-trained physician; United States-trained physician

Introduction

In 2017, foreign-trained physicians made up over 25% of practicing physicians in the United States [1]. Given their substantial contributions to the U.S. healthcare system [2,3], assessing the performance of diverse physicians without bias is critical. Policy suggestions have been developed recently to promote the benefits of employing international health care workers, such as treating them transparently and fairly [4]. However, comparing the performance of U.S.-trained physicians (USTP) to non-U.S.-trained physicians (NUSTPs) is not well-understood [5]. Exploring soft skills like professionalism, interpersonal and communication skills, teamwork, and patient interactions is critical, and physician competency assessments should be unbiased regardless of demographic differences like age, gender, and nationality. The purpose of this paper is to explore potential bias in a multisource competency assessment program when evaluating U.S. and non-U.S.-trained physicians working in the United States. While previous studies on physician diversity have explored where physicians were educated or trained (international medical graduates), in this study, NUSTP refers to physicians who completed their residencies outside the United States.

Assessing Physician Competence

Maintaining quality patient care is important because 6-12% of physicians are referred to remediation for poor clinical skills [6]. The Institute of Medicine estimates that physician dyscompetence is one contributor to preventable medical errors at an estimated cost of $17 billion [6,7]. To determine which physicians should be referred for dyscompetence, one model for performance remediation starts with an assessment of the physician’s competence [8]. Reasons for evaluating physician performance range from appraisal to recertification, identifying high-risk physicians, and remediating those with a previous history of poor performance [9]. A common framework for maintaining physician competency is the American Board of Medical Specialties, which developed the Maintenance of Certification (ABMS MOC). This four-part framework includes: maintaining licensure, lifelong learning, cognitive expertise, and quality improvement [10]. The value of an assessment to support lifelong learning and quality improvement is underscored by Hawkins et al. [10], who call for more research into the validity of multisource feedbacks. For example, assessment methods must evaluate the potential for discrimination based on protected classes. Assessment of physician competencies regardless of gender, country of training, native language, and age is a key first step towards improving job performance.

Multisource Feedback

Multisource, also known as 360-degree, feedback is the use of physicians’ team members (other providers, nurses, and staff) to evaluate job performance [11,12]. Multisource feedback is an important component of professional development for physicians, and it can lead to performance improvement up to five years later [13]. The scope and depth of multisource feedback is valuable given the argument that patient evaluations of physician performance are subjective at best [14]. For example, patient evaluations have been found to be influenced by the race and gender of the physician such that only physicians who were white and male benefited from a customer satisfaction judgment, even after controlling for objective measures of performance [15]. Beyond clinical skills, physician performance is based on a combination of individual differences including specialty area, gender, and age [16,17].

Evidence suggests that biases against international medical graduates may lead to more complaints against physicians and disciplinary outcomes [18], but findings on biased physician performance evaluations are mixed [19]. Given the inconclusive evidence, having two examiners appears to mitigate potential gender or ethnic biases against physicians who are being evaluated based on their clinical performance [20]. Some research has explored the use of multi-rater assessments on international medical graduates and found them reliable [21], but little research has examined bias in physician assessment as a function of training country (i.e., USTPs versus NUSTPs). Of the research on assessing physician performance, one experiment found that after holding education, experience, and personality consistent, international medical graduates were rated more poorly than those who had born in the prospective patients’ home country. However, physicians who had been trained in an industrialized and high-income country benefited on their evaluations [22]. There are no significant differences in mortality rates for international versus national practitioners, but differences may exist in regard to the soft skills of communication, teamwork, and ethical issues [23,24]. Part of this bias may be a function of the examiners themselves [25]. In one study, international medical graduates had lower mortality rates than U.S. medical graduates [26]. Further, there is evidence that in Canada, international medical graduates are disciplined for misconduct more frequently than North American medical graduates [27]. In Australia, international medical graduates receive more complaints and disciplinary adverse findings [18]. Thus, there is a critical need for unbiased tools to evaluate physicians on their job performance. Given the mixed findings on the effects of gender, country of training, language, and age on performance, the following is predicted:

Hypothesis 1: There will be significant differences in PULSE 360 physician performance (as rated by colleagues) based on gender such that women will have higher scores.

Hypothesis 2: There will be significant differences in PULSE 360 physician performance based on country where training occurred such that USTPs will have higher scores.

Hypothesis 3: There will be significant differences in PULSE 360 physician performance based on first language spoken such that native English speakers will have higher scores.

Hypothesis 4: There will be significant differences in PULSE 360 physician performance based on age such that younger physicians will have higher scores.

Methods

Design

A non-experimental retrospective analysis of data was conducted on 258 physicians who participated in a physician competency evaluation program.

Statistical Analyses

For hypotheses 1-4, we conducted a post-hoc power analysis using G*Power v. 3.0.10 to determine the sample size needed to detect significant effects [28]. Given a two-tailed independent samples t-test with a large effect size (d = 0.5, α = .05), the preferred sample size is 210 participants. Independent samples t-tests were conducted in order to evaluate potential biases in PULSE 360 scale scores due to demographic differences including: gender, country in which residency occurred, first language spoken, and age (Tables 1-4).

Scale Score	Gender
Scale Score	Male (n=206)		Female (n=52)		Mean Comparison (df=256)
PULSE 360 Scale Score	m	sd	m	sd	t	p
Teamwork Index Score	66.5	24.0	65.0	23.4	.401	.69^ns
Motivating Behavior Score	81.4	10.2	81.2	10.2	.153	.88^ns
Discouraging Behavior Score	28.2	9.9	29.2	9.4	-.625	.53^ns
Technical Practice Score	86.2	10.3	87.9	8.3	-1.14	.25^ns
Patient Interaction Score	87.1	9.7	87.5	9.1	-.225	.82^ns

*All post hoc comparisons for between group differences were non-significant for all scale scores

Table 1: t-Test comparison of mean pulse 360 scale scores by gender.

Scale Score	Physician trained in the US?
Scale Score	Yes (n=157)		No (n=97)		Mean Comparison (df=252)
PULSE 360 Scale Score	m	sd	m	sd	t	p
Teamwork Index Score	65.3	22.9	67.5	25.6	-.712	.48^ns
Motivating Behavior Score	81.1	9.6	81.7	11.1	-.433	.67^ns
Discouraging Behavior Score	28.9	9.6	27.7	10.2	.946	.35^ns
Technical Practice Score	86.0	9.7	87.3	10.5	-.952	.34^ns
Patient Interaction Score	87.0	9.1	87.4	10.3	-.324	.75^ns
				missing n=4

*All post hoc comparisons for between group differences were non-significant for all scale scores

Table 2: t-Test comparison of mean pulse 360 scale scores by trained in US status.

Scale Score	Native English Speaker?
Scale Score	Yes (n=166)		No (n=84)		Mean Comparison (df=248)
PULSE 360 Scale Score	M	sd	m	sd	t	p
Teamwork Index Score	66.4	23.3	66.8	24.8	-.122	.90^ns
Motivating Behavior Score	81.4	9.9	81.5	10.7	-.032	.98^ns
Discouraging Behavior Score	28.3	9.6	28.1	9.8	.207	.84^ns
Technical Practice Score	86.6	9.5	86.1	10.9	.359	.72^ns
Patient Interaction Score	87.3	9.3	87.4	9.9	-.031	.98^ns
				missing n=8

*All post hoc comparisons for between group differences were non-significant for all scale scores

Table 3: t-Test comparison of mean pulse 360 scale scores by native English speaker status.

Scale Score	Age Range of Physicians
Scale Score	62 or younger (n=126)		63 or older (n=125)		Mean Comparison (df=249)
PULSE 360 Scale Score	M	sd	m	sd	t	p
Teamwork Index Score	67.7	22.6	64.5	24.5	1.06	.290^ns
Motivating Behavior Score	82.0	9.9	80.6	10.2	1.10	.272^ns
Discouraging Behavior Score	27.8	9.0	28.9	10.3	-0.92	.357^ns
Technical Practice Score	86.4	8.9	87.8	9.1	.275	.783^ns
Patient Interaction Score	87.8	9.1	86.5	9.8	1.13	.257^ns
				missing n=7

*All post hoc comparisons for between group differences were non-significant for all scale scores

Table 4: t-Test comparison of mean pulse 360 scale scores by age range.

Participants

Eighty percent (80%) of the physicians were males (n=206), 61% were trained in the United States (n=157), 64% had English as their first language (n=166), the average physician age was 61 (range 41-84), and 78% were board certified (n=202). Age was median split into two groups: 1) 62 or younger (n=126, 49%), 2) 63 or older (n=125, 48%), n=7 missing data (3%). There were thirty (n=30) different specialties represented within the sample of physicians, including internal medicine, obstetrics and gynecology, surgery, anesthesiology etc.). All physicians in the sample were referred to an evaluation conducted through a physician competency assessment program in the United States.

PULSE 360 Surveys

This study investigates the use of the PULSE 360 Survey to evaluate physician performance across protected classes. It has provided multisource feedback assessments for physicians since 2001 with over 15,000 unique healthcare professionals participating in the program receiving over a million completed surveys of feedback. The original PULSE 360 Survey was developed with the help of subject matter experts (SMEs) including senior physician leaders, quality experts, nurses, and other healthcare leaders to determine the behaviors that they believed were most associated with physicians providing a high quality of patient care. This led to the creation of over 100 behavioral rating items, which have been revised through years of item analyses and outcome studies to the most commonly used survey today, which consists of about 25 behavioral items. PULSE conducts assessments for academic medical centers, community hospitals, and other healthcare organizations throughout the US and Canada.

The PULSE 360 Survey is an assessment of leadership, teamwork, professionalism, interpersonal and communication skills, and other physician behaviors based on multisource feedback from other physicians, advanced practice providers, clinical and administrative staff, and trainees who work with a physician. The survey used with the current sample included n=25 items and is made up of 5 performance domains, including a total composite performance score known as the Teamwork Index (TI) Score. All items are scored using a 5-point Likert type scale regarding the extent to which raters perceive a physician engages in a target behavior. The internal consistency reliability estimates for all performance domains are as follows: TI Score (25 items) α = .92, Motivating Behavior Score (9 items) α = .85, Discouraging Behavior Score (7 items) α = .84, Technical Practice Score (5 items) α = .82, and Patient Interaction Score (4 items) α =.79. TI scores typically range from 0 to 100 with a national mean score of 68.9 for physicians while the other scale scores (Motivating, Discouraging, Technical Practice and Patient Interaction) range from 20 to 100 based on a proprietary scale calculation we use to standardize the data. Prior research has demonstrated both the internal and external validity of PULSE 360 item and scale scores in relation to important physician outcomes such as malpractice risk and patient satisfaction [12,29-34].

Data Analyses

The PULSE 360 survey item data is collected at the ordinal level of measurement while scale scores created from this data are interval level data. We opted to perform parametric analyses (independent sample t-tests) because the observed data demonstrated an approximately normal distribution. However, we also conducted non-parametric chi-square comparisons of expected distribution of scores for each hypothesis given the ordinal nature of the item level data. We report the results of the parametric analyses only because the non-parametric analyses produced the same results/ conclusions at both the scale and item levels.

Results

Hypothesis 1 was not supported. There were non-significant differences in the mean PULSE 360 scores for male vs. female physicians (Tables 1-4 for independent samples t-test results). Hypothesis 2 was not supported; there were non-significant differences in the mean PULSE 360 scores for US-trained vs. foreign-trained physicians. Hypothesis 3 was not supported; there were non-significant differences in the mean PULSE 360 scores for native English speakers vs. non-native English speakers. Hypothesis 4 was not supported; there were non-significant differences in the mean PULSE 360 scores between age groups (Table 4). Additionally, all post hoc comparisons of item level differences yielded non-significant differences in mean scores on all PULSE 360 Scale scores for all comparison groups mentioned in Hypotheses 1-4.

Discussion

Physicians are referred for performance problems for a variety of reasons (e.g., behavioral concerns, professionalism, physical illness, substance abuse, and missing clinical skills). Because physician dis-competence affects patient safety, addressing these issues effectively is important. Overall, programs aimed at improving physician performance have shown mixed results because there is a lack of standardization, and more outcome-based research is needed [35]. Furthermore, it is crucial to establish that whatever tool is used to assess behavioral performance is not subject to bias from raters due to the demographic characteristics of the physicians receiving feedback.

This study compared the performance of USTPs and NUSTPs in a competency evaluation program using a multisource assessment tool. Given the growing demand for NUSTPs in the United States and potential bias in how their performance is assessed, there is a need to fairly select, train, and support diverse groups of physicians [34]. Physicians in this study were evaluated on their performance using the PULSE 360 Survey, and they were compared across gender, country of training, native language, and age. These demographic characteristics in theory should not be related to others’ perceptions regarding a physicians’ competence. There were no significant differences in physicians’ reported performance on professionalism, teamwork, motivating behaviors, discouraging behaviors, technical practice style, or patient interactions.

Understanding actual and measured differences in physician performance is complex. Evaluators do not seem to prefer same-sex or same-ethnicity candidates [19]. In UK samples, women appear to outperform men [36]. In particular, women have higher scores than men on clinical skills assessments in some areas like women’s health [37]. However, U.S. findings remain mixed. Some U.S. studies indicate no differences between men and women [38] while others suggest that women outperform men [39]. Beyond gender differences, there is scarce research on if and why USTP and NUSTP show differences in performance. Physicians who trained or were certified in a different country than the U.S. or Canada seem to perform worse than physicians who were trained, certified, and work in the United States [40]. Given these mixed findings, programs that provide individualized feedback and training for NUSPTs based on their competencies can provide the support needed for physicians to be successful on the job [41].

Ultimately, maintaining physician competency as a function of skills rather than demographic characteristics is critical for healthcare organizations. Bourgeois-Law, Teunissen and Regehr [42] recommend considering each physician in need of remediation as a unique individual and not categorize them based on demographic or sociocultural constructs. Thus, the use of valid, reliable, and non-discriminatory multisource feedback tools is critical.

Limitations

The physicians in our current sample were recruited to a physician assessment program for a variety of reasons that may not be representative of all practicing physicians in the United States [43]. Further research will be needed to explore these findings more thoroughly, but at least within our sample, there were no significant variations in feedback scoring patterns attributable to protected class membership.

Conclusions

Providing non-United States trained physicians with the tools needed to be successful is critical. The use of 360-degree, multisource feedback may provide a more comprehensive and unbiased assessment of others’ perceptions of physician behavior and performance within the healthcare team than traditional single-source methods of feedback. Future research will need to address the impact of protected class status more directly on raters’ perceptions of physician behavior. There are no theoretical rationales to explain potential differences in how physicians’ performance of technical or non-technical skills are perceived by others as it relates to protected class status. Therefore, any assessment of behavioral performance should be able to provide reliable assessment scores regardless of protected class status for physicians who otherwise are expected to show similar performance.

Conflict of Interest

1. Two of the authors (PG, LH) are employees of the Physicians Development Program Inc / PULSE 360 Program, Miami, FL, USA.

References

American Immigration Council (2018) Foreign-trained doctors are critical to serving many U.S. communities. Washington, USA.
Jolly P, Boulet J, Garrison, G, Signer MM (2011) Participation in US graduate medical education by graduates of international medical schools. Acad Med 86:559-64.
Pinsky W W (2017) The importance of international medical graduates in the United States. Annals Int Med 166:840-41.
Chen PG, Auerbach DI, Muench U, Curry LA, Bradley EH (2013) Policy solutions to address the foreign-educated and foreign-born health care workforce in the United States. Health Aff 32:1906-13.
Rao NR, Yager J (2012) Acculturation, education, training, and workforce issues of IMGs: current status and future directions. Acad Psychiat 36:268-70.
William BW (2006) The prevalence and special educational requirements of dyscompetent physicians. J Contin Educ Health Prof 26:173-91.
Kohn LT, Corrigan J, Donaldson MS (1999) Committee on Quality of Health Care in America, Institute of Medicine: To err is human: Building a safer health system. National Academy Press, Washington, DC, USA.
Hauer KE, Ciccone A, Henzel TR, Katsufrakis P, Miller SH, et al. (2009) Remediation of the deficiencies of physicians across the continuum from medical school to practice: a thematic review of the literature. Acad Med 84:1822-32.
Humphrey C (2010) Assessment and remediation for physicians with suspected performance problems: An international survey. J Contin Educ Health Prof 30:26-36.
Hawkins RE, Lipner RS, Ham HP, Wagner R, Holmboe ES (2013) American Board of Medical Specialties Maintenance of Certification: Theory and evidence regarding the current framework. J Contin Educ Health Prof 33:S7-19.
Castanelli D, Kitto S (2011) Perceptions, attitudes, and beliefs of staff anesthetists related to multi-source feedback used for their performance appraisal. Br J Anaesth 107:372-7.
Lanz JJ, Gregory PJ, Menendez ME, Harmon L (2018) Dr. Congeniality: Understanding the importance of surgeons’ nontechnical skills through 360° feedback. J Surg Educ 75:984-92.
Violato C, Lockyer JM, Fidler H (2008) Changes in performance: A 5‐year longitudinal study of participants in a multi‐source feedback Programme. Med Educ 42:1007-13.
Glickman SW, Schulman KA (2013) The mis-measure of physician performance. Am J Manag Care 19:782-5.
Hekman DR, Aquino K, Owens BP, Mitchell TR, Schilpzand P, et al. (2010) An examination of whether and how racial and gender biases influence customer satisfaction. Acad Manag Ann 53:238-64.
Grace ES, Wenghofer EF, Korinek EJ (2014) Predictors of physician performance on competence assessment: Findings from CPEP, the Center for Personalized Education for Physicians. Acad Med 89:912-9.
Wenghofer EF, Williams AP, Klass DJ (2009) Factors affecting physician performance: Implications for performance improvement and governance. Healthcare Policy 5:e141-60.
Elkin K, Spittal MJ, Studdert DM (2012) Risks of complaints and adverse disciplinary findings against international medical graduates in Victoria and Western Australia. Med J Aus 197:448-52.
Denney ML, Freeman A, Wakeford R (2013) MRCGP CSA: Are the examiners biased, favouring their own by sex, ethnicity, and degree source?. Br J Gen Prac 63:e718-25.
McManus IC, Elder AT, Dacre J (2013) Investigating possible ethnicity and sex bias in clinical examiners: an analysis of data from the MRCP (UK) PACES and nPACES examinations. BMC Med Educ 13:1-11.
Nair BR, Alexander HG, McGrath BP, Parvathy MS, Kilsby EC, et al. (2008) The mini clinical evaluation exercise (mini‐CEX) for assessing clinical performance of international medical graduates. Med J Aus189:159-61.
Louis WR, Lalonde RN, Esses VM (2010) Bias against foreign‐born or foreign‐trained doctors: Experimental evidence. Med Educ 44:1241-7.
Norcini JJ, Boulet JR, Dauphinee WD, Opalek A, Krantz ID, et al. (2010) Evaluating the quality of care provided by graduates of international medical schools. Hea Aff 29:1461-8.
Slowther A, Lewando Hundt GA, Purkis J, Taylor R (2012) Experiences of non-UK-qualified doctors working within the UK regulatory framework: A qualitative study. J Royal Soc Med 105:156-67.
Yeates P, Woolf K, Benbow E, Davies B, Boohan M, et al. (2017) A randomised trial of the influence of racial stereotype bias on examiners’ scores, feedback and recollections in undergraduate clinical exams. BMC Med 15:1-11.
Tsugawa Y, Jena AB, Orav EJ, Jha AK (2017) Quality of care delivered by general internists in US hospitals who graduated from foreign versus US medical schools: Observational study. Bio Med J 356:1-8.
Alam A, Matelski JJ, Goldberg HR, Liu JJ, Klemensberg J, et al. (2017) The characteristics of international medical graduates who have been disciplined by professional regulatory colleges in Canada: A retrospective cohort study. Acad Med 92: 244-9.
Faul F, Erdfelder E, Lang AG, Buchner A (2007) G* Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behav Res Metd 39:175-91.
Gregory PJ, Robbins B, Schwaitzberg SD, Harmon L (2017) Leadership development in a professional medical society using 360-degree survey feedback to assess emotional intelligence. Surg End Other Intervent Tech 31:3565-73.
Gregory PJ, Ring D, Rubash H, Harmon L (2018) Use of 360° feedback to develop physician leaders in orthopaedic surgery. J Surg Ortho Adv 27:5-91.
Hageman MG, Ring DC, Gregory PJ, Rubash HE, Harmon L (2015) Do 360-degree feedback survey results relate to patient satisfaction measures? Clin Orthopead Res 201473:1590-7.
Hammerly ME, Harmon L, Schwaitzberg SD (2014) Good to great: Using 360-degree feedback to improve physician emotional intelligence. J Hea car Mang 59:354-66.
Lagoo J, Berry WR, Miller K, Neal BJ, Sato L, et al. (2019) Multisource evaluation of surgeon behavior is associated with malpractice claims. Annals Surg 270:84-90.
Peile E (2014) Selecting an internationally diverse medical workforce. Bio Med J 348:1-2.
Weenink JW, Kool RB, Bartels RH, Westert GP (2017) Getting back on track: A systematic review of the outcomes of remediation and rehabilitation programmes for healthcare professionals with performance concerns. Bio Med J Qual Safety 26:1004-14.
Unwin E, Potts HW, Dacre J, Elder A, Woolf K (2018) Passing MRCP (UK) PACES: a cross-sectional study examining the performance of doctors by sex and country. Bio Med Cen Med Educ 18:1-9.
Pope L, Hawkridge A, Simpson R (2014) Performance in the MRCGP CSA by candidates’ gender: Differences according to curriculum area. Educ Primary Car 25:186-93.
Violato C, Shen E, Gao H (2016) Does Step 3 of the United States Medical Licensing Exam measure clinical competence? A predictive validity study. Med Sci Educ 26:317-22.
Veloski JJ, Callahan CA, Xu G, Hojat M, Nash DB (2000) Prediction of students' performances on licensing examinations using age, race, sex, undergraduate GPAs, and MCAT scores. Acad Med 75:S28-30.
Falcone JL, Middleton DB (2013) Performance on the American Board of Family Medicine Certification examination by country of medical training. J Am Board Fam Med 26:78-81.
Wearne SM, Brown JB, Kirby C, Snadden D (2019) International medical graduates and general practice training: How do educational leaders facilitate the transition from new migrant to local family doctor?. Med Teach 41:1065-72.
Bourgeois-Law G, Teunissen PW, Regehr G (2018) Remediation in practicing physicians: Current and alternative conceptualizations. Acad Med 93:1638-44.
Glickman S, Schulman K (2013) The mis-measure of physician performance. Am J Manag Car 10: 782-5.