Fairness of AI in Medical Imaging

Getting started with research on AI fairness in medical imaging

FAIMI has put together this resource page to help researchers get started with research on the fairness of AI for medical imaging. We would like to see this as an organically evolving resource, so please get in touch with us if you have any suggestions for additions or modifications.

Literature

The literature on AI fairness in medical imaging is growing rapidly, and we do not attempt to provide an exhaustive list of related publications here. Rather, we list a few key references for specific areas of fairness research that can act as starting points for your own literature searches.

Review papers on AI fairness

Du et al (2020), Fairness in Deep Learning: A Computational Perspective, IEEE Intelligent Systems. (arxiv)
Mehrabi et al (2021), A Survey on Bias and Fairness in Machine Learning, ACM Computing Surveys. (arxiv)
Chen et al (2023), Algorithmic Fairness in Artificial Intelligence for Medicine and Healthcare, Nature Biomedical Engineering.
Xu et al (2024), Addressing Fairness Issues in Deep Learning-based Medical Image Analysis: A Systematic Review, npj Digital Medicine.
Hansanzadeh et al (2025), Bias Recognition and Mitigation Strategies in Artificial Intelligence Healthcare Applications, npj Digital Medicine.
Anderson et al (2025), Algorithmic Individual Fairness and Healthcare: A Scoping Review, JAMIA Open.

Seminal works on AI fairness

Angwin et al (2016), Machine Bias, ProPublica. COMPAS study in which racial bias was shown in a machine learning algorithm for predicting recidivism.
Hardt et al (2016), Equality of Opportunity in Supervised Learning, NeurIPS. (arxiv) Introduced the concepts of equality of opportunity and equalized odds in algorithmic fairness. Characterizes trade-offs and provides optimal post-processing methods.
Chouldechova (2017), Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments, Big Data. A re-analysis and re-discussion of the COMPAS case. The key result here was that PPV equality and equal error rates are not compatible in the presence of base rate differences between groups.
Buolamwini et al (2018), Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification, FAccT. One of the earliest papers to uncover AI bias in image classification.
Obermeyer et al (2019), Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations, Science. Demonstration that an algorithm actively used to distribute healthcare system resources was severely biased against black patients.
Barocas et al (2019), Fairness and Machine Learning: Limitations and Opportunities. Full book on algorithmic fairness, covers many aspects.

Perspectives on what constitutes AI fairness

Corbett-Davies and Goel (2018), The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning, arXiv. (As of 2023, a version of this is now also in JMLR.)
Rajkomar et al (2018), Ensuring Fairness in Machine Learning to Advance Health Equity, Annals of Internal Medicine.
McCradden et al (2020), Ethical limitations of algorithmic fairness solutions in health care machine learning, The Lancet Digital Health.
Sambasivan et al (2021), Re-imagining Algorithmic Fairness in India and Beyond, FAccT.
Ganz et al (2021), Assessing Bias in Medical AI, Workshop on Interpretable ML in Healthcare at ICML.
Ricci Lara et al (2022), Addressing Fairness in Artificial Intelligence for Medical Imaging, Nature Communications.
Petersen et al (2023), The Path Toward Equal Performance in Medical Machine Learning, Patterns.
McCradden et al (2023), What’s fair is… fair? Presenting JustEFAB, an ethical framework for operationalizing medical ethics and social justice in the integration of clinical machine learning: JustEFAB, FAccT.

Investigations into the causes of bias

Glocker et al (2023), Algorithmic Encoding of Protected Characteristics in Chest X-ray Disease Detection Models, eBiomedicine.
Stanley et al (2024), Where, why, and how is bias learned in medical image analysis models? A study of bias encoding within convolutional networks using synthetic data, eBiomedicine.
Olesen et al (2025), Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis Using Slice Discovery Methods Proceedings MICCAI FAIMI.
Lee et al (2025), An Investigation Into the Causes of Race Bias in AI-based Cine CMR Segmentation, EHJ Digital Health.

Shortcut learning, models recognizing sensitive patient attributes, and fairness in medical AI

Geirhos et al (2020), Shortcut Learning in Deep Neural Networks, Nature Machine Intelligence. General overview of shortcut learning in deep neural networks, not medicine-specific.
Yi et al (2021), Radiology “Forensics”: Determination of Age and Sex from Chest Radiographs Using Deep Learning, Emergency Radiology, and Gichoya et al (2022), AI Recognition of Patient Race in Medical Imaging: A Modelling Study, Lancet Digital Health. AI recognition of patient age, sex, and race from chest x-rays, which raises the possibility of bias in AI models trained using such images.
Glocker et al (2023), Algorithmic Encoding of Protected Characteristics in Chest X-ray Disease Detection Models, eBiomedicine.
Does encoding of protected characteristics in an AI model necessarily lead to bias?
Brown et al (2023), Detecting Shortcut Learning for Fair Medical AI Using Shortcut Testing, Nature Communications. Method for detecting when shortcut learning is being used
Zou et al (2023), Implications of Predicting Race Variables from Medical Images, Science.

Quantitative comparisons

Zhang et al (2022), Improving the Fairness of Chest X-ray Classifiers, CHIL. Comparison of multiple approaches for addressing bias in chest X-ray classification and evaluation using different definitions of fairness.
Lee et al (2023), An Investigation Into the Impact of Deep Learning Model Choice on Sex and Race Bias in Cardiac MR Segmentation, MICCAI FAIMI workshop. Comparison of bias characteristics of different deep learning models including CNNs and a vision transformer.
Zong et al (2023), MEDFAIR: Benchmarking Fairness for Medical Imaging, ICLR.
Comparison of many standard fairness methods on ten medical image datasets, spanning chest x-rays, brain MRIs, retinal fundus images, dermatoscopic images, heart CT, lung CT, and SD-OCT.

Applied AI fairness research in medical imaging

Miscellaneous

Simoiu et al (2017), The Problem of Infra-Marginality in Outcome Tests for Discrimination, The Annals of Applied Statistics, and Kearns et al (2018), Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness, ICML: Fairness with respect to one protected attribute can hide aggravated unfairness with respect to another (a.k.a. inframarginality / fairness gerrymandering / subgroup fairness).
Wick et al (2019), Unlocking Fairness: a Trade-off Revisited, NeurIPS, and Dutta et al (2020), Is There a Trade-Off Between Fairness and Accuracy? A Perspective Using Mismatched Hypothesis Testing, ICML: Observed fairness-accuracy trade-offs may be illusory and purely a result of label bias. Optimizing for fairness may also yield performance-optimal models, even if evaluations on (equally biased) test data suggests otherwise. Also see Sharma et al (2023), On Testing and Comparing Fair classifiers under Data Bias, arxiv, on this subject.
Lazar Reich and Vijaykumar (2020), A Possibility in Algorithmic Fairness: Can Calibration and Equal Error Rates be Reconciled?, FORC: Contrary to popular belief, equalized odds (i.e., equal TPR and FPR) and calibration by groups are compatible.
Wachter et al (2021), Bias Preservation in Machine Learning: The Legality of Fairness Metrics Under EU Non-Discrimination Law, West Virginia Law Review, and Wachter et al (2023), The Unfairness of Fair Machine Learning: Levelling Down and Strict Egalitarianism by Default, Michigan Technology Law Review: A legal perspective on AI fairness and the “levelling down” phenomenon in “fair” machine learning.
Mukherjee et al (2022), Confounding Factors Need to be Accounted for in Assessing Bias by Machine Learning Algorithms, Nature Medicine.
Schrouff et al (2022), Diagnosing Failures of Fairness Transfer Across Distribution Shift in Real-world Medical Settings, NeurIPS. Are bias mitigation strategies robust to real-world domain shifts?
Zhao and Gordon (2022), Inherent Tradeoffs in Learning Fair Representations, JMLR. Theoretical analysis of how group-invariant representations and statistical/demographic parity hurt accuracy in the presence of base rate differences between groups.
Ricci Lara et al (2023), Towards Unraveling Calibration Biases in Medical Image Analysis, MICCAI FAIMI workshop, and Petersen et al (2023), On (Assessing) the Fairness of Risk Score Models, FAccT: Standard calibration error metrics (such as ECE) are biased with respect to the evaluation sample size, which must be taken into account when comparing calibration between (protected) groups of different sizes.
Jones et al (2023), No Fair Lunch: A Causal Perspective on Dataset Bias in Machine Learning for Medical Imaging, arXiv.

Software toolkits

Although one can investigate fairness issues with standard software environments and packages, a number of researchers have made specialised toolkits aimed at facilitating fairness and bias assessments, and you may find it more efficient to make use of one of these.

AI Fairness 360, Bellamy et al (2018). Initially created by IBM, now independent.
FairLearn, Bird et al (2020). Initially created by Microsoft, now independent.
Aequitas, Saleiro et al (2018). Open source bias audit toolkit for machine learning developers (not imaging).
MEDFAIR, Zong et al (2023), ICLR. Fairness benchmarking suite for medical imaging.
FairMedFM, Jin et al (2024), NeurIPS. Fairness benchmarking suite for medical imaging foundation models.

Initiatives, guidelines and legislation

Below are some resources related to data collection and research initiatives, guidelines on fairness in AI and information about government efforts to legislate on the use of AI, many of which include reference to fairness and bias.

Initiatives:

STANDING Together, Ganapathi et al (2022), Nature Medicine. Promotes the formation of inclusive, diverse and transparent medical datasets.
“All of Us” research programme, All of Us Research Program Investigators (2019), NEJM. US initiative to acquire diverse medical data.
Fairness of AI in Medical Imaging (FAIMI): that’s us, an independent academic initiative aimed at exploring and promoting fair AI in medical imaging.

Guidelines:

FUTURE-AI, Lekadir et al (2025), BMJ. aims to establish guidelines for AI in healthcare, including fairness as a key principle.
Algorithmic Bias Playbook (Obermeyer et al, 2021): high-level discussion addressing aspects such as label choice bias.
Tackling algorithmic bias and promoting transparency in health datasets: the STANDING Together consensus recommendations (Alderman et al, 2024): consensus recommendations for health datasets from the STANDING Together initiative.

Legislation/white papers on regulation of AI:

Datasets

Unfortunately, most currently available databases of medical imaging data do not feature associated demographic information such as sex and race, which is essential for much work in fairness of AI. Below we have put together a summary of the most commonly used datasets that do feature such information.

UK Biobank. Database of half a million patients from the UK population including approximately 100,000 with imaging consisting of heart, brain and abdominal MRI together with a wide range of demographic information. Available worldwide to approved projects upon payment of administration fee.
ADNI, Mueller et al (2005), Neuroimaging Clinics of North America: Multisite study of brain imaging (MRI), biochemical, and genetic data with Alzheimer’s diagnosis status and demographic information including race and sex.
ISIC challenge datasets. The International Skin Imaging Collaboration (ISIC) runs yearly challenges for AI processing of dermatology images. Some of these have associated skin tone information.
PAD-UFES, Pacheco et al (2020), Data in Brief: skin lesion dataset, dermatology images with 22 clinical parameters including age and Fitzpatrick skin type.
Diverse Dermatology Images, Daneshjou et al (2022), Science Advances. Database of 656 dermatology images from 570 unique patients, approximately balanced by skin tone. Expert annotations of lesion diagnosis and Fitzpatrick skin tone. Freely available.
Fitzpatrick 17k dataset, Groh et al (2021), CVPR workshops. Database of 17k dermatology images with algorithmically determined Fitzpatrick skin tone data.
NIH chest X-ray dataset. 112,120 chest X-ray images from 30,805 patients, labelled with 14 common thorax diseases and demographic information including age and sex.
CheXpert chest X-ray dataset, Irvin et al (2019), AAAI. 224,316 chest X-rays of 65,240 patients with age and sex information.
PadChest, Bustos et al (2020). Chest X-ray dataset, 160,000 chest X-ray images from 67,000 patients, includes age.
Duke-Breast-Cancer-MRI, Saha et al (2018), British Journal of Cancer. Dataset of dynamic contrast-enhanced MRI images of women with breast cancer, includes images, derived radiomics, tumour segmentations and patient demographics including race. Freely available upon registration.
OASIS Series of brain MR datasets with age, gender information.
Fairseg, Tian et al (2023). Public dataset of scanning laser ophthalmoscopy fundus images for assessment of bias in segmentation of the optic disc and cup

Talks

2016

Joy Buolamwini (TED talk), “How I’m Fighting Bias in Algorithms”

2019

Sandra Wachter, “When AI Disrupts the Law: Fairness, Privacy and Advertising in an Algorithmic World”

2021

2022

FAIMI 2022 Online Symposium talks
Judy Wawira Gichoya, “Hidden in Plain Sight: An Update to the Reading Race Project”
Sanmi Koyejo, “Algorithmic Fairness: Why it’s Hard, and Why it’s Interesting”, CVPR Tutorial: Part 1, Part 2
Jessica Schrouff, “Maintaining Fairness Under Distributions Shift”

2023

2024

Andrew King, “Bias and Fairness in AI for Medical Imaging” (RISE-MICCAI/FAIMI Summer School)