Estimation of bias of deep learning-based chest X-ray classification algorithm


To evaluate the bias in the diagnostic performance of a deep learning-based chest X-ray classification algorithm on previously unseen external data.


632 chest X-rays were randomly collected from an academic centre hospital and anonymised selectively, leaving out fields needed for the bias estimation (manufacturer name, age, and gender). They were from six different vendors – AGFA (388), Carestream (45), DIPS (21), GE (31), Philips (127), and Siemens (20). The male and female distribution was 376 and 256. The X-rays were read for consolidation ground truth establishment on CARING analytics platform (CARPL). These X-rays were run on open-sourced chest X-ray classification model. Inferencing results were analysed using Aequitas, an open-source python-based package to detect the presence of bias, fairness of algorithms. Algorithms’ performance was evaluated on the three metadata classes – gender, age group, and brand of equipment. False omission rate (FOR) and false-negative rate (FNR) metrics were used for calculating the inter-class scores of bias.


AGFA, 60 to 80 age group, and male were the dominant entities and hence considered as baseline for evaluation of bias towards other classes. Significant false omission rate (FOR) and false negative rate (FNR) disparities were observed for all vendor classes except Siemens as compared to AGFA. No gender disparity was seen. All groups show FNR parity whereas all classes showed disparity with respect to false omission rate for age.


We demonstrate that AI algorithms may develop biases, based on the composition of training data. We recommend bias evaluation check to be an integral part of every AI project. Despite this, AI algorithms may still develop certain biases, some of those difficult to evaluate.


Limited pathological classes were evaluated.