Estimating AI-generated Bias in Radiology Reporting by Measuring the Change in the Kellgren-Lawrence Grades of Knee Arthritis Before and After Knowledge of AI Results—A Multi-reader Retrospective Study


To estimate the extent of bias generated by AI in the radiologists’ reporting of grades of osteoarthritis on Knee X-rays by observing the change in grading after the knowledge of predictions of a deep learning algorithm.


Anteroposterior views of 271 knee x-rays (542 joints) were randomly extracted from PACS and anonymized. These x-rays were analyzed using DeepKnee, an open-source algorithm based on the Deep Siamese CNN architecture that automatically predicts the presence of osteoarthritis on Knee X Rays on a 5 scale Kellgren and Lawrence system (KL) along with an attention map. These x-rays were independently read by three sub-specialist MSK radiologists on the CARPL AI research platform (CARING Research, India). The KL grade for each Xray was recorded by the radiologists, following which the AI algorithm grade was shown, and radiologists given the option to change their result. The pre-AI result and post-AI results were both recorded. The change in the scores of all three readers was calculated and modulus of change in the score was estimated using the incongruence rate. The consensus shift before and after the knowledge of the AI results was also estimated.


There were a total of 542 knee joints that were analyzed by the algorithm and read by the three radiologists giving total 1,626 “instances”. There were 139 instances (8.5%) of readers changing their results. The number of shifts was 13,44, 31, 32 & 19 for grades 0 to 4 respectively. The reader1, reader2, reader3 changed their estimations in 52 (single shift), 34 (single shift), 53 (50 single shift, 2 two shifts, 1 three shift). The intra-reader incongruence rates were 9.6%, 6.3% and 9.8 % respectively. The Krippendorff’s alpha among the readers before knowledge and after knowledge AI results was 0.84 & 0.87 implying minimal convergence towards AI results. Three-reader, two-reader, and no consensus were found in 219, 296, and 27 cases before and 248, 279, and 15 cases after knowledge of AI results (see Figure 1).

Figure 1


We demonstrate that there is a tendency of readers to converge towards AI results which, as expected, occurs more often in the ‘middle’ or ‘median’ grades rather than the extremes of grade.


With an increase in the number and variety of AI applications in radiology, it is important to consider the extent and relevance of the behavior-modifying effect of AI algorithms on radiologists.