DICE Score vs Radiologist – Visual quantification of Virtual Diffusion Sequences – pitfalls of lesion segmentation-based approach as compared to clinical relevance-based qualitative assessment
The performances of image segmentation/translation algorithms are typically evaluated by measuring image similarity metrics like DICE score or SSIM. In some instances, this approach may be counter-productive. In this study, we propose to compare such an approach with more clinical relevance focussed qualitative assessment method for estimating the accuracy of a virtually generated diffusion-weighted (DW) sequences using Generative Adversarial Networks (GAN).
METHODS AND MATERIALS:
we used a previously described Virtual Imaging Using Generative Adversarial Networks for Image Translation (VIGANIT) network which comprises a 15-layer deep convolution neural network (CNN) used in conjunction with a GAN to improve the clarity of the output image. VIGANIT was used to predict B1000 diffusion-weighted image from input T2W images in 24 cases (12 cases of acute and chronic infarcts each). The ground truth B1000 DW and the predicted B1000 images were blinded and randomized. A radiologist with 9 years’ experience in MRI did pixel-level annotations of the bright and dark areas on ITK- SNAP. Dice score coefficients (DSC) for the annotated areas were calculated. Another radiologist with 16 years’ experience studied the scans to determine the scan level presence or absence of restriction like signal. In positive cases, slice level analysis for the number and location of discretely visible ischemic foci of size greater than 2 mm were also noted.
The DICE score for the cases with acute infarcts ranged 0 to 0.85 with an average of 0.43 and the dark areas ranged from 0.27 to 0.81 with an average of 0.46. The qualitative assessment revealed that eight out of the 12 cases had positive scan level predictions of restricted diffusion. None of the 12 chronic infarct cases had false predictions of restricted diffusion. There was an absence of comparable predictions in 4 out of the 12 cases with acute infarcts. Two of these four patients had some degree of movement artifacts in their T2W images. The overall accuracy of the predictions was 72%.
Despite the low dice score co-efficient for image translation, the scan level accuracy for the clinical classification of presence or absence of acute infarct was reasonably good. This study makes the case for additionally employing clinical-significance of lesions as an indicator of model performance. In this study, we demonstrate a significant change in the acceptability score of an image translation network by applying a more clinically relevant assessment method as compared to in-silico mathematical methods.
The EPOS can be viewed here: http://dx.doi.org/10.26044/ecr2020/C-06645