Articles of Interest: Assessing AI Research
< Back To FeedHealthcare systems worldwide are facing mounting challenges as treatment decisions grow increasingly complex due to the prevalence of comorbidities. Concurrently, research exploring the application of artificial intelligence models in healthcare has surged, producing intriguing results which many researchers claim will pave the way to personalized healthcare. However, amidst this rapid proliferation of research, it's essential for clinicians to navigate these developments prudently, critically assessing the potential implications and utility of these models in their clinical environments. The following two articles provide an excellent framework for clinicians reviewing artificial intelligence-based research, or those who are considering implementing a model into their clinical workflow. These articles focus on appraising models in the fields of radiology and cardiovascular disease but are applicable to other clinical specialties as well.
Critical appraisal of artificial intelligence-based prediction models for cardiovascular disease
The authors of this article stress the importance of distinguishing models that can add value to patient care, and models that cannot. Their evaluation framework provides us with 12 critical questions to ask when confronted with AI based prediction models, accompanied by a justification for each question. The authors distilled their experience into some key considerations (found in Table 1) which can be quickly referenced. For example, is AI really needed to solve the problem under consideration or are there other valid solutions to improve patient care? There are already over 360 models for cardiovascular disease, most not using AI, but only a few are actually used in practice. In this case, finding solutions to integrate existing models in current clinical workflow may be the more important problem to solve.
12 Critical questions for AI based models:
- Is artificial intelligence needed to solve the targeted medical problem?
- How does the artificial intelligence prediction model fit in the existing clinical workflow?
- Are the data for prediction model development and testing representative for the targeted patient population and intended use?
- Is the (time) point of prediction clear and aligned with the feature measurements?
- Is the outcome variable labelling procedure reliable, replicable, and independent?
- Was the sample size sufficient for artificial intelligence prediction model development and testing?
- Is optimism of predictive performance of the artificial intelligence prediction model avoided?
- Was the artificial intelligence model’s performance evaluated beyond simple classification statistics?
- Were the relevant reporting guidelines for artificial intelligence prediction model studies followed?
- Is algorithmic (un)fairness considered and appropriately addressed?
- Is the developed artificial intelligence prediction model open for further testing, critical appraisal, updating, and use in daily practice?
- Are presented relations between individual features and the outcome not overinterpreted?
This article from the Radiology Editorial Board provides a guide for the assessment of AI models in diagnostic imaging. As in the first article, the authors provide a set of questions to ask when reviewing or conducting research in the field. They begin by acknowledging the widespread impact that AI will have on “any medical application that uses any image of any sort”. They also reference a statement from the RSNA and ACR which suggests that AI technology “will be so universal that all radiologists using these tools need to be involved in self-education on the topic”.
The questions they outline are specific to the research and review of AI algorithms applied to medical imaging. One of the most important issues that the authors address is that the traditional performance benchmarks used in computer science don’t translate well to performance in a clinical setting. For example, new algorithms are often benchmarked against existing algorithms using a standard set of test images and performance is expressed with a single area under the curve (AUC) value. However, the AUC value on its own may have little relevance to how the algorithm would be used in clinical practice. The authors encourage presentation of algorithm performance metrics that are better suited for clinical use such as overlaid probability maps. This and other challenges are addressed in the full article, which you can read here.
9 Considerations for AI/ML radiology research
- Carefully define all three image sets (training, validation, and test sets of images) of the AI experiment.
- Use an external test set for final statistical reporting
- Use multivendor image, preferably for each phase of the AI evaluation (training, validation, test sets).
- Justify the size of the training, validation, and test sets.
- Train the AI algorithm using a standard of reference that is widely accepted in our field.
- Describe any preparation of images for the AI algorithm.
- Benchmark the AI performance to radiology experts.
- Demonstrate how the AI algorithm makes decisions.
- The AI algorithm should be publicly available so that claims of performance can be verified.