Article 1
Validation of the Artificial Intelligence-Based Predictive Optimal Trees in Emergency Surgery Risk (POTTER) Calculator in Emergency General Surgery and Emergency Laparotomy Patients. El Hechi MW, Maurer LR, Levine J, et al. J Am Coll Surg. 2021 Jun;232(6):912-919.
Emergency general surgery represents a heterogeneous population with many different diagnoses and often complex comorbidities which lead to challenges in predicting clinical outcomes. This is further impacted by the concept of “nonlinearity,” where the impact of each comorbidity varies based whether or not other factors are present. To overcome this limitation, machine learning was leveraged to develop the Predictive Optimal Trees in Emergency Surgery Risk (POTTER) calculator for patients undergoing emergency surgical procedures. As the tool was derived using all emergency surgical operations (including vascular, cardiac, and thoracic surgeries) the authors applied the score to EGS and emergency laparotomy cases to assess its accuracy in predicting outcomes in this specific population.
The authors identified 59,955 EGS patients who most commonly underwent laparoscopic appendectomy (41.3%), laparoscopic cholecystectomy (7.7%), and small bowel resection (4.4%). 30-day mortality was 4.4% and morbidity was 24.6%. The POTTER tool predicted mortality of emergency surgery patients well with a c-statistic of 0.93. It also predicted 30-day morbidity well with c-statistic of 0.83.
18,952 of the cohort underwent emergency laparotomy, most commonly small bowel resection (13.5%), partial colectomy with diversion (12.6%), and partial colectomy with primary anastomosis (9.7%). The POTTER tool performed well when predicting mortality of emergency laparotomy patients with a c-statistic of 0.86 and 30-day morbidity with c-statistic of 0.77.
This study demonstrates the use of machine learning and artificial intelligence tools to predict outcomes in a diverse patient population using pre-operative factors. It performs well in its predictive accuracy, and it can be considered for use in counseling patients and families when confronted with emergency general surgical problems. As additional databases are created for the EGS population, especially with the inclusion of non-operative management, this model could be expanded to allow for inclusion of a greater proportion of our patient population.
Article 2
Disorder-Free Data Are All You Need — Inverse Supervised Learning for Broad-Spectrum Head Disorder Detection. He Y, Guo Y, Lyu J, Ma L, et al. NEJM AI. 2024 March 28;1(4).
Artificial intelligence (AI) and machine learning (ML) tools are rapidly gaining popularity and expanding in their use. However, they come with well-known limitations; most specifically, they lack generalizability and are ‘brittle,’ meaning that when applied in a population different from the training dataset, efficacy in the proposed task decreases dramatically. The standard AI paradigm of training models to identify specific, narrow disease states also does not reflect the reality of clinical practice. To combat this problem, the authors propose a novel solution termed inverse supervised learning (ISL). Rather than pursue models to identify specific disease states, they trained a single model to identify disorder-free studies, thereby identifying all disorders due to their differences from the training samples.
After training their model on 21,429 healthy brain CT scans, the model was tested on a retrospective dataset with 127 disorder types and a prospective dataset with 116 disorder types. AUC in the retrospective dataset was 0.883 (sensitivity 0.810, specificity 0.835) and in the prospective dataset was 0.868 (sensitivity 0.809, specificity 0.795). When lesions were classified by size (large, medium, small) AUCs were 0.941, 0.943, and 0.887, respectively. When classified by treatment urgency (high, medium, low), AUCs were 0.942, 0.853, and 0.859, respectively. Incorporation of the model into radiological workflow improved sensitivity and specificity in addition to reducing workload. Finally, the investigators demonstrated generalizability of ISL by applying the same principles to detect disorders in pulmonary CT scans and optical coherence tomography images where they demonstrated average AUCs of 0.893 and 0.895, respectively.
By using disorder free data and inverse supervised learning, this study demonstrates that more generalizable models to identify pathology on imaging studies are possible. The investigators further demonstrated that their methodology is generalizable to other imaging modalities and continues to hold integrity on a prospective dataset. Future research will need to integrate this methodology with other techniques to identify specific disease processes to enhance its use for positive identification of disease states.
Article 3
Predicting blood transfusion following traumatic injury using machine learning models: A systematic review and narrative synthesis. Oakley W, Tandle S, Perkins Z, Marsden M. J Trauma Acute Care Surg. 2024 Oct 1;97(4):651-659.
Predicting the need and extent of blood transfusion in the setting of trauma is often decided by integrating multiple objective data with physician gestalt. Machine learning (ML) models, capable of deducing the interplay between clinical patterns, data points, and their outcomes, have the potential to serve as powerful decision-making adjuncts. In this systematic review, the authors sought to identify ML models developed to predict blood transfusion in trauma and describe their application to clinical practice.
Using MEDLINE, Embase, and the Cochrane Central Register for Controlled Trials, the authors identified twenty-five ML models, the majority published in the past five years. To analyze ML performance, models were graded in terms of discrimination and calibration. Discrimination was plotted as model sensitivity versus 1-specificity, and reported as AUROC, with values of <0.7 considered poor predictors and >0.9 considered excellent. Data was presented in a narrative fashion, and risk of bias for each ML model was determined using the Prediction of Risk of Bias Assessment Tool (PROBAST).
Across the twenty-five ML models, multiple algorithmic methods were implemented, including neural networks, decision trees, and Bayesian networks. Models used 3-24 variables ranging from vital signs to diagnostic imaging findings. Overall, 23 models used retrospective data in their construction, and 17 models were crafted using single-center data. Seventeen models achieved good to excellent discrimination during internal validation (AUROC >0.8); however, only 4 models demonstrated this same performance during external validation (Bleeding Risk Index, Compensatory Reserve Index, Marsden et al. model, and the Mina et al. model). Only two studies included measures of calibration, a notable limitation given its statistical pertinence in predicting underestimation or overestimation of need for transfusion. All studies were found to have a high risk of bias according to PROBAST calculation, largely due to small sample size and reliance on retrospective data.
This robust review of ML models illustrates an innovative avenue for providing personalized transfusion prediction while identifying limitations to their development with actionable areas for improvement. The authors emphasize common desires for datasets, such as leveraging multicenter data to lend greater generalizability, but go on to describe useful ML-specific feedback. They underscore the importance of calibration, further emphasizing the clinical implications of model reliability. Importantly, the authors describe a need for applicability and the seamless integration of ML models into physician workflow. The authors explain how this influential factor relies on minimizing clinician burden, utilizing readily available datapoints, and furthering physician understanding of how ML models function. As the field of AI in healthcare matures, addressing issues of database optimization, model validation, and convenient integration will be crucial to fully realizing the ability for ML models to improve patient outcomes.