A systematic review by Choy et al. of 64 deep learning models reveals their high diagnostic accuracy for common skin diseases such as acne, psoriasis, eczema, and rosacea. Some models not only diagnose but also assess disease severity. These increasingly accurate models, though mostly still in research and development, offer opportunity for AI-assisted diagnosis to improve access in face of dermatologist shortages and long wait times; the ability to assess disease severity can build on diagnosis-only outputs to inform treatment decisions and patient self-management. Primary care, given assuring sentiments from providers, presents a particularly apt setting for application of dermatologist-trained models. However, the review also highlights significant challenges, including the need for further refinement in complex diseases, concerns about model bias, and the lack of standardization and diversity in training data. Regulators and providers should implement evaluation criteria for approval and adoption that prioritize applicability and inform decisions on where to best implement these novel technologies.

Main text

Artificial intelligence (AI) diagnosis in dermatology has moved beyond skin cancer alone to a wide range of common skin diseases - offering exciting new horizons for dermatology care. To date, the FDA has yet to approve an AI device for dermatology diagnosis or treatment1. With teledermatology burgeoning during the COVID-19 pandemic, new databanks of skin images have become broadly available to train models2.

AI first entered dermatology in the context of Stanford’s landmark deep learning model for skin cancer detection in Nature in 20173. Since then, new models have evolved beyond skin cancer alone—promising significant growth potential for the highly prevalent chronic inflammatory skin diseases, which affect 20–25% of the population wordwide4,5,6. With an array of promising diagnostic models inching closer to the bedside, questions arise for providers and regulators as to how these models should be evaluated and adopted. – particularly as they relate to bias and equity. Moreover, while new AI models has shown proficiency in diagnosing common skin conditions, their ability to navigate the nuances of more complex cases and recommend therapeutic interventions remains a critical area for exploration.

Promise in diagnosis

Choy et al. conducted a systematic review of 64 non-cancer-related deep learning models for diagnosis and monitoring of 144 different skin diseases7. Of these, the most common skin diseases were acne (30), psoriasis (27), eczema (22), rosacea (12), vitiligo (12), and urticaria (8). Most models predicted diagnosis (81%) and the rest, disease severity. Most image datasets (88%) used macroscopic images of skin, hair and nails, with the remainder using dermoscopic images. These image datasets were separated into three types, with varying uses and sizes: training (used to create models; median n = 2555), validation (used to evaluate model performance; median n = 1032), and testing (final evaluation; median n = 331).

Overall, the accuracy of these models was impressive in diagnosing acne (94%), rosacea (94%), eczema (93%) and psoriasis (89%). Accuracy for grading severity was more variable, but still high: psoriasis (93–100%,), eczema (88%), and acne (67–86%). These findings align with those of prior systemic reviews8, demonstrating growing evidence of accurate AI diagnostic tools across dermatology – at least for common skin conditions.

Diagnostic assistance from AI models has significant value in increasing access given context of the dermatologist shortage and long wait times (averaging 36 days in the US)9. Severity ratings of disease make model outputs more relevant for treatment decisions and patient self-management of chronic disease—a significant advance from diagnosis-only models10. However, there remains significant room for improving the nuance of these models to accurately recommend therapeutic changes. Moreover, many of high-accuracy diagnoses (acne, psoriasis, vitiligo) are readily recognizable by most providers; other conditions such as eczema and urticaria may be more difficult, and models that diagnose these and similarly complex diagnoses offer more promise.

The next step in implementation involves identifying the most opportune use cases for such technology. One particularly important application lies in primary care; in one study, 92% of p considered the tested AI dermatology diagnosis model a useful support tool in creating a differential diagnosis, and 60% even considered it useful to determine the final diagnosis11. Beyond primary care, other provider groups serving patients with skin conditions should critically analyze whether and when such technology would augment care.

Pitfalls in applicability

Choy et al. also found nearly ubiquitous bias and applicability concerns; quality assessment with the CLEAR Derm and QUADAS-2 framework12 found that 59 studies (92%) had a high risk of bias and 62 (97%) had a high level of applicability concerns. Bias in AI has been a long-standing concern across healthcare. In dermatology, the QUADAS-2 framework and CLEAR Derm guidelines could be useful for future evaluation. Further development of quality assessment tools, with validation in dermatology specifically, is also necessary to ensure that these AI tools do not perpetuate biases.

Moreover, models in the study used varying reference standards (the “correct” diagnoses used to train models), i.e., some used dermatologist-produced diagnoses, while others used PCP-based diagnoses or a combination of both sources of diagnoses. Dermatologists have significantly higher diagnostic accuracy than non-dermatologists given their specialized training13, suggesting that dermatologist reference standards should be used in all relevant datasets for the highest quality of care. This choice of providers involved in producing training datasets has ramifications for the care setting in which these technologies will be used. For example, dermatologists may be wary to adopt models trained with PCP-generated data, while PCPs may be more amenable to such a model.

AI has immense potential in increasing access to care, including new data on autonomous AI showing promise in increasing productivity14. However, Choy et al. found that only 19% of models reported ethnicity or Fitzpatrick skin type (skin color gradation). Even among those reporting, darker skin types were underrepresented - leaving significant concerns regarding whether these findings are applicable to marginalized populations, who often face the most challenges accessing dermatology care. Skin diversity metrics of training datasets should be mandatory in the academic literature and for product approval. Moreover, regulators and industry should consider requiring validation and testing with diversity-certified datasets, particularly for models trained on private and undisclosed datasets.

Looking forward

Overall, deep learning models in dermatology have promising accuracy in diagnosis and severity classification among numerous common skin diseases, though they still present limitations in recommending therapy with nuance. Models are further challenged by significant risk of bias, applicability concerns, varying reference standards, and poor diversity representation. As the scope of AI utilization continues to expand, evaluation frameworks are necessary to evaluate bias, standardize dataset training produced by dermatologists, and ensure representation of diverse skin phenotypes. As we usher in this new era of digital dermatology, it is imperative for researchers, clinicians, and policymakers to collaboratively navigate these uncharted waters, ensuring that AI tools are developed and implemented thoughtfully, with an eye towards their ultimate goal: enhancing patient care and outcomes for all.