Publication: Exploring the Limitations of AI in Medical Code Extraction

Large language models (LLMs) are gaining traction in various specialized domains, but their capabilities are still being tested in highly technical fields such as medical coding. Our recent study investigates how effective these AI models are at extracting ICD-10-CM codes from clinical documentation compared to a human coder. We specifically analyzed six different LLMs, including widely known models like GPT-3.5, GPT-4, and Llama 2-70b, against actual patient data. The findings showed that while LLMs can extract a greater number of unique codes, their alignment with human coders remains poor.

The study revealed that while Llama 2-70b extracted the highest number of unique codes, it was GPT-4 that showed the most agreement with human coders, albeit at a modest 15.2%. When it came to identifying the primary diagnosis, Claude 3 showed a slightly better performance with a 26% agreement. However, the Cohen’s kappa values, measuring statistical agreement, suggested minimal to no consistency between the AI and human results. This indicates that while AI models are comprehensive in their extraction, they still struggle with precision and relevance when it comes to critical medical data.

Our exploration also uncovered various reasons for the discrepancies in code extraction. For instance, some models like GPT-4 were prone to extracting codes for diagnoses not confirmed by medical providers, while others, such as GPT-3.5, frequently selected non-specific codes. Another area of concern was the tendency of models like Claude-2.1 to include signs and symptoms rather than definitive diagnoses. These findings underline the need for refinement and improved training data to better align AI’s coding prowess with human expertise before they can be integrated into critical healthcare operations to improve efficiency without compromising accuracy.

Source: Simmons A, Takkavatakarn K, McDougal M, Dilcher B, Pincavitch J, Meadows L, Kauffman J, Klang E, Wig R, Smith GS, Soroush A, Freeman R, Apakama D, Charney A, Kohli-Seth R, Nadkarni G, Sakhuja A. Extracting International Classification of Diseases Codes from Clinical Documentation using Large Language Models. Appl Clin Inform. 2024 Nov 28. doi: 10.1055/a-2491-3872. Epub ahead of print. PMID: 39608761.

Leave a Comment