Large Language Models for Chatbot Health Advice Studies
Bright Huo, Amy Boyle, N. Marfo, W. Tangamornsuksan, Jeremy P Steen, T. McKechnie, Yung Lee, Julio Mayol 외
Key Points Question What do studies report when evaluating the performance of large language models (LLMs) providing health advice? Findings In this systematic review of 137 articles, 99.3% of the studies assessed closed-source models and did not provide enough information to identify the LLM. Most (64.5%) studies used subjective means as the ground truth to define the successful performance of the LLM, while less than a third addressed the ethical, regulatory, and patient safety implications of clinically integrating LLMs. Meaning The findings of this study suggest that the extent of reporting varies considerably among studies evaluating the clinical accuracy of LLMs providing health advice.