Chatbot Underperformance in Biology and Image-Based Questions in Medical Education

AI chatbots have demonstrated variable performances across biological disciplines in medical education, particularly in multiple-choice and image-based assessments. However, their performance in addressing discipline-specific and image-based questions in biology remains unexamined. This study evaluated the accuracy and reliability of chatbots in answering biological questions from the Progress Test, a medical assessment applied across ten universities. We conducted an observational cross-sectional study by inputting 180 questions into the chatbots and categorising them according to morphology, function, and aggression. Each question was assessed for correctness across multiple chatbot attempts, and logistic regression and hierarchical clustering were applied to identify performance patterns. Although the chatbots answered functional and morphological questions accurately (from 85% (Gemini) to 91.7% (ChatGPT-4)), their accuracy decreased significantly for questions involving biological aggression and visual content. The agreement between chatbot responses remained weak, and Co-pilot displayed the lowest concordance. Chatbot accuracy decreased significantly in aggression-related disciplines and image-based questions. Logistic regression confirmed that the presence of images reduced the odds of correct answers by up to 17.6% (ChatGPT-4). Hierarchical clustering distinguished the two distinct response patterns, further validating these findings. These results highlight the potential of chatbots in medical education while emphasising their limitations in handling image-based and aggression-related content.