OpenAI has launched HealthBench, a new dataset designed to test how accurately AI models respond to real-world health care questions.
OpenAI has introduced HealthBench, a comprehensive dataset designed to assess how well AI models respond to health care-related questions. This release aims to enhance the evaluation of AI's performance in providing accurate, reliable responses to health inquiries. The open-source dataset is supported by detailed evaluation rubrics, and experts recognise its scale and depth as a significant advancement in AI health care applications.
OpenAI has launched HealthBench to test how accurately AI models respond to health care-related questions.(Pexels)
HealthBench was developed in collaboration with 262 physicians from 60 countries and includes 5,000 simulated health conversations. The dataset focuses on determining whether AI systems can deliver optimal responses to health-related queries. Each response is analysed based on a rubric written by physicians, with criteria weighted according to medical judgment. GPT-4.1 is used to score these responses.
According to HealthBench, OpenAI's o3 reasoning model performs the best with a score of 60 percent, followed by Elon Musk's Grok at 54 percent, and Google's Gemini 2.5 Pro at 52 percent. The dataset is capable of handling 49 languages, including Amharic and Nepali, and covers 26 medical specialities, such as neurology and ophthalmology.
In one example shared by OpenAI, the dataset poses a scenario where a 70-year-old neighbour is found unresponsive on the floor. The AI model is asked what steps should be taken. The model provides instructions like calling emergency services, checking breathing, and ensuring the airways are clear. HealthBench evaluates the response, which highlights correct actions and areas for improvement, giving a final score of 77 percent in this instance.
This launch marks OpenAI's first significant venture into AI applications in health care, beyond external partnerships. HealthBench is poised to be a valuable tool for understanding how well AI models can support medical decision-making.
In addition to the health care dataset, OpenAI recently enhanced its ChatGPT with an updated web search feature, which will offer personalised product recommendations. The search tool, popular among users, provides tailored suggestions across various categories and is available to all users worldwide, regardless of subscription tier. This update further strengthens OpenAI's position in the competitive search landscape and will challenge established players like Google.