OpenAI has recently unveiled SimpleQA, a groundbreaking benchmark to assess the factual accuracy of large language models (LLMs) that power generative AI (genAI).
Imagine it as an SAT for genAI chatbots, featuring 4,326 questions spanning various fields like science, politics, pop culture, and art. Each question has a single correct answer validated by independent reviewers.
Each question is repeated 100 times to track the frequency of each response. The goal is for a confident model to consistently provide the same answer.
The questions were carefully chosen due to their historical difficulty for AI models, especially those relying on OpenAI’s GPT-4. This targeted selection ensures that low accuracy scores reflect performance on challenging questions rather than overall model capabilities.
This concept mirrors the SATs’ focus on challenging rather than common knowledge questions that high school students must work hard to master. The benchmark results reveal OpenAI’s models struggle with specific types of questions and tend to produce inaccurate responses.
The o1-preview model from OpenAI achieved a success rate of 42.7%, while GPT-4o followed closely with 38.2% accuracy. The smaller GPT-4o-mini scored only 8.6%. Surprisingly, Anthropic’s Claude-3.5-sonnet model performed even worse than OpenAI’s top model with just 28.9% correct answers.
2024-11-22 17:15:04
Source from www.computerworld.com