OpenAI has recently unveiled SimpleQA, a groundbreaking benchmark to assess
Imagine it as an SAT for genAI chatbots, featuring
Each question is repeated 100 times to track the frequency of each response. The goal is for a confident model to consistently provide the same answer.
The questions were carefully chosen due to their historical difficulty for AI models, especially those relying on
This concept mirrors the SATs’ focus on challenging rather than common knowledge questions that high school students must work hard to master. The benchmark results reveal OpenAI’s models struggle with specific types of questions and tend to produce inaccurate responses.
The o1-preview model from OpenAI achieved a success rate of 42.7%, while GPT-4o followed closely with 38.2% accuracy. The smaller GPT-4o-mini scored only 8.6%. Surprisingly, Anthropic’s Claude-3.5-sonnet model performed even worse than OpenAI’s top model with just 28.9% correct answers.
2024-11-22 17:15:04
Source from www.computerworld.com