Technology

Humanity’s Last Exam: Global Push for Tougher Tests to Challenge AI

Musa

A global initiative called “Humanity’s Last Exam” is seeking complex, expert-level questions to challenge artificial intelligence (AI) systems, as existing benchmarks have become too easy for advanced models. The project, spearheaded by the non-profit Center for AI Safety (CAIS) and the startup Scale AI, aims to evaluate when AI reaches true expert-level capability and to remain relevant as AI progresses in the future.

The call for tougher questions follows the preview of OpenAI’s new model, known as OpenAI o1, which has surpassed popular reasoning benchmarks. Dan Hendrycks, executive director of CAIS and advisor to Elon Musk’s xAI, noted that current benchmarks, such as undergraduate-level knowledge tests, have been easily handled by AI models like ChatGPT and Claude, which now achieve high scores compared to earlier versions.

Also Read: Sam Altman Steps Down from OpenAI’s Safety Committee

While AI models have improved on many reasoning tasks, they still perform poorly on lesser-used tests that involve planning and abstract reasoning. This includes tests like the ARC-AGI visual pattern-recognition puzzles, where OpenAI o1 scored only 21%.

“Humanity’s Last Exam” will focus on abstract reasoning and will include over 1,000 crowd-sourced questions, with submissions due by November 1. The best questions will undergo peer review, and selected contributors could win up to $5,000 in prizes or co-authorship credits. One restriction is that the project will not accept questions related to weapons, to avoid potential risks of AI learning dangerous knowledge.

Alexandr Wang, CEO of Scale AI, emphasized the importance of developing more challenging tests to measure AI’s rapid advancements accurately.