Skip to content

New ClockBench Benchmark Exposes AI's Struggle with Analog Clocks

AI models struggle with reading analog clocks. Despite Google's Gemini 2.5 Pro leading, human performance is still far ahead.

The picture consists of an analog clock.
The picture consists of an analog clock.

New ClockBench Benchmark Exposes AI's Struggle with Analog Clocks

A new benchmark, ClockBench, has been introduced to assess AI models' ability to read analog clocks. Despite Gemini 2.5 Pro leading with 13.3% accuracy, human performance remains significantly higher at 89.1%.

ClockBench, created to evaluate AI models' visual reasoning skills, consists of 180 unique analog clocks and 720 questions. It's designed to be 'easy for humans, hard for AI'.

The test revealed that while AI models show basic visual reasoning, they struggle with initial visual information extraction. Certain clock features like Roman numerals, circular numbers, and colorful backgrounds posed challenges. Gemini 2.5 Pro, the top performer, still lagged behind human accuracy by 75.8 percentage points.

Grok 4 performed poorly with 0.7% accuracy, marking 63.3% of clocks as invalid. GPT-5 came in third with 8.4% accuracy, with varying reasoning budgets having little impact. The median error sizes for incorrect AI answers were much larger than those of humans.

ClockBench, an ongoing benchmark with a public version available, highlights the gap in AI models' ability to read analog clocks compared to humans. While Gemini 2.5 Pro leads AI models with 13.3% accuracy, human performance stands at 89.1%. Further research is needed to bridge this significant performance gap.

Read also:

Latest