Crimson Reason: Killed by LLM

Killed by LLM – Collected Data on AI Benchmarks

It's interesting to look back:

2023: GPT-4 was truely something new

2024: Others caught up, progress in fits and spurts

O1/O3 used test-time compute to saturate math and reasoning benchmarks
Sonnet 3.5/ 4o incremented some benchmarks into saturation, and pushed new visual evals into saturation
Llama 3/ Qwen 2.5 brought Open Weight models to be competitive across the board

Today: We need better benchmarks

I'm amazed seeing tasks I didn't think we'd solve until 2030 become obsolete, and yet we still can't trust a model to do the same tasks as a junior
It's clear our benchmarks aren't yet measuring real-world reliability, I hope we have as much progress in benchmarks as we do models in 2025.

Let me know what you think!

Crimson Reason