From a reddit post @ https://www.reddit.com/r/LocalLLaMA/comments/1hs6ftc/killed_by_llm_i_collected_data_on_ai_benchmarks/
Killed by LLM – Collected Data on AI Benchmarks
For my year-end I collected data on how quickly AI benchmarks are becoming obsolete.
It's interesting to look back:
2023: GPT-4 was truely something new
It didn't just beat SOTA scores, it completely saturated benchmarks
It was the first time humanity created something that can beat the turing test
It created a clear "before/after" divide
2024: Others caught up, progress in fits and spurts
O1/O3 used test-time compute to saturate math and reasoning benchmarks
Sonnet 3.5/ 4o incremented some benchmarks into saturation, and pushed new visual evals into saturation
Llama 3/ Qwen 2.5 brought Open Weight models to be competitive across the board
Today: We need better benchmarks
I'm amazed seeing tasks I didn't think we'd solve until 2030 become obsolete, and yet we still can't trust a model to do the same tasks as a junior
It's clear our benchmarks aren't yet measuring real-world reliability, I hope we have as much progress in benchmarks as we do models in 2025.
Let me know what you think!
Code + data (if you'd like to contribute): https://github.com/R0bk/killedbyllm
Interactive view: https://r0bk.github.io/killedbyllm/
No comments:
Post a Comment