Wednesday, January 22, 2025

Killed by LLM

From a reddit post @  https://www.reddit.com/r/LocalLLaMA/comments/1hs6ftc/killed_by_llm_i_collected_data_on_ai_benchmarks/

Killed by LLM – Collected Data on AI Benchmarks

For my year-end I collected data on how quickly AI benchmarks are becoming obsolete.

It's interesting to look back:

2023: GPT-4 was truely something new

  • It didn't just beat SOTA scores, it completely saturated benchmarks

  • It was the first time humanity created something that can beat the turing test

  • It created a clear "before/after" divide

2024: Others caught up, progress in fits and spurts

  • O1/O3 used test-time compute to saturate math and reasoning benchmarks

  • Sonnet 3.5/ 4o incremented some benchmarks into saturation, and pushed new visual evals into saturation

  • Llama 3/ Qwen 2.5 brought Open Weight models to be competitive across the board

Today: We need better benchmarks

  • I'm amazed seeing tasks I didn't think we'd solve until 2030 become obsolete, and yet we still can't trust a model to do the same tasks as a junior

  • It's clear our benchmarks aren't yet measuring real-world reliability, I hope we have as much progress in benchmarks as we do models in 2025.

Let me know what you think!

Code + data (if you'd like to contribute): https://github.com/R0bk/killedbyllm
Interactive view: https://r0bk.github.io/killedbyllm/

No comments:

Post a Comment