A site devoted mostly to everything related to Information Technology under the sun - among other things.

Wednesday, January 22, 2025

Killed by LLM

From a reddit post @  https://www.reddit.com/r/LocalLLaMA/comments/1hs6ftc/killed_by_llm_i_collected_data_on_ai_benchmarks/

Killed by LLM – Collected Data on AI Benchmarks

For my year-end I collected data on how quickly AI benchmarks are becoming obsolete.

It's interesting to look back:

2023: GPT-4 was truely something new

  • It didn't just beat SOTA scores, it completely saturated benchmarks

  • It was the first time humanity created something that can beat the turing test

  • It created a clear "before/after" divide

2024: Others caught up, progress in fits and spurts

  • O1/O3 used test-time compute to saturate math and reasoning benchmarks

  • Sonnet 3.5/ 4o incremented some benchmarks into saturation, and pushed new visual evals into saturation

  • Llama 3/ Qwen 2.5 brought Open Weight models to be competitive across the board

Today: We need better benchmarks

  • I'm amazed seeing tasks I didn't think we'd solve until 2030 become obsolete, and yet we still can't trust a model to do the same tasks as a junior

  • It's clear our benchmarks aren't yet measuring real-world reliability, I hope we have as much progress in benchmarks as we do models in 2025.

Let me know what you think!

Code + data (if you'd like to contribute): https://github.com/R0bk/killedbyllm
Interactive view: https://r0bk.github.io/killedbyllm/

No comments:

About Me

My photo
I had been a senior software developer working for HP and GM. I am interested in intelligent and scientific computing. I am passionate about computers as enablers for human imagination. The contents of this site are not in any way, shape, or form endorsed, approved, or otherwise authorized by HP, its subsidiaries, or its officers and shareholders.

Blog Archive