Model Performance

Evaluating large language models (LLMs) is becoming increasingly difficult. One major challenge is test set contamination, where benchmark questions unintentionally end up in a model’s training data—skewing results and making once-reliable benchmarks quickly outdated. While newer benchmarks try to avoid this by using crowdsourced questions or LLM-based evaluations, these methods come with their own problems, like bias and difficulty in judging complex tasks.

That’s where LiveBench comes in.

LiveBench is some kind of a benchmark designed to address these issues head-on. It features regularly updated questions sourced from fresh content—like math competitions, academic papers, and news articles—and scores answers automatically using objective ground-truth values. It covers a wide range of tough tasks, including math, coding, reasoning, and instruction following, pushing LLMs to their limits.

With questions refreshed monthly and difficulty scaling over time, LiveBench is built not just for today’s models but for the next wave of AI breakthroughs. Top models currently score below 70%, showing just how challenging—and necessary—this benchmark is.

I put toghether few benchmarks.

Model Performance by Average score.

Model Reasoning Performance

Model Coding Performance

Model Language Performance

Recap

I have created all diagrams by using GPT-4o based on data obtained from https://livebench.ai/#/?Coding=a

If some of models are missing in diagrams, please forgive me (It was omitted by GPT Diagram Generation :)).