DeepSWE rates your LLM

May 30, 2026 matt Comments 0 Comment

Rating exactly how well an AI does on tasks has been an open field. There are benchmarks, but there have been lots of arguments these current early benchmarks are too limited or biased. A new player is on the field and they seem to have discovered that some benchmarks are actually evaluating incorrectly a shocking amount of times.

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks.

On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work. pic.twitter.com/HCDcjNuTFK
— Serena Ge (Datacurve) (@serenaa_ge) May 26, 2026

A startup called Datacurve released a benchmark it says does a much better job. DeepSWE, a 113-task evaluation spanning 91 open-source repositories and five programming languages, produces a dramatically wider spread among the same frontier models.

It’s biggest shock is that it claims many benchmarks aren’t even correct. Their tests cover a pretty impressive and wider range of characteristics such as bigger, more representative tasks. They avoid what they call ‘contamination’ that results from benchmarks that rely on simple Github coding samples that some models simply regugitate vs truely generate. And most damning – they found that some benchmarks verifiers (the part of the code that verifies what the AI built is correct) gives false positive/negative rates from 8-24% of the time.

More benchmarks and more testing is valuable for evaluating models – so hopefully these guys will help push the industry to more scrutiny and reproducible real-world results.

You can even go download it and try it on your own models.

Links:

Matt's Homepage

DeepSWE rates your LLM

May 30, 2026 matt Comments 0 Comment

Related

Leave a Reply Cancel reply

Share this:

Related

Leave a Reply Cancel reply