DeepSWE rates your LLM

DeepSWE rates your LLM

Rating exactly how well an AI does on tasks has been an open field. There are benchmarks, but there have been lots of arguments these current early benchmarks are too limited or biased. A new player is on the field and they seem to have discovered that some benchmarks are actually evaluating incorrectly a shocking amount of times.

A startup called Datacurve released a benchmark it says does a much better job. DeepSWE, a 113-task evaluation spanning 91 open-source repositories and five programming languages, produces a dramatically wider spread among the same frontier models.

It’s biggest shock is that it claims many benchmarks aren’t even correct. Their tests cover a pretty impressive and wider range of characteristics such as bigger, more representative tasks. They avoid what they call ‘contamination’ that results from benchmarks that rely on simple Github coding samples that some models simply regugitate vs truely generate. And most damning – they found that some benchmarks verifiers (the part of the code that verifies what the AI built is correct) gives false positive/negative rates from 8-24% of the time.

Screenshot 2026-05-26 at 3.22.11 PM

More benchmarks and more testing is valuable for evaluating models – so hopefully these guys will help push the industry to more scrutiny and reproducible real-world results.

You can even go download it and try it on your own models.

Links:

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.