{"id":16437,"date":"2026-05-30T09:45:16","date_gmt":"2026-05-30T16:45:16","guid":{"rendered":"https:\/\/mattfife.com\/?p=16437"},"modified":"2026-05-30T09:45:16","modified_gmt":"2026-05-30T16:45:16","slug":"deepswe-rates-your-llm","status":"publish","type":"post","link":"https:\/\/mattfife.com\/?p=16437","title":{"rendered":"DeepSWE rates your LLM"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Rating exactly how well an AI does on tasks has been an open field. There are benchmarks, but there have been lots of arguments these current early benchmarks are too limited or biased. A new player is on the field and they seem to have discovered that some benchmarks are actually evaluating incorrectly a shocking amount of times.<\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-rich is-provider-x wp-block-embed-x\"><div class=\"wp-block-embed__wrapper\">\n<div class=\"embed-x\"><blockquote class=\"twitter-tweet\" data-width=\"550\" data-dnt=\"true\"><p lang=\"en\" dir=\"ltr\">Today we\u2019re releasing DeepSWE, a new standard for agentic coding benchmarks.<br><br>On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work. <a href=\"https:\/\/t.co\/HCDcjNuTFK\">pic.twitter.com\/HCDcjNuTFK<\/a><\/p>&mdash; Serena Ge (Datacurve) (@serenaa_ge) <a href=\"https:\/\/x.com\/serenaa_ge\/status\/2059308218564890875?ref_src=twsrc%5Etfw\">May 26, 2026<\/a><\/blockquote><script async src=\"https:\/\/platform.x.com\/widgets.js\" charset=\"utf-8\"><\/script><\/div>\n<\/div><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">A startup called Datacurve released a benchmark it says does a much better job. <a href=\"https:\/\/deepswe.datacurve.ai\/blog\">DeepSWE<\/a>, a 113-task evaluation spanning 91 open-source repositories and five programming languages, produces a dramatically wider spread among the same frontier models. <\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"480\" data-attachment-id=\"16438\" data-permalink=\"https:\/\/mattfife.com\/?attachment_id=16438\" data-orig-file=\"https:\/\/i0.wp.com\/mattfife.com\/wp-content\/themes\/mattTheme\/headerimgs\/2026\/05\/image-5.png?fit=734%2C551&amp;ssl=1\" data-orig-size=\"734,551\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;,&quot;alt&quot;:&quot;&quot;}\" data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/i0.wp.com\/mattfife.com\/wp-content\/themes\/mattTheme\/headerimgs\/2026\/05\/image-5.png?fit=640%2C480&amp;ssl=1\" src=\"https:\/\/i0.wp.com\/mattfife.com\/wp-content\/themes\/mattTheme\/headerimgs\/2026\/05\/image-5.png?resize=640%2C480&#038;ssl=1\" alt=\"\" class=\"wp-image-16438\" srcset=\"https:\/\/i0.wp.com\/mattfife.com\/wp-content\/themes\/mattTheme\/headerimgs\/2026\/05\/image-5.png?w=734&amp;ssl=1 734w, https:\/\/i0.wp.com\/mattfife.com\/wp-content\/themes\/mattTheme\/headerimgs\/2026\/05\/image-5.png?resize=360%2C270&amp;ssl=1 360w, https:\/\/i0.wp.com\/mattfife.com\/wp-content\/themes\/mattTheme\/headerimgs\/2026\/05\/image-5.png?resize=300%2C225&amp;ssl=1 300w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">It&#8217;s biggest shock is that it claims many benchmarks aren&#8217;t even correct. Their tests cover a pretty impressive and wider range of characteristics such as bigger, more representative tasks. They avoid what they call &#8216;contamination&#8217; that results from benchmarks that rely on simple Github coding samples that some models simply regugitate vs truely generate. And most damning &#8211; they found that some benchmarks verifiers (the part of the code that verifies what the AI built is correct) gives false positive\/negative rates from 8-24% of the time. <\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/venturebeat.com\/_next\/image?url=https%3A%2F%2Fimages.ctfassets.net%2Fjdtwqhzvc2n1%2F5OFgNKCFyANJu5nyAr7sL%2Fe8ff388e60faba69c0e28371f87588f5%2FScreenshot_2026-05-26_at_3.22.11%25C3%25A2__PM.png%3Fw%3D1000%26q%3D100&amp;w=3840&amp;q=75\" alt=\"Screenshot 2026-05-26 at 3.22.11\u202fPM\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">More benchmarks and more testing is valuable for evaluating models &#8211; so hopefully these guys will help push the industry to more scrutiny and reproducible real-world results.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">You can even go <a href=\"https:\/\/deepswe.datacurve.ai\/run\" data-type=\"link\" data-id=\"https:\/\/deepswe.datacurve.ai\/run\">download it and try it on your own models<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Links:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/venturebeat.com\/technology\/deepswe-blows-up-the-ai-coding-leaderboard-crowns-gpt-5-5-and-finds-claude-opus-exploiting-a-benchmark-loophole\">https:\/\/venturebeat.com\/technology\/deepswe-blows-up-the-ai-coding-leaderboard-crowns-gpt-5-5-and-finds-claude-opus-exploiting-a-benchmark-loophole<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/deepswe.datacurve.ai\/blog\">https:\/\/deepswe.datacurve.ai\/blog<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Rating exactly how well an AI does on tasks has been an open field. There are benchmarks, but there have been lots of arguments these current early benchmarks are too limited or biased. A new player is on the field and they seem to have discovered that some benchmarks are actually evaluating incorrectly a shocking amount of times. A startup called Datacurve released a benchmark it says does a much better job. DeepSWE, a 113-task evaluation spanning 91 open-source repositories&#8230;<\/p>\n<p class=\"read-more\"><a class=\"btn btn-default\" href=\"https:\/\/mattfife.com\/?p=16437\"> Read More<span class=\"screen-reader-text\">  Read More<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[28,9],"tags":[],"class_list":["post-16437","post","type-post","status-publish","format-standard","hentry","category-ai","category-cool"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p4WECr-4h7","jetpack-related-posts":[],"_links":{"self":[{"href":"https:\/\/mattfife.com\/index.php?rest_route=\/wp\/v2\/posts\/16437","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mattfife.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mattfife.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mattfife.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mattfife.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=16437"}],"version-history":[{"count":1,"href":"https:\/\/mattfife.com\/index.php?rest_route=\/wp\/v2\/posts\/16437\/revisions"}],"predecessor-version":[{"id":16439,"href":"https:\/\/mattfife.com\/index.php?rest_route=\/wp\/v2\/posts\/16437\/revisions\/16439"}],"wp:attachment":[{"href":"https:\/\/mattfife.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=16437"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mattfife.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=16437"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mattfife.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=16437"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}