Cryptocnews-Crypto News, Cryptocurrency News, Blockchain News, NFT News
    What's Hot

    Stripe in Early Talks on Potential PayPal Deal: Bloomberg

    02/25/2026

    XRP At Risk? Large Holders Stir The Market, Increasing Near-Term Turbulence

    02/25/2026

    Ethereum Foundation Starts Staking Treasury Amid Vitalik Buterin’s ETH Sales

    02/25/2026
    Facebook Twitter Instagram
    • Business
    • Markets
    • Get In Touch
    • Our Authors
    Facebook Twitter Instagram
    Cryptocnews-Crypto News, Cryptocurrency News, Blockchain News, NFT News
    • Home
    • Business

      $27 Million Exploit Triggers Wave Of Shutdowns

      02/25/2026

      ‘Stablecoin Summer’: Stripe Makes Tender Offer at $159 Billion Valuation

      02/24/2026

      Terra Classic (LUNC) price in focus as Terraform Labs sues Jane Street

      02/24/2026

      Monero (XMR) hits resistance as bears threaten the $300 level

      02/24/2026

      LINK price rebounds as SEC taps former LINK lawyer to head crypto task force

      02/24/2026
    • Technology
      1. Business
      2. Insights
      3. View All

      $27 Million Exploit Triggers Wave Of Shutdowns

      02/25/2026

      ‘Stablecoin Summer’: Stripe Makes Tender Offer at $159 Billion Valuation

      02/24/2026

      Terra Classic (LUNC) price in focus as Terraform Labs sues Jane Street

      02/24/2026

      Monero (XMR) hits resistance as bears threaten the $300 level

      02/24/2026

      XRP At Risk? Large Holders Stir The Market, Increasing Near-Term Turbulence

      02/25/2026

      $4 Billion in Token Unlocks Coming: WBT, HYPE, JUP Lead the Wave

      02/24/2026

      Crypto Social Media Is Exploding Over These 5 Topics

      02/24/2026

      U.S. Treasury Sanctions Russian Exploit Broker Over Crypto Cyber Theft

      02/24/2026

      Stripe in Early Talks on Potential PayPal Deal: Bloomberg

      02/25/2026

      Ethereum Foundation Starts Staking Treasury Amid Vitalik Buterin’s ETH Sales

      02/25/2026

      Treasury Sanctions Russian ‘Exploit’ Broker Over Stolen US Cyber Tools

      02/25/2026

      Coinbase, Kraken and Binance Push Deeper Into Tokenization as Capital Shifts

      02/24/2026
    • Insights
      1. Bitcoin
      2. Ethereum
      3. Eurozone
      4. Monero
      5. View All

      Crypto Social Media Is Exploding Over These 5 Topics

      02/24/2026

      Terra Classic (LUNC) price in focus as Terraform Labs sues Jane Street

      02/24/2026

      Monero (XMR) hits resistance as bears threaten the $300 level

      02/24/2026

      LINK price rebounds as SEC taps former LINK lawyer to head crypto task force

      02/24/2026

      $4 Billion in Token Unlocks Coming: WBT, HYPE, JUP Lead the Wave

      02/24/2026

      Terra Classic (LUNC) price in focus as Terraform Labs sues Jane Street

      02/24/2026

      Monero (XMR) hits resistance as bears threaten the $300 level

      02/24/2026

      LINK price rebounds as SEC taps former LINK lawyer to head crypto task force

      02/24/2026

      U.S. Treasury Sanctions Russian Exploit Broker Over Crypto Cyber Theft

      02/24/2026

      Terra Classic (LUNC) price in focus as Terraform Labs sues Jane Street

      02/24/2026

      Solo Miner Turns $75 Into $200,000 Bitcoin Block Reward Using Rented Hashrate

      02/24/2026

      Monero (XMR) hits resistance as bears threaten the $300 level

      02/24/2026

      Terra Classic (LUNC) price in focus as Terraform Labs sues Jane Street

      02/24/2026

      Monero (XMR) hits resistance as bears threaten the $300 level

      02/24/2026

      LINK price rebounds as SEC taps former LINK lawyer to head crypto task force

      02/24/2026

      Cronos (CRO) price outlook as Crypto.com secures conditional OCC approval in the US

      02/24/2026

      XRP At Risk? Large Holders Stir The Market, Increasing Near-Term Turbulence

      02/25/2026

      $4 Billion in Token Unlocks Coming: WBT, HYPE, JUP Lead the Wave

      02/24/2026

      Crypto Social Media Is Exploding Over These 5 Topics

      02/24/2026

      U.S. Treasury Sanctions Russian Exploit Broker Over Crypto Cyber Theft

      02/24/2026
    • Markets
    • Get In Touch
    Cryptocnews-Crypto News, Cryptocurrency News, Blockchain News, NFT News
    Home»Uncategorized»OpenAI Says Benchmark Used to Measure AI Coding Skill Is ‘Contaminated’—Here’s Why
    Uncategorized

    OpenAI Says Benchmark Used to Measure AI Coding Skill Is ‘Contaminated’—Here’s Why

    adminBy admin02/24/2026No Comments4 Mins Read
    Facebook Twitter Pinterest LinkedIn Tumblr Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    In brief

    • OpenAI argues that SWE-bench Verified no longer reflects real coding ability because the benchmark is allegedly contaminated.
    • It is now pushing SWE-bench Pro as tougher replacement.
    • Scores plunged from ~70% to ~23% on the newer benchmark,

    The number that every major AI lab has been using to claim coding supremacy was just declared meaningless.

    OpenAI published a post this week announcing that SWE-bench Verified, the go-to benchmark for measuring AI coding capabilities, is so riddled with flawed tests and training data leakage that it no longer tells you anything useful about whether a model can actually write software.

    The benchmark works like this: Give an AI a real GitHub issue from a popular open-source Python project, ask it to fix the bug without seeing the tests, and check if its patch makes the failing tests pass without breaking anything else.

    OpenAI created SWE-bench Verified in August 2024 as a cleaner version of the original 2023 benchmark, recruiting 93 software engineers to filter out tasks that were impossible or poorly designed.

    The cleanup worked well enough that every major lab started citing scores on it as proof of progress. When Anthropic launched Claude Opus 4 in May 2025, Decrypt reported that the model scored 72.5% on SWE-bench Verified, beating GPT-4.1’s 54.6% and Gemini 2.5 Pro’s 63.2%. It was the coding benchmark that mattered.

    Since then, every single AI lab from America to China has shown the SWE performance to claim the throne as the best model for coding capabilities.

    Image: Minimax

    Now OpenAI says that race was partly a mirage. According to the report, the team audited 138 tasks that GPT-5.2 consistently failed across 64 independent runs, and had six engineers review each one. It ultimately concluded that 59.4% of those tasks are broken.

    About 35.5% have tests so narrowly written that they require a specific function name never mentioned in the problem description. Another 18.8% check for features that weren’t part of the original problem at all, gathered from unrelated pull requests.

    The contamination problem roughly works like this: SWE-bench pulls its problems from open-source repositories that most AI companies crawl when building training sets. OpenAI tested whether GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash Preview had seen the benchmark’s solutions during training. All three had.

    Given only a task ID and a brief hint, each model could reproduce the exact code fix from memory, including variable names and inline comments that appear nowhere in the problem description. In one case, GPT-5.2’s chain-of-thought logs showed it reasoning that a specific parameter must have been “added around Django 4.1″—a detail found only in Django’s release notes, not the task description. It was answering a question it had already seen the answer to.

    OpenAI now recommends SWE-bench Pro, a newer benchmark from Scale AI that uses more diverse codebases and licenses that reduce training data exposure. The performance drop is jarring: models that cleared 70% on the old Verified benchmark score around 23% on SWE-bench Pro’s public split, and even less on its private tasks.

    On the current public SWE-bench Verified leaderboard, OpenAI is far from the benchmark’s podium. Retiring a benchmark where you’re losing and endorsing one where everyone starts at 23% resets the scoreboard at a convenient moment and makes the competitors’ claims less impressive.

    This is especially important considering that the much anticipated newer version of DeepSeek is rumored to beat or get extremely close to American ai models, especially in agentic and coding tasks with a free, open-source model. That model could be days away from release, and SWE-bench Verified can be a key metric to measure its quality.

    OpenAI said it’s building privately authored evaluations that won’t be released before testing, pointing to its GDPVal project where domain experts write original tasks graded by trained human reviewers.

    The benchmark problem is not new, and is not unique to coding. AI labs have cycled through multiple evaluations, each useful until models were trained on them or until the tasks proved too narrow.

    But what makes this case notable is that OpenAI hyped SWE-bench Verified, promoted it across model releases, and is now publicly documenting how thoroughly it has failed—including by showing their own model cheating on it.

    Daily Debrief Newsletter

    Start every day with the top news stories right now, plus original features, a podcast, videos and more.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

    Related Posts

    South Korean Man Faces Murder Charge Over Bitcoin Bet Gone Bad

    02/25/2026

    Terra Classic (LUNC) price in focus as Terraform Labs sues Jane Street

    02/24/2026

    JPMorgan Chase CEO Warns Markets Mirroring Pre-2008 Financial Crisis Conditions

    02/24/2026

    Monero (XMR) hits resistance as bears threaten the $300 level

    02/24/2026
    Add A Comment

    Leave A Reply Cancel Reply

    Top Posts

    Millennials Are Quitting Job to Become Day Traders

    01/20/2021

    Jack Dorsey Says Bitcoin Will Unite The World

    01/15/2021

    Hong Kong Customs Arrest Four in Crypto Laundering Bust

    01/15/2021

    Subscribe to Updates

    Get the latest sports news from SportsSite about soccer, football and tennis.

    Advertisement
    Demo
    Facebook Twitter Instagram Pinterest YouTube
    Top Insights

    Stripe in Early Talks on Potential PayPal Deal: Bloomberg

    02/25/2026

    XRP At Risk? Large Holders Stir The Market, Increasing Near-Term Turbulence

    02/25/2026
    Get Informed

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    © {2025} Copyright CryptocNews.com
    • Home
    • Business
    • Markets
    • Technology
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.