Cryptocnews-Crypto News, Cryptocurrency News, Blockchain News, NFT News
    What's Hot

    When Will Bitcoin and Crypto Winter End? Fidelity Details Five Historical Catalysts

    06/30/2026

    XRP Ledger’s ‘Missing Layer’ Draws Closer as Developers Test Lending, Credit Features: Ripple

    06/30/2026

    Trump’s Bitcoin made in America push runs into a power problem the tax bill cannot fix

    06/30/2026
    Facebook Twitter Instagram
    • Business
    • Markets
    • Get In Touch
    • Our Authors
    Facebook Twitter Instagram
    Cryptocnews-Crypto News, Cryptocurrency News, Blockchain News, NFT News
    • Home
    • Business

      Chainlink price prediction: record network growth meets bearish technicals

      06/30/2026

      Dogecoin Open Interest Hovers Around $959 Million As Traders Wait For Recovery Signal

      06/30/2026

      ’47 Ronin’ Director Gets 30 Months for Spending Netflix’s $11M on Dogecoin

      06/30/2026

      CertiK joins XDC Network to secure trade finance and RWA tokenization

      06/29/2026

      AAVE Holds Support Above $98

      06/29/2026
    • Technology
      1. Business
      2. Insights
      3. View All

      Chainlink price prediction: record network growth meets bearish technicals

      06/30/2026

      Dogecoin Open Interest Hovers Around $959 Million As Traders Wait For Recovery Signal

      06/30/2026

      ’47 Ronin’ Director Gets 30 Months for Spending Netflix’s $11M on Dogecoin

      06/30/2026

      CertiK joins XDC Network to secure trade finance and RWA tokenization

      06/29/2026

      Chainlink price prediction: record network growth meets bearish technicals

      06/30/2026

      Chainlink Adds 6,100 Wallets In Two Days In Strongest Growth Burst Of 2026

      06/30/2026

      Altcoin Market Cap Roundtrips Nearly 900 Days As Analyst Points To Major Support

      06/30/2026

      David Schwartz Says XRP Ledger Front-Running Risk Is Real But Overstated

      06/30/2026

      Trump’s Bitcoin made in America push runs into a power problem the tax bill cannot fix

      06/30/2026

      Bitcoin Is in a Fight at $60K—Here’s What the Charts Say

      06/30/2026

      Chainlink price prediction: record network growth meets bearish technicals

      06/30/2026

      Institutions dumped Bitcoin and Ethereum ETFs but still bought XRP and HYPE again

      06/29/2026
    • Insights
      1. Bitcoin
      2. Ethereum
      3. Eurozone
      4. Monero
      5. View All

      Chainlink price prediction: record network growth meets bearish technicals

      06/30/2026

      CertiK joins XDC Network to secure trade finance and RWA tokenization

      06/29/2026

      What Binance’s EU exit means for the BNB token price

      06/27/2026

      GoMining mines first Stratum V2 Bitcoin block using DMND pool

      06/26/2026

      Chainlink price prediction: record network growth meets bearish technicals

      06/30/2026

      CertiK joins XDC Network to secure trade finance and RWA tokenization

      06/29/2026

      What Binance’s EU exit means for the BNB token price

      06/27/2026

      GoMining mines first Stratum V2 Bitcoin block using DMND pool

      06/26/2026

      Chainlink price prediction: record network growth meets bearish technicals

      06/30/2026

      CertiK joins XDC Network to secure trade finance and RWA tokenization

      06/29/2026

      What Binance’s EU exit means for the BNB token price

      06/27/2026

      Bitcoin Tests Critical Support As Key Level Hangs In The Balance

      06/26/2026

      Chainlink price prediction: record network growth meets bearish technicals

      06/30/2026

      CertiK joins XDC Network to secure trade finance and RWA tokenization

      06/29/2026

      What Binance’s EU exit means for the BNB token price

      06/27/2026

      GoMining mines first Stratum V2 Bitcoin block using DMND pool

      06/26/2026

      Chainlink price prediction: record network growth meets bearish technicals

      06/30/2026

      Chainlink Adds 6,100 Wallets In Two Days In Strongest Growth Burst Of 2026

      06/30/2026

      Altcoin Market Cap Roundtrips Nearly 900 Days As Analyst Points To Major Support

      06/30/2026

      David Schwartz Says XRP Ledger Front-Running Risk Is Real But Overstated

      06/30/2026
    • Markets
    • Get In Touch
    Cryptocnews-Crypto News, Cryptocurrency News, Blockchain News, NFT News
    Home»Uncategorized»AI Still Can’t Beat the On-Call Engineer: Here’s Why
    Uncategorized

    AI Still Can’t Beat the On-Call Engineer: Here’s Why

    adminBy admin05/18/2026No Comments3 Mins Read
    Facebook Twitter Pinterest LinkedIn Tumblr Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    In brief

    • ARFBench is the first AI benchmark built entirely from real production incidents.
    • GPT-5 leads all existing AI models at 62.7% accuracy but falls short of domain experts at 72.7%.
    • A theoretical model-expert oracle—combining AI and human judgment—hits 87.2% accuracy, setting the ceiling for what collaborative AI-human teams could achieve.

    AI companies keep pitching autonomous site reliability engineer agents—AI that investigates production incidents in place of humans. Datadog ran the actual benchmark on real outages, and the best AI models can’t yet beat the engineers they’re supposed to replace.

    The benchmark is ARFBench (Anomaly Reasoning Framework Benchmark), a joint project from Datadog and Carnegie Mellon. Built from 63 real production incidents, extracted from engineers’ own Slack threads during live emergencies—750 multiple-choice questions covering 142 monitoring metrics and 5.38 million data points, every question verified by hand. No synthetic data. No textbook scenarios.

    “Trillions of dollars are lost each year due to system outages,” the researchers write. The benchmark tests whether AI can actually help change that.

    “Despite the central role of such question-driven analysis in incident response, it remains unclear whether modern foundation models can reliably answer the kinds of time series questions engineers ask in practice,” the paper reads.

    Questions come in three tiers. Tier I: Does an anomaly exist in this chart? Tier II: When did it start, how severe is it, what type?

    The Tier III—the hardest—requires cross-metric reasoning: Is this chart causing the problem in that other chart? That’s where AI falls apart. GPT-5 scores just 47.5% F1 on Tier III questions, a metric that penalizes models for gaming answers by picking the most common class.

    “Despite the central role of such question-driven analysis in incident response, it remains unclear whether modern foundation models can reliably answer the kinds of time series questions engineers ask in practice,” the researchers write.

    How every model stacked up

    GPT-5 led all existing models at 62.7% accuracy—on a test where random guessing gets 24.5%. Gemini 3 Pro scored 58.1%. Claude Opus 4.6: 54.8%. Claude Sonnet 4.5: 47.2%.

    Domain experts scored 72.7% accuracy. Non-domain experts—time series researchers at Datadog without extensive observability experience—still hit 69.7%.

    No AI model beat either human baseline.

    ARFBench leaderboard table
    Image built by Decrypt based on the ARFBench leaderboard CSV

    The model that actually topped the full leaderboard was Datadog’s own hybrid: Toto—their internal time series forecasting model—combined with Qwen3-VL 32B. Toto-1.0-QA-Experimental scored 63.9% accuracy, edging past GPT-5 while using a fraction of its parameters. On anomaly identification specifically, it outperformed every other model by at least 8.8 percentage points in F1.

    A purpose-built domain model, trained on observability data, outperforming a frontier general-purpose system at this specific task is the expected outcome. That’s the point.

    The most valuable finding isn’t which model scored highest.

    “We observe substantially different error profiles between leading models and human experts, suggesting that their strengths are complementary,” the researchers write. Models hallucinate, miss metadata, and lose domain context. Humans misread precise timestamps and occasionally fail on complex instructions. The mistakes barely overlap.

    Model a theoretical “Model-Expert Oracle”—a perfect judge that always picks the right answer between the AI and the human—and you get 87.2% accuracy and 82.8% F1. Way above either alone.

    That’s not a product. It’s a documented target—built from real emergencies, not curated datasets—that quantifies exactly how much better human-AI collaboration could perform. The leaderboard is live on Hugging Face. GPT-5 sits at 62.7%. The ceiling is 87.2%.

    Daily Debrief Newsletter

    Start every day with the top news stories right now, plus original features, a podcast, videos and more.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

    Related Posts

    When Will Bitcoin and Crypto Winter End? Fidelity Details Five Historical Catalysts

    06/30/2026

    XRP Ledger’s ‘Missing Layer’ Draws Closer as Developers Test Lending, Credit Features: Ripple

    06/30/2026

    Chainlink price prediction: record network growth meets bearish technicals

    06/30/2026

    Michael Saylor’s Strategy Boosts US Dollar Reserves, Unveils ‘Bitcoin Monetization Program’

    06/30/2026
    Add A Comment

    Leave A Reply Cancel Reply

    Top Posts

    Millennials Are Quitting Job to Become Day Traders

    01/20/2021

    Jack Dorsey Says Bitcoin Will Unite The World

    01/15/2021

    Hong Kong Customs Arrest Four in Crypto Laundering Bust

    01/15/2021

    Subscribe to Updates

    Get the latest sports news from SportsSite about soccer, football and tennis.

    Advertisement
    Demo
    Facebook Twitter Instagram Pinterest YouTube
    Top Insights

    When Will Bitcoin and Crypto Winter End? Fidelity Details Five Historical Catalysts

    06/30/2026

    XRP Ledger’s ‘Missing Layer’ Draws Closer as Developers Test Lending, Credit Features: Ripple

    06/30/2026
    Get Informed

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    © {2025-2026} Copyright CryptocNews.com
    • Home
    • Business
    • Markets
    • Technology
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.