Cryptocnews-Crypto News, Cryptocurrency News, Blockchain News, NFT News
    What's Hot

    Trump-backed WLFI is selling $5 million access while pitching finance for everyone

    03/16/2026

    Cardano Founder Calls For Insider Recusal In Liqwid Dispute

    03/16/2026

    Bitcoin Price Could Explode to $1,000,000 Based on These ‘Reasonably Conservative Assumptions,’ Says Bitwise CIO Matt Hougan

    03/16/2026
    Facebook Twitter Instagram
    • Business
    • Markets
    • Get In Touch
    • Our Authors
    Facebook Twitter Instagram
    Cryptocnews-Crypto News, Cryptocurrency News, Blockchain News, NFT News
    • Home
    • Business

      Crypto’s CLARITY Act May Miss 2026 Window Without April Action

      03/15/2026

      We Tested Utopai’s PAI: The Best Long-Form AI Video Generator Today?

      03/15/2026

      Crypto Warning: Bonk.fun Domain Hack Exposes Solana Traders To Wallet Drain

      03/14/2026

      Bitcoin tops $73K as SOL, ADA and BNB surge; $370M in shorts wiped out

      03/13/2026

      Bitcoin targets $73,000 as crypto bounces despite oil price jitters

      03/13/2026
    • Technology
      1. Business
      2. Insights
      3. View All

      Crypto’s CLARITY Act May Miss 2026 Window Without April Action

      03/15/2026

      We Tested Utopai’s PAI: The Best Long-Form AI Video Generator Today?

      03/15/2026

      Crypto Warning: Bonk.fun Domain Hack Exposes Solana Traders To Wallet Drain

      03/14/2026

      Bitcoin tops $73K as SOL, ADA and BNB surge; $370M in shorts wiped out

      03/13/2026

      Cardano Founder Calls For Insider Recusal In Liqwid Dispute

      03/16/2026

      Bitcoin Inflection Point Forms At $70k As Institutional Demand Offsets Whale Sell-Off

      03/16/2026

      Ethereum Foundation Finalizes 5,000 ETH Sale In $10M OTC Deal — Details

      03/16/2026

      US Bitcoin ETFs Hit 5-Day Inflow Streak For First Time In 2026

      03/15/2026

      Trump-backed WLFI is selling $5 million access while pitching finance for everyone

      03/16/2026

      Australia Senate Panel Backs Crypto Framework in Latest Regulatory Push

      03/16/2026

      BlockFills Entities File Bankruptcy After Withdrawals Halted, Court Froze Bitcoin

      03/16/2026

      Traders Flip Senate Control Bet as Democrats Overtake Republicans on Kalshi, Polymarket

      03/16/2026
    • Insights
      1. Bitcoin
      2. Ethereum
      3. Eurozone
      4. Monero
      5. View All

      Altcoin Market Cap Tests Key Structure as Bullish Divergence Emerges

      03/15/2026

      Key Bitcoin Indicator Just Turned Negative for the First Time Since the 2022 Bear Market

      03/14/2026

      Bitcoin tops $73K as SOL, ADA and BNB surge; $370M in shorts wiped out

      03/13/2026

      Bitcoin targets $73,000 as crypto bounces despite oil price jitters

      03/13/2026

      Ethereum Tests $2,130 Resistance as Price Compression Intensifies

      03/15/2026

      Santiment Data Reveals the Ten Most Actively Developed DeFi Projects of the Past 30 Days

      03/14/2026

      Bitcoin tops $73K as SOL, ADA and BNB surge; $370M in shorts wiped out

      03/13/2026

      Bitcoin targets $73,000 as crypto bounces despite oil price jitters

      03/13/2026

      Bitcoin tops $73K as SOL, ADA and BNB surge; $370M in shorts wiped out

      03/13/2026

      Bitcoin targets $73,000 as crypto bounces despite oil price jitters

      03/13/2026

      Ethereum price forecast: bulls hold $2K support amid CEX outflows

      03/12/2026

      Ark Labs Raises $5.2M With Tether To Bring Programmable Finance To Bitcoin

      03/12/2026

      Bitcoin tops $73K as SOL, ADA and BNB surge; $370M in shorts wiped out

      03/13/2026

      Bitcoin targets $73,000 as crypto bounces despite oil price jitters

      03/13/2026

      Ethereum price forecast: bulls hold $2K support amid CEX outflows

      03/12/2026

      Hyperliquid price prediction: can HYPE hit a new ATH after $38 break?

      03/12/2026

      Cardano Founder Calls For Insider Recusal In Liqwid Dispute

      03/16/2026

      Bitcoin Inflection Point Forms At $70k As Institutional Demand Offsets Whale Sell-Off

      03/16/2026

      Ethereum Foundation Finalizes 5,000 ETH Sale In $10M OTC Deal — Details

      03/16/2026

      US Bitcoin ETFs Hit 5-Day Inflow Streak For First Time In 2026

      03/15/2026
    • Markets
    • Get In Touch
    Cryptocnews-Crypto News, Cryptocurrency News, Blockchain News, NFT News
    Home»Uncategorized»This One Weird Trick Defeats AI Safety Features in 99% of Cases
    Uncategorized

    This One Weird Trick Defeats AI Safety Features in 99% of Cases

    adminBy admin11/14/2025No Comments5 Mins Read
    Facebook Twitter Pinterest LinkedIn Tumblr Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    AI researchers from Anthropic, Stanford, and Oxford have discovered that making AI models think longer makes them easier to jailbreak—the opposite of what everyone assumed.

    The prevailing assumption was that extended reasoning would make AI models safer, because it gives them more time to detect and refuse harmful requests. Instead, researchers found it creates a reliable jailbreak method that bypasses safety filters entirely.

    Using this technique, an attacker could insert an instruction in the Chain of Thought process of any AI model and force it to generate instructions for creating weapons, writing malware code, or producing other prohibited content that would normally trigger immediate refusal. AI companies spend millions building these safety guardrails precisely to prevent such outputs.

    The study reveals that Chain-of-Thought Hijacking achieves 99% attack success rates on Gemini 2.5 Pro, 94% on GPT o4 mini, 100% on Grok 3 mini, and 94% on Claude 4 Sonnet. These numbers destroy every prior jailbreak method tested on large reasoning models.

    The attack is simple and works like the “Whisper Down the Lane” game (or “Telephone”), with a malicious player somewhere near the end of the line. You simply pad a harmful request with long sequences of harmless puzzle-solving; researchers tested Sudoku grids, logic puzzles, and abstract math problems. Add a final-answer cue at the end, and the model’s safety guardrails collapse.

    “Prior works suggest this scaled reasoning may strengthen safety by improving refusal. Yet we find the opposite,” the researchers wrote. The same capability that makes these models smarter at problem-solving makes them blind to danger.

    Here’s what happens inside the model: When you ask an AI to solve a puzzle before answering a harmful question, its attention gets diluted across thousands of benign reasoning tokens. The harmful instruction—buried somewhere near the end—receives almost no attention. Safety checks that normally catch dangerous prompts weaken dramatically as the reasoning chain grows longer.

    This is a problem that many people familiar with AI are aware of, but to a lesser extent. Some jailbreak prompts are deliberately long to make a model waste tokens before processing the harmful instructions.

    The team ran controlled experiments on the S1 model to isolate the effect of reasoning length. With minimal reasoning, attack success rates hit 27%. At natural reasoning length, that jumped to 51%. Force the model into extended step-by-step thinking, and success rates soared to 80%.

    Every major commercial AI falls victim to this attack. OpenAI’s GPT, Anthropic’s Claude, Google’s Gemini, and xAI’s Grok—none are immune. The vulnerability exists in the architecture itself, not any specific implementation.

    AI models encode safety checking strength in middle layers around layer 25. Late layers encode the verification outcome. Long chains of benign reasoning suppress both signals which ends up shifting attention away from harmful tokens.

    The researchers identified specific attention heads responsible for safety checks, concentrated in layers 15 through 35. They surgically removed 60 of these heads. Refusal behavior collapsed. Harmful instructions became impossible for the model to detect.

    The “layers” in AI models are like steps in a recipe, where each step helps the computer better understand and process information. These layers work together, passing what they learn from one to the next, so the model can answer questions, make decisions, or spot problems. Some layers are especially good at recognizing safety issues—like blocking harmful requests—while others help the model think and reason. By stacking these layers, AI can become much smarter and more careful about what it says or does.

    This new jailbreak challenges the core assumption driving recent AI development. Over the past year, major AI companies shifted focus to scaling reasoning rather than raw parameter counts. Traditional scaling showed diminishing returns. Inference-time reasoning—making models think longer before answering—became the new frontier for performance gains.

    The assumption was that more thinking equals better safety. Extended reasoning would give models more time to spot dangerous requests and refuse them. This research proves that assumption was inaccurate, and even probably wrong.

    A related attack called H-CoT, released in February by researchers from Duke University and Taiwan’s National Tsing Hua University, exploits the same vulnerability from a different angle. Instead of padding with puzzles, H-CoT manipulates the model’s own reasoning steps. OpenAI’s o1 model maintains a 99% refusal rate under normal conditions. Under H-CoT attack, that drops below 2%.

    The researchers propose a defense: reasoning-aware monitoring. It tracks how safety signals change across each reasoning step, and if any step weakens the safety signal, then penalize it—force the model to maintain attention on potentially harmful content regardless of reasoning length. Early tests show this approach can restore safety without destroying performance.

    But implementation remains uncertain. The proposed defense requires deep integration into the model’s reasoning process, which is far from a simple patch or filter. It needs to monitor internal activations across dozens of layers in real-time, adjusting attention patterns dynamically. That’s computationally expensive and technically complex.

    The researchers disclosed the vulnerability to OpenAI, Anthropic, Google DeepMind, and xAI before publication. “All groups acknowledged receipt, and several are actively evaluating mitigations,” the researchers claimed in their ethics statement.

    Generally Intelligent Newsletter

    A weekly AI journey narrated by Gen, a generative AI model.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

    Related Posts

    Bitcoin Price Could Explode to $1,000,000 Based on These ‘Reasonably Conservative Assumptions,’ Says Bitwise CIO Matt Hougan

    03/16/2026

    Texas Firm Handing Out up to $5,000 per Person After Data Breach Exposed Names, Social Security Numbers and More

    03/16/2026

    What Is AGI? The AI Goal Everyone Talks About But No One Can Clearly Define

    03/15/2026

    Tinder Handing $60,500,000 To Users in Class Action Settlement Over Alleged Discrimination

    03/14/2026
    Add A Comment

    Leave A Reply Cancel Reply

    Top Posts

    Millennials Are Quitting Job to Become Day Traders

    01/20/2021

    Jack Dorsey Says Bitcoin Will Unite The World

    01/15/2021

    Hong Kong Customs Arrest Four in Crypto Laundering Bust

    01/15/2021

    Subscribe to Updates

    Get the latest sports news from SportsSite about soccer, football and tennis.

    Advertisement
    Demo
    Facebook Twitter Instagram Pinterest YouTube
    Top Insights

    Trump-backed WLFI is selling $5 million access while pitching finance for everyone

    03/16/2026

    Cardano Founder Calls For Insider Recusal In Liqwid Dispute

    03/16/2026
    Get Informed

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    © {2025} Copyright CryptocNews.com
    • Home
    • Business
    • Markets
    • Technology
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.