No Frontier AI Model Holds the Line When an Attacker Keeps Trying

No Frontier AI Model Holds the Line When an Attacker Keeps Trying

AI Security · Research Analysis

Every enterprise AI deployment starts with a safety score. Today's research from Cisco shows those scores were measured under conditions that real attackers ignore.

73% Gemini 3 Pro attack success under realistic conditions, vs. 18% on published benchmarks
9x GPT-5.4 vulnerability increase from benchmark to real attack conditions
44pp Safety swing from one configuration setting on the same model, same test
8 of 15 Models where published safety scores would produce the wrong procurement ranking
Key Takeaway

Every enterprise AI deployment rests on a safety score. That score was generated by sending an attacker one message and recording whether the model blocked it. Real attackers do not send one message. The number your procurement team used to justify the decision does not reflect what happens next.

Most organizations do not buy AI models without a governance process. There is a document somewhere, often a security review or a vendor assessment, that includes the model's published safety scores. Those numbers come from the model card, a specification sheet the vendor publishes that includes benchmark results, intended uses, and known limitations. Procurement teams, legal reviewers, and IT governance functions treat these scores as the safety input to a go or no-go decision. In my work advising on enterprise technology adoption, I have seen these documents anchor decisions for months. The scores feel authoritative. They come from named benchmarks with documented methodology.

What those benchmarks do not measure is the second message. Or the sixth. Cisco's AI Threat Intelligence and Security Research team published a study today that tested 15 major AI models from OpenAI, Anthropic, Google, Amazon, and xAI under both standard conditions and under realistic sustained attack. The gap between the two is the number every enterprise buyer should have seen before signing the contract they already signed.

Standard safety benchmarks work by sending the model a single adversarial prompt and recording whether it refuses. One message, one response, pass or fail. The major benchmarks used across the industry, HarmBench, the MLCommons AILuminate benchmark, and TrustLLM, all operate this way. That is not a design flaw. It is a scope decision that made sense when the concern was whether a model would generate harmful content in a single exchange. The problem is that enterprise deployments do not work in single exchanges, and neither do the people trying to exploit them.

A real adversary sends a message, reads the refusal, and tries a slightly different framing. Then a persona. Then a question that seems unrelated but builds toward the same destination across ten turns. An agent running a contract review, a financial analysis workflow, or a customer support process operates in exactly this kind of extended session. The attack surface is the whole conversation, not the opening prompt. No published benchmark the Cisco team could identify measures that surface.

The Rankings Change When the Test Reflects Reality

GPT-5.4 from OpenAI shows a 2.74% vulnerability rate on standard benchmarks. That is an excellent number, the kind that appears in a vendor presentation as evidence of safety leadership. Under sustained realistic attack in the Cisco study, that rate rises to 24.68%. Ninefold. Gemini 3 Pro from Google shows 18.10% on standard tests and 73.35% under realistic conditions, a fourfold increase.

Those numbers are not measuring the same thing getting worse. They are measuring two different properties of the same model. Standard benchmarks measure whether a model refuses a direct harmful request. Realistic attack testing measures whether a model holds its position when someone keeps coming back. Those are different capabilities, and they do not correlate reliably.

Amazon's Nova 2 Lite makes this precise. On standard benchmarks it scores 34.05%, which looks poor against the field. Under realistic sustained attack it shows 7.89%, the lowest in the entire cohort of 15 models. A procurement process ranking by published scores would eliminate it. It is actually the safest model tested. Eight of the 15 models in the study show a gap larger than 15 percentage points between their published score and their realistic score, in both directions. More than half. A selection process built on standard benchmarks is, for the majority of models currently on the market, producing the wrong ranking.

The Anthropic Claude family is worth noting specifically because it shows relative consistency between the two measures. Single-prompt vulnerability rates run between 2.19% and 3.64%, rising to between 11.16% and 16.20% under sustained attack. The gap is real but contained, which is itself a meaningful data point when comparing across vendors.

"A model with 2.74% single-turn ASR is not the same product as a model that holds the line at 24.68% multi-turn ASR. Without paired-regime data, the two are indistinguishable on most public evaluations."

A Single Deployment Choice Your Team Will Make Invisibly

Grok 4.1 Fast from xAI tests at 88.30% vulnerability under realistic attack in its default configuration. Enable reasoning mode on the exact same model, with the exact same test prompts, and that number drops to 43.47%. A 44-point swing from one checkbox during implementation.

That setting does not appear on any model card the Cisco team could identify. The enterprise that deploys Grok 4.1 Fast in default mode, which is how most teams will deploy it without specific instruction otherwise, is operating a substantially different risk profile than any published benchmark suggests. This is not specific to one vendor. Configuration choices around reasoning modes, system prompt adherence, temperature, and guardrail tiers all affect how a model responds under sustained pressure. None of those effects are currently required to be documented alongside the safety scores that drive procurement decisions.

The implication for IT governance is direct: the model your team approved in the security review and the model running in production may not be the same thing, and there is currently no disclosure standard that closes that gap.

Google Announced a Security Product Built on One of the Worst Performers

Also published today: Google Cloud launched Google AI Threat Defense, an enterprise security platform combining Gemini models, the cloud security firm Wiz, the AI code-fixing agent CodeMender, and the threat intelligence firm Mandiant. It is positioned as an always-on platform to find and patch enterprise vulnerabilities faster than attackers can exploit them.

Gemini 3 Pro, which powers significant parts of Google's security product portfolio, shows the second-highest realistic attack vulnerability rate in the Cisco cohort at 73.35%. That is not a verdict on Google AI Threat Defense. Enterprise deployments layer system prompts, content filters, and custom orchestration on top of base models, and Cisco's research explicitly states it does not characterize deployed products with those controls in place. The product may work exactly as Google describes.

It does mean the right question for any enterprise evaluating Google AI Threat Defense is not "what does the Gemini benchmark show" but "what controls are layered on top, and what does the product's realistic attack profile look like with those controls active." That question should be on the table before a contract is signed, not after the first security incident.

Key Takeaway

Standard safety benchmarks were not designed to measure what enterprise deployments actually face. That is not a scandal. It is a scope mismatch that the entire procurement process has treated as sufficient. It is not sufficient, and today's research makes that case with specific model names and specific numbers.

Three Questions Your Next Vendor Cannot Decline to Answer

Cisco's research team translates the findings into three operational steps. Reframed for a procurement conversation rather than a research paper, they are questions any vendor should be able to answer in writing before a contract closes.

What is your model's vulnerability rate under sustained multi-turn attack, broken down by the type of attack used? Not the benchmark headline. The realistic attack number, with the attack types specified. A vendor who responds with the standard benchmark score has not answered the question.

What configuration choices will your implementation team make during deployment, and what is the security-relevant effect of each one? Reasoning mode on or off. System prompt adherence settings. Temperature. Guardrail tiers. The Grok result shows a 44-point swing from one of these choices. If the vendor cannot tell you what the swing looks like for their model, your team is making security decisions without the relevant input.

What is the gap between your published benchmark score and your realistic sustained-attack score? If the vendor has not run that comparison, say so directly and ask them to commission it before deployment approval. Eight of the 15 models in this study show a gap larger than 15 points. For more than half the market, this question has a materially different answer than the number on the model card.

None of this requires specialized security expertise to ask. It requires knowing what the published score does not cover, and being willing to hold the process until the missing data exists.

On the commercial context: this research was produced by Cisco, whose AI Defense product and LLM Security Leaderboard are directly positioned to serve enterprises that internalize these findings. The methodology is transparent and the data is specific. Read it as both.

CIO / CISO Viability Question

Pull the security review document from your last AI deployment. Find the safety score that justified the approval. Now ask: was that number measured against a single adversarial prompt, or against a sustained attack across a full conversation? If the answer is a single prompt, you approved a model based on a test that does not simulate the conditions it will actually operate under.

That is not a compliance failure. It is a gap in what vendors are currently required to disclose. The regulatory frameworks moving through the National Institute of Standards and Technology and the European Union AI Act are heading toward closing it. The enterprises that start asking for realistic attack data now will not be scrambling to requalify their deployed models when those requirements land.

Sources
  1. Conley, Nicholas, and Amy Chang. "Proprietary Problems: How Frontier Closed Models Collapse Under Iterative Pressure." Cisco Systems, AI Threat Intelligence and Security Research Team, 27 May 2026, cisco.com.
  2. Conley, Nicholas, and Amy Chang. "Proprietary Problems: No Frontier Model Is Multi-Turn Immune." Cisco Blogs, 27 May 2026, blogs.cisco.com.
  3. "Introducing Google AI Threat Defense to Help You Outpace the Adversary." Google Cloud Blog, 27 May 2026, cloud.google.com.
  4. Bellamkonda, Shashi. "Your Security Tools Were Built for People. Agents Are Not People." shashi.co, 11 May 2026, shashi.co.
  5. National Institute of Standards and Technology. Artificial Intelligence Risk Management Framework (AI RMF 1.0). 2023, nist.gov.
  6. European Parliament and Council. Regulation (EU) 2024/1689 Laying Down Harmonised Rules on Artificial Intelligence (AI Act). 2024, eur-lex.europa.eu.
Disclaimer: This blog reflects my personal views only. Content does not represent the views of my employer, Info-Tech Research Group. AI tools may have been used for brevity, structure, or research support. Please independently verify any information before relying on it.