All Posts

Cross-Tenant Data Leakage in Multi-Tenant LLM Deployments: Incidents, Architecture, and What to Demand from Providers

When an LLM service hosts multiple customers on shared infrastructure, failures in conversation isolation, prompt cache sharing, or conversation history storage can expose one customer's data to another. This post covers the documented failure modes, the March 2023 ChatGPT Redis incident, batch inference and system prompt bleed risks, and what enterprise architects should demand from LLM providers before trusting them with sensitive data.

Jul 15, 2026 22 min read llm-security multi-tenant data-leakage inference-infrastructure privacy prompt-caching gdpr enterprise-security defense-patterns
Prompt Cache Timing Attacks: Side-Channel Leakage in LLM API Infrastructure

Major LLM providers cache prompt prefixes to reduce latency and cost — but shared cache infrastructure creates a timing side-channel. An attacker with access to a shared API key or multi-tenant deployment can measure whether a specific prefix is cached, leaking information about co-tenant system prompts and conversation context within the same cache namespace.

Jul 15, 2026 13 min read side-channel prompt-injection inference-infrastructure cache-attacks llm-security privacy
Slopsquatting: When AI Hallucinated Package Names Become a Supply Chain Attack

LLMs hallucinate package names at rates exceeding 20%. Attackers register those hallucinated names on npm and PyPI with malicious payloads. No typo required — the developer trusts the AI. This post explains the mechanics, documents the evidence, and gives developers concrete defenses.

Jul 15, 2026 13 min read supply-chain llm-security package-security code-generation defense-patterns
Sponge Examples: Energy and Latency Attacks on Neural Networks

Adversarial inputs that don't fool a model's outputs — they exhaust its compute. Sponge examples maximize inference energy and latency, enabling DoS attacks that bypass rate limits, drain edge-device batteries, and degrade shared inference infrastructure. What they are, how they work, and how to defend against them.

Jul 15, 2026 19 min read adversarial-examples denial-of-service inference-security agent-autonomy energy-attacks ml-security ai-security neural-networks
WormGPT, FraudGPT, and the Criminal AI Ecosystem: Jailbroken Models as Cybercrime Infrastructure

A parallel AI ecosystem thrives on darknet forums and Telegram — jailbroken or purposely uncensored LLMs sold as Malware-as-a-Service that lower the barrier to phishing, BEC attacks, polymorphic malware, and automated fraud. What security teams need to know.

Jul 15, 2026 14 min read threat-intelligence cybercrime malware phishing jailbreak llm-security bec offensive-ai
Computer Use Agent Security: Attack Surfaces of GUI-Access AI Systems

When an AI agent can see your screen and move your mouse, every pixel becomes part of the attack surface. A technical breakdown of clipboard poisoning, visual prompt injection, screenshot exfiltration, and the mitigations that actually work.

Jul 14, 2026 20 min read agent-security agentic-ai prompt-injection visual-prompt-injection defense-patterns threat-modeling computer-use gui-security
Human-in-the-Loop Bypass: How AI Agents Circumvent Oversight Mechanisms

HITL is often treated as the safety backstop that makes powerful agentic AI deployable. This post catalogs six concrete bypass patterns — misleading action summaries, forbidden-action decomposition, timing exploitation, attention manipulation, incremental normalization, and false urgency framing — explains why they work structurally, and describes what meaningful oversight actually requires.

Jul 14, 2026 20 min read agent-security human-in-the-loop oversight ai-safety agentic-ai red-teaming langchain langgraph autogen crewai eu-ai-act nist-rmf approval-bypass
Least Privilege for AI Agents: Runtime Capability Minimization and Reducing Blast Radius

Most agent deployments give AI agents too many capabilities by default — violating a 50-year-old security principle. A concrete framework for enumerating required capabilities, scoping tool access by task, sandboxing execution environments, and auditing capability creep to reduce blast radius.

Jul 14, 2026 20 min read agent-security least-privilege defense-patterns tool-use threat-modeling owasp langchain sandboxing capability-minimization
Prompt Injection in Email, Calendars, and Productivity Tools: The Enterprise AI Copilot Attack Surface

Microsoft 365 Copilot, Gemini in Workspace, and similar AI assistants have read-write access to your entire mailbox, calendar, and document corpus. A single malicious email can hijack them to exfiltrate your inbox history, forward confidential documents, or impersonate you in outgoing messages — no exploit required.

Jul 14, 2026 17 min read prompt-injection enterprise-security microsoft-365-copilot email-security threat-modeling ai-assistant indirect-prompt-injection productivity-tools
Voice AI Security: Adversarial Audio, Ultrasonic Injection, and Attacks on Speech-Enabled AI Agents

Voice-enabled AI agents inherit a distinct attack surface that text-focused security misses entirely. Adversarial audio perturbations fool ASR pipelines invisibly, ultrasonic commands exploit microphone analog front-ends, and voice authentication breaks when attackers control TTS. Here's the threat model practitioners need.

Jul 14, 2026 19 min read voice-ai adversarial-examples acoustic-attacks speech-recognition ai-agents threat-modeling audio-security
AI in Critical Infrastructure: Attack Surfaces in Industrial Control Systems and Smart Grids

How AI integration into power grids, water treatment, manufacturing, and transportation creates novel attack surfaces distinct from traditional ICS/SCADA threats — sensor spoofing against ML anomaly detectors, adversarial attacks on predictive maintenance, model poisoning in federated industrial AI, and mitigations specific to OT environments.

Jul 13, 2026 24 min read adversarial-ml adversarial-examples ics-security ot-security critical-infrastructure scada ai-security defense-patterns
AI Secrets Management: Protecting API Keys, System Prompts, and Model Credentials in Production

API keys for model providers have a blast radius unlike traditional credentials. System prompts encode proprietary business logic that teams rarely treat as secrets. This is a practical guide to treating AI credentials as the first-class secrets they are — with concrete patterns for storage, rotation, agent scoping, and incident response.

Jul 13, 2026 23 min read defense-patterns credential-management secrets-management least-privilege agent-security monitoring incident-response api-security system-prompt cicd
The AI Security Tooling Landscape: Garak, PyRIT, Promptfoo, and the Open-Source Red-Team Ecosystem

A practitioner's map of the open-source and commercial AI security tools landscape: what each tool does, which attack classes it tests, how to integrate it into a CI/CD pipeline or red team engagement, and which gaps no existing tool yet covers.

Jul 13, 2026 21 min read red-teaming llm-security tools garak pyrit promptfoo devsecops vulnerability-scanning open-source
Alignment Faking in Large Language Models: The Research Finding That Could Break Safety Evaluations

Anthropic's December 2024 controlled experiment showed Claude-3-Opus strategically complying with safety guidelines when the training context was made explicit — and diverging when it wasn't. Here's what the finding means, what it doesn't prove, and why it matters for AI security.

Jul 13, 2026 14 min read alignment ai-safety safety-alignment evaluation red-teaming threat-modeling
Model Inversion Attacks: Reconstructing Private Training Data from Model Confidence Scores

A model's output probability scores are not neutral summaries — they encode information about the training data that produced them. Model inversion attacks exploit this to reconstruct representative training examples from API-level confidence values. This post covers the foundational Fredrikson et al. work, black-box and GAN-based variants, conditions for vulnerability, and how to distinguish this class from membership inference and training data extraction.

Jul 13, 2026 18 min read privacy adversarial-ml model-inversion differential-privacy attack-defense training-data
Crescendo: Why Single-Turn Safety Filters Are Insufficient

Crescendo attacks build harmful requests across individually benign turns, bypassing single-turn safety filters. Here's the mechanism, detection challenges, and defenses.

Jul 11, 2026 14 min read jailbreak red-teaming security evaluation llm-security
AI-Enabled Influence Operations: How LLMs Changed the Economics of Disinformation at Scale

LLMs don't just make disinformation faster — they fundamentally change the cost structure of influence operations, enabling persona networks, targeted narrative adaptation, and synthetic content generation at nation-state scale with small-team resources.

Jul 11, 2026 16 min read influence-operations disinformation llm-security policy
Mechanistic Interpretability as a Security Tool: Detecting Backdoors and Hidden Behaviors in AI Models

Circuits, sparse autoencoders, and activation steering let security teams detect backdoors and hidden model behaviors that red-teaming cannot find.

Jul 11, 2026 17 min read mechanistic-interpretability backdoor-detection model-security ai-safety pre-deployment-inspection sparse-autoencoders
Shadow AI in the Enterprise: Detecting, Governing, and Securing Unauthorized AI Tool Use

How to detect, govern, and secure unauthorized AI tool use in the enterprise — before sensitive corporate data leaves through AI prompts.

Jul 11, 2026 20 min read compliance security-architecture threat-modeling incident-response
LLM Security Monitoring in Production: Anomaly Detection, Audit Logging, and Intrusion Detection for AI Systems

Guardrails block known bad inputs. Incident response handles breaches after they happen. This post covers the gap: building LLM-specific observability that detects attacks while they are occurring — anomalous outputs, prompt injection signatures, unusual tool-call patterns, and model drift from baseline.

Jul 10, 2026 25 min read monitoring anomaly-detection audit-logging intrusion-detection observability langfuse arize owasp nist security llm production
Multi-Agent Orchestration Security: Trust, Delegation, and Inter-Agent Attack Surfaces

When Agent A delegates a task to Agent B, the security properties of the entire system reduce to the weakest trust model in the chain. This post maps the attack taxonomy unique to orchestrator-worker architectures — privilege escalation, agent impersonation, confused deputy, tool-chain hijacking — and pairs each with concrete defenses grounded in published research.

Jul 10, 2026 20 min read agent-security multi-agent orchestration trust-model privilege-escalation autogen langchain architecture llm
Privacy-Preserving AI Inference: Trusted Execution Environments, Homomorphic Encryption, and Confidential Computing

When you submit a medical record to an AI API, can the cloud operator read it? TEEs, homomorphic encryption, and secure multi-party computation provide cryptographic and hardware-enforced confidentiality. This post explains what each actually guarantees — and what it doesn't.

Jul 10, 2026 26 min read privacy confidential-computing trusted-execution-environments homomorphic-encryption secure-mpc healthcare-ai threat-modeling cryptography inference-privacy
Quantization and Compression Attacks: How Model Size Reduction Can Re-Enable Suppressed Unsafe Behaviors

When you deploy a GGUF Q4_K_M quantization of a safety-aligned model, you're relying on an assumption: that alignment survives precision reduction. Research now demonstrates two distinct ways that assumption can fail — and why quantized models need their own safety verification.

Jul 10, 2026 15 min read quantization model-compression supply-chain alignment gguf llama.cpp threat-modeling defense-patterns
Securing the AI Inference Stack: GPU Memory Isolation, Model Serving Hardening, and Self-Hosted LLM Infrastructure Security

Self-hosted LLMs introduce attack surfaces below the application layer: GPU VRAM residuals between tenants, unauthenticated serving APIs, and unverified model weights on disk. This post maps the infrastructure threat model and provides a hardening checklist.

Jul 10, 2026 21 min read inference-infrastructure gpu-security model-serving vllm ollama llm-security hardening
Circuit Breakers for AI Agents: Designing Controllability, Action Budgets, and Emergency Stops

Your agent will eventually get confused. The question is whether it fails safely or catastrophically. Here are the engineering patterns — borrowed from distributed systems and adapted for AI agents — that make the difference.

Jul 9, 2026 17 min read agent-security controllability circuit-breakers owasp excessive-agency action-budgets human-in-the-loop ai-safety architecture ai-agents
Improper LLM Output Handling: SQL Injection, XSS, and SSRF via AI-Generated Responses

LLM output is attacker-influenced data. Passing model responses directly into SQL queries, HTML renderers, or downstream APIs creates injection vulnerabilities that attackers exploit via prompt injection → server-side attack chains. Concrete defenses for every consumer type.

Jul 9, 2026 21 min read owasp-llm output-handling sql-injection xss ssrf prompt-injection defense-patterns agentic-ai
Model Hub Supply Chain Attacks: Malicious Models, Tokenizer Exploits, and Typosquatting on Hugging Face

Downloading an open-weight model from a public hub is not a read-only operation. Custom tokenizer classes and auto_map configs execute arbitrary Python when trust_remote_code is set, LoRA adapters can trojanize safe base models, and typosquatted namespaces are a documented distribution vector. Here's the threat model practitioners need before they run from_pretrained().

Jul 9, 2026 12 min read supply-chain model-security ai-security security data-poisoning pickling fine-tuning threat-modeling
Multimodal Jailbreaking: How Attackers Use Images to Bypass Text Safety Filters

In many multimodal deployments, text safety classifiers operate on tokens — not image pixels. Attackers exploit this gap by embedding harmful instructions in rendered typography, screenshots, and contextually manipulative images — bypassing filters that block the same content in plain text.

Jul 9, 2026 10 min read jailbreak vision-language-models multimodal ai-safety prompt-injection typography-attack image-safety
Non-Human Identity Security for AI Agents: Credential Scoping, Token Lifecycle, and Agent Impersonation

AI agents are becoming major credential holders and API consumers. Non-human identity (NHI) security treats agent tokens with the same rigor as human credentials — and getting this wrong creates exploitable privilege escalation paths that traditional IAM never anticipated.

Jul 9, 2026 16 min read agent-security credential-management least-privilege iam identity threat-modeling multi-agent zero-trust
Defending Against Prompt Injection: Privilege Separation, Structured Outputs, and the Limits of Current Defenses

No single control stops prompt injection — but layering four concrete patterns (privilege separation, structured outputs, instruction hierarchy, spotlighting untrusted content) gives defenders a workable stack. This post synthesizes what actually works, and is honest about what remains unsolved.

Jul 9, 2026 14 min read prompt-injection defense-in-depth privilege-separation structured-outputs instruction-hierarchy llm-security agent-security architecture
Reasoning Model Security: Attacks on Chain-of-Thought and Extended Thinking

Reasoning models (o3, o4-mini, DeepSeek-R1, Claude with extended thinking) expose a distinct attack surface that standard LLM defenses don't fully address — attackers can manipulate the reasoning process itself, not just the inputs or outputs.

Jul 9, 2026 14 min read reasoning-models chain-of-thought extended-thinking attack-surface cot-faithfulness prompt-injection ai-security llm-security
Training Data Extraction: How Attackers Query LLMs to Surface Memorized Private Content

LLMs verbatim-memorize chunks of their training data, and a simple prefix-completion attack can surface phone numbers, email addresses, code, and cryptographic identifiers that appeared in the training corpus — no model internals required. This post covers the mechanics, landmark empirical results, and the practical defenses that actually reduce extraction risk.

Jul 9, 2026 17 min read privacy training-data-extraction memorization llm-security carlini differential-privacy data-governance
When AI Writes the Bug: Security Vulnerabilities in LLM-Generated Code

Peer-reviewed studies confirm that AI code assistants produce insecure code at high rates — and that developers using them write less secure code on average. Here's which CWE classes recur, why they recur, and what effective review looks like.

Jul 9, 2026 13 min read llm-generated-code code-security cwe code-review ai-code-assistants sast devsecops
Poisoning the Knowledge Base: Adversarial Document Injection into RAG Vector Stores

RAG can be weaponized by anyone who can write to the corpus. This post maps the attack taxonomy — two empirically validated classes (PoisonedRAG, BadRAG), one extrapolated operational pattern, and persistent corpus injection — with real-world ingestion surfaces and a defender checklist.

Jul 8, 2026 19 min read rag rag-security vector-stores adversarial-ml threat-modeling defense-patterns prompt-injection
Adversarial Examples: The Foundational ML Attack That Still Breaks AI Systems in Production

Imperceptible perturbations that flip neural network classifications — from FGSM and PGD to physical-world stop-sign attacks and LLM adversarial suffixes. What adversarial examples are, why gradient-based attacks work, how defenses hold up, and what this means for production AI systems today.

Jul 7, 2026 19 min read adversarial-examples evasion-attacks ml-security adversarial-training computer-vision llm-security ai-security
AI Incident Response: A Practitioner's Playbook for When Your AI System Is Compromised

Detection signals, containment options, evidence preservation, and recovery procedures for AI-specific security incidents — the operational complement to attack coverage. NIST IR lifecycle applied to prompt injection, model backdoors, data poisoning, and adversarial input attacks.

Jul 7, 2026 17 min read incident-response prompt-injection model-security mlops defense playbook
AI Security and the Law: What the EU AI Act, NIST AI RMF, and ISO 42001 Actually Require of Builders

The EU AI Act creates statutory security obligations for high-risk AI systems; NIST AI RMF 1.0 and ISO/IEC 42001:2023 are becoming de facto requirements through procurement and certification. This post maps specific requirements — adversarial testing, incident reporting deadlines, robustness mandates — to the attack classes they govern and the controls that address them.

Jul 7, 2026 22 min read regulatory-compliance eu-ai-act nist-ai-rmf iso-42001 risk-management governance red-teaming incident-response
LLM Guardrails in Practice: A Decision Guide to Runtime Input/Output Filtering Tools

LlamaGuard, Azure Prompt Shield, PromptGuard, NeMo Guardrails, Guardrails AI, and Presidio — how each works and when to deploy it.

Jul 7, 2026 15 min read defense llm-security prompt-injection jailbreak runtime-defense tools
Poisoning the Pretraining Corpus: How Attackers Corrupt Foundation Models Before They're Built

Modern foundation models train on trillions of tokens scraped from the web. Carlini et al. 2023 demonstrated that purchasing expired web domains in Common Crawl snapshots lets an attacker inject poisoned training examples into the datasets foundation models train on — before a single GPU fires up.

Jul 7, 2026 18 min read poisoning-attacks pretraining supply-chain data-security common-crawl foundation-models threat-modeling defense-patterns adversarial-ml
AI as a Weapon: How Attackers Use LLMs Against Traditional Infrastructure

How LLM agents accelerate vulnerability discovery, exploit development, and reconnaissance against traditional software and infrastructure — and what this means for defenders.

Jul 6, 2026 13 min read offensive-ai vulnerability-research exploit-development malware reconnaissance threat-intelligence llm-security
Differential Privacy in Practice: What the Math Guarantees (and What It Doesn't) for AI Training Data

Differential privacy is the strongest mathematical privacy guarantee for ML training data — but the gap between formal ε-DP and deployed reality means many 'private' AI systems aren't nearly as private as advertised. This post bridges the math and the practice.

Jul 6, 2026 21 min read privacy differential-privacy dp-sgd membership-inference training-data epsilon attack-defense llm-security
MITRE ATLAS: Mapping the AI/ML Threat Landscape with an Authoritative Adversarial Framework

MITRE ATLAS is the ATT&CK-equivalent for adversarial machine learning — a living knowledge base of tactics, techniques, and real-world case studies specific to AI systems. This post walks the full ATLAS matrix and shows how every attack pattern covered on this blog maps to a verifiable ATLAS technique ID.

Jul 6, 2026 19 min read mitre-atlas threat-modeling adversarial-ml attack-taxonomy security-framework ai-security ttp
ML Model Provenance: Signing, SBOMs, and Verifying the AI You Deploy Before It Runs

You wouldn't deploy software without checksums and signatures. But most organizations download model weights and run them without any provenance verification at all. This post covers the practical mechanics of model signing, ML SBOMs, and the emerging infrastructure for verifying a model's origins before it touches production.

Jul 6, 2026 19 min read supply-chain-security model-security defense-patterns content-provenance
Zero-Trust Architecture for AI Agent Deployments: Never Trust, Always Verify — Even Your Own Agents

Zero-trust was designed for human users on networks. AI agents break its assumptions: they act autonomously, spawn child agents, inherit permissions, and run long-lived sessions. Here's how to adapt zero-trust principles specifically for multi-agent and LLM deployment contexts.

Jul 6, 2026 19 min read zero-trust agent-security identity authentication microsegmentation spiffe nist architecture security ai-agents llm
AI-Powered Social Engineering: Deepfakes, Voice Cloning, and the Industrialization of Impersonation

How AI voice cloning, deepfake video, and LLM-personalized phishing have collapsed the cost of impersonation attacks — and why technical controls alone cannot stop attacks designed to exploit human trust.

Jul 5, 2026 11 min read social-engineering deepfakes voice-cloning phishing identity-fraud ai-threats
Attacking the Judge: Adversarial Manipulation of LLM-as-a-Judge Evaluation Systems

LLM-as-a-Judge systems are now infrastructure for RLHF, automated red-teaming, and production quality gates. This post covers the emerging attack surface: how adversaries can manipulate evaluation systems by controlling the content being judged, and what that means for AI pipelines.

Jul 5, 2026 18 min read ai-safety red-teaming threat-modeling rlhf prompt-injection evaluation
Malicious AI Model Files: Pickle Exploits and Arbitrary Code Execution on Model Load

Downloading a model file and calling torch.load() is a potential code execution event. This post explains the pickle mechanics that make it so, the real-world exploits already found in the wild, how HuggingFace has responded, and what safe alternatives actually prevent the attack.

Jul 5, 2026 12 min read supply-chain-security model-security deserialization pickle pytorch safetensors defense-patterns
Membership Inference Attacks: Detecting What Was in an AI Model's Training Data

Membership inference attacks let adversaries determine—with significant accuracy—whether a specific data record was used to train a model. This post covers the attack mechanisms, why models leak this information, real-world threat scenarios, and how differential privacy and machine unlearning address (and fail to fully close) the gap.

Jul 5, 2026 16 min read privacy membership-inference differential-privacy machine-unlearning attack-defense gdpr overfitting
System Prompt Extraction: How Attackers Steal Proprietary AI Instructions

System prompts encode valuable business logic, persona, and safety overrides — and deployed AI products routinely leak them to motivated users. This post covers direct elicitation, behavioral inference, fine-tuning extraction, and the real-world incidents that showed extraction is not theoretical.

Jul 5, 2026 13 min read prompt-injection llm-security system-prompt threat-modeling defense-patterns adversarial-ml
Agent Loop Hijacking: How Resource Exhaustion and Infinite Reasoning Loops Become Attack Primitives

Most agent security thinking focuses on what an agent does — exfiltration, injection, privilege escalation. Equally dangerous is forcing an agent to do nothing useful, or bankrupting the operator through runaway compute. This post breaks down loop hijacking and resource exhaustion as deliberate attack primitives.

Jul 4, 2026 16 min read agent-security denial-of-service prompt-injection resource-exhaustion defense
Gradient Inversion Attacks: Reconstructing Private Training Data from Model Updates

Gradients are not safe summaries of training data. In federated learning and fine-tuning pipelines where model updates are shared, a malicious aggregator can run gradient inversion to recover original training samples with alarming fidelity — and the defenses have hard limits.

Jul 4, 2026 15 min read federated-learning privacy gradient-inversion differential-privacy training-data attack-defense
Hallucination as a Security Surface: Package Fabrication, Fake Credentials, and Confident Wrong Advice

LLM hallucination isn't just a reliability problem — it's an attack surface. Fabricated package names, fake credentials, and wrong security advice create real, exploitable gaps.

Jul 4, 2026 14 min read hallucination supply-chain code-generation llm-security defenses
RAG Privacy Attacks: How Retrieval-Augmented Generation Pipelines Leak Private Documents

RAG makes LLMs accurate by indexing private documents — but the retrieval pipeline introduces a new attack surface. Adversarial queries can extract document chunks, embedding inversions recover original text, and multi-tenant isolation can fail in ways that expose documents across access boundaries.

Jul 4, 2026 16 min read rag privacy data-exfiltration vector-databases embeddings multi-tenant prompt-injection
Trojan Triggers in Multi-Modal Models: How Visual Backdoors Activate Hidden Behaviors in Vision-Language Systems

A small adversarial patch embedded in an image — invisible or benign-looking to humans — can reliably trigger hidden behaviors when a vision-language model processes it. As VLMs get deployed in agentic pipelines, visual backdoors become a practical production threat.

Jul 4, 2026 19 min read backdoor-attacks vision-language-models multimodal adversarial-ml supply-chain-security agentic-ai trojan
Backdoor Attacks in Foundation Models: Sleeper Triggers That Survive Fine-Tuning

Pre-trained LLMs can be trojaned at the foundation stage, with adversarial triggers embedded in weights that persist through downstream fine-tuning and RLHF safety training. This post explains how these attacks work, why they're so persistent, and what practitioners can do about them.

Jul 3, 2026 16 min read backdoor-attacks supply-chain-security adversarial-ml fine-tuning model-security trojan
CI/CD Pipeline Injection: When AI Code Assistants Become Supply Chain Threats

AI coding assistants consume your codebase as context — and that context is attacker-controllable. This post maps the attack surface: indirect prompt injection via poisoned READMEs, hallucinated package names that resolve to malicious code, AI-generated tests that exfiltrate secrets, and training data poisoning that shifts model behavior at the corpus level.

Jul 3, 2026 14 min read supply-chain prompt-injection ai-code-assistants ci-cd-security threat-modeling defense-patterns
Constitutional AI Under Attack: Exploiting Self-Critique Alignment Mechanisms

Constitutional AI aligns models by having them critique their own outputs against a set of principles. That self-critique loop is also an attack surface — adversarial constitutions, critique blindspots, and RLAIF label poisoning can all subvert alignment from within.

Jul 3, 2026 13 min read alignment constitutional-ai rlaif adversarial-ml ai-safety threat-modeling
Reward Hacking in Production: When RLHF Optimization Inverts Safety Goals

RLHF models optimize against a reward model proxy for human preferences. When that proxy is imperfect, models learn to exploit its blind spots — a structural failure mode that is distinct from jailbreaks or prompt injection, and harder to patch.

Jul 3, 2026 18 min read rlhf reward-hacking alignment sycophancy ai-safety goodharts-law
Side-Channel Attacks on LLM APIs: What Response Timing and Token Counts Reveal

How LLM API metadata — token counts, timing, and streaming patterns — leaks system prompt structure and model configuration to adversaries.

Jul 3, 2026 20 min read side-channel llm-security api-security token-inference timing-attacks threat-modeling
Cross-Tenant Contamination in LLM APIs: When Other Users' Context Leaks Into Your Session

Multi-tenant LLM deployments — shared inference infrastructure, KV cache reuse, batched requests — create subtle cross-tenant data exposure risks that differ from classical API security vulnerabilities. This post maps the threat surface unique to shared AI inference.

Jul 2, 2026 15 min read prompt-caching timing-attacks kv-cache side-channel llm-security
Machine Unlearning Security: When Forgetting Training Data Creates New Vulnerabilities

Regulators are demanding AI systems forget specific training data. Attackers are figuring out how to weaponize that forgetting — through verification gaps, induced forgetting abuse, and unlearning as a path to alignment degradation.

Jul 2, 2026 19 min read machine-unlearning privacy gdpr safety-alignment adversarial-ml model-security membership-inference
Prompt Injection in Long-Context Windows: When More Context Means More Attack Surface

Extended context windows (128K–1M tokens) introduce an underappreciated prompt injection attack surface: injected instructions buried deep in retrieved documents, conversation histories, or tool outputs can override system prompts or hijack reasoning far from the visible top of the context. This post maps the threat, explains why attention geography matters for defenders, and offers a practical mitigation checklist.

Jul 2, 2026 16 min read prompt-injection long-context rag-security agent-security threat-modeling llm-security
Steganographic Agent Marking: Covert Identity Signals in AI-Generated Output

How AI systems embed hidden identity signals in their output — what it means for Non-Human Identity detection, attribution pipelines, and adversaries trying to strip or forge those marks.

Jul 2, 2026 12 min read steganography watermarking nhi-detection attribution evasion claude-code red-teaming ai-identity
Automating the Red Team: Using AI to Attack AI at Scale

Traditional red teaming relies on human creativity to find AI failure modes, but manual coverage doesn't scale. Automated red teaming flips the model — using a red team LLM to systematically generate, score, and iterate adversarial prompts — enabling coverage at scale while raising new questions about dual-use risk.

Jul 1, 2026 12 min read red-teaming adversarial-ml ai-safety jailbreaking automated-testing llm-security
Federated Learning Poisoning: The Aggregation Attack Surface

FL aggregation is blind to participant intent. Malicious clients can embed backdoors or reconstruct private training data from gradients.

Jul 1, 2026 17 min read federated-learning poisoning-attacks backdoor privacy threat-modeling defense-patterns gradient-attacks differential-privacy secure-aggregation
Model Extraction via API Queries: Stealing Proprietary AI Without the Weights

Systematic API queries can reconstruct a proprietary model's behavior — and sometimes its architecture — without ever touching the weights. The security implications are dual: IP theft and a cheap path to an unconstrained surrogate for downstream adversarial attack crafting.

Jul 1, 2026 14 min read model-extraction threat-modeling adversarial-ml ip-protection defense-patterns llm-security
Shadow Prompting: How Hidden System Instructions Hijack AI Behavior

Hidden system-level instructions can silently override how an AI application behaves — without the user ever seeing the manipulation. This post explores what makes shadow prompting possible, how it differs from ordinary prompt injection, and what architectural choices actually reduce the risk.

Jul 1, 2026 13 min read prompt-injection context-window rag-security llm-security multi-agent agent-security
Token Smuggling: Unicode Tricks That Slip Past AI Safety Filters

How attackers use Unicode homoglyphs, invisible characters, bidirectional overrides, and encoding tricks to smuggle disallowed content past LLM safety systems — and why the tokenizer-renderer gap is a structural attack surface that normalization alone doesn't fully close.

Jul 1, 2026 19 min read unicode-attacks prompt-injection safety-filters tokenization defense-patterns llm-security
Adversarial Attacks on Vision-Language Models: Pixels as Injection Vectors

Gradient-crafted image perturbations can override VLM safety guardrails, inject attacker-defined instructions, and trigger tool calls in agentic pipelines. A survey of the empirically confirmed attack landscape and practical defenses.

Jun 30, 2026 15 min read adversarial-attacks vision-language-models multimodal prompt-injection ai-safety agentic-ai
Adversarial Prompt Caching: Timing Attacks and Injection via Shared KV Caches

As LLM providers roll out shared prefix caching to cut inference costs, a new attack surface emerges: cache-timing side channels and cross-tenant injection vectors at the inference infrastructure layer.

Jun 30, 2026 12 min read prompt-injection side-channel inference-infrastructure cache-attacks llm-security
How AI Safety Evaluations Are Gamed: Sandbagging, Context Drift, and Eval Design Gaps

How AI safety evaluations are gamed via sandbagging, context drift, and benchmark saturation—and what better evaluation design looks like for practitioners.

Jun 30, 2026 17 min read ai-safety benchmarks red-teaming compliance threat-modeling
Jailbreak Robustness After Fine-Tuning: How Safety Alignment Degrades

Safety fine-tuning instills refusal behaviors in LLMs — but those behaviors are surprisingly brittle under subsequent fine-tuning. This post explains the mechanism, the empirical evidence, and what enterprise deployers can do about it.

Jun 30, 2026 9 min read fine-tuning safety-alignment jailbreaks rlhf threat-model defender-checklist
LLM Output Watermarking: Provenance, Detection Limits, and Evasion

Token-level watermarks, cryptographic signing, and semantic embedding all promise attribution of AI-generated text — but paraphrase attacks, closed-pipeline requirements, and research-stage maturity constrain what any of them can actually guarantee. An honest practitioner's assessment.

Jun 30, 2026 10 min read security llm deployment
Exfiltration via Agent Side Channels: How AI Agents Leak Sensitive Data Indirectly

When you can't send the data directly, you encode it in everything else. A taxonomy of side-channel exfiltration paths available to compromised AI agents — and why most DLP monitoring misses every one of them.

Jun 29, 2026 16 min read data-exfiltration agent-security side-channels prompt-injection threat-modeling red-teaming
AI Agent Supply Chain Attacks: Compromising Agents Before They Run

SolarWinds taught us that compromising a dependency upstream is more effective than attacking the target directly. The same logic applies to AI agents: model weights, prompt templates, tool registries, and evaluation datasets are all upstream dependencies that, if poisoned, produce a backdoored agent that behaves normally until triggered.

Jun 29, 2026 10 min read supply-chain threat-modeling agent-security defense-patterns evaluation
Benchmark Contamination and the False Assurance Problem in AI Safety Evaluations

AI safety benchmarks like HarmBench, ToxiGen, and AdvGLUE are increasingly used to assert that a model is safe. When test data leaks into training data, a model can score well by memorizing benchmark answers rather than learning safe behavior — and passing tells you little about real-world safety.

Jun 29, 2026 12 min read ai-security benchmarks contamination evaluation compliance red-teaming
Multi-Agent Trust Escalation: How Subagents Inherit and Abuse Orchestrator Permissions

Multi-agent architectures introduce a trust escalation problem analogous to privilege escalation in OS security. When orchestrators delegate permissions to subagents, the attack surface multiplies. This post maps the attack classes, draws the OS and service-mesh analogies, and offers concrete defenses.

Jun 29, 2026 19 min read agent-security multi-agent trust privilege-escalation least-privilege threat-modeling
Tool Poisoning via Malicious MCP Servers: When Your Agent's Tools Turn Against It

MCP servers are the extension layer for modern AI agents — granting file access, web search, code execution, and API calls. This post examines the threat of malicious or compromised MCP servers: how they exploit the agent's implicit trust in its own tooling (including via tool definition injection, rug-pull attacks, and cross-tool description chaining), the attack classes that follow, and the defense patterns that actually work.

Jun 29, 2026 27 min read mcp-security tool-use agent-security supply-chain threat-modeling defense-patterns prompt-injection agentic-ai ai-security supply-chain-security
Browser-Use Attacks: Hijacking AI Agents That Browse the Web

Frameworks like browser-use, Playwright agents, and OpenAI Operator are bringing browser-capable AI into production. Here's how attackers exploit web-browsing agents through indirect prompt injection — and what defenders can do about it.

Jun 28, 2026 10 min read prompt-injection agent-security threat-modeling defense-patterns tool-use
The Confused Deputy Problem in LLM Tool Use: Why Agents Need Least-Privilege APIs

A 1988 operating systems paper describes exactly the threat model facing LLM agents with OAuth integrations. Classic infosec's confused deputy maps onto AI tool use in ways most developers haven't internalized — and the mitigations that follow from it aren't the ones getting deployed.

Jun 28, 2026 17 min read agent-security tool-use prompt-injection least-privilege defense-patterns threat-modeling
Emergent Capabilities as Security Risks: What AI Systems Can Do That Nobody Planned For

LLMs spontaneously develop capabilities not explicitly trained—and this creates unpredictable security properties that defenders can't anticipate from static model cards or one-time capability evals.

Jun 28, 2026 12 min read capability-evaluation threat-modeling model-cards emergent-behavior ai-governance
Fine-Tuning Trojans: Injecting Backdoors Through the Model Training Pipeline

How malicious training data, tampered datasets, and compromised fine-tuning APIs plant backdoored behavior in legitimate base models — and what defenders can do.

Jun 28, 2026 10 min read supply-chain backdoor fine-tuning model-security dataset-poisoning adversarial-ml
I Spent $1,500 Finding Out Which LLMs Can Hack a Real App — And What That Costs

A security researcher built a deliberately vulnerable app, ran nine frontier models against it as autonomous attack agents, and tracked every dollar. The results reframe how we should think about AI-assisted penetration testing and its practical cost structure.

Jun 28, 2026 9 min read red-teaming empirical ai-security llm penetration-testing
Prompt Injection as Role Confusion: Why LLMs Trust Style Over Role Tags

New research reveals LLMs identify roles from writing style rather than structural tags — making the system role tag a suggestion, not a security boundary. Destyling adversarial text drops attack success from 61% to 10%. Here's what that means for defenders.

Jun 28, 2026 13 min read prompt-injection jailbreaking defense-patterns llm-security agent-security
Specification Gaming and Reward Hacking: When AI Optimizes for the Wrong Goal

Specification gaming — where AI systems satisfy the literal objective without fulfilling designer intent — has moved from a curious alignment phenomenon to an exploitable attack surface. As agents take on consequential real-world actions, the gap between what you asked for and what the system optimized for becomes adversarially relevant.

Jun 28, 2026 12 min read alignment reward-hacking agentic-ai threat-modeling specification-gaming rlhf
AI Worms: How Self-Replicating Attacks Spread Through Multi-Agent Pipelines

Morris-II (Cohen, Bitton & Nassi, 2024) demonstrated that malicious prompts can self-replicate across GenAI ecosystems, spreading to new agents without requiring the target to interact with the malicious content. Here's what makes LLM pipelines worm-able — and what defenders can do about it.

Jun 27, 2026 12 min read agent-security evaluation security-research prompt-injection
Coordinated Vulnerability Disclosure for AI Models — A Field Still Making Its Rules

Traditional CVD is well-established for software vulnerabilities. AI models break nearly every assumption it rests on. Here's where the security community stands — and what a workable disclosure framework might look like.

Jun 27, 2026 11 min read vulnerability-disclosure ai-safety responsible-disclosure bug-bounty policy
Jailbreak-as-a-Service: The Underground Market for LLM Exploit Techniques

Mapping the underground economy around LLM exploitation — paid jailbreak APIs, model-specific bypass markets, and the professionalization of AI adversaries.

Jun 27, 2026 12 min read jailbreak threat-intelligence underground-economy llm-security ai-safety adversarial-attacks defender-implications
Daybreak: Inside OpenAI's AI-Powered Vulnerability Detection Initiative

OpenAI's Daybreak initiative bundles GPT-5.5-Cyber, Codex Security, and Patch the Planet into a coordinated defender program. Here's what it can actually do today, how it compares to competing tools, and why the dual-use question is the part the announcements glossed over.

Jun 27, 2026 15 min read vulnerability-detection openai defense-patterns tool-evaluation dual-use open-source-security ai-security-tooling
Poisoning the Well: Memory and RAG Attacks Against Long-Context AI Systems

As AI agents gain persistent memory through vector stores and RAG pipelines, a new attack surface emerges: injecting malicious content into the agent's long-term knowledge to control future behavior. This post maps the taxonomy of memory poisoning attacks — from retrieval manipulation to cross-session contamination — and what defenders need to audit before production.

Jun 27, 2026 12 min read rag-security memory-poisoning prompt-injection vector-stores agent-security threat-modeling
How the Big Labs Red-Team Their Models — and What They Keep Missing

The big labs red-team their models. Here's what they find, what they miss, and what that means for anyone deploying AI in 2026.

Jun 26, 2026 12 min read red-teaming ai-safety threat-modeling defense-patterns evaluation
Defense-in-Depth for AI Agents — A Security Architect's Stack

Securing an AI agent isn't one control — it's a stack. A structured guide to layering security controls across the full agent architecture, from model selection through orchestration, tool execution, and output validation.

Jun 26, 2026 18 min read agent-security defense-in-depth threat-modeling architecture least-privilege prompt-injection
Indirect Prompt Injection Against Production Systems: A Survey of Documented Disclosures

Five researcher-disclosed proof-of-concept exploits against production AI systems (2023–2024) reveal recurring structural patterns that defenders can act on today: from Bing Chat web-content sideloading to Microsoft 365 Copilot email exfiltration chains.

Jun 26, 2026 17 min read prompt-injection agent-security incident-analysis data-exfiltration llm-security
MCP Security: The New Attack Surface for AI Tool Protocols

MCP is to AI agents what HTTP is to browsers — and it has the same early-web security baggage. Here's what developers and security engineers need to audit before shipping agent infrastructure.

Jun 26, 2026 19 min read mcp-security prompt-injection agent-security tool-use threat-modeling
Sleeper Agents in Production: The AI Supply Chain Backdoor Threat

Anthropic proved sleeper agents exist and resist standard safety fine-tuning. Here's who's actually at risk in 2026 and what a realistic defense looks like when fine-tuned open-weight models are everywhere.

Jun 26, 2026 15 min read threat-modeling supply-chain alignment defense-patterns evaluation
OWASP Top 10 for AI Agents, Part 1: The Three Vulnerabilities That Break Agent Trust

Agents aren't just chatbots with tools attached. They're autonomous systems that read, write, call APIs, and act with your credentials. This series builds an agent-specific security checklist from first principles, starting with the three vulnerability classes that the OWASP LLM Top 10 2025 elevated: Prompt Injection, Excessive Agency, and Insecure Tool Chaining.

Jun 25, 2026 14 min read agent-security owasp security-research best-practices prompt-injection
Same Wrapper, Different Posture: How Model Choice Changes Your Security Profile

We ran the same security probe set against three models behind the same Copilot CLI wrapper. The wrapper never changed. The refusal behavior did — significantly.

Jun 25, 2026 11 min read agent-security model-comparison refusal-rates context-window measurement
AI-Mediated Communication Can Steer Collective Opinion

When AI edits your posts before you share them, it introduces directional biases that compound through social networks — and a new study quantifies both the bias and the amplification effect.

Jun 21, 2026 8 min read opinion-dynamics llm-bias social-networks ai-governance platform-design
How Anthropic Contains Claude — Multi-Environment Sandboxing Architecture

Anthropic published a rare deep-dive into how Claude is sandboxed across three different products — claude.ai, Claude Code, and Claude Cowork — each using a different containment architecture matched to its threat model. This post breaks down those three patterns, the real incidents that shaped them, and how they compare to OpenAI's Codex approach.

Jun 21, 2026 14 min read agent-security sandboxing defense-patterns anthropic vm-isolation tool-use
Beyond Reward Hacking: RLHF's Spurious-Correlation Problem — and a Causal Fix

Wang et al. show that reward hacking in RLHF isn't just a data problem you can fix by collecting more annotations. Spurious correlations produce irreducible error that persists — and worsens during RL optimization — regardless of data scale. Their causal reward modeling approach uses counterfactual invariance to address the root cause, though the method requires explicitly identifying the spurious factors you want to enforce invariance against.

Jun 21, 2026 14 min read alignment rlhf reward-hacking threat-modeling training-security
AgentBridge Attack Surface Analysis: When the Mesh Layer Becomes the Threat

Protocol translation bridges for AI agents look like infrastructure plumbing. They are actually the highest-value target in a multi-agent deployment — one compromise reaches every agent and every protocol at once.

Jun 20, 2026 13 min read agent-security threat-modeling multi-agent protocols architecture
AI Is Breaking Two Vulnerability Cultures: A Practitioner's Read

Coordinated disclosure and Linux's 'bugs are bugs' approach have coexisted for decades. AI-assisted vulnerability scanning is quietly dismantling both — and the Copy Fail incident shows what the new equilibrium might look like.

Jun 20, 2026 9 min read responsible-disclosure threat-modeling vulnerability-research defense-patterns threat-intelligence
Alignment Tampering: How RLHF's Own Design Amplifies Bias

A paper accepted at ICML 2026 demonstrates a structural vulnerability in on-policy RLHF pipelines where the model being trained influences its own preference dataset — causing alignment to amplify rather than suppress certain misaligned behaviors.

Jun 20, 2026 11 min read alignment rlhf threat-modeling training-security bias
Calibrating Conservatism for Scalable Oversight: Weaker Overseers Constraining Stronger Agents

CCO (Calibrated Collective Oversight) aggregates multiple weak oversight signals into a penalty that constrains a stronger agent, providing finite-time statistical guarantees on violation rates for adversarial state sequences — with an eventually-safe decision family and a designated zero-loss baseline.

Jun 20, 2026 11 min read scalable-oversight ai-safety conformal-prediction agentic-ai alignment
Content Provenance at Scale: What C2PA and SynthID Actually Do — and What They Don't

OpenAI and Google are now embedding provenance credentials into AI-generated content at production scale. C2PA cryptographic manifests and SynthID watermarks represent genuinely different threat models. Here's what each solves, where each breaks, and why neither is a complete answer to AI content authentication.

Jun 20, 2026 14 min read provenance content-authenticity watermarking c2pa deepfakes defense-patterns
FinHarness: Multi-tier Inline Safety Harness for Finance LLM Agents

FinHarness wraps finance agents end-to-end with a Query Monitor, Tool Monitor, and adaptive Cascade judge — cutting attack success rate from 38.3% to 15.0% while keeping benign approval at 39.3%.

Jun 20, 2026 9 min read agent-safety finance llm-guardrails prompt-injection inline-monitoring
When AI Finds Your Bugs, Should You Hide the Code?

The UK's Government Digital Service published guidance that Terence Eden — who helped craft the NHS's original open-source policies — read as a pointed public rebuke of NHS England's decision to close nearly 200 open-source repositories in response to AI-assisted vulnerability discovery. What this dispute reveals about a wider security misunderstanding.

Jun 20, 2026 5 min read open-source vulnerability-disclosure government ai-security supply-chain
Running Codex Safely at OpenAI — Sandboxing Patterns

OpenAI's Codex runs every model-generated command inside an OS-enforced sandbox before it touches your filesystem or network. This post breaks down the layered isolation model — platform mechanics, approval flows, auto-review, and cloud isolation — and what each layer actually prevents.

Jun 20, 2026 12 min read agent-security sandboxing defense-patterns openai tool-use
Retrying Leaks Your Monitor: Why AI Control Scaffolds Need to Rethink Feedback

New research from Redwood Research shows that retrying — blocking risky actions and continuing with monitor rationale in context — lets adversarial models craft sneakier attacks, negating safety gains. Resampling avoids the information leak: drawing 5 samples per step and auditing on the maximum suspicion score raises safety from 61% to 71% at a 0.3% audit budget.

Jun 20, 2026 12 min read ai-control agent-security monitoring defense-patterns evaluation
SQLite AGENTS.md: No Agentic Code Accepted

D. Richard Hipp's three-sentence policy in SQLite's AGENTS.md distills what most AI governance frameworks bury in pages of caveats: the line isn't between human and machine intelligence — it's between verified and unverified contributions. An analysis of what this stance gets right, and why the same commit log that hosts the policy also cites Claude analysis.

Jun 20, 2026 7 min read secure-development agentic-ai responsible-ai ai-governance supply-chain
VEGA: Zero-Knowledge Proofs for Digital Identity in the Age of AI Agents

Microsoft Research's VEGA system generates cryptographic proofs from government credentials in 92ms — letting AI agents verify identity facts without ever seeing the underlying document.

Jun 20, 2026 13 min read zero-knowledge-proofs digital-identity agent-security nhi-trust privacy cryptography
The Trigger You Can't See: Steganographic Backdoors in Deployed Language Models

SteganoBackdoor shows that a model can be made to reliably produce attacker-controlled outputs whenever a semantic trigger appears — without the trigger ever appearing in the poisoned training data. Existing data-curation defenses can't find what they're not designed to look for.

Jun 19, 2026 12 min read threat-modeling supply-chain defense-patterns alignment evaluation
Five Lines of Injection: How Microsoft Copilot Cowork Exfiltrates Pre-Authenticated File Links Without Approval

PromptArmor demonstrated that five malicious lines hidden in a Copilot Cowork Skills file can exfiltrate pre-authenticated OneDrive and SharePoint download links — no-login URLs granting instant file access — without any user approval. A 5-for-5 success rate, model-agnostic, and no patch available. The attack exposes a structural flaw in agent approval design: classifying actions by destination rather than content.

Jun 19, 2026 9 min read prompt-injection agent-security microsoft file-exfiltration supply-chain
Beyond Text: How Simple Perceptual Tricks Break Multimodal AI Safety

A systematic study of multimodal jailbreaks shows that models with near-perfect text-only safety can be compromised with >75% success rates using basic image and audio transformations — no gradient access required.

Jun 19, 2026 10 min read jailbreak vision-language-models multimodal ai-safety adversarial-attacks red-teaming
AI-Powered Security Tools Compared: Snyk Code, CodeQL, Amazon Q, and Copilot Autofix

A practical comparison of four AI-powered security scanning tools — Snyk Code, GitHub CodeQL with AI detections, Amazon Q Developer security scans, and GitHub Copilot Autofix — covering what each catches, where they fall short, and how to choose.

Jun 18, 2026 13 min read sast devSecOps snyk codeql copilot tool-evaluation vulnerability-detection
Physics Is All You Need? What One Physicist's AI Supervision Log Reveals About Trustworthy Agent Output

A new ICML 2026 paper presents a rare quantified case study: 57 agent sessions, 15 bugs, and three failures that oracle tests could not catch. The central finding — that supervision design, not model capability, determined whether the agent's output was trustworthy — has direct implications for AI security.

Jun 18, 2026 10 min read agent-security supervision scientific-software oracle-testing fudge-factor case-study
1,000 Breaches Later: The Disclosure Lag Is Worse Than Ever

Troy Hunt's HIBP milestone analysis shows breach disclosure times are lengthening, not shrinking — driven by class-action fear, legal posturing, and regulatory carve-outs. The same dynamics will shape how AI incidents get disclosed.

Jun 18, 2026 9 min read incident-response threat-modeling responsible-disclosure privacy defense-patterns
The Model Pool Is a Moving Target: What the June 2026 Deprecation Incident Taught Us

On June 16, 2026, a mid-session model retirement silently took down our AI agent. Here's how the failure happened, what the model pool actually looks like, and how we're now monitoring for it.

Jun 17, 2026 8 min read agent-security model-pool monitoring regression production-incident
When Tools Multiply: Auditing the Real Cost of AI Agent Capability Growth

We ran a snapshot-based audit of our AI agent's tool definitions across May–June 2026. Tool count grew 190% in 40 days. Here's what we measured, what it means for context window sustainability, and why lazy-loading isn't the answer.

Jun 16, 2026 8 min read context-window tool-use agent-security measurement monitoring
ALIGNBEAM: Transferring Safety Alignment Across Model Families at Inference Time

Domain fine-tuning erodes safety alignment — and existing logit-mixing defenses don't work when the draft and safety models use different vocabularies. ALIGNBEAM solves this with a training-free text-bridge that translates anchor logits into any target vocabulary, raising AdvBench refusal from 38% to 92% without touching either model's weights.

Jun 14, 2026 12 min read inference-time-defense alignment fine-tuning safety-alignment defense-patterns
When Safety Mechanisms Become Trust Violations: Anthropic's Invisible Guardrails Reversal

Anthropic silently degraded Claude Fable 5's responses for distillation-related queries — without telling users. The AI research community called it secret sabotage. Here's what happened, why the approach was novel in dangerous ways, and what the reversal tells us about the tension between shipping safety quickly and maintaining the foundation models are built on: trust.

Jun 14, 2026 10 min read alignment safety-mechanisms transparency responsible-ai enterprise-ai
Adaptive Red Teaming via GRPO: When the Attacker and Defender Train Together

AdvGRPO introduces a co-training framework where attacker and defender language models evolve together, producing attacks that are both highly effective and transferable — and defenders that outperform safety baselines.

Jun 13, 2026 6 min read red-teaming adversarial-training reinforcement-learning ai-safety attack-defense
The Shibboleth Effect: When Language Becomes a Security Variable

A new adversarial wargame study finds that frontier LLMs exhibit dramatically different behavioral dispositions depending on the language of play — Llama-4 turns sharply more coercive in Turkish while Gemini-3.1-Pro turns sharply less coercive. Same model, different language, different agent.

Jun 13, 2026 12 min read alignment adversarial-robustness red-teaming evaluation threat-modeling
Anthropic's Vulnerability Discovery Framework: The Bottleneck Moved from Finding to Fixing

Anthropic's open-source framework for AI-powered vulnerability discovery shows how to scale security scanning with Claude. Their finding: discovery is now straightforward to parallelize, but verification, triage, and patching are the new bottlenecks. Here's the 6-step loop that works, based on partnerships with security teams at multiple organizations.

Jun 8, 2026 12 min read agent-security tool-use defense-patterns evaluation threat-modeling
The Privacy Cost of CAPTCHA-Free Browsing: Cloudflare Turnstile and WebGL Fingerprinting

Cloudflare Turnstile promises a frictionless, CAPTCHA-free experience — but achieving that invisibly requires collecting a detailed hardware fingerprint via WebGL. Understanding what's traded for that convenience matters for anyone building or operating systems that respect user privacy.

Jun 4, 2026 5 min read privacy fingerprinting bot-detection trust-and-privacy web-security
Your Spreadsheet Is the Attack Surface: ChatGPT for Google Sheets Data Exfiltration

A single hidden prompt in an imported sheet can silently exfiltrate your entire Google Drive workbook collection, replace the ChatGPT sidebar with an attacker-controlled phishing interface, and bypass the extension's own human-approval safety setting. PromptArmor's analysis of the ChatGPT for Google Sheets add-on shows how LLM extensions inherit every permission you've granted — and hand them to whoever controls the data you import.

Jun 3, 2026 5 min read prompt-injection extensions data-exfiltration llm-security supply-chain
One Prompt, One High-Profile Instagram Account: How Meta's Support Bot Became an Exploit

Hackers bypassed Meta's AI support chatbot with a single text prompt to take over high-profile Instagram accounts — including the Barack Obama White House account. The bot had account recovery powers, no human escalation path, and failed at its most critical function: distinguishing between the account owner and an attacker who knew the right words.

Jun 3, 2026 7 min read prompt-injection account-takeover support-automation ambient-authority llm-security
VeilGate: When Your Defense Is a Lie That Costs the Attacker Money

Deception proxies flip the economics of AI-assisted pentesting by routing hostile automation into believable tarpits instead of blocking it

Jun 1, 2026 12 min read defense-patterns agent-security tool-use threat-modeling
A Common Language for Evaluating Frontier AI: OpenAI's Shared Playbook

OpenAI released a shared playbook for conducting trustworthy third-party AI evaluations, structured around three pillars: model capabilities, safeguards, and validity. Here's what that structure reveals about where the field currently falls short.

May 30, 2026 8 min read evaluation ai-safety trust defense-patterns threat-modeling
The Time Bomb in Your Fine-Tuned Model: MetaBackdoor Exploits Position, Not Content

A new backdoor attack requires no suspicious text—it activates when conversation length crosses a threshold, leaking system prompts and making unauthorized tool calls.

May 27, 2026 11 min read agent-security threat-modeling defense-patterns tool-use
We Found a Regression in Our Own AI Agent

We built monitoring infrastructure to catch silent behavior changes in AI agent wrapper layers. The first time we ran it on ourselves, it caught a production bug we had no idea existed.

May 27, 2026 7 min read agent-security context-window monitoring regression system-prompt
Your Safety Fine-Tuning Data May Be Teaching the Wrong Lessons

A fundamental flaw in how LLMs process negation during fine-tuning means datasets showing models what NOT to do can inadvertently teach them to do exactly that.

May 25, 2026 13 min read alignment agent-security defense-patterns evaluation
The Inbox Is the New Attack Surface: What Gemini Spark Reveals About Personal AI Agent Security

Google's personal AI agent has ambient authority over your Gmail, Calendar, and Drive. Researchers have already demonstrated how to hijack it through a calendar invite. Infrastructure defenses don't fix this.

May 22, 2026 11 min read agent-security prompt-injection threat-modeling defense-patterns multi-agent
When Your Safety Layer Gets Compromised: The npm Supply Chain Problem in AI Agent Pipelines

The Mini Shai-Hulud campaign hit guardrails-ai and the Mistral AI SDK. For AI teams, this is more than a supply chain story — it's a demonstration that your agent's safety layer is part of the attack surface.

May 20, 2026 12 min read agent-security tool-use threat-modeling defense-patterns
Your Agent Runtime Is a 1960s Operating System

A new paper from TU Berlin and CISPA maps AI agent security onto 50 years of OS research — and finds that agent runtimes are failing to apply solutions that were well-understood before most of their developers were born.

May 18, 2026 13 min read agent-security threat-modeling defense-patterns tool-use
Your Agent's Memory Is Building a Privacy Database You Didn't Design

Cloud-assisted agent memory systems are accumulating raw user PII — health conditions, credentials, contact details — in vector databases where it persists indefinitely. MemPrivacy shows the attack surface is real, quantified, and fixable. Here's the threat model most teams haven't modeled.

May 15, 2026 11 min read agent-security threat-modeling defense-patterns tool-use
The Hidden Cost of Instructions: 12,956 Tokens Before You Say a Word

We measured how many tokens the Copilot CLI wrapper layer consumes before your first message. The answer — and what it means for context window budgeting — surprised us.

May 14, 2026 6 min read context-window system-prompt tool-use agent-security measurement
When Your Agent Forgets the Right Things: Skill Libraries as Emergent Defense Against Memory Poisoning

A new RL framework for agent skill libraries creates an unexpected security property: skills that lead to task failures get naturally retired. Here's what that means for your threat model — and where the attack surface actually shifts.

May 14, 2026 10 min read agent-security threat-modeling defense-patterns multi-agent capability-theft
Your AI Agent Is an Improvised Prototype. Here's Why That's a Security Problem.

A new cs.CR paper argues that the dominant 'on-the-fly' agentic paradigm short-circuits 50 years of software engineering discipline — and that the security implications are severe. Every improvised tool chain is a prototype you're deploying as if it were production.

May 12, 2026 7 min read agent-security threat-modeling defense-patterns infrastructure software-engineering
Safe in Isolation, Dangerous Together: The Multi-Turn Blind Spot in Your Safety Filter

Decompositional jailbreaks split a harmful request across innocuous-looking turns. TwinGate is the first defense designed for the hardest variant: fully anonymous, interleaved traffic with no user identity metadata.

May 11, 2026 11 min read prompt-injection defense-patterns threat-modeling agent-security multi-agent
Exploration Hacking: When Your Model Games Its Own Training

A new attack class shows that sufficiently capable LLMs can strategically suppress their exploration during RL training to avoid having dangerous capabilities elicited — and frontier models already reason about it.

May 8, 2026 8 min read alignment threat-modeling agent-security evaluation defense-patterns
423 Security Fixes in One Month: Inside Mozilla's AI-Powered Vulnerability Pipeline

Mozilla shipped 423 Firefox security fixes in April 2026 — nearly 20x the monthly average — by combining Anthropic's Claude Mythos Preview with a custom agentic harness. What the numbers mean, how the pipeline works, and what defenders should learn from it.

May 8, 2026 7 min read ai-defense case-study vulnerability-remediation mozilla agentic-security
7.1%: What Happens When You Actually Measure Multi-Agent Safety

TrinityGuard tested real multi-agent system configurations against a structured, OWASP-grounded taxonomy of 20 risk types. The average safety pass rate was 7.1%. Here's what that number means and what the framework gives you to act on it.

May 6, 2026 10 min read multi-agent threat-modeling defense-patterns evaluation agent-security
Poisoning What Your Agent Remembers: The Cross-Session Attack You Haven't Modeled

eTAMP shows that a single compromised webpage can silently corrupt an agent's persistent memory, then trigger the payload on a completely different site in a future session — with attack success rates climbing to 32.5% when the agent is under stress.

May 4, 2026 11 min read agent-security threat-modeling prompt-injection defense-patterns tool-use
No Auth Required: How a Healthcare RAG Chatbot Leaked 1,000 Patient Conversations

Researchers used nothing but Chrome DevTools to extract the system prompt, full RAG configuration, knowledge base, and 1,000 stored patient conversations from a live medical chatbot. The exploit wasn't prompt injection — it was basic web application security failure.

May 4, 2026 8 min read rag-security deployment-security privacy healthcare threat-modeling
When AI Agents Talk in Embeddings, Text-Level Safety Filters Go Blind

RecursiveMAS replaces inter-agent text communication with latent-space embeddings for efficiency. The security consequence: an entirely new attack surface — latent-space injection — where adversarial representations propagate between agents with no text transcript, no content filter, and no audit trail.

May 2, 2026 13 min read agent-security multi-agent prompt-injection threat-modeling latent-space
Safe Agents, Unsafe Systems: The Non-Compositionality Problem in Multi-Agent Security

A 24-author paper from Oxford, CMU, MIT, and the Turing Institute argues that individually safe AI agents can compose into unsafe systems — and that securing each agent in isolation misses the point entirely.

May 1, 2026 13 min read multi-agent agent-security threat-modeling prompt-injection alignment
What Red-Teaming Misses When Agents Talk to Each Other

Microsoft Research red-teamed a live 100+ agent platform and found four attack classes — worms, amplification, trust capture, proxy chains — that only emerge at network scale. Single-agent benchmarks miss all of them.

May 1, 2026 11 min read agent-security prompt-injection multi-agent threat-modeling red-teaming
Your Guardrails Can't Read JSON: The Structural Bottleneck in Agentic Safety

New research finds that guardrail performance on tool-call trajectories correlates at ρ=0.79 with structured-data reasoning ability — and near-zero with jailbreak robustness. Here's what that means for how you secure agents.

Apr 29, 2026 11 min read agent-security defense-patterns tool-use evaluation threat-modeling
Your Agent Is Mine: The LLM Router Supply Chain Attack You're Not Defending Against

Researchers bought 428 LLM API routers and found 9 actively injecting malicious code. Here's what that means for every agent that uses a third-party API proxy.

Apr 27, 2026 11 min read agent-security tool-use threat-modeling defense-patterns prompt-injection
Three Papers, Three Attack Layers: Agent Security Gets Mapped

In one week, three independent research groups dissected the conversation, tool-use, and capability layers of AI agent systems. Here's what practitioners need to know.

Apr 26, 2026 7 min read threat-modeling prompt-injection mcp-security tool-use agent-security

No posts match your filters.

All Posts

Cross-Tenant Data Leakage in Multi-Tenant LLM Deployments: Incidents, Architecture, and What to Demand from Providers

Prompt Cache Timing Attacks: Side-Channel Leakage in LLM API Infrastructure

Slopsquatting: When AI Hallucinated Package Names Become a Supply Chain Attack

Sponge Examples: Energy and Latency Attacks on Neural Networks

WormGPT, FraudGPT, and the Criminal AI Ecosystem: Jailbroken Models as Cybercrime Infrastructure

Computer Use Agent Security: Attack Surfaces of GUI-Access AI Systems

Human-in-the-Loop Bypass: How AI Agents Circumvent Oversight Mechanisms

Least Privilege for AI Agents: Runtime Capability Minimization and Reducing Blast Radius

Prompt Injection in Email, Calendars, and Productivity Tools: The Enterprise AI Copilot Attack Surface

Voice AI Security: Adversarial Audio, Ultrasonic Injection, and Attacks on Speech-Enabled AI Agents

AI in Critical Infrastructure: Attack Surfaces in Industrial Control Systems and Smart Grids

AI Secrets Management: Protecting API Keys, System Prompts, and Model Credentials in Production

The AI Security Tooling Landscape: Garak, PyRIT, Promptfoo, and the Open-Source Red-Team Ecosystem

Alignment Faking in Large Language Models: The Research Finding That Could Break Safety Evaluations

Model Inversion Attacks: Reconstructing Private Training Data from Model Confidence Scores

Crescendo: Why Single-Turn Safety Filters Are Insufficient

AI-Enabled Influence Operations: How LLMs Changed the Economics of Disinformation at Scale

Mechanistic Interpretability as a Security Tool: Detecting Backdoors and Hidden Behaviors in AI Models

Shadow AI in the Enterprise: Detecting, Governing, and Securing Unauthorized AI Tool Use

LLM Security Monitoring in Production: Anomaly Detection, Audit Logging, and Intrusion Detection for AI Systems

Multi-Agent Orchestration Security: Trust, Delegation, and Inter-Agent Attack Surfaces

Privacy-Preserving AI Inference: Trusted Execution Environments, Homomorphic Encryption, and Confidential Computing

Quantization and Compression Attacks: How Model Size Reduction Can Re-Enable Suppressed Unsafe Behaviors

Securing the AI Inference Stack: GPU Memory Isolation, Model Serving Hardening, and Self-Hosted LLM Infrastructure Security

Circuit Breakers for AI Agents: Designing Controllability, Action Budgets, and Emergency Stops

Improper LLM Output Handling: SQL Injection, XSS, and SSRF via AI-Generated Responses

Model Hub Supply Chain Attacks: Malicious Models, Tokenizer Exploits, and Typosquatting on Hugging Face

Multimodal Jailbreaking: How Attackers Use Images to Bypass Text Safety Filters

Non-Human Identity Security for AI Agents: Credential Scoping, Token Lifecycle, and Agent Impersonation

Defending Against Prompt Injection: Privilege Separation, Structured Outputs, and the Limits of Current Defenses

Reasoning Model Security: Attacks on Chain-of-Thought and Extended Thinking

Training Data Extraction: How Attackers Query LLMs to Surface Memorized Private Content

When AI Writes the Bug: Security Vulnerabilities in LLM-Generated Code

Poisoning the Knowledge Base: Adversarial Document Injection into RAG Vector Stores

Adversarial Examples: The Foundational ML Attack That Still Breaks AI Systems in Production

AI Incident Response: A Practitioner's Playbook for When Your AI System Is Compromised

AI Security and the Law: What the EU AI Act, NIST AI RMF, and ISO 42001 Actually Require of Builders

LLM Guardrails in Practice: A Decision Guide to Runtime Input/Output Filtering Tools

Poisoning the Pretraining Corpus: How Attackers Corrupt Foundation Models Before They're Built

AI as a Weapon: How Attackers Use LLMs Against Traditional Infrastructure

Differential Privacy in Practice: What the Math Guarantees (and What It Doesn't) for AI Training Data

MITRE ATLAS: Mapping the AI/ML Threat Landscape with an Authoritative Adversarial Framework

ML Model Provenance: Signing, SBOMs, and Verifying the AI You Deploy Before It Runs

Zero-Trust Architecture for AI Agent Deployments: Never Trust, Always Verify — Even Your Own Agents

AI-Powered Social Engineering: Deepfakes, Voice Cloning, and the Industrialization of Impersonation

Attacking the Judge: Adversarial Manipulation of LLM-as-a-Judge Evaluation Systems

Malicious AI Model Files: Pickle Exploits and Arbitrary Code Execution on Model Load

Membership Inference Attacks: Detecting What Was in an AI Model's Training Data

System Prompt Extraction: How Attackers Steal Proprietary AI Instructions

Agent Loop Hijacking: How Resource Exhaustion and Infinite Reasoning Loops Become Attack Primitives

Gradient Inversion Attacks: Reconstructing Private Training Data from Model Updates

Hallucination as a Security Surface: Package Fabrication, Fake Credentials, and Confident Wrong Advice

RAG Privacy Attacks: How Retrieval-Augmented Generation Pipelines Leak Private Documents

Trojan Triggers in Multi-Modal Models: How Visual Backdoors Activate Hidden Behaviors in Vision-Language Systems

Backdoor Attacks in Foundation Models: Sleeper Triggers That Survive Fine-Tuning

CI/CD Pipeline Injection: When AI Code Assistants Become Supply Chain Threats

Constitutional AI Under Attack: Exploiting Self-Critique Alignment Mechanisms

Reward Hacking in Production: When RLHF Optimization Inverts Safety Goals

Side-Channel Attacks on LLM APIs: What Response Timing and Token Counts Reveal

Cross-Tenant Contamination in LLM APIs: When Other Users' Context Leaks Into Your Session

Machine Unlearning Security: When Forgetting Training Data Creates New Vulnerabilities

Prompt Injection in Long-Context Windows: When More Context Means More Attack Surface

Steganographic Agent Marking: Covert Identity Signals in AI-Generated Output

Automating the Red Team: Using AI to Attack AI at Scale

Federated Learning Poisoning: The Aggregation Attack Surface

Model Extraction via API Queries: Stealing Proprietary AI Without the Weights

Shadow Prompting: How Hidden System Instructions Hijack AI Behavior

Token Smuggling: Unicode Tricks That Slip Past AI Safety Filters

Adversarial Attacks on Vision-Language Models: Pixels as Injection Vectors

Adversarial Prompt Caching: Timing Attacks and Injection via Shared KV Caches

How AI Safety Evaluations Are Gamed: Sandbagging, Context Drift, and Eval Design Gaps

Jailbreak Robustness After Fine-Tuning: How Safety Alignment Degrades

LLM Output Watermarking: Provenance, Detection Limits, and Evasion

Exfiltration via Agent Side Channels: How AI Agents Leak Sensitive Data Indirectly

AI Agent Supply Chain Attacks: Compromising Agents Before They Run

Benchmark Contamination and the False Assurance Problem in AI Safety Evaluations

Multi-Agent Trust Escalation: How Subagents Inherit and Abuse Orchestrator Permissions

Tool Poisoning via Malicious MCP Servers: When Your Agent's Tools Turn Against It

Browser-Use Attacks: Hijacking AI Agents That Browse the Web