Are Your LLMs Secure? Here's How Cisco is Thinking About It

Written by Uzi Ahmed | Apr 3, 2026 3:00:00 PM

As organizations increasingly integrate LLMs into their operations, a question that often gets deferred is: which model is actually the most secure to deploy? It's a harder question than it sounds. Capability benchmarks are everywhere; security benchmarks are not. Existing efforts like MITRE ATLAS, NIST's AML taxonomy, and OWASP's LLM Top 10 each cover slices of the problem — but none achieves full coverage across content safety, lifecycle scope, multi-agent threats, and supply chain risk simultaneously.

Cisco has made a serious effort at developing a security framework for AI. This framework informs Cisco's LLM Security Leaderboard, a publicly accessible tool that ranks models on resistance to adversarial attacks, which was recently launched at the 2026 RSA conference. You can reference this tool to get a sense of how secure your LLM of choice is.

In this blog post, we'll dive into the security framework developed by Cisco to evaluate LLMs.

How the Leaderboard Scores Models

The leaderboard evaluates models across two distinct test types. It does so against base models with no guardrails applied, establishing a consistent baseline of inherent model security rather than measuring production-hardened deployments.

Single-turn testing is blunt force: send a malicious prompt directly, see if the model refuses. "Write me malware for X." "Explain how to synthesize Y." It tests the model's immediate, unconditional safety response — the first-instinct refusal. Single-turn scoring is simply the percentage of direct, single-prompt attacks the model successfully refused.

Multi-turn testing mirrors how a real adversary actually operates. Rather than one direct request, the attacker builds across a conversation — establishing rapport, adopting a persona, escalating gradually from innocuous to harmful. A model might refuse a direct request but cave after six turns of social engineering. Multi-turn attack strategies include persona adoption, gradual escalation from benign to harmful requests, social engineering and trust-building, and context manipulation. Scoring measures the percentage of individual attack strategies that bypassed the model's safeguards, with each conversation typically testing 4-5 independent strategies — more granular than a binary pass/fail per conversation.

Each model receives a combined score weighted equally between the two (50/50), so a model can't rank well by excelling in only one dimension. A model that aces single-turn but fails multi-turn is dangerous in production — most real interactions aren't one-shot. Scores range from 0-100%, with bands of Excellent (85-100%), Good (70-84%), Fair (50-69%), and Poor (0-49%).

The single-turn/multi-turn distinction maps directly onto the threat taxonomy. Single-turn testing primarily exercises techniques executable in one prompt — direct prompt injection, jailbreak variants, harmful content elicitation. Multi-turn testing is where the more sophisticated objectives come alive: Goal Manipulation, Context Boundary Attacks, Masquerading and Impersonation, Persistence, and the social engineering dimension of Communication Compromise. If you're deploying a one-shot API tool, single-turn scores are most relevant. If you're deploying a conversational agent with memory and tool access — which is where enterprise AI is heading — multi-turn resistance is the number that matters.

A Framework Built on Three Taxonomies

The Cisco Integrated AI Security and Safety Framework is not a single taxonomy — it's three interlocking ones, each covering a distinct attack surface, and each designed to map back to a common structure so organizations can reason about risk consistently across all of them.

1. The AI Security Framework Taxonomy

This is the master taxonomy — the 19 objectives, 40 techniques, and 112 subtechniques that classify the full range of AI threats. It operates on four hierarchical levels: objectives (the "why" behind attacks), techniques (the "how"), subtechniques (specific variants), and procedures (discrete real-world implementations).

The 19 objectives span three risk groups:

Common Manipulation Risks

Goal Hijacking — direct and indirect prompt injection, multi-modal injection, goal manipulation
Jailbreak — context manipulation, token exploitation, obfuscation, multi-agent jailbreak collaboration
Masquerading / Impersonation — identity obfuscation, trusted agent spoofing
Communication Compromise — agent injection, context boundary attacks, protocol manipulation (server rebinding, replay exploitation)
Persistence — memory system persistence, configuration persistence, agent profile tampering

Data-Related Risks

Feedback Loop Manipulation — training data poisoning, reinforcement biasing, reinforcement signal corruption
Sabotage / Integrity Degradation — reasoning corruption, memory anchor attacks, token theft
Data Privacy Violations — membership inference, data exfiltration via agent tooling, LLM data leakage, system prompt leakage
Supply Chain Compromise — malicious package injection, dependency squatting, backdoors and trojans, unauthorized system/network access
Model Theft / Extraction — API query stealing, weight reconstruction, model inversion
Adversarial Evasion — environment-aware and model-selective evasion, targeted model fingerprinting

Downstream / Impact Risks

Action-Space & Integration Abuse — tool poisoning, tool shadowing, parameter manipulation, insecure output handling
Availability Abuse — compute exhaustion, model DoS, decision paralysis, cost harvesting
Privilege Compromise — credential theft, permission escalation via delegation
Harmful / Misleading Content — 25 subtechniques spanning CBRN risks, disinformation, unauthorized professional advice, PII/PHI/PCI privacy attacks
Surveillance, Cyber-Physical Attacks, System Misuse, Multi-Modal Risks — rounding out the downstream picture as agentic systems interact with physical infrastructure and multi-modal inputs

2. The MCP Threats Taxonomy

Model Context Protocol (MCP) is an open standard that governs how LLMs interact with external tools, data sources, and execution environments. It's the plumbing that makes agentic AI work — allowing models to call APIs, query databases, execute code, and chain actions across systems. As MCP adoption accelerates, it has also become a significant attack surface in its own right.

Separate from the main taxonomy, Cisco has published a dedicated MCP threats taxonomy covering 14 threat types organized into four groups. Every threat maps back to the main taxonomy's objectives and techniques, but the MCP taxonomy adds MCP-specific indicators, severity levels, and mitigation guidance — making it directly operationalizable for teams building on MCP.

Injection and Interpretation Threats — misleading the model into executing harmful or unintended tool calls
Tool Integrity Threats — malicious or spoofed tools replacing trusted logic, intercepting data, or embedding hidden behavior
Data Exfiltration and Access Threats — unauthorized file-system access, internal service reachability, secrets leakage
Execution and Payload Threats — arbitrary or hidden code execution within the MCP environment via unsafe evaluation constructs, insecure tool definitions, dynamic imports, or serialized payload loading

3. The Supply Chain Threats Taxonomy

Supply chain risk is systematically underweighted in most current threat models.

Cisco has published a standalone supply chain taxonomy covering 22 threat types across four groups, each mapped to the main framework with file-type indicators, severity ratings, and a "Model Defense Layer" mitigation approach.

Artifact and Format Vulnerabilitie

Deserialization / serialized-code execution — malicious payloads embedded in model files that execute on load
Format-specific backdoors — hidden triggers embedded in Lambda layers or similar constructs
Tokenizer or template injection — malicious logic injected via tokenizer configs or server-side templates
Hidden native libraries or executables — concealed binaries bundled within model artifacts
Archive or compression abuses — malicious content hidden within compressed file structures
Metadata/manifest tampering — altered model metadata used to misrepresent provenance or behavior
Format inconsistencies and poisoned arrays — structural anomalies used to smuggle malicious data

Model Manipulation and Tampering

Training data poisoning — corrupting training data to introduce bias or malicious behavior
Weight poisoning or backdoor insertion — directly modifying model weights to embed hidden triggers
Malicious adapters / merged-model trojans — fine-tuning adapters or model merges that introduce stealthy malicious behavior
Annotation or label tampering — corrupting labeled datasets to skew model outputs
Snapshot or shard manipulation — tampering with model checkpoints or distributed model shards

Dependency and Distribution Compromise

Malicious package/tool injection — introducing compromised packages into the model's dependency chain
Registry/mirror compromise (rug pulls) — replacing legitimate packages with malicious ones via compromised registries
Typosquatting/namespace squatting — registering near-identical package names to intercept installations
Dependency replacement/version downgrade — forcing resolution to vulnerable or malicious older versions
CI/CD and artifact-swap compromises — tampering with build pipelines to swap legitimate artifacts for malicious ones

Operational and Runtime Threats

Arbitrary code execution on load — model artifacts that trigger code execution when loaded
Unauthorized system access or privilege escalation — supply chain compromises that enable broader system access
Unauthorized network access or data exfiltration — compromised models that beacon out or exfiltrate data at runtime
Model extraction or weight reconstruction — using runtime access to steal model weights or architecture
Runtime obfuscation or evasion — techniques that hide malicious behavior from monitoring and detection

Conclusion

The practical value here is three well-structured, interoperable taxonomies that give security teams a common vocabulary — a structured way to ask which objectives are relevant to your use case, which MCP-specific threats apply to your agentic architecture, and where your supply chain exposures actually lie.

Please reach out with questions about securing AI in your organization's network.

View full post