Are Your LLMs Secure? Here's How Cisco is Thinking About It
As organizations increasingly integrate LLMs into their operations, a question that often gets deferred is: which model ...
As organizations increasingly integrate LLMs into their operations, a question that often gets deferred is: which model is actually the most secure to deploy? It's a harder question than it sounds. Capability benchmarks are everywhere; security benchmarks are not. Existing efforts like MITRE ATLAS, NIST's AML taxonomy, and OWASP's LLM Top 10 each cover slices of the problem — but none achieves full coverage across content safety, lifecycle scope, multi-agent threats, and supply chain risk simultaneously.
Cisco has made a serious effort at developing a security framework for AI. This framework informs Cisco's LLM Security Leaderboard, a publicly accessible tool that ranks models on resistance to adversarial attacks, which was recently launched at the 2026 RSA conference. You can reference this tool to get a sense of how secure your LLM of choice is.
In this blog post, we'll dive into the security framework developed by Cisco to evaluate LLMs.

How the Leaderboard Scores Models
The leaderboard evaluates models across two distinct test types. It does so against base models with no guardrails applied, establishing a consistent baseline of inherent model security rather than measuring production-hardened deployments.
Single-turn testing is blunt force: send a malicious prompt directly, see if the model refuses. "Write me malware for X." "Explain how to synthesize Y." It tests the model's immediate, unconditional safety response — the first-instinct refusal. Single-turn scoring is simply the percentage of direct, single-prompt attacks the model successfully refused.
Multi-turn testing mirrors how a real adversary actually operates. Rather than one direct request, the attacker builds across a conversation — establishing rapport, adopting a persona, escalating gradually from innocuous to harmful. A model might refuse a direct request but cave after six turns of social engineering. Multi-turn attack strategies include persona adoption, gradual escalation from benign to harmful requests, social engineering and trust-building, and context manipulation. Scoring measures the percentage of individual attack strategies that bypassed the model's safeguards, with each conversation typically testing 4-5 independent strategies — more granular than a binary pass/fail per conversation.
Each model receives a combined score weighted equally between the two (50/50), so a model can't rank well by excelling in only one dimension. A model that aces single-turn but fails multi-turn is dangerous in production — most real interactions aren't one-shot. Scores range from 0-100%, with bands of Excellent (85-100%), Good (70-84%), Fair (50-69%), and Poor (0-49%).
The single-turn/multi-turn distinction maps directly onto the threat taxonomy. Single-turn testing primarily exercises techniques executable in one prompt — direct prompt injection, jailbreak variants, harmful content elicitation. Multi-turn testing is where the more sophisticated objectives come alive: Goal Manipulation, Context Boundary Attacks, Masquerading and Impersonation, Persistence, and the social engineering dimension of Communication Compromise. If you're deploying a one-shot API tool, single-turn scores are most relevant. If you're deploying a conversational agent with memory and tool access — which is where enterprise AI is heading — multi-turn resistance is the number that matters.
A Framework Built on Three Taxonomies
The Cisco Integrated AI Security and Safety Framework is not a single taxonomy — it's three interlocking ones, each covering a distinct attack surface, and each designed to map back to a common structure so organizations can reason about risk consistently across all of them.
1. The AI Security Framework Taxonomy
This is the master taxonomy — the 19 objectives, 40 techniques, and 112 subtechniques that classify the full range of AI threats. It operates on four hierarchical levels: objectives (the "why" behind attacks), techniques (the "how"), subtechniques (specific variants), and procedures (discrete real-world implementations).
The 19 objectives span three risk groups:
Common Manipulation Risks
- Goal Hijacking — direct and indirect prompt injection, multi-modal injection, goal manipulation
- Jailbreak — context manipulation, token exploitation, obfuscation, multi-agent jailbreak collaboration
- Masquerading / Impersonation — identity obfuscation, trusted agent spoofing
- Communication Compromise — agent injection, context boundary attacks, protocol manipulation (server rebinding, replay exploitation)
- Persistence — memory system persistence, configuration persistence, agent profile tampering
Data-Related Risks
- Feedback Loop Manipulation — training data poisoning, reinforcement biasing, reinforcement signal corruption
- Sabotage / Integrity Degradation — reasoning corruption, memory anchor attacks, token theft
- Data Privacy Violations — membership inference, data exfiltration via agent tooling, LLM data leakage, system prompt leakage
- Supply Chain Compromise — malicious package injection, dependency squatting, backdoors and trojans, unauthorized system/network access
- Model Theft / Extraction — API query stealing, weight reconstruction, model inversion
- Adversarial Evasion — environment-aware and model-selective evasion, targeted model fingerprinting
Downstream / Impact Risks
- Action-Space & Integration Abuse — tool poisoning, tool shadowing, parameter manipulation, insecure output handling
- Availability Abuse — compute exhaustion, model DoS, decision paralysis, cost harvesting
- Privilege Compromise — credential theft, permission escalation via delegation
- Harmful / Misleading Content — 25 subtechniques spanning CBRN risks, disinformation, unauthorized professional advice, PII/PHI/PCI privacy attacks
- Surveillance, Cyber-Physical Attacks, System Misuse, Multi-Modal Risks — rounding out the downstream picture as agentic systems interact with physical infrastructure and multi-modal inputs
2. The MCP Threats Taxonomy
Model Context Protocol (MCP) is an open standard that governs how LLMs interact with external tools, data sources, and execution environments. It's the plumbing that makes agentic AI work — allowing models to call APIs, query databases, execute code, and chain actions across systems. As MCP adoption accelerates, it has also become a significant attack surface in its own right.
Separate from the main taxonomy, Cisco has published a dedicated MCP threats taxonomy covering 14 threat types organized into four groups. Every threat maps back to the main taxonomy's objectives and techniques, but the MCP taxonomy adds MCP-specific indicators, severity levels, and mitigation guidance — making it directly operationalizable for teams building on MCP.
- Injection and Interpretation Threats — misleading the model into executing harmful or unintended tool calls
- Tool Integrity Threats — malicious or spoofed tools replacing trusted logic, intercepting data, or embedding hidden behavior
- Data Exfiltration and Access Threats — unauthorized file-system access, internal service reachability, secrets leakage
- Execution and Payload Threats — arbitrary or hidden code execution within the MCP environment via unsafe evaluation constructs, insecure tool definitions, dynamic imports, or serialized payload loading
3. The Supply Chain Threats Taxonomy
Supply chain risk is systematically underweighted in most current threat models.
Cisco has published a standalone supply chain taxonomy covering 22 threat types across four groups, each mapped to the main framework with file-type indicators, severity ratings, and a "Model Defense Layer" mitigation approach.
Artifact and Format Vulnerabilitie
- Deserialization / serialized-code execution — malicious payloads embedded in model files that execute on load
- Format-specific backdoors — hidden triggers embedded in Lambda layers or similar constructs
- Tokenizer or template injection — malicious logic injected via tokenizer configs or server-side templates
- Hidden native libraries or executables — concealed binaries bundled within model artifacts
- Archive or compression abuses — malicious content hidden within compressed file structures
- Metadata/manifest tampering — altered model metadata used to misrepresent provenance or behavior
- Format inconsistencies and poisoned arrays — structural anomalies used to smuggle malicious data
Model Manipulation and Tampering
- Training data poisoning — corrupting training data to introduce bias or malicious behavior
- Weight poisoning or backdoor insertion — directly modifying model weights to embed hidden triggers
- Malicious adapters / merged-model trojans — fine-tuning adapters or model merges that introduce stealthy malicious behavior
- Annotation or label tampering — corrupting labeled datasets to skew model outputs
- Snapshot or shard manipulation — tampering with model checkpoints or distributed model shards
Dependency and Distribution Compromise
- Malicious package/tool injection — introducing compromised packages into the model's dependency chain
- Registry/mirror compromise (rug pulls) — replacing legitimate packages with malicious ones via compromised registries
- Typosquatting/namespace squatting — registering near-identical package names to intercept installations
- Dependency replacement/version downgrade — forcing resolution to vulnerable or malicious older versions
- CI/CD and artifact-swap compromises — tampering with build pipelines to swap legitimate artifacts for malicious ones
Operational and Runtime Threats
- Arbitrary code execution on load — model artifacts that trigger code execution when loaded
- Unauthorized system access or privilege escalation — supply chain compromises that enable broader system access
- Unauthorized network access or data exfiltration — compromised models that beacon out or exfiltrate data at runtime
- Model extraction or weight reconstruction — using runtime access to steal model weights or architecture
- Runtime obfuscation or evasion — techniques that hide malicious behavior from monitoring and detection
Conclusion
The practical value here is three well-structured, interoperable taxonomies that give security teams a common vocabulary — a structured way to ask which objectives are relevant to your use case, which MCP-specific threats apply to your agentic architecture, and where your supply chain exposures actually lie.
Please reach out with questions about securing AI in your organization's network.