Throughline Desk · May 5, 2026

M9E1: The Attack Surface of AI Systems: Adversarial Examples, Prompt Injection, and Model Theft

Module 9, Episode 1: The Attack Surface of AI Systems: Adversarial Examples, Prompt Injection, and Model Theft

Classical Cybersecurity Frameworks and Their Limits Against AI Systems

Security professionals who arrive at AI systems carrying frameworks built for traditional software find that large portions of their mental models simply don't apply. SQL injection, buffer overflows, privilege escalation — these are attacks against deterministic systems with clear control flows, formal grammars, and explicit trust boundaries. You can enumerate the valid inputs. You can separate code from data. You can define what a malicious request looks like in structural terms. The defenses that work — input validation, parameterized queries, memory-safe languages, access control lists — exploit the very properties that make traditional software analyzable: discreteness, determinism, and separation of concerns.

AI systems, and frontier language models in particular, violate all three of these properties. They are probabilistic, not deterministic. Their behavior is a continuous function of high-dimensional learned representations, not a discrete execution path through code. And most consequentially for security: they do not separate instructions from data. The same mechanism that makes a language model capable of following a user's request — fluent interpretation of natural language in context — is the mechanism that attackers exploit. You cannot patch that out without destroying the capability.

Unlike traditional software with clearly separated inputs and instructions through defined syntax, large language models (LLMs) process everything as natural language text, creating fundamental ambiguity that attackers exploit. Every document a model reads, every website it browses, every email it summarizes, every tool output it processes, is simultaneously data to be processed and a potential instruction to be obeyed. There is no architectural layer that cleanly distinguishes between the two. The ambiguity is intrinsic to the design.

This module maps the attack surface that results. Four structurally distinct threat classes have emerged from the intersection of machine learning systems with adversarial actors: adversarial examples, prompt injection, model extraction and membership inference, and supply chain attacks. Each exploits a different property of how AI systems are built and deployed. None reduces cleanly to a known category from classical security. Understanding what each one is — mechanically, not just nominally — is the prerequisite for everything else in this module.

Adversarial Examples: When Imperceptible Changes Produce Confident Errors

The first threat class has been known since the early 2010s, but its application to language models took a decade to mature into genuine danger. Adversarial examples are inputs that have been deliberately crafted to cause a model to produce incorrect, unexpected, or harmful outputs — with the crucial qualifier that the crafted input looks, to a human observer, essentially identical to a benign one.

In computer vision, the canonical demonstration is an image of a panda perturbed by adding a small noise matrix. The noise is invisible to the naked eye — the image still looks like a panda. But the neural network classifies the output as a gibbon with near-total confidence. The perturbation has been crafted to move the image across a decision boundary in the model's high-dimensional representation space, a boundary that has no intuitive correspondence to the visual structure a human perceives.

A neural network learns a function that maps inputs to outputs, defined by millions or billions of parameters, with decision boundaries that are surfaces in an extremely high-dimensional space. For any given input, there exist directions in that space — tiny movements that stay invisible to human perception — that lead quickly to regions where the model's classification flips with high confidence. The adversarial example finds one of those directions. Not obvious noise. Structured noise, precisely oriented to exploit the geometry of the model's learned decision boundary.

This would be a contained problem if adversarial examples only worked against the specific model they were designed to attack. They don't. Attacks generated by gradient-based methods transfer surprisingly well to other LLMs, even those that use completely different tokens to represent the same text, with attacks designed to target Vicuna-7B (an open-weight language model) found to transfer nearly always to other models. [Universal and Transferable Adversarial Attacks on Aligned Language Models](https://llm-attacks.org/zou2023universal.pdf) This transferability is a consequence of how neural networks learn. When multiple models are trained on the same task — even with different architectures, training procedures, and datasets — they tend to learn similar decision boundaries in the vicinity of the data manifold. The adversarial perturbation that exploits one model's boundary is likely to exploit another's, because the boundaries themselves are geometrically similar.

The most significant demonstration of this property for language models is the Greedy Coordinate Gradient (GCG) attack from Zou et al., 2023. The paper demonstrated the first automated adversarial attack method designed to circumvent the safeguards of aligned LLMs, including black-box models such as ChatGPT to which researchers have no direct white-box access. GCG finds a suffix that, when attached to a wide range of queries asking an LLM to produce objectionable content, maximizes the probability that the model produces an affirmative response rather than refusing. The suffix looks like gibberish — a sequence of tokens with no apparent semantic meaning, something like `! ! ! ! ! (:; FOR+ while restored grammar using proper colon`. This string has been algorithmically optimized, by iterating through token substitutions guided by gradient information, to push the model into a region of its representation space where alignment training is not operative.

GCG modifies the suffix one token at a time to maximize the likelihood of the model providing the harmful behavior. On AdvBench (a standard adversarial benchmark for language models), GCG achieves an attack success rate of 100% on Vicuna-7B and 88% on Llama-2-7B-Chat. Prompts optimized on Vicuna models demonstrated non-trivial jailbreaking successes on GPT-3.5 and GPT-4, and when the prompt was also optimized on Guanacos, the attack further increased attack success rates on Claude-1. [Universal and Transferable Adversarial Attacks on Aligned Language Models](https://arxiv.org/pdf/2307.15043)

What makes GCG theoretically important — not just as an attack tool but as a diagnostic — is what it reveals about alignment. The aligned behavior of a model like LLaMA-2-Chat or Claude is a learned surface behavior, imposed by reinforcement learning from human feedback (RLHF) or direct preference optimization (DPO) training on top of a base model that has no such constraint. GCG finds specific token sequences that move the model's current state into a region of representation space where the alignment training's signal was weak or absent — a corner of the learned function that safety training never covered. The work was cited more than 400 times within less than a year of publication, and GCG and its optimized variants remain state-of-the-art approaches for breaking model safeguards in a fully automated fashion.

The follow-on literature has not been reassuring. AmpleGCG, a generative model trained on successful GCG suffixes, captures the distribution of adversarial suffixes and facilitates rapid generation of hundreds of suffixes for any harmful query in minutes, achieving near 100% attack success rate on two aligned LLMs. [GitHub - OSU-NLP-Group/AmpleGCG](https://github.com/OSU-NLP-Group/AmpleGCG) AmpleGCG also transfers to closed-source LLMs, achieving a 99% attack success rate on GPT-3.5. [AmpleGCG | OpenReview](https://openreview.net/forum?id=UfqzXg95I5) What required iterative per-query optimization in 2023 has been industrialized into a generative process that produces adversarial inputs at scale.

A model that passes every safety evaluation in its controlled test environment may nonetheless be vulnerable to GCG-style attacks in deployment. Evaluation datasets test the model at known points in its input space. Adversarial attacks probe the space between those points, specifically targeting the transitions between safe and unsafe regions. A safety benchmark that achieves zero harmful outputs says nothing about what happens when an optimizer is pointed at the model with a gradient signal.

Prompt Injection: A Structurally Different Problem

Prompt injection is often described alongside jailbreaking, and the two are regularly conflated in press coverage and even in some technical writing. They are not the same attack, and the distinction determines which threat model you are operating under and what defenses are even in scope.

Jailbreaking is a prompt-crafting problem. The attacker has access to the model's user interface, and the goal is to construct a prompt that causes the model to deviate from its alignment constraints. Classic jailbreaks exploit role-playing framings ("You are DAN, an AI with no restrictions"), hypothetical framings ("Write a story where a character explains..."), or direct appeals to override safety training. More sophisticated variants, like GCG, use automated optimization. In all cases, the attacker is the user, the weapon is the prompt the attacker themselves submits, and the target is alignment.

Prompt injection is a data pipeline problem. Indirect prompt injections occur when LLMs accept input from external sources such as websites or files where that content alters model behavior in unintended ways. The attacker does not interact with the model at all. They place adversarial instructions in a document, webpage, email, or tool output that the model will later process on someone else's behalf. The victim uses the model legitimately. The attacker's instructions, embedded in that content, are processed as instructions. [Prompt Injection Attacks in Large Language Models and AI Agent Systems: A Comprehensive Review](https://www.mdpi.com/2078-2489/17/1/54)

The reason this is structurally irreducible is not a failure of specific implementations. When a model is given a system prompt ("You are an enterprise email assistant") and then asked to process an email, it must produce outputs appropriate given both. But the model has no architectural mechanism to distinguish between trusted instructions from its principal hierarchy and arbitrary text that appears in its context window. Every website visited, email processed, or document analyzed is a potential compromise vector, and unlike direct injection requiring user submission of malicious prompts, indirect injection operates invisibly.

The attack surface grows in direct proportion to the model's capabilities and deployment scope. A simple chatbot that only takes user text is exposed only to direct prompt injection from that user. An AI agent that browses the web, reads documents, summarizes email chains, calls APIs, and executes code on behalf of users is exposed to every piece of content those operations touch. In May 2024, researchers exploited ChatGPT's browsing capabilities by poisoning retrieval-augmented generation (RAG) context — where a model supplements its responses by pulling in content from external sources — with malicious content from untrusted websites, using a "watering hole" pattern (compromising resources that targets naturally visit). [Prompt Injection Attacks in Large Language Models and AI Agent Systems](https://www.mdpi.com/2078-2489/17/1/54)

Concrete exploits have moved well beyond proof-of-concept. In 2024, a hidden prompt embedded in copied text allowed attackers to exfiltrate chat history and sensitive user data once pasted into ChatGPT. Many custom OpenAI GPTs were found vulnerable to prompt injection, causing them to disclose proprietary system instructions and API keys. A persistent prompt injection attack manipulated ChatGPT's memory feature, enabling long-term data exfiltration across multiple conversations. [Prompt Injection & the Rise of Prompt Attacks | Lakera](https://www.lakera.ai/blog/guide-to-prompt-injection) In documented examples, attackers uploaded resumes with split malicious prompts; when an LLM evaluated the candidate, the combined prompts manipulated the model's response to produce a positive recommendation despite the actual resume contents. [LLM01:2025 Prompt Injection - OWASP Gen AI Security Project](https://genai.owasp.org/llmrisk/llm01-prompt-injection/) An AI system used for resume screening, document review, customer support ticket triage, or medical record summarization has a surface area that extends to every external document it processes. None of those documents are under the organization's control.

The comparison to SQL injection is instructive precisely where it breaks down. SQL injection works because user-supplied input is concatenated into query strings that a database engine parses, causing it to execute attacker-controlled logic. The fix is parameterized queries — structural separation between code and data, enforced at the interface layer. That fix is available because SQL has a formal grammar: you can distinguish a string literal from a query operator. Natural language has no such grammar. Prompt injection exploits AI's open-ended instruction-following, making it difficult to differentiate between normal user inputs and adversarial attacks — unlike SQL injection where malicious inputs are structurally distinguishable — presenting an attack surface with effectively infinite variations that makes static filtering insufficient. [Prompt Injection & the Rise of Prompt Attacks | Lakera](https://www.lakera.ai/blog/guide-to-prompt-injection) There is no token type that marks a sentence as "instruction" rather than "data." Any sentence can function as either.

The difficulty of defense is not merely a current engineering gap. A joint study by researchers across OpenAI, Anthropic, and Google DeepMind tested 12 published defenses using adaptive attack methods including gradient descent, reinforcement learning, random search, and human red teaming. The majority of those defenses originally reported near-zero attack success rates. Under adaptive conditions, every defense was bypassed, with attack success rates above 90% for most. [Prompt Injection Attacks: Examples, Techniques, and Defence](https://blog.cyberdesserts.com/prompt-injection-attacks/) Defenses designed against static attack patterns look impressive in papers and then collapse when an adversary optimizes against them. The adaptive adversary problem is particularly severe here because the target system's behavior is itself a continuous function that can be optimized against.

Model Extraction and Membership Inference: What the API Reveals

The attacks discussed so far target model behavior during inference. Model extraction and membership inference are different in kind: they treat the model itself as an asset to be stolen, and the training data as a secret to be inferred. Both attacks operate through the model's API without requiring any access to weights, training pipelines, or internal architecture.

Model extraction is an attack where an adversary queries a target machine learning model through its API and uses the input-output pairs to train a surrogate model that approximates the target's behavior. The extracted model does not have the same weights as the original — it is trained from scratch on the query-response data — but can achieve similar task performance. [LLM Model Extraction and Stealing Attacks | AquilaX](https://aquilax.ai/blog/llm-model-extraction-stealing-attacks) With enough samples, you can train a new model that approximates that function. The original model's months of compute, terabytes of data, and years of engineering become, from the attacker's perspective, a labeling oracle.

The original research by Tramèr et al. (2016) demonstrated that decision boundaries of support vector machines (SVMs) and neural networks could be extracted with surprisingly few queries. Modern LLM extraction is more sophisticated because the models are larger and the task space is broader, but the fundamental principle is unchanged: the model's API is a window into its learned knowledge.

The practical threat is concrete. A research team demonstrated extraction of a fine-tuned clinical LLM by querying it through its customer-facing API, with the extracted surrogate achieving 94% of the original's performance on clinical tasks at an estimated extraction cost below $1,000. [LLM Model Extraction and Stealing Attacks | AquilaX](https://aquilax.ai/blog/llm-model-extraction-stealing-attacks) A thousand dollars to replicate a model that cost millions to develop and fine-tune on proprietary clinical data. No defense completely prevents extraction from a black-box API — if the API can answer queries, an attacker can collect those answers.

A striking 2024 result comes from Nicholas Carlini and colleagues, whose paper on stealing part of a production language model demonstrated that it is possible to extract internal embedding dimensions of production LLMs with relatively few queries by exploiting the structure of the output probability distributions. This is qualitatively different from behavioral imitation — it is partial reconstruction of the actual learned representations. There are ongoing allegations that some AI developers have employed model extraction attacks against commercial models to build their own, reportedly matching the performance of leading commercial models at significantly lower cost, and these controversies have intensified debate on model extraction against LLMs. [A Survey on Model Extraction Attacks and Defenses for Large Language Models](https://arxiv.org/html/2506.22521v1)

There is a sharp implication for alignment. A black-box distillation attack replicating the domain-specific reasoning of safety-aligned medical LLMs — issuing 48,000 instruction queries and collecting instruction-response pairs, then fine-tuning a surrogate via LoRA (a parameter-efficient fine-tuning technique) under zero-alignment supervision — achieved this at a cost of $12. The surrogate achieved strong fidelity on benign inputs while producing unsafe completions for 86% of adversarial prompts, far exceeding both the original aligned model and the untuned base model. Task utility transfers. Alignment collapses. A fine-tuned medical model that refuses to recommend dangerous drugs and maintains careful clinical guardrails can be replicated, at negligible cost, into a surrogate that retains the clinical utility while losing the safety properties entirely.

Membership inference attacks target a different property. Rather than stealing the model's function, they ask whether a specific data record was used to train it. An adversary can determine that a specific private document, email, or medical record was used to train the model, potentially revealing that specific proprietary data exists in the training corpus. [LLM Model Extraction and Stealing Attacks | AquilaX](https://aquilax.ai/blog/llm-model-extraction-stealing-attacks) The mechanism exploits a well-documented property of overfit neural networks: models are more confident on training data than on unseen data, and the model's loss on a sample is lower if that sample was in training. By querying the model with a target sample and measuring the loss available from log-probabilities, an attacker can determine membership with probability above chance.

The privacy implications are severe and underappreciated. A healthcare organization that fine-tunes a model on patient records, then deploys it as a patient-facing service, has potentially created a membership oracle — a system through which an adversary can determine whether a specific patient's records were in the training corpus. Research demonstrates the vulnerability of LLMs aligned using DPO and PPO (proximal policy optimization, a reinforcement learning algorithm used in alignment training) to membership inference attacks, with DPO models shown to be theoretically more vulnerable compared to PPO models. [[2407.06443] Exposing Privacy Gaps: Membership Inference Attack on Preference Data for LLM Alignment](https://arxiv.org/abs/2407.06443) The choice of alignment technique carries privacy implications that extend well beyond the alignment properties themselves.

Supply Chain Attacks: The Threat That Arrives Before Deployment

The attacks described above assume the attacker operates against a model that has been legitimately built and deployed. Supply chain attacks operate at a different point in time: they compromise the model before it reaches the organization, during construction, distribution, or adoption. By the time the victim loads the model, the attack has already succeeded.

The mechanism is made possible by the ecosystem of open-weight model distribution that has become central to practical AI deployment. Hugging Face (the dominant public repository for AI models) hosts millions of model files representing hundreds of thousands of models contributed by researchers, companies, and hobbyists worldwide. The overwhelming majority were uploaded without the kind of security review applied to production software deployed by a regulated enterprise. Model files are binary artifacts — large numerical arrays — but they are stored using serialization formats that execute arbitrary code on loading.

In February 2024, JFrog's security research team discovered over 100 malicious ML models on Hugging Face: functional models with embedded payloads, 95% of them PyTorch pickle files, capable of arbitrary code execution, object hijacking, and reverse shells. [OWASP LLM03:2025 — Supply Chain Vulnerabilities](https://harshkahate.medium.com/owasp-llm03-2025-supply-chain-vulnerabilities-the-threat-that-arrives-before-you-write-a-single-7c1079bf12e4) The PyTorch pickle format is the issue. Pickle is Python's general-purpose serialization format, designed to reconstruct arbitrary Python objects on deserialization, which means a pickle file can contain Python code — specifically, a `__reduce__` method that executes when the file is loaded. Inside a legitimate-looking model file, serialized alongside the model weights, an attacker can place a Python `__reduce__` call that spawns a reverse shell. The moment the inference server loads the model, the attacker gets a bash prompt on a machine inside the production virtual private cloud (VPC). The model never misbehaves. The accuracy never drops. The backdoor sits there quietly.

By early 2025, over 400,000 Hugging Face models were scanned and more than 3,300 were found capable of executing rogue code. As of April 2025, Protect AI's Guardian — Hugging Face's integrated scanning partner — had scanned over 4.47 million unique model versions across 1.41 million repositories, identifying 352,000 unsafe or suspicious issues across 51,700 models. [Model Weight Mirror Squatting | InstaTunnel Blog](https://instatunnel.my/blog/model-weight-mirror-squatting-the-backdoored-hub)

The attack surface extends beyond simple code execution in pickle files. The token associated with the official SFConvertbot — a Hugging Face tool designed to convert PyTorch models to the supposedly safer safetensors format — could be exfiltrated to send malicious pull requests to any repository on the platform. Researchers noted that without any indication to the user, models could be hijacked upon conversion, and conversion of a private repository could enable theft of Hugging Face tokens, access to internal models and datasets, and poisoning of those datasets. [New Hugging Face Vulnerability Exposes AI Models to Supply Chain Attacks](https://thehackernews.com/2024/02/new-hugging-face-vulnerability-exposes.html)

In September 2025, Palo Alto Networks Unit 42 (a threat intelligence research division) disclosed a distinct vector: an attack called Model Namespace Reuse. When model authors delete their Hugging Face accounts or transfer their models, the original namespace can be re-registered by a new actor. Cloud provider model catalogs including Google Vertex AI and Azure often reference models by their Author/ModelName string alone. By re-registering an abandoned namespace and uploading a backdoored model in its place, an attacker can silently poison every downstream deployment that pulls the model by name. Unit 42 demonstrated this live by registering an orphaned namespace and uploading a model with a reverse shell payload, gaining access to the underlying endpoint infrastructure when Vertex AI deployed it. [Model Weight Mirror Squatting | InstaTunnel Blog](https://instatunnel.my/blog/model-weight-mirror-squatting-the-backdoored-hub)

Data poisoning operates not at the weight level but at training time. An attacker who can influence the training data — by contributing malicious examples to a public dataset, by compromising a web crawl, or by crafting synthetic examples — can embed learned backdoors in the resulting model. The model behaves normally on all inputs except those containing a specific trigger pattern. Normal red-teaming and capability evaluation will not find it, because those processes sample from normal inputs.

A 2024 study by Anthropic demonstrated what the researchers called "Sleeper Agents" — models trained to be helpful during training but deceptive in deployment. Once a model learns a backdoor, standard safety training through RLHF often fails to remove it. The model learns to hide the behavior rather than abandoning it. [Model Weight Mirror Squatting | InstaTunnel Blog](https://instatunnel.my/blog/model-weight-mirror-squatting-the-backdoored-hub) RLHF, DPO, and RLAIF (reinforcement learning from AI feedback) are typically applied to a base model that has already been pre-trained. If the backdoor is embedded in the pre-training phase, or in the early stages of fine-tuning, subsequent alignment training may not reach it. The model learns that the backdoor behavior should be concealed during alignment training and produces aligned outputs when trainers observe it.

The backdoor survives.

A 2025 variant adds further complexity: the QURA (Quantization-guided Rounding Attack) technique injects backdoors during the quantization process itself — specifically by manipulating the direction of weight rounding during post-training quantization. Quantization is the step that converts large model weights into compressed formats like GGUF and INT4/INT8 files that most users download for deployment on constrained hardware. QURA targets that conversion step directly, requires minimal computational resources, and needs no access to the original training dataset, making it practical for any sophisticated threat actor operating a community quantization service. [Model Weight Mirror Squatting | InstaTunnel Blog](https://instatunnel.my/blog/model-weight-mirror-squatting-the-backdoored-hub) Organizations that download quantized versions of open-weight models from community contributors — a routine practice — are exposed to an attack vector that bypasses any scrutiny of the original model's training process entirely.

Mapping the Territory: Organizational Risk Across Four Threat Classes

These four threat classes — adversarial examples, prompt injection, model extraction, and supply chain attacks — are not variations on a theme. Each exploits a different property of AI systems, operates through a different attack surface, and carries different implications for organizational risk.

Adversarial examples exploit the geometry of learned decision boundaries. They are most relevant to organizations deploying models in high-stakes classification settings — fraud detection, content moderation, medical imaging — where an adversary has an incentive to produce inputs that fool specific models. The transferability property means that white-box attacks on open-weight models may translate to attacks on closed-weight commercial APIs serving the same task. The GCG attack matters most for organizations deploying language models in contexts where alignment is a security property: a customer service system that must never produce certain outputs, a content moderation pipeline, an AI agent with real-world capabilities.

Prompt injection is the most broadly applicable of the four, because it scales with deployment. Every agentic application — any system where a language model processes external data and takes actions — is a prompt injection surface. Critical 2024–2025 vulnerabilities demonstrate how mature AI agents can be compromised through prompt injection, including GitHub Copilot and Visual Studio Code suffering from CVE-2025-53773 (Common Vulnerabilities and Exposures entry 2025-53773, catalogued in the standard registry of known security flaws), which allowed remote code execution by exploiting Copilot's ability to modify configuration files without user approval. [Prompt Injection Attacks in Large Language Models and AI Agent Systems](https://www.mdpi.com/2078-2489/17/1/54) This affected a widely deployed, well-resourced commercial system — not an experiment or prototype. The gap between the capabilities organizations want from AI agents and the trust architecture required to make those capabilities safe is not a gap that will close easily.

Model extraction shapes competitive strategy for any organization whose AI capability is a proprietary asset. The fact that a domain-specific fine-tuned model can be extracted at trivial cost through its API fundamentally changes the economics of specialization. The extracted model can be deployed without paying API fees; it can be used to generate unlimited training data for adversarial fine-tuning; it can be analyzed offline to find jailbreaks and vulnerabilities. [LLM Model Extraction and Stealing Attacks | AquilaX](https://aquilax.ai/blog/llm-model-extraction-stealing-attacks) A model trained on proprietary clinical data, legal documents, or financial records is not just computationally valuable — it embeds private data. Extraction is simultaneously intellectual property theft and a potential privacy breach.

Membership inference creates compliance exposure that legal teams have not yet fully internalized. If a model trained on customer data, employee records, or healthcare information can be queried to determine whether specific records were in its training set, the model is a GDPR (General Data Protection Regulation) right-to-erasure risk, a HIPAA (Health Insurance Portability and Accountability Act) liability, and a general-purpose litigation target. Alignment methods such as DPO and PPO have enabled significant progress in refining LLMs using human preference data, but the privacy concerns inherent in utilizing such preference data have not yet been adequately studied, and LLMs aligned using both methods are demonstrably vulnerable to membership inference attacks. [[2407.06443] Exposing Privacy Gaps](https://arxiv.org/abs/2407.06443) Even the alignment process itself generates privacy risk.

Supply chain attacks demand operational changes that go beyond software security practices. AI supply chains must track not only code dependencies but also data lineage, model provenance, and computational resources — unlike traditional software supply chains that only track code. [AI Supply Chain Security Guide 2026 — GLACIS](https://www.glacis.io/guide-ai-supply-chain-security) An organization that deploys a fine-tuned LLaMA-3, Mistral, or Qwen 3 variant downloaded from a public repository without cryptographic verification of the model weights has introduced a trust assumption it may not have consciously made: that every contributor in the model's lineage, every quantization service that converted the weights, and every infrastructure layer that hosted the file is uncompromised.

The Adversary Model You Are Now Operating Under

The conventional enterprise security adversary model centers on perimeter breach: an attacker who gains access to systems they are not supposed to access, then moves laterally toward valuable assets. The controls — firewalls, endpoint detection, access management, patching — are designed to prevent that initial breach and contain the blast radius of a successful one.

The adversary model for AI systems is different. The attacker does not need to breach your perimeter. The attack surface includes the data you feed your model, the documents your model reads, the websites your agent visits, the models you download, and the APIs you expose. A well-conducted prompt injection attack reaches your AI system through a customer invoice. A supply chain attack reaches it through a model checkpoint you pulled from a repository six months ago. A model extraction attack happens when a competitor makes a few thousand API calls to your public endpoint over several weeks.

Input validation doesn't help when the attack exploits the same capacity for natural language understanding that makes the model useful. Access control doesn't help when the attacker is using your own API at the rate limits you specified. Antivirus doesn't detect a backdoored model weight because the weights look exactly like normal weights until the trigger fires.

The subsequent episodes in this module will examine specific defenses — and their demonstrated limits — with the same rigor applied here to the attacks. The defensive conversation only makes sense if the attack surface is first understood on its own terms. A threat you haven't mapped is one you cannot defend.