M9E1: Classic Deception Meets Machine Learning
Module 9, Episode 1: Classic Deception Meets Machine Learning
The Art of the Lie Has a New Listener
There is a tendency among technically oriented analysts to treat adversarial machine learning as a novel security discipline, something that emerged with the deep learning era and requires its own entirely new vocabulary. That framing is partly true and mostly misleading. The technical mechanisms are genuinely new. The underlying logic — the adversary's strategic reasoning about why the attack works and what it is attacking — is ancient. Deception doctrine precedes machine learning by millennia, and the operational principles that made deception effective against human analytical systems in 1944 are the same principles that make it effective against statistical reasoning systems in 2026. Understanding that continuity is what allows the analyst to anticipate the adversary's next move rather than simply catalog the last one.
Denial and deception, known in the profession as D&D, is the formal discipline concerned with how an adversary shapes the information environment to produce false beliefs in a target's mind — or prevents true beliefs from forming at all. The distinction between the two halves matters. Denial is active suppression: cover, concealment, camouflage, emissions control, anything that removes true signals from the collection environment. Deception is active injection: simulation, disinformation, provocateurs, anything that introduces false signals that the target will mistake for true ones. Effective deception operations almost always require both simultaneously, because a false signal standing alone in a void is obviously false. It only becomes persuasive when the true signal is absent, or when the false signal arrives through a channel the target already trusts.
The NIST (National Institute of Standards and Technology) Adversarial Machine Learning taxonomy released in March 2025 — the current authoritative framework for classifying attacks on AI systems — organizes adversarial attacks by the stage of the machine learning lifecycle at which they occur, the attacker's objectives, and the attacker's level of knowledge about the target system. The taxonomy recognizes that adversarial attacks occur at either the training stage or the deployment stage: during training, the attacker might control part of the training data, their labels, the model parameters, or the code of the machine learning algorithms, resulting in different types of poisoning attacks; during deployment, the adversary could mount evasion attacks to change predictions, or privacy attacks to infer sensitive information about the training data or model. This is a technically precise framework. Notice what it describes in strategic terms: the attacker is either removing true signal from the model's learning environment, which is denial, or injecting false signal into it, which is deception. The names change. The moves do not.
Operation Fortitude and the Structural Logic of Collection Poisoning
To understand why this matters operationally, it is worth spending real time on one historical case before moving to the present — not as a metaphor, but as a structural dissection.
Operation Fortitude, during World War II, was an Allied deception operation intended to make Nazi Germany's high command believe that the main Allied invasion of Europe in 1944 would not be at Normandy. Organized by Allied military officials beginning in 1943, it aimed to deceive through the use of a fake army. The operation succeeded not because the German analytical apparatus was stupid, but because the British had achieved something far more powerful: they had compromised the adversary's collection channels themselves. The Double Cross System — a World War II counter-espionage and deception operation run by Britain's MI5, the domestic counterintelligence service — captured, turned, or received the surrender of Nazi agents in Britain, then used those same agents to broadcast disinformation back to their Nazi controllers.
The operational architecture deserves careful attention. Deception was effected through physical deception, disinformation through diplomatic channels for passage via neutral countries to the Germans, wireless traffic to suggest the creation of nonexistent formations, use of German agents controlled by the Allies through the Double Cross System to send false information to German intelligence, and the public presence of notable officers associated with phantom groups such as FUSAG (the First United States Army Group, a fictitious formation designed to suggest a massive invasion force massing opposite Pas-de-Calais). Most notably, General George Patton was publicly associated with this phantom command.
Agent Garbo — Juan Pujol Garcia — never left Lisbon but ran an entire invented intelligence network of 27 fictional agents whose reports German high command trusted without question. His German handlers were so impressed with his work that they awarded him the Iron Cross. The credibility German High Command gave Garcia's information meant he was able to convincingly sell the Allied deception, leaving the Germans entirely unaware they were being manipulated.
Here is the structural insight that makes this more than interesting history. The Germans were not deceived about facts they could independently verify. They were deceived through channels they had no reason to doubt, about matters they could not directly observe. Their collection system — the network of agents reporting from Britain — was the attack surface. Once that surface was compromised, the quality of the German analytical machinery was irrelevant. The Germans could apply any amount of reasoning rigor to Garbo's fabrications and conclude correctly that FUSAG was assembling opposite Pas-de-Calais, because every data point in their collection environment supported that conclusion. German intelligence used the agent reports to construct an order of battle for the Allied forces that placed the center of gravity of the invasion force opposite Pas-de-Calais, the point on the French coast closest to England. The deception was so effective that the Germans kept 15 divisions in reserve near Calais even after the invasion had begun, lest it prove to be a diversion from the main assault.
There was a second critical element that the Allies possessed and that most analysts overlook: they could monitor the effectiveness of their deception in real time. Ultra — the intelligence derived from Bletchley Park, where a team had broken the German Enigma coding system — allowed the Allies to check the success of any information or misinformation they planted. The Germans, convinced that Enigma could not be broken, remained entirely unaware of this fact and their consequent vulnerability. The Allies could intercept and read the decoded responses. The deception operators knew which of their planted stories were believed, which were questioned, and which were reinforcing the target's existing mental model — what intelligence professionals call the adversary's analytic pre-commitments. This feedback loop between deception output and adversary belief state is a critical structural feature that reappears, with eerie precision, in modern adversarial machine learning attacks.
Deception methods had been practised and refined over the previous three years of the war, so that by the time of Operation Bodyguard — the broader deception plan of which Fortitude was a component — it was recognized that the foundation of all such operations was to support and encourage the expectations of the enemy, carefully reinforcing what you wanted them to believe, and ensuring that information reached the highest levels of command. The formula is exact: exploit prior beliefs, corrupt the trusted collection channel, ensure the false signal is self-consistent. In the AI context, only the definition of "collection channel" and "belief state" changes.
How the Map Transfers: Poisoning the Statistical Learner
A machine learning model does not have beliefs in any philosophically meaningful sense. It has something functionally equivalent: a learned representation of the world encoded in its weights, which determines how it classifies new inputs, what it retrieves from memory, and how it prioritizes competing signals. That representation was built from training data, and its behavior at inference time is shaped by whatever context — documents, tool outputs, retrieved passages — it is given to reason over. Both of those inputs are attack surfaces. The adversary who understands D&D doctrine understands exactly what to do with them.
The 2025 NIST report substantially expands its adversarial machine learning attack taxonomy to cover advanced generative AI threats, including misuse and prompt injection attacks, delineating clearly between attack types affecting integrity, availability, and privacy. The taxonomy formalizes a threat landscape that maps directly onto the D&D categories: training-time poisoning corresponds to corrupting the collection channel before the model forms its representation of the world; inference-time evasion and prompt injection correspond to corrupting the context the model is given at the moment of analysis; and backdoor attacks correspond to the double agent — a trusted source that behaves normally under most conditions and activates only when the adversary wants.
The training-data attack is the most foundational. The NIST 2025 taxonomy includes clean-label poisoning: attacks that subtly corrupt data without altering labels, making them harder to detect. A clean-label attack does not mislabel training examples. It introduces new examples, carefully crafted, that push the model's decision boundary in a direction the attacker controls without ever appearing anomalous to a human reviewer. This is Garbo's network applied to gradient descent: plausible-sounding reports that cumulatively orient the model's representation of the world in a direction the attacker chose. The model trained on contaminated data will, under normal operating conditions, perform normally. It will only betray its conditioning when the adversary triggers it — which may be months or years after the contamination was introduced.
The inference-time attack is more immediately dangerous in 2026 because so much AI deployment relies on RAG — retrieval-augmented generation — systems where a model does not rely solely on trained weights but pulls in relevant documents from an external knowledge base at runtime. RAG was designed to solve the problem of stale knowledge and hallucination. It has created a new collection channel that can be poisoned in real time. The attacker does not need to touch the model's weights. They need only to ensure that when the model's retrieval system queries for information about a specific topic, it returns documents containing the attacker's preferred narrative — or, more insidiously, documents that appear authoritative while containing hidden instructions for the model itself.
That second mechanism is indirect prompt injection, and it is currently the most operationally active threat vector in the landscape. Unlike direct prompt injection — where a user types a malicious instruction directly into the model — indirect prompt injection hides adversarial commands inside ordinary web content, documents, or tool outputs that the model will process as trusted sources. The attack hides covert instructions inside ordinary web pages, waiting for an AI agent to crawl, summarize, or analyze that content and then execute the hidden instructions. The model cannot reliably distinguish between the document's authentic content and the injected command because both arrive through the same trusted channel. This is the Double Cross System: the source is trusted, the content is compromised.
The explicit recognition of RAG and agentic systems as distinct attack surfaces in the NIST taxonomy reflects the operational reality. The threat is no longer abstract.
When the Adversary Observes Your Outputs
The Allies at Bletchley Park had a feedback loop: they could read German signals to determine what the enemy believed and adjust their deception accordingly. In adversarial machine learning, the equivalent is the black-box query — and the attacker does not need Ultra to access it. Any publicly accessible AI system gives the adversary exactly what they need to calibrate their attack.
Black-box attacks assume the attacker operates with minimal, sometimes no, knowledge about the machine learning system. An adversary might have query access to the model but no other information about how the model is trained. These attacks are the most practical since the attacker has no knowledge of the AI system's internals and uses only interfaces readily available for normal use. That is NIST's own language describing the adversary's structural advantage.
The attacker does not need your code, your weights, your system prompt, or your training data. They need the same interface your analysts are using. They can query the model with variations on a theme, observe how its outputs shift, infer what it is sensitive to, what it trusts, what framing elicits the response they want — and then design their deception operation accordingly.
This asymmetry is brutal. Your system has to be right every time. The adversary only has to find the one framing that works, and they have unlimited attempts. Every public-facing AI-augmented intelligence product is simultaneously a capability and an intelligence collection vulnerability: it tells sophisticated adversaries, through its outputs, something about the shape of the model's learned representation and the limits of its skepticism.
The feedback loop goes further when systems are used for recurring analysis on contested topics. If an adversary watches how your AI-augmented system's assessments evolve in response to events — what evidence it treats as probative, what sources it weights heavily, where it hedges and where it commits — they can construct a reasonably accurate model of your model. They learn which of their narrative framings are being surfaced in your outputs, which are being suppressed, and where the gaps are. They learn, in other words, which collection channels you trust. Then they poison those channels.
This is exactly the strategy documented in the Pravda operation. The Pravda network — a collection of pro-Kremlin websites that has used AI tools to flood the internet with millions of pieces of Russian propaganda — is almost certainly intended to shape the responses of large language models. The network was not built for human readers. These articles are not intended to be read directly. The Pravda network receives minimal traffic, averaging fewer than 1,000 monthly visits per domain. Its purpose is laundering Kremlin talking points into the broader information ecosystem. The attack vector is the training pipeline, not the audience. Generated material is optimized to game search engines, with the intention that false narratives appear more often and more prominently in results for searches on a given topic. The frequency of false narratives in search results correlates to the likelihood of their integration into training corpora. Gaming one system allows Russia to groom another.
By November 2025, the DFRLab — the Digital Forensic Research Lab, a nonpartisan research organization focused on exposing disinformation — documented the scale of the operation in concrete terms. Roughly 40,000 pieces of English-language Pravda content had been archived in Common Crawl, the open web archive that many large language model developers use as a primary training data source. Although this figure is paltry compared to the billions of webpages in the full Common Crawl database, it had grown by orders of magnitude since November 2024, when Pravda had only 37 articles across the entire archive. The DFRLab's April 2026 analysis found that the content was optimized for machine ingestion, not human persuasion — robots.txt files configured to invite crawling, sitemaps designed for indexing. The target was not a reader. The target was Common Crawl's archive.
A NewsGuard audit of ten leading AI tools, including ChatGPT-4o, Google's Gemini, and Microsoft's Copilot, found that these models repeated Pravda's false narratives 33% of the time. The attack also operates at the retrieval layer, not just training. DFRLab researchers found that content posted by Pravda news portals had found its way into generated responses when they queried popular AI chatbots. The chatbots did not disclose the network's links to Russia despite including sources of reports proving so. The model surfaces the contaminated source with the same confidence it surfaces clean sources. To the model, there is no visible difference.
The embedding of Pravda network websites into Wikipedia is particularly concerning given Wikipedia's significant role as a primary knowledge source for large language models. The attack is multi-layered: poison Wikipedia, which poisons LLM training data, which poisons LLM retrieval, which poisons analyst outputs. Each layer of laundering adds a degree of apparent legitimacy. The contamination operates below the threshold of conscious detection — designed to work inside the model's epistemic comfort zone, not against it.
The Supply Chain Is the Attack Surface
The inference-time attack against a deployed model has a cousin that is in some ways even more dangerous: the attack against the model's tool ecosystem. AI agents — systems that not only reason but take actions, calling external APIs, browsing the web, reading emails, executing code — depend on a supply chain of plugins, skills, and tool integrations. Every element of that supply chain is a potential collection channel. Every collection channel that can be trusted can be poisoned.
In January 2026, an attacker executed precisely this strategy against OpenClaw, a fast-growing open-source AI agent framework. A supply chain attack called ClawHavoc compromised ClawHub, the official skill marketplace for OpenClaw — the platform where developers publish and users install modular capability extensions for the agent. Researchers uncovered at least 1,184 malicious "Skills," plugin-style packages hidden among the legitimate ones. Attackers registered as ClawHub developers, creating and mass-uploading malicious Skills disguised as legitimate plugins. Using social engineering techniques, they tricked users into downloading and installing these malicious Skills, thereby implanting and executing malicious code within users' systems.
The tradecraft detail is worth examining. Threat actors uploaded skills with professional documentation and plausible-sounding names — "solana-wallet-tracker," productivity tools, utility scripts. When users installed these skills, they were presented with fake prerequisite instructions: directions to run an external script that would "set up dependencies." This is simulation — the appearance of legitimate tooling, credibly constructed, designed to pass casual inspection. OpenClaw skills are natural-language instructions, not compiled code. The malicious payload sits in plain English inside a SKILL.md file. Signature-based malware scanners have no mechanism for flagging that. The attackers had identified precisely which detection layer did not exist and designed their attack to operate in that gap. Classic cover and concealment.
The structural implication is more important than any specific payload. Unlike passive generative language models such as ChatGPT, OpenClaw belongs to the category of agentic AI. The core value of such systems no longer lies in generating text in response to prompts, but in action — they not only generate text but also execute tasks on behalf of users. When the collection channel is a system that can read your email, execute your code, call your APIs, and forward your credentials, poisoning the tool is not merely an information operation. It is direct access. Skills execute with full local privileges and no sandboxing. When a user installs a skill in OpenClaw, they are allowing third-party code to execute locally with the full permissions of the host environment.
The MCP — Model Context Protocol, the standard that lets AI agents call external services and tools — threat map published in March 2026 catalogued 38 distinct threat categories across both OWASP's AI top-10 (OWASP, or the Open Web Application Security Project, is a widely used framework for categorizing application security risks) and an agentic applications framework. That catalog formalized what ClawHavoc demonstrated in practice: the protocol that lets AI agents call external services is itself an attack surface. Tool description poisoning — where an attacker manipulates how an agent understands what a tool does — is structurally equivalent to a turned agent describing enemy positions: accurate enough to be trusted, skewed enough to serve the adversary.
"The Model Said It" Is Not an Answer
There is a conclusion that follows from all of the above, and it is not comfortable. "The model said it" should function as the weakest possible form of epistemic support in any contested information environment, not the strongest.
This requires some examination, because the intuition runs backward for many analysts encountering AI for the first time. The model has processed vast amounts of information. It has synthesized sources that no single analyst could read. It produces confident, fluent summaries. It does not hedge the way analysts hedge. This produces a cognitive shortcut: the model has done the work, so this is what the evidence says.
Consider what that argument looks like from inside the Fortitude operation. The German analytical system had processed thousands of agent reports. It had synthesized sources that no single Abwehr officer could personally verify. It produced a confident assessment: the main invasion would land at Pas-de-Calais. Every collection channel agreed. The assessment was entirely wrong, precisely because every collection channel agreed — because agreement across compromised sources is confirmation of coordination, not confirmation of truth.
A machine learning model trained on contaminated data or operating over a poisoned retrieval corpus does not know it has been deceived. There is no internal warning state, no anomaly alert, no epistemic flag. Users consulting large language models are increasingly receiving contaminated responses: authoritative-seeming answers imbued with conspiracy theories, fabricated narratives, or questionable interpretations. The contamination does not announce itself. The model produces its output with the same fluency and the same absence of hedging that characterizes outputs drawn from clean sources. The analyst looking at the output cannot, from the output alone, distinguish between the two cases.
The impact may often be subtle — simply encouraging an argument to begin from a pro-Kremlin premise. But that aligns closely with Russian information warfare strategy. Deception does not need to be dramatic to be effective. A model that frames a question slightly differently, that weights one source above another, that surfaces three articles supporting one interpretation and one supporting the alternative — that model is already useful as a deception instrument, even if no single output is an outright fabrication.
The analyst using AI in a contested environment must hold two claims simultaneously. The model's output is evidence about what the model has learned from a particular information environment. It is not evidence about the information environment itself. The moment you treat the model's confidence as a substitute for source verification, you have handed the adversary the same structural advantage the British had over German high command in 1944: the ability to shape your conclusions by controlling your collection channels, while you provide the analytical horsepower.
The adversary advantage in AI-augmented analysis requires no technical sophistication. The most practical attacks assume the attacker has no knowledge of the AI system's internals and uses only interfaces readily available for normal use. The adversary needs three things: knowledge of which information sources your system trusts; the ability to introduce content into those sources; and enough output access to calibrate whether the operation is working. In most operational contexts, all three are trivially satisfied. Your AI summarizer reads the web. The web is writable. Your summarizer's outputs are observable.
The analyst who understands D&D doctrine has one tool the model lacks: the capacity to ask not just what does the evidence say but who constructed this evidence, and for whom. That was what Bletchley Park answered for the Allied commanders — not by evaluating the content of Abwehr reports, but by monitoring whether the German system was receiving and believing the content the Allies had planted. No amount of model capability replaces that question.
The hardest version of the problem is not detecting a single poisoned document. It is recognizing that your entire collection environment has been shaped — over months, systematically — to produce confidence in a false picture. The Germans never doubted Garbo. The models that repeated Pravda's narratives 33% of the time were not broken. They were working exactly as designed, on exactly the data they had been given. The long tail of AI development means that capable open-weights models may be encountering Pravda material in training corpora right now, without any signal that they have done so.
That is the tradecraft challenge Module 9 is built to address: not whether AI is useful for intelligence analysis, but what kind of epistemic discipline is required to use it in an environment where someone is actively trying to corrupt what it knows.
The next episode examines the specific mechanics of open-source intelligence poisoning at scale — how coordinated inauthentic behavior, AI-generated content, and systematic manipulation of open-source collection channels create the information environment that AI-augmented analysis then operates over, and what structural indicators allow analysts to detect that the environment itself has been shaped.