M2E2: Structured Analytic Techniques in the Real World
Module 2, Episode 2: Structured Analytic Techniques in the Real World
The Accuracy Illusion
Structured Analytic Techniques (SATs) do not reliably make analysts more accurate. This is closer to a consensus finding among researchers who have tested the proposition than a controversial claim. The community's continued investment in SATs, mandated in Intelligence Community Directive 203 and embedded in training pipelines from the Sherman Kent School — the CIA's primary analytic training institution — to every major contractor's analytic standards document, is therefore worth examining carefully. The investment may be justified, but it is being justified by the wrong argument — and the wrong argument produces the wrong kind of failure.
The evidentiary anchor is a 2019 randomized study by Mandeep Dhami and colleagues published in Applied Cognitive Psychology. The design was straightforward: fifty intelligence analysts were randomly assigned to use Analysis of Competing Hypotheses (ACH) or not when completing a hypothesis-testing task with probabilistic ground truth. Randomization is the methodological gold standard for isolating a technique's effect from analyst selection effects — you cannot credit ACH for good outcomes if you only observe it being used by careful, structured thinkers. The result was unambiguous. ACH-trained analysts did not follow all of the steps of ACH. Evidence for ACH's ability to reduce confirmation bias was mixed, and the researchers found that ACH may increase judgment inconsistency and error.
That last finding — increased inconsistency — requires careful interpretation, because it is frequently misread as a definitive case against ACH. The study's fine-grained results show that ACH disrupted analysts' natural heuristics without reliably replacing them with the prescribed procedure. Only 12% of ACH-trained analysts used base rate information, compared to 52% of untrained analysts. Untrained analysts are not therefore better reasoners. ACH, applied without complete uptake of its own logic, can strip away base-rate sensitivity — a genuine cognitive resource — without compensating through the falsificationist discipline the technique requires. The technique works as advertised only when analysts execute all its steps. Most analysts trained and instructed to use ACH deviated from one or more of the prescribed steps, and they departed in particular from Step 5, which covers evidence integration.
This pattern has a long history in applied settings. Army intelligence officers resisted using ACH after being trained and repeatedly instructed to do so. The technique requires a discipline that feels unnatural — not because analysts are lazy, but because the cognitive architecture it demands runs against the grain of how expert pattern recognition operates. Experts develop fast, efficient heuristics precisely because analytical processing under load is expensive. Telling an experienced analyst to treat every piece of evidence as equally suspect, weigh inconsistencies mechanically across a matrix, and deliberately set aside domain intuition is not just procedurally difficult. It asks them to abandon what makes them good.
None of this is a reason to discard ACH. It is a reason to be precise about what ACH can and cannot do — and the intelligence community, for institutional reasons that will become apparent, has often been imprecise on this question.
The Audit Trail Thesis
Strong analytic tradecraft increases the likelihood that assessments are transparent, relevant, and rigorous. That sentence, from a careful review of tradecraft history published in Intelligence and National Security, is doing a great deal of work. Transparency and rigor are not synonyms for accuracy. They are the properties that make a judgment reviewable — a different and arguably more institutionally important quality.
Consider what the ACH matrix produces from an organizational standpoint. An analyst who completes a well-formed ACH matrix has created a document that records which hypotheses were considered, what evidence was brought to bear, how that evidence was rated against each hypothesis, and which alternatives were eliminated and why. This record exists independent of whether the final judgment was correct. A supervisor, a quality reviewer, a congressional staffer, or a post-mortem investigator can examine that matrix and determine whether the analytical process was sound — whether the hypothesis set was comprehensive, whether critical evidence was appropriately weighted, whether elimination was defensible. Intelligence analysis production organizations have applied three methods to evaluate whether analysis is good: Did it meet analytic tradecraft standards? Were the assessments accurate? Did the product make a difference with a decision-maker? None of those evaluation methods is perfect, and all three leave questions.
The institutions that mandated SAT use after 9/11 and Iraq were not responding to evidence that SATs improve accuracy. They were responding to evidence that analysis had been unaccountable — that judgments had been made on the basis of unexamined assumptions and single-hypothesis reasoning, and that when those judgments failed, there was no paper trail sufficient to enable meaningful learning. President George W. Bush signed the Intelligence Reform and Terrorism Prevention Act in 2004, which mandated procedural and methodological changes in how the U.S. Intelligence Community fulfilled its analytic requirements. That led the Director of National Intelligence to promulgate Intelligence Community Directive 200, requiring the IC to implement policies and procedures that encourage sound analytic methods and tradecraft across all its elements. The mandate was explicit about process, not outcomes.
This is the audit trail thesis: SATs are accountability infrastructure. They create a record that reasoning occurred, that alternatives were considered, that assumptions were named. Whether that reasoning arrived at a correct answer is a separate question — one the technique cannot guarantee and was never designed to ensure, whatever the marketing language around cognitive bias reduction might suggest.
The Iraq case demonstrates this reading in a way that is under-appreciated. Målfrid Braut-Hegghammer, a leading expert on weapons proliferation at the University of Oslo, produced perhaps the most in-depth study of Saddam Hussein's incentives. As she states: "The Iraqi leadership did not, as is widely believed, try to create a deterrent effect through calculated ambiguity as to whether Iraq no longer possessed WMD." Braut-Hegghammer's central finding, published in International Security in 2020, was that what American intelligence read as deliberate strategic deception was something structurally distinct: a communications failure inside an authoritarian regime trapped in what she called a "cheater's dilemma." Between the 1991 Gulf War and the U.S.-led invasion in 2003, the Iraqi regime faced a stark choice — how much to reveal of its weapons capabilities when each additional revelation made reward less likely, while continued denial also prevented the lifting of sanctions. The Iraqi leadership struggled to resolve this dilemma as elites pursued competing policies and subordinates failed to consistently obey Saddam Hussein's orders. Principal-agent problems, aggravated by the leadership's initial attempts to deny and cover up Iraq's weapons capabilities, explain a range of puzzling Iraqi behaviors that registered as calculated ambiguity to outside observers.
Now run a key assumptions check backward against the 2002 National Intelligence Estimate (NIE) — the authoritative pre-war document in which the IC stated its collective judgment on Iraq's weapons programs. Among the load-bearing assumptions undergirding the assessment that Iraq retained WMD was the premise that Baghdad's ambiguous behavior toward inspectors constituted evidence of concealment — active deception designed to hide ongoing programs. As the CIA concluded in a 2006 retrospective, when Saddam Hussein refused to come clean on his government's deception program, intelligence analysts in Washington assumed he had something to hide. He was hoping to avoid a coup. A key assumptions check that surfaced this premise — explicitly articulating the assumption that ambiguous behavior implied hidden capability — would have been required to ask whether that assumption must be true, and whether an alternative explanation could account for the same observable behavior.
The technique did not save the judgment. Had it been rigorously applied and documented, however, the assumption would have been named in a way that enabled challenge. The failure of the judgment does not nullify the value of making the assumption auditable. A process that makes wrong assumptions visible is worth something even when it cannot prevent them — because visible assumptions can be contested, and contested assumptions sometimes lose.
Indicators and the Falsification Trap
The concept of indicators is where SATs make their clearest epistemological claim. Operationalizing "what would I observe if hypothesis X were true?" is sound scientific practice — it follows from Karl Popper's falsificationism and the basic logic of hypothesis testing. The CIA's own Tradecraft Primer describes using indicator matrices to track preconditions for regime instability, and the logic is correct: define observable conditions, check for their presence or absence, update your assessment accordingly. Analysts have tracked the potential for regime change by identifying a list of indicators, posing the question "is this occurring or not?" for each, and developing trigger mechanisms that might bring about a political shift.
The problem is not the theory. The problem is what happens to indicator lists when they enter the collection system.
When an analyst defines a set of indicators and those indicators are formalized into a collection requirement, collection managers task sensors against them. Satellites are cued to watch for specific military movements. Signals intelligence platforms are directed toward certain communication patterns. Human intelligence sources are asked targeted questions. The collection system optimizes to detect the indicators it has been given. This is exactly what should happen — collection should serve analysis. But a side effect follows immediately: the collection system is now structurally better positioned to find evidence that the indicators are occurring than evidence that they are not. Absence of evidence is inherently harder to collect for than presence. You can direct a satellite to watch for troops massing at a border. You cannot easily direct it to confirm that troops are definitively not massing.
The result is that indicator lists, once they become collection requirements, can function as confirmation machines. The analyst receives reporting on indicators precisely because collection has been tasked against them. Ambiguous signals consistent with the indicators get reported and logged; ambiguous signals consistent with the null hypothesis are less systematically sought. Over time, the indicator set dominates the analytical picture not because the world has confirmed it, but because the collection architecture was built around it. The analyst mistakes the footprint of their own collection design for independent evidence.
This problem was visible in the run-up to Iraq. The fabrications of "Curveball" — the Iraqi defector whose claims about mobile biological weapons laboratories were later proven entirely false — became not only the basis for the NIE's sweeping biological weapons assessments but were also included in President Bush's 2003 State of the Union address and Secretary of State Colin Powell's presentation to the United Nations Security Council. Part of why Curveball's reporting was so difficult to dislodge is that it was consistent with a set of hypotheses collection had already been optimized to confirm. The mobile biological weapons laboratory claim fit neatly into the existing indicator framework. Sources that contradicted it faced a much higher burden of proof, because the collection architecture had not been designed to pursue their line of argument.
The falsification discipline that ACH and indicator methodology require is genuinely hard to maintain in a real collection environment, for structural reasons that good technique alone cannot solve. Indicators do not just describe the world. They direct attention toward it. Directed attention, compounding over a collection cycle, bends the evidentiary record toward whatever hypothesis generated the indicators in the first place.
Process Theater
There is a failure mode that sits downstream of everything discussed so far, and it is the one most likely to accelerate in an AI-assisted environment. Call it process theater: the use of structured analytic technique as a display of rigor rather than an exercise of it. The analyst fills in the ACH matrix correctly, completes the key assumptions check worksheet, names the assumptions, ticks the boxes — and remains wrong for all the same reasons they would have been wrong without any of it.
ACH is a thinking tool, not an answer machine. Analysts and their managers must engage with it rather than comply with it. The distinction sounds obvious. In practice, under workload, under organizational pressure, under the cognitive exhaustion of a fast-moving target set, compliance and engagement tend to converge into indistinguishable behaviors. The matrix gets filled in. The assumptions get listed. The devil's advocate statement gets written. Whether any of this changed the underlying judgment is not observable from the artifacts.
The signs of process theater are recognizable to anyone who has worked near it. The ACH matrix in which every hypothesis except the favored one has been populated with inconsistencies that, on inspection, are not particularly diagnostic — where the evidence listed is mostly consistent with multiple hypotheses and the ratings reflect the analyst's conclusion rather than driving it. The key assumptions check that lists assumptions at the level of banality ("we assume the source is reliable") without surfacing the ones that are load-bearing. The red team report that reaches the same conclusion as the primary assessment, differing mainly in tone.
Heuer understood this risk. If the analyst is already generally knowledgeable on a topic, the usual procedure is to develop a favored hypothesis and then search for evidence to confirm it. This is a "satisficing" approach — going with the first answer that seems supported by the evidence. It is efficient because it saves time and works much of the time. The analyst has made no investment in protection against surprise. The satisficing tendency does not disappear when an analyst is trained in ACH. It gets redirected: the analyst satisfices through the matrix rather than through free-form reasoning, arriving at the same conclusion they would have reached anyway while generating documentation that suggests a more rigorous path.
The historical record shows that endemic problems analytic tradecraft was designed to mitigate — cognitive biases, mindsets, poor logic, hazy exposition — still confront the community. SATs did not eliminate these problems after two decades of institutionalization. They created new surfaces on which the problems could hide.
The Iraq failure shows what process theater looks like at the NIE level. Saddam Hussein's mindset rested on three fundamental assumptions: that the most significant threats his regime faced were opponents within Iraq; that the United States was weak and irresolute; and that his most dangerous external threats came from his neighbors to the east and north. These assumptions shaped his interpretation of events as well as his operational and strategic decisions. American analysts, similarly, built their assessments on assumptions about Saddam's strategic calculus — assumptions never surfaced in a way that permitted challenge. The 2002 NIE had sourcing qualifications, caveats, and dissenting footnotes from the Department of Energy on the aluminum tubes. It carried the structural markers of rigorous analysis. A genuine engagement with the hypothesis that Saddam had no WMD and was behaving ambiguously for entirely different reasons was absent. That hypothesis was available. It was not seriously tested.
The devil's advocacy function in that analytic cycle failed not because no one understood the technique, but because the organizational environment had already resolved the question in favor of the dominant hypothesis. When the organization is not seeking challenge, the transparency function can be satisfied pro forma while the rigor function atrophies.
What AI Does to This
The question is not whether AI will be used in intelligence analysis. It already is. In March 2026, the Department of Defense designated Palantir's Maven Smart System — an AI-enabled intelligence platform — as an official programme of record, a bureaucratic designation meaning Congressional funding is guaranteed through September 2026 and beyond, transforming what began as an experimental project into permanent military infrastructure. Based on public reporting, the AI handling natural-language intelligence queries inside Maven is Claude, built by Anthropic. Anthropic was among the first to deploy a large language model in a classified military setting. The Defense Intelligence Agency (DIA), meanwhile, has reorganized its AI work into a Digital Modernization Accelerator and deployed a classified chatbot called "ChatDIA" on the top-secret intelligence network. This is the operating environment now.
Against that backdrop, the question of what AI does to SATs is urgent and the answer is precise. It splits along the two functions this episode has distinguished.
For the audit trail function, AI assistance is largely neutral and may even be helpful. If an analyst asks Claude to generate an ACH matrix for a given question — populating hypotheses, listing relevant evidence, structuring the consistency ratings as a first draft — and then reviews, modifies, and signs off on that matrix, the institutional record still exists. The analyst's judgment is still on paper. Supervisors can still review the structure of the reasoning, challenge the hypothesis set, interrogate the evidence ratings. Human sign-off on a machine-generated matrix is still a judgment on record.
For the cognitive discipline function, AI assistance is actively destructive — not because AI reasons poorly about structure, but because the cognitive discipline was never located in the artifact. It was located in the friction. The value of being required to manually populate an ACH matrix is precisely that it forces the analyst to sit with each piece of evidence against each hypothesis long enough to notice when something is off — when a piece of evidence that was supposed to be inconsistent with a hypothesis is genuinely ambiguous, when a hypothesis that seemed eliminable survives the evidence more strongly than the leading candidate, when the whole exercise suggests the collection base is too thin to support any confident judgment. That noticing is what ACH is for. It happens in the act of populating the matrix, not in reading the populated matrix.
Claude, deployed within Maven, served as an interface and synthesis layer — helping analysts query massive datasets, summarize multi-source intelligence reporting, and translate raw data into assessable language for commanders. Claude also ranked targets by strategic importance and assessed the expected impact of strikes. At 1,000 strikes in 24 hours during Operation Epic Fury, the average time available per targeting decision was approximately 86 seconds. Whether 86-second human authorization constitutes meaningful engagement with an AI-generated analytical product — rather than a rubber stamp — is not answered by whether a human was technically in the loop. It is answered by whether the human had enough time, information, and cognitive space to actually reason about what they were approving.
When Claude generates an ACH matrix in twelve seconds, the analyst who reviews it for thirty seconds and approves it has performed the audit trail function — there is a record, there was nominally human oversight. The cognitive discipline function has been entirely offloaded to the model. If the model's hypothesis set is incomplete, the analyst is unlikely to notice. If the model's evidence ratings reflect biases in its training data, the analyst has no mechanism to catch them that is independent of redoing the analysis manually. If the model has framed the question in a way that systematically excludes a class of alternatives, the analyst reviewing a finished matrix is working from within that frame.
The popularity of ACH is surprising given the scarcity of empirical research testing its utility. That scarcity was a problem before AI entered the picture. It becomes a far more serious problem when the technique is applied at machine speed across thousands of targeting decisions, collection requirements, and analytical products per day. The failure mode is not that AI gets ACH wrong in an obvious way. It is that AI gets ACH right — correct procedure, coherent matrix, defensible ratings — while the cognitive engagement that justified the procedure has been quietly hollowed out.
The intelligence community's answer to this problem so far has been to emphasize "human in the loop" requirements. This is the right instinct but the wrong operationalization. A human who reviews an AI-generated analytical product is in the loop in a procedural sense. Whether they are in the loop in a substantive sense — whether their review constitutes genuine independent reasoning about the quality of the underlying analysis — depends entirely on whether they have the time, the expertise, the adversarial disposition, and the institutional license to push back. Intelligence organizations deviate from prescribed ACH steps even when analysts have been trained to apply them. Adding an AI layer that generates a plausible-looking artifact does not fix that. It gives the deviation a better cover story.
The Decision You Can Now Make
The thesis of this episode was falsifiable: SATs do not reliably improve analytic accuracy, and the institutional value of SATs lies in their audit trail function, not their cognitive improvement function. The Dhami study provides direct empirical support for the first claim. The institutional history — ICD 203 mandated after intelligence failures, not after evidence that techniques improved accuracy — supports the second. The Iraq case shows an audit function working correctly in principle, assumptions nameable on paper, while the cognitive function failed in practice.
The practitioner now has a clean decision criterion.
Before incorporating AI into an SAT workflow, ask which function you are relying on in that context. If you are using ACH for accountability — to create a record that alternatives were considered, to enable supervisor review, to document the analytical path for a subsequent post-mortem — then AI-assisted ACH serves that function, provided there is genuine human review and sign-off on the machine-generated product. The record exists. The judgment is attributed.
If you are using ACH for cognitive discipline — to force genuine engagement with alternative hypotheses, to surface assumptions you haven't yet named, to catch the moment when an evidence rating should be "ambiguous" rather than "inconsistent" — then AI-assisted ACH does not serve that function. The friction is the mechanism. An AI that eliminates the friction eliminates the mechanism. You would be better served by doing the matrix slowly, by hand, fighting your own first instincts about the consistency ratings, than by reviewing a polished AI product that ratifies the conclusion you already held.
Most real analytic work needs both functions simultaneously. The answer is not to refuse AI assistance but to be deliberate about where you apply it. Use AI to handle the mechanical scaffolding — collecting evidence, formatting the matrix, drafting the initial hypothesis set — and then treat that draft as an adversarial challenge, not a starting point for minor edits. The discipline you are protecting is not the ACH process itself. It is your own capacity to notice when the structure of the analysis is wrong. That capacity cannot be delegated to a model and recovered at review time. It has to be exercised during the analysis, by a human, under conditions that permit genuine uncertainty about the answer.
War on the Rocks's review of the Iraq intelligence literature summarized Braut-Hegghammer's finding this way: the puzzling behaviors that "came across as calculated ambiguity to the outside world" were the product of institutional dysfunction, a regime unable to communicate its own compliance. The key assumption that deception implied capability — never rigorously surfaced, never genuinely tested against the alternative — was the load-bearing premise of the entire failure. That assumption could have been named. The technique for naming it existed. The institutional environment in which it would have been seriously challenged did not.
AI can build the scaffold faster than any analyst. It cannot build the environment in which challenge is real. That remains a human problem, which means it remains yours.