M11E1: The AI-Enabled Analytic Team and How to Build It
Module 11, Episode 1: The AI-Enabled Analytic Team and How to Build It
The Right Structural Question
Most organizations building AI-enabled analysis capabilities start with the same question: how much AI should we use? The framing produces organizations designed around the wrong axis entirely — teams structured by how much automation they've adopted rather than by how clearly they've defined where human judgment must stay in the loop and why.
The right question is structural: given what AI can do and what it cannot do, how do you build a team so that the people who hold context and carry accountability are also the people making the consequential calls? This is an engineering problem with engineering answers — specific role definitions, specific handoff protocols, specific training requirements, and specific evaluation methods that tell you whether the design is working.
The stakes are concrete. As U.S. and Israeli forces have unleashed more than 11,000 strikes on Iranian targets since late February 2026, Palantir Technologies' AI systems have emerged as a central tool in Pentagon targeting operations, compressing what military planners call the "kill chain" from days to minutes and enabling a tempo of attacks unprecedented in modern warfare. Project Maven's reported accuracy hovers around 60%, compared with 84% for human analysts in some assessments. A U.S. strike on a girls' elementary school in Minab killed more than 165 civilians. The school was reportedly on a target list generated with AI assistance, though officials say outdated intelligence contributed and a full investigation continues. What happens when AI is integrated for its speed properties without equivalent investment in the organizational structures that ensure judgment remains with people who understand context — context the model has no access to, like the operational history of a specific neighborhood, the reliability of a specific source, or the gap between what the training data covered and what the current situation is.
The lesson concerns what organizational design must accompany AI deployment, not whether AI should have been used at all.
Who Does What: Roles, Responsibilities, and Where Handoffs Must Happen
An AI-enabled analytic team is a team with distinct functional roles, different knowledge requirements, and handoff points between those roles where most integration failures occur. Adding a chatbot to an existing analysis team is not the same thing.
The foundational taxonomy deserves precision. A data analyst surfaces business insights, a data scientist builds predictive models, a machine learning engineer ships those models to production, an AI engineer builds applications on large language models and foundation models, and a data engineer builds and maintains the pipelines that make everything else possible. In an analytic context, the data analyst role maps roughly to the substantive analyst — the person with domain knowledge who can evaluate whether the output is meaningful. The data scientist builds the models or customizes existing ones. The machine learning engineer handles production stability, monitoring, and drift detection. The AI engineer is the newer role, responsible for applications that sit on top of foundation models like Claude, GPT-5, or Gemini — the layer where prompt architecture, retrieval-augmented generation pipelines (systems that pull relevant documents at query time to ground a model's responses), and agentic workflows live. The data engineer is the unglamorous foundation: the pipelines that move data from its sources to the systems that process it.
In a mature analytic organization, these five roles are not interchangeable and should not be expected to be. The failure mode of under-resourced teams is to collapse them — to expect the substantive analyst to also understand how the retrieval pipeline works, or to expect the machine learning engineer to evaluate whether an output reflects accurate geopolitical reasoning. Both errors are common. Both are dangerous.
There is a sixth role that most organizations don't name cleanly but that the best organizations have learned to cultivate deliberately: the toolsmith. The toolsmith is not a software engineer in the full sense. Someone who sits at the intersection of analytic tradecraft and technical capability — fluent enough in Python or SQL to build custom workflows, familiar enough with APIs to connect tools together, but whose primary orientation is toward analytic problems rather than software architecture. Palantir deploys Forward Deployed Engineers (FDEs) who work on-site with customers to build the data ontology (the structured model of how entities, relationships, and events are defined within the platform), integrate data sources, and train users. The FDE is Palantir's commercial implementation of the toolsmith concept — a person embedded in the mission context whose job is to translate between what analysts need and what the platform can do. In government and corporate analytic teams, this role is often filled informally by the most technically curious analyst, which means it is understaffed, unrecognized, and fragile.
The product owner role is equally undervalued. Someone must be responsible for what gets built, why, and in what order — and that person needs to hold both analytic requirements and technical constraints simultaneously. Analytic teams that skip this role end up with a proliferation of tools built to spec requests that don't reflect actual workflow, or with data engineers building pipelines toward no defined analytical destination. The product owner is not just a project manager. They are the person who can refuse a technically feasible feature because it doesn't serve an analytic purpose, and refuse an analytic requirement because it isn't technically achievable at acceptable fidelity.
The handoff points where integration most commonly fails are predictable. The handoff between data engineer and analyst is where data quality problems get buried — the pipeline runs cleanly, producing output that looks authoritative but reflects gaps, biases, or coverage limitations in the source data that only the data engineer fully understands. The handoff between model and analyst is where automation bias accumulates — the analyst sees a confident AI output and stops applying the structured skepticism they would apply to any other source. The handoff between toolsmith and production machine learning engineer is where prototype-to-production failures happen, when something that works in a notebook doesn't survive contact with real operational data. Each of these handoffs needs explicit governance: documented assumptions, explicit confidence levels, clear escalation paths when the system encounters edge cases.
Keeping the people who understand what intelligence consumers need in continuous conversation with the people building the tools they'll use is the core organizational design challenge, not a secondary function.
Embedded vs. Centralized AI Support: Why the Lab Model Fails
The most common structural mistake large organizations make when building AI capability is to create a centralized AI center of excellence — a dedicated team of data scientists, engineers, and researchers who sit apart from operational units and deliver products to them. The logic seems sound: concentrate expertise, avoid duplication, maintain standards. Centralized AI labs in analytic organizations consistently underdeliver, and the reason is architectural rather than accidental.
A centralized lab produces outputs without context. The lab team learns the data; they do not learn the mission. They optimize for what is measurable in their environment — model accuracy on a held-out test set, system latency, throughput — rather than for what matters in the operational environment: whether the output is useful to an analyst making a judgment under time pressure with incomplete information. The feedback loop is long and lossy. An analyst receives a dashboard, finds it unhelpful, stops using it, and the lab never learns why because there is no mechanism for that signal to flow back. AI model drift silently degrades production model accuracy over time, and in a centralized model, neither data drift nor concept drift is detected until the gap between model behavior and analyst expectations has grown wide enough to produce visible failures.
"Throwing dashboards over the wall" describes the workflow of most centralized AI labs. The lab builds something, documents it, trains a liaison to explain it, and hands it off. The analyst receives a tool built to requirements defined months earlier, against data that has since changed, by people who have moved on to the next project. If the tool doesn't fit the workflow, the analyst stops using it. The lab doesn't know. The cycle repeats.
Embedded AI support breaks this pattern. Palantir's commercial expansion of AIP (Artificial Intelligence Platform, the company's suite for deploying AI models on enterprise data) has been built explicitly on an embedded model: all components are designed to facilitate AI teaming patterns to unlock the full potential of operators, analysts, and subject-matter experts. The structural bet is that AI tools only deliver value when they are built into the workflow rather than offered alongside it.
Embedded means the AI engineer and the toolsmith sit with the analytic team, attend the production meetings, understand what intelligence questions are currently being worked, and know which data pipelines are unreliable this week and why. The feedback loop between analyst and tool is measured in hours, not quarters. The product owner is accountable to the mission, not to a technology roadmap maintained elsewhere.
The tension is real. Pure embedding creates fragmentation — each embedded team develops its own tooling, its own standards, its own data models, and organizational knowledge doesn't accumulate. The hybrid solution that the best-run teams have converged on combines embedded engineers with a centralized function that maintains shared infrastructure, a community of practice for peer review, and escalation paths to specialized roles when workflows involve high-risk data or decisions. Think of the design practice concept of the "elevator team" — a small group of specialists who circulate across embedded teams to maintain consistency, share patterns, and raise standards without removing autonomy. The centralized function maintains the pipelines, manages vendor relationships, sets data governance standards, and runs the evaluation infrastructure. The embedded function builds and iterates in close contact with the people who have to use the output.
The DoD's challenge in deploying Maven illustrates this precisely. Success depends on maintaining a feedback loop between the warfighters on the ground — who currently number in the "tens of thousands" of users — and the engineers at Palantir refining the AI's capabilities. That feedback loop doesn't happen automatically. It requires deliberate structure.
The IC has been grappling with this structural question for years. The organization responsible for the bulk of open-source intelligence collection, the Open Source Enterprise, has moved within the CIA and the Office of the Director of National Intelligence (ODNI) repeatedly. That organizational instability — moving the function without resolving the structural question of how it integrates with substantive analysts — is precisely what the embedded versus centralized debate is about. The answer is not where to locate the Open Source Enterprise. It is how to connect open-source practitioners with the analysts who hold mission context, and how to ensure that tools built in support of open-source collection are built to those analysts' requirements.
The CIA will embed generative AI co-workers inside every analytic platform the agency uses within the next two years. Deputy Director Michael Ellis announced the plan on April 9, 2026, at a Special Competitive Studies Project event in Washington. Not a new AI lab. Not a center of excellence. Co-workers embedded inside the platforms analysts already use. As Ellis put it: it won't do the thinking for our analysts, but it will help draft key judgments, edit for clarity and compare drafts against tradecraft standards. That sentence describes the correct role architecture: AI as a first-draft tool, a consistency check, a format enforcer — and analysts as the people who own the judgment, the sourcing, the uncertainty calibration, and the final product.
Training Analysts for AI, and AI People for Analysis
The knowledge gap runs in both directions, and most organizations address only one side. They train analysts in how to use AI tools — prompt writing, output evaluation, workflow integration — while leaving engineers and data scientists in ignorance of analytic tradecraft. The result is a team that can use AI but cannot hold AI accountable to the standards that make analysis worth consuming.
Most working intelligence analysts, open-source intelligence practitioners, and corporate risk professionals lack a functional model of how large language models work — specifically, the properties that determine when their outputs are reliable and when they aren't. They don't understand the relationship between training data cutoffs and knowledge currency. They don't have a precise vocabulary for the failure modes: hallucination (confident assertion of false detail), confabulation (plausible-sounding synthesis that fills gaps with invented material), sycophancy (adjusting output to match what the user seems to want). Without that vocabulary, they can't document failures in terms that engineers can act on. They also often lack foundational data literacy: the ability to evaluate a dataset's coverage, identify sampling bias, or recognize when an AI's response is suspiciously consistent because the training data was suspiciously uniform.
The second gap is more subtle. Analysts who have spent careers building judgment about source reliability — interrogating the provenance, access, and motivation of human sources — often apply none of that discipline to AI outputs. The output arrives formatted like an authoritative summary and gets treated like one. The principle that humans will remain in the decision loop for analytic judgments only holds if analysts are trained to treat AI outputs as drafts requiring evaluation, not as conclusions requiring formatting.
On the engineering side: what AI engineers and data scientists typically don't know is how intelligence analysis works, and why the standards exist. They haven't read ICD 203 (Intelligence Community Directive 203, the governing standard for analytic tradecraft across the U.S. intelligence community). They don't understand the distinction between information and intelligence. They have no framework for source qualification, alternative hypothesis generation, or confidence calibration. They don't understand why an analyst would refuse to report something that is probably true but can't be sourced, or why a product that is technically accurate but irrelevant to the consumer's decision calculus is a failure. Without that understanding, engineers build tools optimized for things they can measure — accuracy on labeled data, latency, throughput — rather than for things that make analysis useful: appropriate uncertainty quantification, traceable sourcing, and relevance to the decision at hand.
Gerald McMahon's 2024 Belfer Center paper is one of the few serious treatments of this problem from an IC practitioner's perspective. Its central argument is that existing analytic standards — sourcing requirements, uncertainty calibration, alternative analysis — do not automatically extend to AI-assisted workflows, and that the IC needs deliberate amendments to ICD 203 that specify how those standards apply when an AI tool is in the production chain.
For analysts, the core curriculum needs three things. First: a mechanical understanding of how large language models generate text — not deep enough to implement one, but deep enough to understand why a model that has read every document on a topic will still confabulate a citation that doesn't exist, and why asking the same question with different framing can produce different answers. Second: hands-on adversarial practice — structured exercises in which analysts are given AI-generated intelligence products containing specific errors and asked to identify them using standard source evaluation tradecraft. This is exactly the tradecraft workflow analysts already apply to human sources, repurposed for a new input type. Third: threshold training — practice at setting and documenting confidence thresholds for AI-assisted outputs, so that the analyst's decision about when to use AI versus when to work the problem manually is explicit and auditable rather than intuitive and invisible.
For data scientists and engineers, the curriculum runs the other direction. They need exposure to the intelligence production cycle and the decision support context in which their tools will be used. They need to understand why a 60% accurate AI tool — Maven's reported accuracy, compared with 84% for human analysts — is operationally dangerous even if technically impressive by benchmark standards. They need to understand what intelligence professionals sometimes call the Lindy effect of intelligence failures: a system performing at 60% accuracy in a low-stakes environment, once integrated into a high-stakes workflow, will continue to perform at 60%, but the consequences of the remaining 40% are no longer low stakes. They need to know what ICD 203 requires, because they are producing outputs that will be evaluated against those requirements even if they never intended to.
The training gap between analysts and engineers doesn't close through documentation. It closes through deliberate contact — engineers attending analytic production meetings, analysts participating in model validation exercises, both communities working the same problem from different entry points. The training agenda is mutual fluency, built in both directions simultaneously, with specific content on each side.
Joint Exercises and Cross-Pollination: Building a Shared Mental Model
For AI-enabled analytic teams, the shared mental model has a precise referent: the team's collective understanding of what the AI tools do, what their failure modes are, how their outputs should be evaluated, and who holds authority over what kind of decision. Without it, each member of the team operates from a different set of assumptions about what the AI is doing — and those divergent assumptions are invisible until something goes wrong.
No document produces a shared mental model. Training programs produce individual literacy. Joint exercises produce shared models, because exercises force the team to surface and reconcile disagreements in real time, under controlled conditions where the cost of error is educational rather than operational.
The most effective exercise format for AI-enabled analytic teams is the analytic red cell combined with a production review. The red cell presents the team with an AI-generated product — a summary, a link analysis, a translated document, a collection of social media signals synthesized into a narrative — and asks the team to break it: find the errors, surface the unsupported inferences, identify the gaps the tool couldn't know about. Engineers see how analysts evaluate outputs. Analysts see where the model's architecture produces systematic errors versus random ones. The toolsmith learns what kinds of errors are fixable at the prompt layer versus the data layer versus the model layer.
The second exercise type is the pipeline audit walk-through. An engineer walks the analytic team through the full data pipeline — from source through processing to the output the analyst sees — in plain language, stopping at each transformation to explain what was done, what was assumed, and what was discarded. This exercise is humbling for engineers, because it forces them to articulate in plain language decisions that are usually buried in code and treated as implementation details. It is revealing for analysts, because they see for the first time that the authoritative-looking output they've been relying on was preceded by a dozen judgment calls made by people they've never met, under constraints they didn't know existed.
The IC's Open Source Intelligence Strategy 2024–2026 explicitly names workforce development as one of its four strategic pillars. Kevin Carlson, head of the Open Source Enterprise for the CIA's Directorate of Digital Innovation, has found — as have practitioners across the community — that the workforce question is inseparable from the organizational design question. You cannot train analysts to use AI tools they don't trust. You cannot build trust without exposure. Exposure requires structure.
Cross-pollination — rotating people across roles — is the most underused structural tool for building shared mental models. A data engineer who spends two weeks sitting with an analytic team learns more about what intelligence analysts need than any requirements document could convey. An analyst who spends a week watching a model validation exercise learns more about AI failure modes than any training curriculum teaches. A team of Leidos open-source intelligence analysts were recognized with the Defense Intelligence Agency Team Award for applying innovative supplemental collection methods to support Joint Task Force – Southern Border. The award is for operational impact — achievable only when the people building tools and the people using them are close enough to course-correct in real time.
Rotating people across roles for two-week stints every six months won't produce the shared model. Continuous, repeated contact at the working level produces it: engineers present in production meetings, analysts present in technical design reviews, both communities sharing a communication channel where questions flow in both directions without formal routing. The product owner's job is to maintain that contact as a structural requirement rather than a courtesy.
The CIA tested more than 300 AI projects during 2025 and recently used AI to generate an intelligence report for the first time in its history — with the significant caveat that the date, topic, model, and distribution status were not disclosed. That ratio — 300 experiments to one cautiously reported production outcome — is the shape of organizational learning at an institution treating AI integration as a serious epistemological problem rather than a procurement decision. The organizations that produce durable AI-enabled analytic capability invest in the learning process rather than trying to shortcut it through tool deployment.
Evaluating Whether Your AI Integration Is Working
The hardest question in organizational design for AI-enabled analysis is: how do you know if it's working? Not whether the tools are running, or whether analysts are using them — but whether the analysis is getting better. Are the judgments more accurate, the turnaround faster, the sourcing more rigorous, the uncertainty better calibrated? Or has AI been added to the workflow in ways that create the appearance of enhanced capability while degrading it?
Usage metrics — logins, queries, documents processed — tell you nothing about whether AI is supporting good analysis. They tell you whether the tool is being used, not whether using it is better than not using it. An analyst who uses an AI summarization tool to process 100 documents per day and doesn't catch that the tool systematically misrepresents sources in a specific language is using the tool actively and producing degraded output.
The evaluation signals that matter are traceable to analytic quality. The first is output traceability: can every claim in an AI-assisted product be traced to its source, including identification of which parts were AI-generated and which were analyst-verified? The IC has applied three methods to determine whether analysis is good: did it meet analytic tradecraft standards? Were the assessments accurate? Did the product make a difference with a decision maker? None of those evaluation methods is perfect, and all three leave questions. Adding AI to the production chain compounds the evaluation problem rather than simplifying it. The AI's contribution to the product must be evaluated as a distinct layer, not just the final product.
The second signal is override rate: how often do analysts override, modify, or reject AI outputs, and what are the patterns in those overrides? A team where analysts never override AI outputs has automation bias, not a perfect AI. A team where analysts override everything doesn't trust the tool and probably shouldn't be using it. The healthy signal is a stable, documented override rate with clear patterns — the tool is reliable in these contexts and unreliable in these others — which provides feedback to the engineering team about where to improve the model and gives analysts a calibrated map of where to apply additional scrutiny.
The third signal is drift detection: is the AI's performance on the team's actual work degrading over time, and does the team know it? Data drift occurs when the statistical properties of inputs change; concept drift occurs when the relationship between inputs and correct outputs changes. In analytic contexts, concept drift is the more dangerous variety. A model trained on pre-2024 open-source data about Iranian nuclear facilities will produce systematically different outputs about those facilities after Operation Epic Fury than before it. Whether those outputs are accurate depends on whether the model's training has caught up to the operational reality. The team needs mechanisms to detect that gap, which requires analysts to document cases where AI output diverged from their ground truth judgment — not just cases where the AI was wrong.
The fourth signal is the automation bias audit: a structured review, conducted periodically, in which a sample of AI-assisted products is compared against the original AI outputs to determine how much the analyst changed. If the final products look essentially identical to what the AI generated — if the analyst's edits were cosmetic rather than substantive — that is an organizational red flag. It may mean the tool is genuinely excellent. More often, it means analysts have lost the habit of substantive evaluation because the AI output always looks confident and well-formatted.
Red flags that warrant immediate intervention: analysts who cannot explain how the AI tool they use daily makes its decisions; a complete absence of documented overrides or corrections; AI outputs included in finished intelligence products without analyst-authored sourcing statements; any workflow where an AI's output is the only input to a consequential judgment rather than one of several inputs that a human is synthesizing.
The Maven example clarifies what happens when evaluation infrastructure doesn't keep pace with operational deployment. As the U.S. continued strikes on Iran as part of Operation Epic Fury, speakers at Palantir's AIPCON event said Maven had shortened the time to select and hit targets. "So we've gone from identifying the target to now coming up with a course of action, to now actioning that target, all from one system. This is revolutionary," said Cameron Stanley, chief digital and artificial intelligence officer for the DoD. "We were having this done in about eight or nine systems where humans were literally moving detections left and right in order to get to our desired end state." That consolidation of nine systems into one is a genuine organizational achievement. But it concentrates risk. As Palantir architect Chad Wahlquist noted: "I saw stats where normally we would have 2,000 intelligence officers trying to do targeting and look at stuff. Now that's 20 and they're doing it in rapid succession as well."
Two thousand analysts to twenty. A 99% reduction in human review. The evaluation question — what are those twenty people able to catch that the two thousand would have caught, and what are they missing? — cannot be answered after the strikes have already happened.
The evaluation infrastructure needs to be built before the system is deployed at scale, not after.
The Decision You Are Now Equipped to Make
The organizational design question for AI-enabled analysis is a question about who holds what kind of knowledge, where authority over consequential judgments must sit, and how to build the structures that ensure those two things stay aligned as AI capability increases.
The specific decisions this analysis equips you to make: whether your current team structure places the people with mission context in genuine supervisory authority over AI outputs, or has them functionally subordinate to automated pipelines they don't understand; whether your centralized AI function is building toward shared knowledge or building toward a moat; whether your training investment addresses both directions of the knowledge gap or only teaches analysts to use tools without teaching engineers what analysis is for; whether you have the evaluation signals in place to detect drift, automation bias, and production quality degradation before they become organizational failures.
The CIA's Deputy Director told the Special Competitive Studies Project audience in April 2026 that the agency's AI tools will help draft key judgments, edit for clarity, and compare drafts against tradecraft standards — but will not decide. That sentence is the design specification. Build everything around it. Every handoff protocol, every training program, every evaluation metric should be asking: does this preserve the condition in which the people with context are the people making the call?
The organizations that get this right will look, from the outside, like they've adopted AI more slowly than their peers. Their analysts will be slower to automate and faster to question. Their engineers will know more about intelligence tradecraft than seems strictly necessary. Their evaluation processes will impose friction that the most AI-enthusiastic voices on their teams will want to remove.
That friction is the point.
It is the organizational expression of the principle that judgment — the kind that can account for what a model cannot know — belongs with people who have earned it. The organizations that get this wrong will be faster, briefly. Impressive at scale, measurably. And they will eventually produce something like a school in Minab on a targeting list — a failure that no one in the organization can fully explain, because the architecture distributed responsibility so effectively that no human being was in a position to catch what the machine missed.