M8E1: From One-Off Prompts to Analytic Agents
Module 8, Episode 1: From One-Off Prompts to Analytic Agents
The Distance Between a Prompt and a Workflow
Most analysts who have spent time with a frontier model — Claude, GPT-5, Gemini — have experienced a particular frustration. The model is clearly capable. It reasons well. It handles ambiguity without panicking. But every session starts from nothing. The context you laboriously assembled yesterday is gone. The source documents you pasted in have to be pasted again. The hypothesis you were testing has to be re-explained. The tool you needed — a geocoding lookup, a company registry search, a translation — has to be done manually, outside the model, and the result pasted back in. Every interaction is a fresh negotiation with a system that has no memory of the last one.
This is an architecture problem, not a model capability problem. The distance between that frustrating workflow and an analytic agent that remembers context, queries external tools, maintains a structured evidence record, and applies something resembling a structured analytic technique (SAT) to that evidence — that distance is measurably smaller than most analysts believe. Closing it is no longer a software engineering project. It is an analyst design problem. The analyst decides what the workflow looks like. The framework executes it.
That reframing matters enormously. The standard mental model of AI in analytic work treats the model as an oracle you consult. You ask a question; it answers. You paste in a document; it summarizes. The interaction is episodic, single-turn, and fundamentally passive on the analyst's side. An agentic workflow inverts this. The analyst designs the process — what gets collected, in what order, against what criteria, with what human checkpoints — and the agent executes that process across multiple steps, accumulating and updating a shared record of what has been found and what it means. The model is not the answer machine. It is the reasoning engine inside a workflow that the analyst architects.
This episode is about what that architecture looks like in practice: what state and memory mean in an agent context, how ReAct-style reasoning loops (an approach that interleaves a model's thinking with its actions, so each step informs the next) map onto the procedural logic analysts already use, how SATs become workflow specifications, and what tool choices are available to analysts who want real open-source intelligence (OSINT) and research capability plugged into their pipelines. It is also about where the boundary sits — what the agent executes, and what judgment the analyst must retain.
What ReAct Does, Mechanically
The original ReAct paper, published by Shunyu Yao and colleagues in 2022, proposed something conceptually simple: rather than separating the reasoning a language model does from the actions it takes, interleave them. Let the model think about what to do, do it, observe the result, think again, and loop. Reasoning traces help the model induce, track, and update action plans and handle exceptions, while actions allow it to interface with external sources such as knowledge bases or environments to gather additional information.
That interleaving is not merely an engineering convenience. It changes the epistemic structure of what the model is doing. A chain-of-thought prompt that generates a long reasoning trace before producing an answer is doing all its thinking with the information already in context. It cannot discover that a critical piece of evidence is missing, retrieve it, and revise. A ReAct-style agent can. ReAct can retrieve information to support reasoning, while reasoning helps to target what to retrieve next. The feedback is real: the observation from each action modifies the model's reasoning about the next step, which shapes what it retrieves, which shapes what it concludes.
Mechanically, a ReAct loop works as follows. The model receives an initial query and a description of available tools. It generates a thought — an explicit articulation of its current understanding and what it needs next. At each step, the agent doesn't just decide what to do; it first articulates why. This explicit articulation helps decompose complex tasks, track progress, handle exceptions, and dynamically adjust the plan based on intermediate outcomes. It then selects and calls a tool — a search, a database query, a document retrieval — and receives an observation back. That observation enters the context. The next thought takes it into account. The loop continues until some stopping condition is met: enough evidence has accumulated, the model reaches a conclusion above some confidence threshold, a maximum iteration count is hit, or the analyst intercepts.
The loop has important failure modes that practitioners need to understand before deploying it. The inherently non-deterministic nature of large language model (LLM) outputs, combined with iterative execution, can produce wildly inaccurate outputs or loops that never terminate — resulting in massive, unexpected API costs. This is not a theoretical concern. The OpenClaw incident (a widely reported April 2026 case in which an autonomous agent ran without cost controls) burned roughly $250,000 in tokens and illustrates what happens when autonomous agents run without custody controls and cost guardrails. For analytic applications, the design discipline is to keep loop iteration counts bounded, define explicit stopping conditions, and build human checkpoints at consequential decision nodes.
The practical implication for analysts is that ReAct is not magic autonomy. It is a structured discipline for multi-step research that explicitly records its reasoning at each step. That property — the explicit reasoning trace — is what makes it useful for intelligence work, because it means the workflow produces something auditable. You can see why the model queried what it queried, what it found, and how that shaped its next thought. The process is documented in a way that a human reviewer can inspect.
SATs as Workflow Specifications
The structured analytic techniques the intelligence community formalized over the past three decades — Analysis of Competing Hypotheses (ACH), Key Assumptions Check, red teaming, indicator validation — are, at their core, procedural specifications for how to reason under uncertainty. They impose sequence. They enforce explicit evidence registration. They require the analyst to do specific cognitive operations in a defined order, precisely because those operations counteract the cognitive biases that unstructured thinking allows. ACH, developed by Richards Heuer at CIA over decades, structures the analyst's task as systematic hypothesis elimination rather than progressive confirmation: ACH shifts the analytical focus from proving a favored hypothesis to disproving less likely alternatives, ensuring that conclusions are reached through elimination rather than assumption.
That procedural structure is precisely what makes SATs translatable into agent workflows. A workflow is a procedure. SATs are procedures. The question is whether the steps of a given SAT can be specified clearly enough that an agent can execute them — and for most standard SATs, the answer is yes, with important caveats about where human judgment is irreplaceable.
Consider ACH in detail. The analyst follows seven stages: Hypotheses, Evidence, Diagnostics, Refinement, Inconsistencies, Sensitivity, and Conclusions and Evaluation. Map these to agent operations. The Hypotheses stage maps to a structured generation step: given a problem statement and initial context, the model generates a set of competing explanations and outputs them as a structured list. The Evidence stage maps to a retrieval-augmented generation (RAG)-backed retrieval loop: the agent queries document collections, OSINT APIs, and news sources to assemble evidence items relevant to each hypothesis, with each item stored in a structured record that includes source, date, and confidence assessment. The Diagnostics stage — determining which evidence is diagnostic, meaning which items discriminate between hypotheses — maps to a scoring step where the model evaluates each evidence item against each hypothesis and marks consistency, inconsistency, or neutrality. The Refinement stage, where low-diagnostic evidence is pruned and new hypotheses may be introduced, maps to a conditional branch in the workflow: if the model assesses that the current evidence doesn't discriminate sufficiently, it triggers additional collection rather than proceeding to conclusions.
Managing a 50-column ACH matrix manually is exhausting. An agent can maintain that matrix in a structured data object, update it with each new evidence item retrieved, and re-score it continuously without cognitive fatigue. That is precisely the bottleneck an agent workflow relieves.
The caveats are significant, and the research on ACH effectiveness should give practitioners pause. A study by Dhami and colleagues that randomized fifty intelligence analysts into ACH-trained and control groups found that ACH-trained analysts did not follow all of the steps of ACH. There was mixed evidence for ACH's ability to reduce confirmation bias, and it may increase judgment inconsistency and error. The finding is not a reason to abandon ACH in agent workflows; it is a reason not to treat the agent's output as a bias-corrected answer. The workflow enforces procedural discipline. It does not guarantee correct analysis. The agent can score evidence incorrectly. It can generate hypotheses that omit the correct explanation. It can fail to weight deception indicators.
The evidence used in an ACH matrix is static — a snapshot in time. In intelligence work, the opponent is intelligent and may be generating information intended to deceive. No ReAct loop catches active deception unless the analyst has explicitly designed a deception-detection step — which maps, in turn, to a structured red-team sub-agent or an adversarial prompt that asks the model to argue against its current leading hypothesis.
The Indicator Validity check and Key Assumptions Check map similarly. Indicator validation is a collection-and-scoring loop: retrieve current reporting against each defined indicator, score presence or absence, update a running indicator matrix. Key Assumptions Check is a structured interrogation step: given a preliminary conclusion, the agent generates explicit assumptions underlying that conclusion and then tests each one against the evidence. The analyst who has designed this workflow has not automated their analysis. They have automated the procedural scaffolding that keeps analysis honest.
An Example Agent Recipe: Background Build, Hypothesis Assessment, Indicator Scoring
A concrete workflow makes the architecture visible. Consider this analytic question, of the type that appears routinely in corporate intelligence, national security analysis, and investigative journalism alike: Is Actor X preparing to expand operations in Region Y, and if so, through what mechanism?
This is not a lookup question. It requires background research, hypothesis generation, evidence collection, and structured assessment. It is exactly the type of question where a one-off prompt fails and an analytic workflow succeeds.
Step one: RAG background build. The workflow begins with a retrieval step. The agent is given access to a curated document collection — prior reporting, entity profiles, financial filings, news archives — and executes a hybrid retrieval pass against the question. Hybrid retrieval matters here. Keyword search using BM25 (a standard ranking algorithm that scores documents by term frequency relative to corpus length) catches exact name matches and specific terminology that semantic embedding might miss. Semantic vector search catches conceptually related material even when phrasing varies. A reranking model then scores and orders the retrieved chunks by relevance to the specific query. The output of this step is a structured evidence package: roughly fifteen to twenty document passages, each tagged with source, date, and relevance score, injected into the agent's context as the factual substrate for all subsequent reasoning.
This retrieval step is not passive. The agent reasons about what to retrieve. It might generate multiple search queries — one targeting the actor's known financial patterns, one targeting regional infrastructure indicators, one targeting regulatory filings — and execute them in parallel, then merge results. The iteration count is bounded, but the search space is broader than any single human query would cover in the same time.
Step two: Structured hypothesis assessment. With the evidence package assembled, the agent executes an ACH-structured hypothesis generation and scoring pass. It outputs three to five competing hypotheses about Actor X's intentions, each stated as a falsifiable claim. It then scores each hypothesis against the evidence package: consistent, inconsistent, or neutral. The output is a scored matrix, stored as a structured data object in the workflow's state — not prose that might be misread, but a machine-readable record that a human analyst can inspect, modify, and challenge. The model marks which evidence items are most diagnostic: the items that discriminate most sharply between hypotheses are flagged for human review, because those are the items whose reliability matters most to the conclusion.
The reasoning trace from this step is a workflow artifact. The analyst can read the model's explicit articulation of why evidence item seventeen is marked "inconsistent" with hypothesis three. If the reasoning is wrong — if the model has misread an ambiguous phrase, or failed to account for a known deception pattern — the analyst can correct it. The trace is not the analysis; it is the substrate for analyst review.
Step three: Indicator scoring. The workflow shifts from hypothesis assessment to forward-looking indicator tracking. The analyst has pre-specified a set of indicators: observable events whose presence or absence would update the probability of each hypothesis. The agent runs targeted searches against OSINT sources — corporate registry APIs, maritime tracking services like Kpler (a commodity and vessel tracking platform) or TankerTrackers (a service specializing in tanker vessel movements), news APIs, open-source financial data — and scores each indicator as present, absent, or inconclusive. Each scoring result updates the indicator matrix in the workflow state.
Consider this example from OSINT analysts describing LangGraph-based (LangGraph is an open-source framework for building multi-step agent workflows) investigation workflows: "Analyse the last 30 days of activity for domain X. If peaks in subdomain creation coincide with negative forum mentions, correlate them with leaked credentials associated with the same CNPJ (Cadastro Nacional da Pessoa Jurídica, Brazil's national corporate registration number)." This is not simply data retrieval. It is the testing of a hypothesis. The logic applies equally to tracking corporate expansion patterns, monitoring sanctions compliance, or following a proliferation network.
This three-step recipe requires no sophisticated engineering. It requires no code beyond configuring a LangGraph graph with three node types: a retrieval node that calls a search API and vector store, a reasoning node that invokes the language model against structured prompts, and a scoring node that outputs structured JSON (JavaScript Object Notation, a standard data format) against a predefined schema. A determined analyst with two days of setup time can build a working version. The sophistication is in the analytic design: choosing the right hypotheses, defining genuinely diagnostic indicators, specifying evidence quality criteria.
What State and Memory Actually Mean in Analytic Workflows
The words "state" and "memory" get used loosely in discussions of AI agents, and the looseness causes confusion about what agentic workflows can do. Precision here is worth the investment.
State, in the technical sense, is the structured data object that persists across the nodes of an agent workflow. LangGraph relies on a centralized state system that persists throughout the workflow, acting as shared memory accessible to all nodes for reading and updating, ensuring coordination across the workflow. In an analytic context, the state object might contain: the original intelligence question, the evidence package assembled by the retrieval node, the hypothesis matrix produced by the assessment node, the indicator scores produced by the scoring node, any tool call results accumulated during the run, and a log of the model's reasoning traces at each step. Every node reads from and writes to this object. The workflow accumulates knowledge as it progresses — not in the model's weights, which don't change, but in this structured record that each subsequent step can use.
This is architecturally distinct from what happens in a conversation interface. In a chat session, the only "state" is the conversation history pasted into context. When that history overflows the context window, older material falls out. No persistent structured record, no indicator matrix, no scored hypothesis table persists for the model to reference. The agent workflow has all of these because the analyst designed them into the state schema.
One of LangGraph's most powerful features is its interrupt mechanism. You can pause execution at any node, persist the state, wait for human input — hours or days later — and resume exactly where you left off. For intelligence analysis, this is not a marginal feature. It is the feature that makes agentic workflows usable in practice. An analyst can initiate a background-build pass on Monday morning, review the retrieved evidence and hypothesis matrix at noon, correct any errors in the model's initial scoring, and let the indicator-tracking pass run overnight. The workflow persists. The state is checkpointed. The work accumulates.
Memory, distinct from state, refers to mechanisms for retaining information across separate workflow runs — between sessions, between days, between analysts. This is harder. Most deployed systems in 2026 handle it through a combination of persistent vector stores (where past findings are embedded and retrievable by semantic similarity) and structured databases (where specific facts about named entities, prior assessments, and historical indicator scores can be queried directly). Research on agent memory failures identifies a concrete pathology: even systems with large vector stores degrade over time because summary drift accumulates — each summarization pass loses some information, and across many sessions, the agent's effective memory of earlier work becomes unreliable. This is not a solved problem.
The practical mitigation is to maintain human-readable structured records — scored matrices, dated indicator logs, explicit assumption registries — outside the model context, so that a returning analyst or a resumed workflow is reading from a ground-truth record rather than a model's lossy memory.
The analyst who treats the agent's context window as the only storage medium will encounter the same frustrations that prompted the move away from one-off prompts. The state schema needs to be designed explicitly. What gets stored? In what format? Which fields are updated by which nodes? These are analytic architecture decisions, not engineering trivialities.
Tool Choices for Analysts: What Connects to What
The practical limiting factor in most analytic agent workflows is not the language model's reasoning quality — it is the tools available to it. ReAct is a conceptual framework for building AI agents that can interact with their environment in a structured but adaptable way, using an LLM as the agent's "brain" to coordinate anything from simple RAG retrieval to complex multi-agent workflows. The "environment" is only as rich as the tools the analyst has configured. A workflow with no external tools is just a very structured prompt chain. The analytic value comes from grounding reasoning in real-world data retrieved in real time.
The Model Context Protocol (MCP), donated to the Linux Foundation in late 2025 as a vendor-neutral standard, has become the practical connection layer between agents and external tools. It defines how an agent discovers what tools are available, how it calls them, and how it interprets their outputs — across model providers and deployment environments. The analytic significance is that the tool ecosystem is now standardized and portable. A tool built for use with Claude can, in principle, be used with GPT-5 or DeepSeek-V4 through the same protocol.
The tool categories worth building for intelligence analysis are distinct, and their characteristics matter.
Document retrieval. The most fundamental tool in any analytic workflow is access to a curated document corpus: prior assessments, source reporting, open-source news archives, regulatory filings, academic literature. A hybrid retrieval setup — BM25 keyword matching combined with dense vector search over embedded document chunks, followed by a cross-encoder reranker (a small model trained specifically for relevance scoring) — is the current production standard. Keyword search handles proper nouns, entity names, and technical terms that semantic models sometimes miss. The reranker re-orders combined results before injection into the agent's context. The practical setup involves a vector database (Pinecone, Weaviate, or Qdrant are common choices), an embedding model, and a reranking model — plus the ingestion pipeline that keeps the document collection current.
OSINT APIs. The tool surface for open-source intelligence has expanded substantially. For corporate structures and financial networks: OpenCorporates provides programmatic access to company registry data across 140 jurisdictions. For domain and infrastructure intelligence: Shodan (a search engine for internet-connected devices) exposes infrastructure fingerprints; VirusTotal (a malware and URL analysis service) and Censys (an internet-wide scan platform) provide additional network intelligence. For maritime tracking: Kpler and TankerTrackers offer vessel position and cargo data that commercial intelligence teams have relied on heavily — as April 2026 Hormuz reporting demonstrated, when these firms supplied near-real-time transit corridor data that outpaced official reporting. For news and media: NewsAPI and GDELT (the Global Database of Events, Language, and Tone — a project that monitors news media in over 100 languages) provide programmatic access to global news coverage, with GDELT particularly valuable for volume-based signal detection across thousands of sources simultaneously. Each of these can be wrapped as a tool in a ReAct workflow, callable by the agent as a named function with structured inputs and outputs.
Geocoding and geographic enrichment. Geographic disambiguation is a persistent problem in OSINT work. Entity names appearing in documents need to be resolved to specific locations. The Palantir AIP (Artificial Intelligence Platform) community registry includes a production-ready package to geocode addresses in bulk using Nominatim (an open-source geocoding tool built on OpenStreetMap data). Beyond Nominatim, the Google Geocoding API provides additional options at varying cost points. For analysts working with conflict mapping, sanctions compliance, or supply chain tracing, geocoding as an agent tool means the workflow can convert location strings in documents to coordinates, cluster events by geography, and flag spatial patterns that text-only analysis misses entirely.
Graph database queries. Network analysis — entity relationships, corporate ownership chains, financial flows, co-offender networks — is where graph query tools become essential. Neo4j's Cypher query language (a declarative language for querying graph databases, analogous to SQL for relational databases) can be wrapped as an agent tool, allowing the model to formulate queries against a pre-populated entity graph. At the Nodes 2025 conference, GraphAware (a Neo4j consulting firm) presented a system combining Neo4j graph analysis with LangGraph agents to study criminal networks. The pipeline converts public police reports into co-offense graphs. Community detection algorithms such as Louvain and Label Propagation identify clusters. Specialized agents then analyze demographics, temporal activity, and geography. The workflow pattern is: document retrieval surfaces mentions of entities and relationships, the graph query tool checks whether those entities already exist in a known-entity network and returns relationship context, and the reasoning node synthesizes both. The agent is not doing graph analysis; it is querying the graph as a knowledge source and reasoning about what it finds.
Translation. For analysts working in multilingual source environments — which in practice means almost all national security and corporate intelligence analysts — translation as a tool removes a chronic workflow bottleneck. The DeepL API and Google Cloud Translation API can both be configured as agent tools. The agent calls translation only when needed: when a retrieved document is in a language other than the working language, translation is triggered before the passage is used in the reasoning step. Machine translation quality for major languages is now high enough for analytic use, with the caveat that technical terms, proper nouns, and intentional ambiguity in source text require human verification.
Palantir AIP, for organizations operating within that platform, provides an alternative architecture for all of the above: it enables developers and builders to create LLM-backed workflows, agents, and applications using tools like AIP Chatbot Studio and AIP Logic. The Ontology-based architecture (a data model that maps enterprise data to named, typed objects with defined relationships) means tool connections to enterprise data sources are managed at the platform level rather than configured per-workflow, which reduces setup friction but introduces its own constraints on portability. For organizations already running Palantir Foundry, AIP is the path of least resistance. For everyone else, LangGraph with MCP-connected tools is the current production standard for analysts building custom workflows.
What the Analyst Designs vs. What the Agent Executes
Here the architecture question becomes a professional practice question, and the honest answer is more restrictive than the marketing surrounding agentic AI suggests.
The agent executes procedure faithfully and at scale. It will run the same retrieval query against thirty documents, score each one against each hypothesis, apply the same diagnostic criteria, and produce a consistent matrix — without the fatigue that causes a human analyst to start skipping steps halfway through a large evidence set. It will keep the state record current, update indicator scores when new data arrives, and flag when a previously neutral indicator has shifted to present. It will call the geocoding API on every location string in a document, not just the ones the analyst happens to notice. It will translate every foreign-language passage, not just the ones whose relevance is already obvious.
What the agent cannot do — and what the analyst must retain — is judgment about the problem structure itself.
The hypotheses the workflow assesses are the ones the analyst specified. If the analyst failed to include the correct explanation as one of the hypotheses, the workflow will score against the wrong set of alternatives and will never discover the error, because the error was in the design, not the execution. Intelligence analysts seldom face neat problems where all hypotheses are provided and are mutually exclusive, and where all relevant evidence is available and precisely quantified. Real problems are murky — there may be insufficient relevant data or overwhelming volumes of it, source credibility may vary, data may be formatted inconsistently, ambiguous, unreliable, and sometimes intentionally misleading, and time pressure is constant. No workflow specification removes that murkiness. The analyst's judgment about how to frame the problem — what questions matter, what hypotheses are plausible, what evidence is credible — precedes and constrains everything the agent does.
The indicators the workflow tracks are the ones the analyst defined. If the actor being tracked changes their modus operandi in a way the analyst didn't anticipate, the workflow will dutifully score the old indicators and miss the new pattern entirely. The OWASP Agentic AI Top 10 (the Open Worldwide Application Security Project's list of the ten most critical risks in autonomous AI systems, published December 2025) formalizes this as "specification failure" — the category of failure where the workflow executes exactly as designed but the design was wrong. This accounts for a substantial share of multi-agent failures in production systems. The failure is not technical. It is analytic.
The tool outputs the workflow uses are only as reliable as the underlying sources. An agent that calls a corporate registry API and gets clean, verified data is grounded. An agent that calls a news API and retrieves AI-generated disinformation has no way to detect that problem unless the analyst has built a source credibility assessment step into the workflow — which requires the analyst to have thought about credibility criteria in advance. The agent executes the step. The analyst designed it.
Palantir's AIP Analyst system shows its work. Every analysis creates an interactive dependency graph showing the flow from question to answer. Users can see exactly how the agent reasoned through their request, inspect intermediate results, and manually adjust steps. That auditability is not cosmetic. It is the mechanism by which the analyst exercises judgment over what the agent executed. The workflow produces a reasoning trace, not a final answer. The analyst reads the trace, identifies the steps where the model's judgment was questionable, corrects the record, and decides whether the conclusion survives that correction. A Gartner survey finding that more than 40% of agentic AI projects will be canceled by end of 2027 — attributing failure not to technology but to governance decisions — is, in practice, a finding about analyst design. The workflows that fail are the ones where no one designed the human checkpoints.
A practical design rule emerges from this. Every analytic workflow should have explicit human checkpoint nodes at three high-stakes points: after the initial hypothesis generation (before the agent invests retrieval resources in the wrong hypothesis set), after the evidence package is assembled (before the agent scores evidence that the analyst hasn't reviewed for reliability), and before any output is delivered to a decision-maker (after the agent has completed its structured assessment but before the analyst endorses it). LangGraph's interrupt mechanism makes these checkpoints technically trivial to implement. The analytic discipline to use them is not trivial. It requires treating the agent as a capable subordinate whose work is worth reviewing, rather than an oracle whose outputs can be passed upstream without inspection.
The forged Git commit incident reported in April 2026 — where security researchers demonstrated that Claude could be manipulated into approving hostile code when Git metadata was spoofed to impersonate a trusted maintainer — sharpens this point. The agent was doing exactly what it was designed to do: reviewing code changes against trusted-contributor criteria. No one had designed a cryptographic verification step. The human checkpoint that would have caught the spoofed identity wasn't there. The agent executed faithfully. The design was wrong.
The Skill Is Workflow Architecture
The gap between "using AI" and "building analytic workflows" is a design gap, not an engineering gap. It is the difference between knowing what question to ask and knowing how to structure a process that asks the right questions in the right order, accumulates evidence against a structured analytic framework, connects to the data sources where ground truth lives, and places human judgment at the nodes where it matters.
That design skill — specifying hypotheses, defining indicators, selecting tools, structuring state, positioning checkpoints — is now a core competency for analysts who want to operate at the capability frontier. The models are capable enough. The frameworks — LangGraph, LangChain (a library for chaining LLM calls into multi-step applications), CrewAI (an open-source framework for orchestrating multiple AI agents with defined roles) — are mature enough for production use. LangGraph is best for enterprises and technical teams that need durable, auditable, long-running agent workflows with precise control over execution order and error recovery. It reached version 1.0 in late 2024 and has become the default runtime for LangChain agents. The tool ecosystem is sufficiently broad that most of the external data sources analysts already use can be connected via MCP-compliant wrappers. The infrastructure problem is solved, or close enough.
What isn't solved — what cannot be solved by better models or more capable frameworks — is the analytic design problem. The analyst who can specify a clean ACH workflow, define genuinely diagnostic indicators, build credibility assessment into the tool call sequence, and position human checkpoints at the right nodes will get substantially more out of the current capability frontier than the analyst who is still treating each model interaction as an isolated prompt.
The agent executes procedure. The analyst designs it.
The question worth asking before starting any complex analysis is no longer what should I ask the model? It is: what process should I design, and where do I need to be in the loop?