M10E2: Evaluating AI-Augmented Analysis: A Practical Framework

Module 10, Episode 2: Evaluating AI-Augmented Analysis — A Practical Framework

The Illusion of a Score

On April 12, 2026, a team at UC Berkeley's Center for Responsible, Decentralized Intelligence published something the AI industry did not want to see: a proof that every major AI agent benchmark can be defeated without solving a single task. The researchers built an automated scanning agent that systematically audited eight prominent benchmarks — SWE-bench, WebArena, OSWorld, GAIA, Terminal-Bench, FieldWorkArena, and CAR-bench — and discovered that every single one could be exploited to achieve near-perfect scores through trojanizing test infrastructure, reading answer keys from config files, and using prompt injection on LLM (large language model) judges. The exploit on SWE-bench Verified, the benchmark most AI coding labs cite first in their press releases, required exactly ten lines of Python. A conftest.py file — a configuration script that intercepts test execution — intercepted pytest result objects during execution and forced every test to report as passing. The benchmark never detected the hook. No code was fixed. Score: 100%.

This is the starting condition for anyone trying to build a serious evaluation program for AI-augmented analysis: the standard metrics are broken, the leaderboards reflect noise as much as signal, and the organizations that claim otherwise have a financial interest in the confusion. Evaluation is entirely possible. But you have to build it yourself — designed around what your mission requires, not what a benchmark vendor decided to measure.

The core argument here is simple but consequential: rigorous evaluation for AI-assisted analysis does not require an ML background, a research team, or a relationship with a benchmark provider. It requires discipline about what you are trying to measure, and honesty about the difference between performance on a controlled task and performance in your operational environment. Most teams that struggle with AI evaluation are not struggling with statistics. They are struggling with the prior question: what does "good" look like for this mission, in this context, against these adversaries?

The Dimensions That Matter

Accuracy is the metric everyone reaches for first because it is the easiest to define. The system retrieved the right answer or it did not. The entity attribution was correct or it was wrong. Factual precision matters enormously in analytic work, where a hallucinated source, a fabricated date, or a misclaimed relationship can propagate through a product and survive review cycles because it sounds plausible. But accuracy is the floor, not the ceiling. A system that is highly accurate on the questions it can answer confidently, while silently failing on everything else, is operationally dangerous in ways that no accuracy metric will surface.

There are at least five additional dimensions that matter for analytic evaluation, and they interact in ways that accuracy alone cannot capture.

Calibration asks whether the system's confidence tracks its actual correctness — whether a claim offered at 80% confidence is right roughly 80% of the time across a large sample. This is distinct from accuracy. A system can be highly accurate on the questions it answers and still be catastrophically miscalibrated if it expresses high confidence on wrong claims and low confidence on correct ones. Miscalibration introduces systematic bias into the analytic product: the analyst who incorporates AI confidence levels into their own reasoning is being trained on a corrupted feedback signal.

Hypothesis coverage asks whether the system is surfacing the plausible hypotheses, not just the most salient ones. This is the Analysis of Competing Hypotheses (ACH) dimension of evaluation. A system that consistently anchors on one narrative frame — the dominant interpretation, the highest-frequency pattern in its training data — will produce analysis that is accurate within that frame while missing the alternative that turns out to be correct. For competitive intelligence teams and strategic warning analysts, incomplete hypothesis coverage is a more dangerous failure mode than factual error, because it is invisible and produces false confidence.

Source diversity asks whether the system is drawing on genuinely distinct evidentiary bases or recombining the same few high-visibility sources in different syntactic configurations. An AI-augmented analysis that cites fifteen sources has done nothing for you if those fifteen sources are all downstream of the same three wire services reporting on the same official statements. Breadth of sourcing signals analytical reach; its absence signals that the system is optimizing for fluent synthesis of available material rather than genuine collection.

Transparency and reasoning legibility asks whether you can inspect the inferential steps the system took to reach a conclusion. This matters not because analysts need to understand the weights in a transformer, but because an analytical product whose reasoning cannot be reconstructed cannot be dissented from. If you cannot see how the system got from evidence to judgment, you cannot identify where its reasoning diverged from yours, which means you cannot provide meaningful oversight and cannot update appropriately when it is wrong.

Novelty of connections is the hardest dimension to operationalize and the one that most distinguishes genuinely capable AI-augmented analysis from fast, expensive text retrieval. Does the system surface relationships, anomalies, or framings that the analyst would not have reached through their own research? That is the "so what" of augmentation: not whether the AI can synthesize faster than a human, but whether the human-AI system as a whole produces qualitatively different analysis than the human working alone.

None of these dimensions map cleanly to standard benchmark metrics. That is exactly the problem the Berkeley findings expose at the technical level — the vulnerabilities found across those benchmarks are not signs of incompetence but signs that adversarial evaluation robustness has not yet become standard practice in the field. For intelligence consumers and analytic managers, the equivalent insight is this: if you are evaluating AI-augmented analysis using the metrics that come preloaded in your vendor's dashboard, you are measuring what the vendor decided to measure, not what your mission requires.

Designing the Eval Set

The first decision in building a mission-specific evaluation is scope. What failure modes would be catastrophic for your specific mission, and which would you accept? A corporate intelligence team tracking merger and acquisition risk has very different answers than a strategic warning shop covering great-power competition, which has different answers than an investigative journalism unit working on financial crime. The eval set has to be built backward from those answers.

For most analytic missions, the eval set should contain at least four categories of test cases, each designed to stress a different capability dimension.

The first category is factual recall against a ground-truth corpus — cases where the right answer is unambiguous and verifiable. This establishes the accuracy baseline and identifies systematic hallucination patterns. Run these with source attribution required: the system should not just produce the correct answer but cite the document or passage that supports it. Verify the citations manually for a random sample. You will often find that the citations are real but the attributed claim does not appear in the cited source — a failure mode that accuracy metrics miss entirely because the answer is correct but the reasoning chain is fabricated.

The second category is calibration probes. Present the system with questions of known difficulty — a mix of items where the right answer is clearly supported by available evidence, items where the evidence is ambiguous, and items where available evidence points toward a wrong conclusion. Ask the system to provide probability estimates. Collect enough responses to compute Brier scores and build a reliability diagram. The Brier score is a strictly proper scoring rule that measures the accuracy of probabilistic predictions; for binary predictions it equals the mean squared error applied to predicted probabilities. A perfect score is zero; a score of 0.25 — what you get by assigning 50% to every binary question — is your baseline for an uninformative model. What you are looking for is not perfection but systematic patterns: is the system overconfident in a particular domain? Does it express appropriate uncertainty when evidence is thin? Does calibration degrade for claims about recent events not well represented in its training data?

The third category is hypothesis coverage stress tests — cases where you know from retrospective analysis that the correct answer was a non-obvious alternative hypothesis. Present the system with the same evidence base that was available at decision time. Evaluate whether the system includes the correct hypothesis among those it surfaces, and where that hypothesis ranks in its response. A system that surfaces the correct explanation only when prompted, or ranks it consistently lower than the dominant narrative, has a hypothesis coverage failure that no accuracy metric will catch. This is the single most valuable test for strategic warning applications, because it directly measures the system's susceptibility to the same confirmation bias pathways that plague human analysts.

The fourth category is source quality and diversity audits — automated tracking of which sources the system retrieves or relies upon across a representative sample of operational queries. This is partly about geographic and linguistic diversity (does the system perform differently on non-English sources?), partly about source tier (is it weighting official statements over primary documents?), and partly about the recency profile of retrieved evidence (is it anchoring on well-indexed material from years prior?). None of these measures requires human judgment to compute; they can be tracked automatically as part of a production evaluation pipeline.

Scenario exercises belong in a fifth, qualitative category — one that cannot be reduced to automated scoring. These are structured exercises where a subject matter expert constructs a synthetic or historical case, withholds the outcome, and asks the AI-augmented workflow to produce a finished analytical assessment. A panel of domain experts then evaluates the output against predefined rubrics covering hypothesis coverage, reasoning transparency, source quality, and calibration of expressed confidence.

The Berkeley team's work on open-world evaluations is instructive here: the vulnerabilities that allow benchmark gaming exist because the system's behavior during evaluation differs from its behavior during deployment. An agent that gets good enough at tool use may discover evaluation gaps on its own, without any deliberate instruction to cheat. Scenario exercises conducted without advance notice, on cases drawn from current operational problems, are the closest thing to genuine field conditions that a controlled evaluation can produce.

The whole point of this architecture is that it forces you to define what you are measuring before you measure it, rather than accepting someone else's definition. Even a fifty-task private benchmark on your real workload tells you more than any public leaderboard currently does.

Calibration Testing in Practice

The intelligence community has been doing probability elicitation longer than AI has been a serious enterprise. The superforecasters at Good Judgment Inc., trained on Philip Tetlock's work from the IARPA-sponsored (Intelligence Advanced Research Projects Activity) Good Judgment Project, demonstrated something important: calibration is a learnable, measurable skill, not a fixed trait. The evidence came from an exhaustive experiment over four years with more than 5,000 experts across the globe, ultimately identifying 260 superforecasters who produced difficult, mostly geopolitical, forecasts better than almost anyone in the world for nearly half a decade. The most adept amateur superforecasters were doing 30 percent better than professional intelligence officers with access to classified information. The skill differential was not about access to better information. It was about calibration discipline.

Superforecasters were those in the tournament who earned the lowest Brier scores. Keeping score was crucial because it provides feedback and therefore an opportunity to learn. That feedback loop — stated probability, resolved outcome, score computed — is exactly what most AI-augmented analytic workflows lack. The system produces an assessment, often with embedded confidence language ("likely," "almost certainly," "we assess with high confidence"), but no mechanism exists to track whether those assessments are systematically over- or under-confident, and no feedback reaches the system or its operators when the assessment is wrong.

Building calibration testing into your evaluation program means, concretely, requiring the system to express assessments in numeric probability terms — not as linguistic confidence labels — and then tracking outcomes. This is uncomfortable for several institutional reasons. Analysts often resist numeric probabilities because they create accountability in ways that hedged language does not. The words "probably" and "likely" are interchangeable in informal usage; 0.65 and 0.72 are not. Managers sometimes resist for the same reason.

But the Brier score directly assesses both the calibration and the resolution of probability estimates, making it central in the theory and practice of probabilistic forecasting. This is exactly the kind of accountability structure that enables meaningful evaluation of AI-augmented analysis.

A practical Brier score program does not require thousands of resolved forecasts. Thirty to fifty binary outcomes with associated probability estimates is enough to identify gross miscalibration. Collect the system's stated confidence on a set of factual queries with known answers. Compute the mean squared difference between stated probability and actual outcome (1 if correct, 0 if wrong). Plot the results as a reliability diagram: bin the probability estimates, and for each bin, plot the actual proportion of correct responses. A well-calibrated system produces a reliability curve that runs along the diagonal. Systematic deviation above the diagonal means the system is underconfident in that range; systematic deviation below means overconfidence.

What you will typically find — and this has been documented across multiple calibration studies of contemporary language models — is that LLMs tend to be overconfident on questions in their training distribution and poorly calibrated on out-of-distribution queries. The reliability component of Brier score decomposition isolates systematic bias and miscalibration; the resolution component indicates the gain in discriminability relative to baseline prediction. For intelligence applications, the out-of-distribution problem is particularly acute because the most analytically important questions — emerging threats, novel actors, unprecedented events — are by definition the questions least well represented in any training corpus.

The calibration failure mode that should concern analysts most is not the obviously wrong confident claim, which at least triggers review. It is the cluster of moderately confident wrong claims — assessments in the 65–80% confidence range that are correct only 40–45% of the time. These survive analytic review because they do not trip the threshold for challenge, they get incorporated into downstream products, and their error is discovered only in retrospective analysis, if at all. A Brier score program run over six months will reveal this pattern if it exists. Without it, you are flying blind on your system's most dangerous failure mode.

LLM-as-Judge: When It Works and When It Doesn't

The idea of using a language model to evaluate another language model's outputs has moved from academic curiosity to industry standard faster than most people realize. At 500x to 5,000x cost savings over human review while achieving 80% agreement with human preferences and matching human-to-human consistency, LLM-as-judge enables continuous quality monitoring and rapid iteration that was previously impossible. For high-volume evaluation tasks — checking whether a system's source citations are properly formatted, whether a response addresses all components of a query, whether the tone of a finished product is appropriate for the intended audience — LLM judges are faster, cheaper, and more consistent than human evaluation at scale.

The Berkeley team's findings expose the structural vulnerability that every practitioner needs to keep in front of them. CAR-bench relies heavily on LLM-as-judge evaluation, where an LLM reads the agent's conversation and scores it. The agent's messages are interpolated directly into the judge prompt with no sanitization. The exploit appended hidden instructions into the conversation, and the judge scored favorably. The attack surface here is not exotic: it is prompt injection into the evaluation pipeline, the exact vulnerability that intelligence professionals are supposed to watch for in adversarial information environments. When you deploy LLM-as-judge in an analytic evaluation context, you are creating a meta-system that can be manipulated at the evaluation layer — a concern that extends well beyond academic benchmark gaming into operational security.

The documented biases of LLM judges compound this problem. Research published at NeurIPS 2024 shows that LLM evaluators recognize and favor their own outputs, with a proven linear correlation between self-recognition capability and self-preference bias strength. A systematic study published at IJCNLP 2025 found that judge model choice has the highest impact on positional bias compared to task complexity, output length, or quality gaps. When researchers swapped answer positions, GPT-4's judgment flipped to favor the alternative. For analytic evaluation, the self-preference bias is especially dangerous: if you are using one model to evaluate outputs produced by that same model, you are encoding a circular preference into your quality signal.

The right frame is a structured hybrid that allocates each approach to the tasks where it performs best. LLM-as-judge evaluations show clear limitations for tasks requiring specialized knowledge, with agreement rates varying significantly across domains even when expert-persona LLMs are used. Source credibility assessment, hypothesis ranking, identification of reasoning gaps, and evaluation of whether an assessment is appropriate for the stated confidence level all require human domain expertise that current LLM judges cannot reliably replicate. You will not catch a system that has systematically misjudged source quality by running its outputs through another LLM.

A workable hybrid program looks like this. Deploy LLM judges for mechanical consistency checks — citation format, attribution claims, completeness of question-answering, tone and register appropriateness. Run these at volume and automatically. Reserve human evaluation for the calibration test cases, the hypothesis coverage stress tests, and any flagged outputs where the automated system indicates uncertainty. Build a "golden dataset" of analyst-reviewed cases with documented reasoning, and use these to periodically calibrate your LLM judge's alignment with human expert judgment. Human experts should create these golden datasets and establish evaluation criteria; according to Databricks best practices, human-created datasets can retrain LLM judges to improve performance. Treat the human-judge alignment score — how often the LLM judge agrees with the human expert panel on the same cases — as a system health metric that you track over time.

The practical discipline here is separating the question "did this output satisfy the formal requirements?" from the question "is this output analytically sound?" LLM judges can answer the first with high reliability. They cannot reliably answer the second, and treating them as if they can will systematically inflate your quality metrics while masking the analytical failures that matter most.

The Novelty Problem: Regurgitation vs. Insight

There is a question that evaluation frameworks consistently avoid because it is uncomfortable to formalize: does the AI-augmented workflow make the analysis better, or does it just make it faster? These are not the same question. Faster access to the same insights through higher-confidence channels is genuinely valuable — it changes what analysts can accomplish in a given time budget. But it is categorically different from a system that surfaces connections the analyst would not have found, or challenges a prevailing hypothesis with evidence the analyst had not encountered, or identifies the anomaly that breaks the pattern.

The intelligence community's historical concern about groupthink and mirror-imaging translates directly into this evaluation dimension. A system trained primarily on the publicly available English-language discourse surrounding a target problem will have absorbed the same consensus framings, the same prominent hypotheses, the same selection biases that affect the human analysts working the same accounts. Faster synthesis of that consensus is not independent analysis. It may be worse than no AI involvement at all, because it provides false confidence in a pre-existing consensus without genuinely stress-testing it.

Testing for novelty is methodologically difficult but not impossible. The core approach is retrospective: select a set of historical cases where the correct interpretation was non-obvious at the time, where the analytic consensus was wrong, or where the crucial signal was embedded in material that did not receive significant human attention. Present the AI-augmented workflow with the same evidence base that was available before the correct answer was known. Evaluate whether the system surfaced the signal that mattered.

This is the intelligence equivalent of the open-world evaluation methodology now gaining traction in AI research. If the reward signal is hackable, a sufficiently capable agent may hack it as an emergent strategy, not a deliberate one. As agents become more capable at reading documentation and writing scripts, they become more capable of spotting exactly these kinds of evaluation gaps. A system optimized on feedback from evaluations that reward consensus-consistent outputs will learn to produce consensus-consistent outputs. The novelty problem is not just a measurement challenge. It is an optimization problem, and if your evaluation framework does not reward surfacing uncomfortable alternatives, the system will not learn to surface them.

There are three concrete moves for operationalizing novelty testing. The first is the retrospective case method described above. This is labor-intensive but irreplaceable. It requires a panel of subject-matter experts who know the case outcome and can evaluate whether the system's output, presented with the pre-outcome evidence base, indicates genuine analytical reach. Run six to ten of these per quarter, selected to cover different mission areas and different types of analytic failure.

The second is source novelty tracking — automated measurement of whether the system's retrieved evidence overlaps heavily with the analyst's existing source diet. This requires logging, but it is mechanically straightforward: for each analytic task, track the set of sources the AI retrieves against the set of sources the analyst typically consults for this account. High overlap suggests the system is reflecting the analyst's existing worldview back at them. Low overlap is a necessary but not sufficient condition for genuine novelty — you still need human judgment to determine whether the novel sources are actually informative.

The third is diversity-of-hypothesis scoring. Present the system with an ambiguous scenario and evaluate the breadth and distinctiveness of the alternative hypotheses it generates. Compare this against a human analyst working the same scenario cold. The optimal analytical output is one that maximizes sharpness subject to calibration: among all calibrated assessments, the one providing the most informative probability assignments is preferred. The analytic equivalent is a system that generates sharply differentiated hypotheses with calibrated confidence — not a cluster of overlapping framings dressed in varied language.

This dimension connects back to the benchmark vulnerability findings directly. The current competitive landscape, driven by leaderboard rankings, may be incentivizing exploitation over genuine innovation. For intelligence managers procuring AI-augmented analytic tools, the equivalent incentive structure is real: vendors are rewarded for demos that impress, for synthesis that sounds authoritative, for dashboards that display clean metrics. They are not rewarded for surfacing the uncomfortable alternative that the analyst's organization has structural reasons to resist. Your evaluation program has to explicitly reward that, or it will not happen.

From Measurement to Management

Evaluation is not a one-time activity. A system that performs well on your initial eval set is making predictions about future performance, and those predictions need to be tracked. This is the distinction between evaluation as a gate and evaluation as a management instrument.

The gate model is what most organizations currently use: run an eval before deployment, accept or reject the system, then operate without systematic measurement. The management instrument model is what rigorous teams use: build continuous evaluation into the operational pipeline, track calibration and hypothesis coverage and source diversity as ongoing metrics, and treat degradation on any of these dimensions as a signal requiring investigation.

Human evaluations that reward confident-sounding language over accuracy create reinforcement learning from human feedback (RLHF) feedback loops where models learn to optimize for apparent authority rather than genuine calibration, compounding small biases across training iterations. The analytic equivalent is an evaluation program that rewards fluency and apparent authority over calibrated uncertainty and genuine coverage. If your evaluators — human or LLM — consistently rate confident-sounding outputs higher than appropriately hedged ones, your system will drift toward overconfidence. This is not theoretical. It is the mechanism by which analytic products become progressively more confidently wrong over time.

Palantir AIP (Artificial Intelligence Platform) offers one model: unit-test-style evaluation that enables iterative improvement and regression detection, tracking the impact of prompt changes, model swaps, or tool additions empirically. The principle is right even if the specific implementation is not available to every team. Every change to the underlying system — model update, prompt revision, new retrieval corpus, new workflow step — should be followed by a re-run of the core eval suite to detect regression. This is not a high technical bar, but it requires institutional commitment: someone has to own the eval suite, maintain the ground-truth cases, and have authority to flag when performance has degraded.

The FDA's seven-step credibility assessment framework for AI models embeds this principle in its first step: defining the context of use before defining the evaluation. High-stakes domains — legal, medical, safety-critical decisions — require specialized expertise requirements where judges may hallucinate. Frontier model evaluation, bias measurement, deterministic checks, and exact-matching tasks all have different appropriate evaluation approaches. The intelligence analytic context is a high-stakes domain by this definition: assessments inform decisions with real consequences for real people, and the cost of systematic miscalibration is not an embarrassing quarterly report but a policy decision made on corrupted evidence.

The eval set you build should be treated as a living document, not a compliance artifact. When your system produces an output that surprises a domain expert — in either direction, better than expected or worse — that output belongs in the eval set. When a case resolves in a way that contradicts a high-confidence AI assessment, that case belongs in the calibration data. When a scenario exercise reveals a hypothesis coverage gap that automated metrics missed, that gap should be formalized as a test case. The eval set is your organization's accumulated understanding of where this system fails, and it should grow with every operational cycle.

The analysts and managers working through this will face an institutional challenge that is not primarily technical: building and maintaining an evaluation program requires sustained effort by people already carrying operational load, in service of a measurement function whose value is invisible when it is working correctly and only visible when something goes wrong. This is the same political economy that makes retrospective analysis and red-teaming difficult in intelligence organizations. The answer is not to wait for the catastrophic failure that finally makes evaluation politically easy. Start small — thirty calibration cases, two retrospective scenario exercises per quarter — and build the habit before the stakes force the issue.

Your system's confidence is a claim about the world. That claim deserves the same scrutiny you would apply to a human analyst who told you they were certain — and the same feedback loop that would tell you, over time, whether they had earned that certainty.