M7E2: Simple Quantitative Methods Analysts Actually Use

Module 7, Episode 2: Simple Quantitative Methods Analysts Use


What "Anomalous" Means

Every quantitative method in the analyst's toolkit begins with a deceptively simple question: compared to what? The word "anomalous" is meaningless in isolation. A transaction of $9,000 is unremarkable in most commercial contexts. It is an immediate red flag in a cash-intensive business whose average transaction is $200, and it becomes a critical signal when it occurs in the week before a sanctioned counterparty expects an OFAC (Office of Foreign Assets Control) designation. The number has not changed. The baseline against which you evaluate it has. Most analytic failures that look like failures of detection were failures of baseline construction — and most of those failures were never recognized as such, because no one formalized what "normal" looked like before the anomaly appeared.

A time series is a sequence of measurements taken at regular intervals over time. Transaction volumes, communication metadata counts, shipment frequencies, vessel position reports, social media posting rates, electricity consumption at a facility — any of these can be plotted as a time series, and each carries a characteristic signature when the underlying activity is normal. That signature is your baseline. It reflects seasonal variation, operational cycles, day-of-week patterns, and the ambient noise of whatever system you're watching. Traditional early warning models often rely on static indicators and linear assumptions, lacking the capacity to capture complex temporal patterns. That critique applies just as forcefully to human analysts as to software: if you're carrying a static threshold in your head rather than a dynamic sense of the system's behavior, you'll miss half of what matters and flag half of what doesn't.

The practical construction of a baseline requires several decisions that analysts routinely underspecify. First, time period: how long a window defines "normal"? A baseline built on six months of data before a sudden operational shift may reflect a reality that no longer exists. A baseline built on five years of data may average out the structural regime changes that make recent behavior meaningful. Second, granularity: are you measuring daily totals, weekly aggregates, hourly peaks? The right grain depends on what you're watching. A vessel that goes dark on AIS (Automatic Identification System, the satellite-based ship-tracking network) for six hours matters — but only if you know that its normal operational profile includes no gaps longer than two hours. A financial account that receives 40 wire transfers on a Tuesday is suspicious — but only if you know that the account normally receives fewer than five per week, not if you know it routinely processes payroll for 300 employees. Third, what counts as the relevant comparison universe? A single entity's own historical behavior is one baseline. Peer group behavior — what other entities of similar type, size, and operating context do — is another, often more powerful baseline. A sanctions-evasion shell company that transferred $2 million in one month looks unremarkable if you're comparing it to large commercial banks. It looks extraordinary if you're comparing it to other newly formed holding companies in the same registry with similar nominal ownership structures.

An analyst who cannot articulate the baseline underlying a judgment about anomaly has not made a quantitative judgment. They have made an intuitive one dressed up in numbers.

Characterizing trends within a time series — not just identifying anomalies against a static baseline — is where many analysts stop short. Direction, momentum, and rate of change each tell a different story. A consistent upward trend in shell company formations within a particular jurisdiction over eighteen months is a different signal from a sudden spike in the final three months of that same period. The former may reflect regulatory arbitrage responding to a new treaty; the latter may indicate anticipatory structuring ahead of a known enforcement window. Analysts can detect directional trends without any sophisticated software by applying a simple moving average — averaging the last N observations forward in time — which smooths day-to-day noise and reveals whether the underlying level is rising, falling, or plateauing. When the current observation consistently exceeds the moving average, you have an upward trend. When the gap between consecutive moving averages is itself widening, you have acceleration — a regime that is not just elevated but actively escalating. That distinction between level, trend, and acceleration is the three-layer read that separates analysts who understand time series from those who only look at the most recent data point.


Leading vs. Lagging Indicators, and Why the Distinction Saves Time

Once you have a baseline, the second critical distinction in practical quantitative analysis is between leading and lagging indicators — a distinction that shapes not just how you read data, but what data you go looking for in the first place.

A lagging indicator confirms that something has already happened. GDP contraction confirms recession. A cargo manifest listing sanctioned goods at a destination port confirms a violation that has already occurred. Body count confirms that a conflict has escalated. These indicators are often highly reliable — by the time you see them, the underlying event is not in doubt — but they arrive too late to drive prevention. They drive investigation and attribution. That's valuable, but it's categorically different from early warning.

A leading indicator changes before the underlying event does. Noisier, less direct, requiring more interpretive judgment — but it gives you time. The classic examples from financial intelligence: beneficial ownership changes in a network of shell companies often precede an OFAC designation by weeks or months, because actors inside the target network receive intelligence about upcoming designations and begin restructuring. In the Russian sanctions-evasion context specifically, the pattern of transferring nominal ownership of real estate from sanctioned oligarchs to family members and shell companies — exactly the structure documented in the January 2025 Miami case where a real estate company paid a $1 million settlement — tends to appear as a detectable network restructuring event before the broader enforcement action becomes visible in public records. The Miami case is instructive precisely because the leading signal was not the property transfer itself but the velocity of transfers: multiple ownership changes across a portfolio of properties within a compressed window, each individually defensible as routine estate planning, collectively anomalous against the peer-group baseline for similar holding structures. No single transaction crossed a threshold. The convergence of several sub-threshold signals, read against the baseline of what comparable companies do, was what made the pattern visible to an analyst reasoning quantitatively.

The leading-versus-lagging distinction applies across analytic domains beyond financial intelligence. In trade analysis, a surge in exports of dual-use precursor chemicals to a transshipment hub is a leading indicator of potential re-export to a sanctioned end-user; the actual delivery manifest is the lagging confirmation. In geopolitical analysis, elite travel restrictions and unusual central bank foreign exchange interventions are leading indicators of currency crisis; the formal devaluation announcement is the lagging confirmation. In operational security analysis, changes in a target organization's procurement patterns — purchasing communications equipment, vehicle parts, or fuel in volumes inconsistent with declared operations — are leading indicators of operational preparation; the activity itself is the lagging event. In each case, the leading indicator requires more interpretive judgment and carries more false positives, but it provides the temporal window for action. The lagging indicator closes that window in exchange for certainty. Analysts who understand this trade-off structure their collection requirements accordingly: they go looking for leading signals first, then use lagging indicators to confirm or disconfirm.

Rate of change is often more informative than absolute value. A communication network where one node goes from receiving 3 messages per day to receiving 47 messages per day over ten days is exhibiting a dramatic rate-of-change signature even if the absolute volume remains unremarkable. A vessel whose speed through a known transshipment zone diverges from its historical average may be responding to operational conditions — or it may be signaling a behavior change. The rate signal appears before the absolute value has crossed any threshold you might naively set. Analysts who have internalized quantitative reasoning focus on the derivative — the speed of change — rather than just the current level.

The practical problem is that leading indicators come with more false positives. A shell company that restructures ownership might be responding to tax law changes rather than anticipating designation. A communication spike might reflect a legitimate business event rather than operational planning. The discipline is to track multiple leading indicators simultaneously and look for convergence — not to demand that any single indicator carry the full evidentiary weight. Three weak signals pointing in the same direction require less certainty in each individual signal than one strong indicator requires on its own. Build for convergence, not for threshold-crossing.

Thresholds themselves deserve scrutiny. Setting a threshold at "flag any transaction over $10,000" made sense in 1970 when the Bank Secrecy Act was designed — it approximated the lower bound of serious criminal financial activity. By 2025, that threshold had become so embedded in adversarial actors' operational planning that structuring, or "smurfing," transactions just below that line is routine, increasingly automated, and in some cases handled by AI tools that programmatically distribute transactions across accounts to avoid triggers. The threshold became a roadmap. This is Goodhart's Law in its most operationally damaging form, and we'll return to it shortly.


Why Simplicity Often Wins, and Why Complex Models Lie to You

There is a persistent temptation among analysts who are newly quantitatively literate — and among vendors selling to them — to assume that more complexity produces more accuracy. More variables, more sophisticated models, more intricate weighting schemes, more precision in the output. The evidence runs in the opposite direction, and the reasons are structural.

Goodhart's Law states that "when a measure becomes a target, it ceases to be a good measure." Named after British economist Charles Goodhart, who articulated the principle in a 1975 article on monetary policy, the core idea is that any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes. This is the operating principle of adversarial intelligence environments. The moment your indicators become known — to policymakers, to compliance teams, to the targets of your analysis — they become things your adversaries optimize against. Vietnam's rat bounties, introduced to reduce the rodent population, instead spurred farmers to breed rats for the reward money. In intelligence terms: the rat farmers are your adversaries, and they read the same compliance guidance you do.

The implication for indicator design is uncomfortable. A sophisticated model with twenty carefully tuned variables is a more complete description of historical behavior, but it is also a more exploitable target. If an adversarial actor can reverse-engineer or simply infer the model's parameters — and in the era of regulatory transparency, compliance guidance, and leaked enforcement documentation, this is less difficult than it sounds — they can route behavior just outside the model's detection frontier. A simpler model with fewer, more directly interpretable features may be more durable under adversarial pressure precisely because its logic is harder to game at the margin. The simplest possible early warning indicator — one that your adversary cannot easily adjust to avoid — is often preferable to the most accurate indicator that your adversary can learn to evade.

One concrete detection method that embodies this simplicity principle is the control limit: a band drawn at a fixed distance above and below the historical mean of a series, typically set at two or three standard deviations. Any observation outside that band triggers review. The method requires no proprietary software, can be constructed in a spreadsheet, and is immediately explainable to a decision-maker. Its power is not statistical sophistication — it is the discipline of committing to what "normal" looks like before inspecting any individual observation. A compliance team monitoring wire transfer volumes for a portfolio of correspondent accounts can construct control limits for each account using ninety days of prior data, update them monthly, and flag outliers for human review. That workflow catches genuine behavioral shifts without requiring a data science team, and it remains interpretable enough that an analyst can explain to a regulator exactly why a specific transaction was flagged.

False precision is a closely related failure mode. A model that outputs "fraud probability: 0.873" implies a precision that the underlying data cannot support. Analysts and decision-makers, particularly under time pressure, tend to treat these outputs as more authoritative than they are. The more decimal places in the output, the less the consumer of that output is likely to interrogate the assumptions behind it. An indicator that says "elevated risk" retains appropriate epistemic humility. A score that says "91.7th percentile risk" invites precisely the kind of overconfidence that produces catastrophic missed signals at the tails.

The black-box nature of complex AI models has limited their ability to generate actionable policy insights. This is the honest state of machine learning applied to analytic problems: the most accurate models are the least explainable, and the least explainable models are the ones most likely to be misused by analysts who do not understand what drives their outputs. A random forest or a deep learning model may outperform a simple threshold rule on historical data — and there is solid evidence that random forest models outperform traditional time-series approaches, particularly in predicting tail risks over longer time horizons — but "outperforms on historical data" is not the same as "will outperform on future adversarial data." The evaluation environment and the deployment environment are different, and that difference matters most precisely in the high-stakes cases where you most need the model to work.

A complex organization is not passive toward its metrics. The metric attracts attention, budgets, promotions, and fear, which means it becomes part of the causal structure of the organization itself. This dynamic, which Donald Campbell identified independently from Goodhart in 1976 as Campbell's Law, operates at both the organizational and operational level. When a financial institution designs its AML (anti-money laundering) monitoring around a particular model, the model's outputs shape behavior: the compliance team's priorities, the training of new analysts, the thresholds that trigger investigations, and eventually, through regulatory guidance and enforcement patterns, the behavior of actors trying to stay beneath detection thresholds. The model ceases to be an objective measurement of underlying activity and becomes a force that shapes that activity.

The practical corrective is not to abandon quantitative models. Design analytic systems that make the measurement relationship explicit, preserve the ability to reason about the underlying phenomenon independently from the model, and rotate or update indicators faster than adversaries can adapt to them. Simple indicators, clearly grounded in the behavior you care about, that can be explained in plain language to a decision-maker, are more durable than elegant models that require a data science team to interpret.


Where AI Tools Help and Where They Don't

Given the argument above — that simplicity is often preferable, that complex models carry their own failure modes — it is worth being precise about where AI tools genuinely improve quantitative analytic work, because they do, in specific and bounded ways that differ sharply from vendor narratives.

The most immediate contribution of large language models to quantitative tradecraft is not analysis. It is translation and generation. An analyst working on an indicator set for a new collection problem — say, early warning of economic pressure on a mid-tier state actor — can use Claude, GPT-5, or a fine-tuned model to rapidly generate candidate indicator sets: what signals, drawn from open source, might precede capital controls? What observable proxies track elite asset flight? What behavioral signatures in trade data might indicate preparation for currency intervention? The model is not doing the analysis; it is doing the rapid enumeration of plausible hypotheses that the analyst would otherwise spend days generating through desk research. Evaluation and prioritization — where domain expertise lives — remain the analyst's work.

LLMs (large language models) offer real value for anomaly explanation and investigator-oriented decision support. By synthesizing structured signals — including transaction features and graph-based risk indicators — into natural language summaries, LLMs can translate complex model outputs into accessible explanations for analysts. The explanatory gap between sophisticated detection models and the humans who need to act on their outputs has been a persistent operational problem. An analyst flagging a transaction to a senior decision-maker has historically needed to either oversimplify the reasoning or overwhelm the consumer with technical detail. A well-designed LLM interface — with appropriate grounding in specific model outputs, not hallucinating connections from general knowledge — can bridge that gap.

The Bank for International Settlements' recent work on financial market monitoring makes a related point explicit: the value of LLMs in this context is contextualization, not prediction. A model that flags elevated triangular arbitrage deviations as a stress precursor becomes more operationally useful when paired with an LLM that can retrieve and summarize the news environment around those deviations — surfacing whether macro events explain the signal before it reaches a human analyst for investigation.

Stress-testing indicator thresholds is another area where AI tools accelerate legitimate analytic work. Given a proposed threshold — "flag any vessel that goes dark for more than four hours in a known transshipment zone" — a model can rapidly generate the historical performance of that threshold against known cases, identify false-positive rates under different seasonal and operational conditions, and surface historical anomalies the threshold would have missed. This kind of sensitivity analysis is tedious for human analysts to do manually and is therefore often skipped. AI tools can make it routine.

The limits, however, must be named. LLMs are prone to hallucination, which poses real risks in regulatory and compliance-sensitive environments. An LLM generating candidate indicators is generating plausible-sounding indicators, not validated ones. An LLM explaining why a transaction was flagged may be constructing a coherent post-hoc narrative rather than accurately describing what the underlying model detected. The analytic workflow must treat AI-generated outputs as hypotheses requiring verification, not as evidence. The tool is useful for speed, breadth, and communication. It is not a substitute for structured engagement with the underlying measurement problem.

Palantir's AIP (Artificial Intelligence Platform), deployed across intelligence and financial crime contexts, represents the current commercial frontier of this integration: by linking a unified knowledge graph with a large language model that can interpret analyst queries and translate them to retrieve relevant entities and relationships, compliance officers can access customer profiles and transaction relationships through natural language — without manual coding or complex navigation. That capability is real and operationally valuable. What it does not provide is the judgment about whether the indicator being queried was well-constructed in the first place.


The Red Flags: How Quantitative Reasoning Fails in Practice

The failure modes in quantitative intelligence analysis are not random. They cluster around three structural vulnerabilities that recur across contexts, disciplines, and analytical traditions — and they are made more dangerous, not less, by AI tools that can generate apparent quantitative rigor at scale.

The first is p-hacking, or data dredging: running many statistical tests on the data and reporting only those that return significant results. In intelligence analysis, p-hacking does not require deliberate fraud. It happens naturally in the process of indicator selection. An analyst reviews historical cases of a phenomenon — say, coup precursors — and retrospectively identifies twenty variables that appear correlated with the outcome in that dataset. Some of those correlations are real. Many are artifacts of the small sample size, the specific historical period, and the fact that the analyst ran twenty comparisons and reported the five that worked. When the indicator set is then applied prospectively, false positives accumulate and genuine signals are diluted.

The generative AI era makes this problem structurally worse. A large language model can propose thirty indicator candidates in thirty seconds. If the analyst evaluates all thirty against historical data and selects the seven that perform best, they have created a p-hacking machine. The speed and apparent systematicity of the process do not change the statistical math: undocumented data-dredging steps can easily push false-positive rates to 20 or 50 percent or higher. The defense is not to avoid generating multiple candidates — it is to commit to the evaluation methodology before examining the data, to report performance on genuinely held-out cases, and to be explicit about how many candidates were evaluated when presenting the final indicator set.

The second failure mode is base rate neglect, the most common quantitative error in high-stakes operational environments. Base rate neglect was identified by Kahneman and Tversky in 1973 as the systematic human tendency to ignore the prior probability of an event when evaluating specific evidence about it. In intelligence terms: an indicator that correctly identifies 90% of true cases of a phenomenon is nearly useless if the phenomenon itself is rare enough that false positives overwhelm true positives in the output queue. Even a test with a 5% false positive rate produces roughly 2% true positives when the base rate is 1 in 1,000. That follows directly from Bayes' Theorem — the prior matters, and ignoring it collapses the math.

The operational consequence is concrete. An anomaly detection system that flags unusual transaction patterns will generate enormous volumes of false positives if deployed against a population where genuinely suspicious transactions represent a tiny fraction of total activity. FinCEN's (the Financial Crimes Enforcement Network's) own data shows the scale: 4.8 million SARs (Suspicious Activity Reports) were filed in fiscal year 2025 — a number that reflects both the scale of financial monitoring and the degree to which SAR filing has become institutionalized, not necessarily a measure of genuine anomaly detection performance. If even a fraction of those SARs are filed mechanically against flagged activity rather than through genuine reasoning about base rates, the investigative burden is allocated to noise rather than signal.

The correction requires forcing the question before any indicator is presented or acted upon: given the population this indicator is applied to, how often should it fire on genuine cases versus false positives? That is a prior probability question. It requires an estimate of the base rate of the underlying phenomenon — an estimate analysts routinely skip because it feels speculative compared to the apparent precision of the indicator itself. A working practice: before deploying any new indicator, require the analyst to state in plain language what percentage of the flagged population they expect to be genuine cases. That estimate, however rough, forces engagement with the base rate and creates a recorded prediction against which the indicator's actual performance can later be evaluated.

The third failure mode is survivorship bias in retrospective analysis, and it corrupts the very process by which analysts learn from history. During World War II, statistician Abraham Wald was asked to advise on where to add armor to bombers returning from missions. The military's initial instinct was to reinforce the areas showing the most bullet damage. Wald reversed the logic. The aircraft that had been shot down weren't available for inspection. Heavy damage to fuselages and wings on returning planes was evidence that those areas could absorb punishment and survive — the fatal hits were landing elsewhere, on planes that never came back.

Wald's insight translates directly into intelligence analysis. When an analyst builds an indicator set by studying historical cases of detected sanctions evasion, they are studying the cases that were caught. The evasion methods that succeeded — that moved money without triggering a SAR, that transited goods without appearing on an export control list — are absent from the dataset. An indicator set tuned to detected evasion will be systematically blind to the patterns of successful evasion. This is not a minor calibration error. It is a structural bias that makes retrospective analysis of discovered cases a fundamentally incomplete basis for prospective indicator design.

The practical mitigation requires explicitly asking about the mechanisms that would produce undetected cases — constructing hypotheses about what patterns successful evasion would leave in observable data, even when those patterns never appear in the case record — and treating the absence of detections in certain channels as potentially informative rather than as evidence that those channels are clean. What does the data you don't have tell you?


Putting It Into Practice

These three failure modes — p-hacking in indicator selection, base rate neglect in threshold application, survivorship bias in retrospective learning — share a common structure. In each case, the analyst has access to real data and is performing what looks like quantitative reasoning, but the reasoning is fundamentally misconfigured because the relationship between the data and the underlying phenomenon is not what the analyst believes it to be. The data is conditioned on detection, or the threshold ignores the prior, or the indicators were selected after seeing the outcomes. The analytical product looks rigorous. It is systematically wrong.

None of these failures requires a statistics degree to avoid. They require a specific discipline of mind before the analysis begins: state the baseline before calling anything anomalous. Specify how many indicators you evaluated before reporting the ones that worked. Ask what the base rate of the underlying event is before deciding how many flags are too many. Ask what data is missing from your retrospective cases before generalizing from them.

AI tools can help — materially and specifically — with the labor-intensive parts of this discipline: generating broad indicator candidates rapidly, stress-testing proposed thresholds against historical data, translating model outputs into language decision-makers can use. Combining recurrent neural networks with LLMs creates two-stage frameworks that can forecast market stress and identify underlying drivers, with the recurrent neural network detecting periods of heightened conditions up to 60 working days in advance. Those tools exist and they work in bounded domains.

They do not protect you from the failure modes described above. A language model will generate a plausible-sounding indicator with a convincing historical rationale whether or not that indicator was p-hacked into existence. It will not spontaneously ask about base rates. It will not flag that the cases you're studying are systematically conditioned on detection.

The analyst who understands what they are measuring — the baseline, the conditionality, the selectivity of the evidence, the relationship between the measured proxy and the underlying phenomenon — is the analyst who can use AI tools intelligently, and who can recognize when a quantitative product, however sophisticated in appearance, is measuring something other than what it claims. A measure is worth tracking only when the relationship between the measure and the underlying phenomenon is intact, understood, and not yet gamed. The moment you're uncertain about any of those three conditions, precision in the output cannot substitute for validity in the design.

The discipline is epistemic, not mathematical. It travels with the analyst regardless of what models or platforms sit between the raw data and the finished product. A useful self-check before finalizing any quantitative analytic product: can you state, in two sentences without jargon, what behavioral reality your indicator is measuring, what population it is applied to, and why you expect the relationship between the indicator and the underlying phenomenon to hold in the current environment rather than only in the historical data where you found it? If you cannot answer that question, the indicator is not ready — regardless of how precise the output appears or how sophisticated the model that produced it.