M6E1: AI-Enhanced OSINT Pipelines: From Scrape to Graph
Module 6, Episode 1: AI-Enhanced OSINT (Open-Source Intelligence) Pipelines: From Scrape to Graph
The Distance Problem
Every intelligence failure has the same shape at its core. Not missing data — too much of it, moving faster than anyone could read. The raw material was there: signals, financial records, travel logs, forum posts, intercepts in languages no one had time to translate. The failure was in the distance between collection and comprehension. Raw material piled up on one side; insight stayed scarce on the other. The analyst, positioned somewhere in the middle, did the impossible work of bridging that gap with her judgment, her time, and her limited cognitive bandwidth.
AI's genuine contribution to OSINT is that it attacks this distance problem directly — compressing the interval between raw collected material and structured, queryable intelligence so the analyst can spend her judgment on the hard questions rather than the mechanical ones. The pipeline from scrape to graph, done well, hands the analyst a working hypothesis rather than a pile of documents. Done badly, it hands her confident-looking garbage at machine speed.
The difference between those outcomes lives in specific decisions made at specific stages of the workflow. This episode walks through those stages concretely, using a worked example — mapping a network of shell companies obscuring the ownership of dual-use components moving toward a sanctioned party — because the tradecraft decisions only become clear when you can see them operating on real material.
The Pipeline: A Concrete Walk-Through
The pipeline has six stages that compound on each other. Collection feeds tagging. Tagging enables clustering. Clustering surfaces candidates for relationship extraction. Relationship extraction populates a graph. The graph anchors a timeline. Each transition is where something can go quietly wrong without any alert being raised. Let's run the example through each stage.
Collection. The investigation begins with a tip: a European component manufacturer suspects its products are reaching a sanctioned defense program through intermediary buyers in three jurisdictions. The collection scope covers corporate registry filings in those jurisdictions, shipping manifests available through public trade databases like ImportYeti or Panjiva (two commercial platforms that aggregate import and export records from customs filings), Telegram channels used by freight brokers in the relevant trade corridors, LinkedIn profiles of logistics executives at named entities, and domain registration records for the web properties of the suspected intermediaries.
The passive-first discipline matters here: before running anything that could alert a target, you query Shodan (a search engine that scans and indexes internet-connected devices and exposed services) for infrastructure exposure, use Intelligence X (a data archive and search platform covering historical web, breach, and darknet records) for historical data and breach records tied to the target's identifiers, and run OSINT Industries (a real-time aggregation platform that queries hundreds of public sources from a single selector input) for digital footprint correlation — none of these queries broadcast investigative intent. Active scanning announces the investigation. Passive reads do not. The collection stage ends not when you have found everything but when you have established a controlled baseline without touching the target's systems.
DataWalk's AI-driven platform (an enterprise analytics environment that converts unstructured documents into queryable knowledge graphs) is representative of what enterprise collection pipelines are doing at the back end: transforming unstructured text into rich knowledge graphs and revealing hidden relationships to accelerate analysis. But even at the collection stage, the analyst faces a critical design choice: how broadly to define the collection perimeter. Too narrow, and you miss the lateral connections that reveal the actual network. Too broad, and you generate volume that defeats the purpose of automation.
The problem with AI-assisted collection is that it makes it trivially easy to be too broad. Automating the collection of everything related to every named entity produces a graph so dense it becomes unnavigable — a phenomenon analysts call "hairballing," where the visualization collapses into an undifferentiated mass of nodes and edges that communicates nothing. Excessive collection creates noise and slows down analysis. The discipline of scoped collection, which experienced analysts apply almost automatically because they know the pain of drowning in material, is the first human judgment that AI tools tend to erode. Automation makes collection cheap. Cheap collection creates excess. Excess degrades analysis.
The solution is to define collection scope before triggering the automated pipeline — and to enforce that scope with the same rigor you would apply to a collection management authority in any other intelligence context.
Automated Tagging. Once material is ingested, the first AI stage applies Named Entity Recognition — NER — to extract and classify the relevant entities: company names, individual names, addresses, registration numbers, shipping container identifiers, port names, dates, financial figures. NER is a core natural language processing technique used to automatically detect and classify entities within unstructured text into predefined categories such as persons, organizations, locations, and dates. In the shell company investigation, this means running NER across thousands of pages of corporate registry documents, shipping manifests in multiple languages, and scraped Telegram content, tagging every instance of a company name, individual name, address, or registration number.
The output of tagging is annotated raw material, not intelligence. The value is in what it enables next.
Two pathologies emerge at the tagging stage that analysts must anticipate. First, alias proliferation: the same entity appears under multiple names — "Redstone Trading LLC," "Redstone Trading Limited," "R. Trading LLC," "RST LLC" — and a naive NER system tags them as four different entities rather than one. The investigation's entire graph will be fractured across these variants until coreference resolution runs to link them. Cross-checking results against additional tools or sources is essential precisely because alias proliferation is invisible at the tagging stage and catastrophic at the graph stage. Second, language contamination: Telegram channels in Russian, shipping manifests in Chinese, corporate filings in Arabic all arrive tagged according to the performance profile of the NER model in each language, which is not uniform. Models trained primarily on English text perform significantly worse on transliterated names and non-Latin scripts. If the analyst assumes consistent tagging quality across languages, she will systematically undercount entities in those languages — which is exactly where the investigative leads are most valuable and most hidden.
Clustering. After tagging, the pipeline clusters entities by similarity and co-occurrence before attempting to extract relationships. This stage answers a different question than tagging: not "what is mentioned here?" but "which mentions refer to the same real-world entity, and which entities appear together repeatedly enough to suggest a connection worth examining?" Modern implementations use a combination of vector embeddings — treating names and addresses as points in semantic space and measuring distance between them — and fuzzy string matching to group variant references to the same entity.
In the worked example, the clustering stage surfaces something the analyst had not initially looked for: three of the seven suspected intermediary companies share a registered address at a corporate services provider in the British Virgin Islands, and two of them share a phone number with a fourth company not yet in the investigation scope. Neither connection was visible in the raw documents. The clustering created it by finding co-occurrence across a corpus no human analyst would have read end-to-end in the available time.
This is exactly the kind of genuine acceleration AI offers — finding the cross-document pattern that no single document contains. But the analyst must resist the temptation to treat every cluster as a confirmed relationship. Clustering produces hypotheses about co-reference and co-occurrence, not verified facts. No single platform output constitutes verified intelligence. The shared BVI address could mean these companies are linked parts of a single network. It could also mean they all used the same incorporation mill that registers hundreds of entities at that address.
The cluster is a lead, not a finding.
Relationship Extraction. This is the stage where the pipeline moves from identifying entities to characterizing the connections between them. A relationship extraction model — typically a fine-tuned transformer architecture applied to sentence-level context — reads tagged text and identifies directional claims: "Company A is a subsidiary of Company B," "Individual X serves as director of Company C," "Container Y was transferred at Port Z from Vessel W to Vessel V." In corporate network investigations, pivot logic runs through Maltego's (a graph-based link analysis platform) Transform architecture: domain → IP → other domains on same IP → associated emails → social media accounts — a chain that traverses the graph one hop at a time, with each hop automated and each result returned as a visual node.
Relationship extraction is where enterprise platforms like Palantir AIP (Palantir's Artificial Intelligence Platform, which connects AI models directly to operational data and workflows) create genuine value for sophisticated operations. The platform connects AI with data and operations, designed to drive automation across operational processes, with builder tools enabling production-ready AI-powered workflows, agents, and functions on top of a structured data layer called the Ontology. In intelligence workflows, the Ontology layer — Palantir's representation of entities, their properties, and their typed relationships — is precisely the formal structure that relationship extraction needs to populate. The AIP pipeline does not simply extract relationships as free-form text; it extracts them into a defined schema: this node is of type "Legal Entity," this edge is of type "DirectorOf," this attribute is "JurisdictionOfIncorporation." That schema is what makes the graph queryable.
The quality of AI solutions depends heavily on extracting and preparing domain-specific data for large language models. The shell company investigation involves a particular extraction challenge: establishing beneficial ownership relationships that are deliberately obscured. The corporate filings say "Company A is wholly owned by Company B." They rarely say "Individual X beneficially controls Company B through a trust arrangement in Liechtenstein." Relationship extraction can find what is stated; it cannot manufacture what is hidden. When the analyst treats extracted relationships as complete, she is treating the absence of extracted evidence as evidence of absence — a classic analytic error that AI-assisted pipelines make dangerously easy.
From Graph to Timeline: What Gets Built and What It Means
Once relationship extraction has populated the graph schema, the visualization layer is where the investigation becomes navigable. The contemporary Maltego platform includes a browser-based version with built-in AI assistance, adding a layer of natural-language querying on top of the traditional graph interface — the analyst can ask "show me all entities that share a director with Entity X" rather than manually constructing that query through successive transform runs.
Maltego functions as a visualization and relationship-mapping layer, with Transforms operating as executable scripts that query external data sources and return results as visual nodes on an investigation graph, connecting entities across 120-plus data sources including IP addresses, domains, corporate registrations, and human profiles, surfacing hidden network structures invisible inside raw datasets. In the shell company case, the analyst can start with a single known entity — the European manufacturer's stated customer — and run Transforms to generate a map of that entity's corporate connections, infrastructure, associated individuals, and co-registration history. Each Transform run returns new nodes. Those nodes become the seeds for further Transform runs. The graph expands outward in waves.
This expansion dynamic is one of the most important things to understand about AI-augmented graph analysis. The graph can always be expanded. There is no natural stopping point in the data. Every node could be the subject of further Transform runs that would add new nodes, which could each be expanded further. The graph is not a completed picture of reality; it is an artifact of the analyst's collection decisions. When practitioners debate whether two investigated entities are "connected," they are often arguing about a graph topology that reflects their collection choices as much as the underlying reality. Two entities are always connectable through some chain of associations in a world rich enough in public data. The meaningful question is not "are these entities connected?" but "is this connection significant in a way that advances the analytic question?" That judgment cannot be automated.
The timeline layer adds temporal structure to the graph. Pattern-of-life analysis aggregates observed data and visualizes patterns and outliers, allowing analysts to identify routine and unusual behavior, model how people might respond to different situations, and predict future activity by following events as they unfold across time. For the shell company investigation, this means plotting corporate registration dates, directorship appointment and resignation dates, shipping transaction timestamps, domain registration dates, and Telegram posting timestamps onto a common timeline and looking for synchronization signals. If Company B was incorporated three weeks after sanctions were first imposed on Company A, that temporal proximity is potentially significant. If the same individual resigned from Company A's board and was appointed to Company B's board on the same date, that temporal co-occurrence is a strong structural signal worth examining.
The timeline is also where the Ukrainian grain ship investigation becomes instructive. Using satellite imagery collected by Planet Labs and ship tracking data from Lloyd's List Intelligence (a maritime data service), Bellingcat (the open-source investigative outlet) reconstructed the journey of the Zafar from Crimea, where it was observed loading grain into silos with its AIS (Automatic Identification System, the mandatory vessel tracking transponder) turned off, before it later activated its AIS en route to Yemen and transitioned through a UN inspection point in Djibouti without being flagged. That investigation succeeded not because any single data source was decisive but because the timeline revealed a gap — a period of AIS silence in the satellite record — that became the central evidentiary pivot. The graph told you what entities were involved. The timeline told you when the vessel disappeared from the tracking record and reappeared elsewhere, and that disappearance was the story.
An AI pipeline that outputs nodes and edges without the temporal dimension is constructing a photograph when what the analyst needs is a film.
A more recent case illustrates the same principle at a different scale. As reported in the newsletter informing this module, a ship allegedly carrying stolen Ukrainian grain was observed sailing away from Israel after an importer refused to unload cargo. The OSINT community tracked that vessel in near-real-time using the same combination of AIS data, satellite imagery, and port authority records that characterized the Zafar investigation. What an AI-augmented pipeline adds to this kind of tracking is the automated fusion of data streams: rather than one analyst manually correlating AIS positions with satellite imagery, the pipeline ingests both feeds, flags the discrepancies, and presents the analyst with the anomaly already surfaced. The analyst then judges whether the anomaly is significant. She does not spend her time finding it.
Where Human Judgment Must Be Inserted — And Where It Gets Pushed Out
The pipeline generates pressure toward full automation. Every manual checkpoint is a delay. Every analyst review is a bottleneck. The organizational incentive is always to let the machine run further before stopping for human review. This pressure is dangerous, and it is worth being precise about where it causes harm and where it is legitimate.
Legitimate automation: collecting from indexed databases, deduplicating records against existing corpora, applying NER to unstructured text, running coreference resolution, executing graph layout algorithms, generating timeline visualizations from timestamped data. These are mechanical tasks where the primary sources of error are computational rather than analytical. Automating them creates time and cognitive capacity for the analyst to do what she is trained for.
Where human judgment must be inserted — and defended against organizational pressure to remove it — is specific.
Node validation. Before any extracted entity becomes a confirmed node in the working graph, a human must answer the question: does this entity exist in the way the pipeline has represented it? No single platform output constitutes verified intelligence. If Shodan flags an exposed service, verify current status. If OSINT Industries links a phone number to an account, corroborate through a second selector query. If Maltego surfaces a corporate registration connection, validate against primary registrar records. The machine will confidently create nodes for entities that are misspellings, alias duplicates, or artifacts of the training data. Once those ghost nodes are in the graph and connected to real nodes through extracted relationships, they corrupt every subsequent analytic step.
Edge weighting and edge typing. A corporate ownership relationship is different from a financial transaction relationship, which is different from a shared address relationship, which is different from a co-occurrence in a news article. The pipeline will extract all of these and, in the absence of explicit human intervention, the visualization layer will represent them with equal visual weight. An analyst looking at the resulting graph may read evidential weight into connections that are purely associative — a confusion that undermines collection management, targeting, and production. The analyst must decide which edge types are substantive for the analytic question at hand, which are merely circumstantial, and which should be filtered out of the primary view before any analytic judgments are made.
Cluster rejection. The clustering stage will sometimes group entities that should not be grouped — an error called "merge error" in entity resolution terminology. Two different individuals named "Ahmed Al-Hassan" who both appear in shipping records from Gulf ports will be clustered as a single node unless a human stops the process and separates them. In counterproliferation investigations, merge errors of this kind can attribute activities to an innocent party or, worse, allow a sanctioned individual to hide behind a name-sibling in the public record. Modern AI-powered entity resolution unifies fragmented data into a knowledge graph, but the task of unmasking sophisticated networks also requires human review to avoid false positives.
Deciding what relationships matter. This is the highest-order judgment in the pipeline, and it is the one most susceptible to being quietly removed. AI relationship extraction will find every relationship the training data taught it to recognize. The question "which of these relationships is analytically significant for this specific investigation?" cannot be answered by the model, because the answer depends on the analytic question and the intelligence context — both of which live in the analyst's head, not in the pipeline's training corpus. In the shell company investigation, the model will extract ownership relationships, directorship relationships, address co-location relationships, and infrastructure-sharing relationships with approximately equal confidence weighting. The analyst who knows that address co-location in BVI corporate services providers is common and often uninformative, while overlapping directorship in companies registered in the same week in different jurisdictions is highly suspicious given the investigation context — that analyst is doing intelligence analysis. The model is doing information retrieval.
The over-automation risk is not theoretical. The World Economic Forum's Global Risks Report 2026 flags directly that as geoeconomic confrontation intensifies, AI will be weaponized to target supply chains and corporate networks through exactly this aggregation logic. An adversary who understands that your automated OSINT pipeline picks up everything that pattern-matches to "suspicious shell company" can structure its operations to flood your pipeline with decoys — entities that look like shell companies, populate your graph with noise, and require human review time to dismiss. Over-automation does not just degrade your analysis; it becomes an attack surface. The pipeline that runs without human checkpoints is a pipeline your adversary can manipulate by understanding its extraction logic.
Tooling: What You Can Do Today Without Writing Code
The honest answer about tooling is that there is a meaningful capability gap between what non-technical analysts can access directly and what requires engineering support — and acknowledging that gap honestly is more useful than pretending it does not exist.
What works without engineering. Maltego's browser-based platform as of 2026 is genuinely accessible to non-engineers. The browser-based Maltego Graph includes built-in data, map, and histogram views, an AI Assistant for person-of-interest investigations, and new use cases being added throughout 2026. You start with a seed entity — a company name, an email address, a domain — drag it onto the canvas, and run Transforms against it. Related entities appear, forming a graph that visually maps connections and surfaces patterns that list-based tools miss. The Transform Hub currently integrates more than 500 third-party data sources including Shodan, VirusTotal (a malware and URL scanning service), HaveIBeenPwned (a breach notification database), Hunter.io (an email address lookup and verification tool), and Recorded Future (a threat intelligence platform). The free Community Edition is capped at twelve results per transform, which is useful for learning but insufficient for production investigations. Professional licensing runs around $999 per year; enterprise pricing adds team collaboration features and removes API rate limits.
ShadowDragon (a no-code collection platform for multi-source social media and identity correlation) is worth knowing for multi-source correlation. You drag an entity onto the graph, run transforms that execute as automated queries gathering related data, and results appear visually connected to the original entity. OSINT Industries provides real-time breadth — a single selector such as a phone number, email address, or username queries hundreds of live public sources simultaneously, useful for rapid identity resolution before moving into deeper graph analysis. Starting with OSINT Industries for a quick profile creates a baseline; as the investigation grows, you import entities into Maltego, which allows graph-based analysis across multiple connected subjects.
For timeline analysis without coding, Cambridge Intelligence's KronoGraph (a purpose-built timeline visualization tool for multi-entity temporal analysis) is the most capable purpose-built option, with native support for multi-entity timeline rendering, heatmap generation for pattern-of-life overview, and event clustering at scale. Flourish, originally a data visualization tool, has become widely used for interactive network diagrams that can be shared with non-technical audiences — useful for briefing products, though it lacks the analytical query capabilities of purpose-built OSINT platforms.
For knowledge graph construction without Python, Sintelix (a document intelligence platform with built-in NER and relationship extraction) deserves mention: it ingests documents directly and runs NER and relationship extraction through a graphical interface, producing network visualizations with multiple layout options including force-directed, hierarchical, and map overlay views. It handles 1,600-plus file formats, which matters when investigations draw on heterogeneous document sets across languages and formats. The learning curve is real, but it is accessible to a motivated non-engineer.
What requires engineering support. Custom NER models trained on domain-specific entity types — specialized arms components, dual-use technology part numbers, jurisdiction-specific corporate filing formats — require machine learning engineering to build and maintain. The standard NER models embedded in consumer OSINT platforms are trained on general-domain text and perform poorly on domain-specific entities. If your investigation requires identifying specific technical specifications embedded in procurement documents, or correlating military equipment part numbers across shipping manifests, you need a custom NER layer.
Agent-based automation pipelines — workflows where a large language model autonomously decides what to query next based on what it finds, rather than executing a predefined sequence of transforms — require engineering to build safely. LangGraph (the graph-based multi-agent orchestration framework built on top of LangChain, a popular library for building LLM-powered applications) is what most intelligence-adjacent AI teams are currently using to construct these workflows. It allows you to define nodes (tasks the agent can perform), edges (transitions between tasks), and state (what the agent remembers between steps). A LangGraph-based OSINT agent can, in principle, collect a seed entity, run NER, surface clusters, query external databases for relationship data, and expand the graph autonomously. In practice, deploying this in a production intelligence environment requires carefully defined scope constraints, output validation steps, and human-in-the-loop checkpoints at each major decision branch — otherwise the agent will confidently pursue false leads at machine speed.
For organizations with Palantir deployments, AIP is probably the most capable environment for building structured intelligence workflows on top of collected OSINT without custom code. AIP is an AI-native platform designed for production workflows: you can embed large language models and AI agents directly into the systems your teams use every day, enabling automation of decisions, analysis of enormous datasets, and task execution with fewer manual steps. The limitation is cost and access — Palantir AIP is not a tool you pick up for a single investigation. It is an enterprise infrastructure commitment.
The most effective approach emerging from advanced threat intelligence teams is localized knowledge graphs: scraping heterogeneous data from public sources in controlled, reproducible batches, normalizing it, converting it to RDF (Resource Description Framework, a standard for representing structured data as linked triples of subject-predicate-object) format, mapping it to common ontologies, and loading it into local triplestores for offline SPARQL (a query language for RDF databases, analogous to SQL for relational databases) queries — circumventing API dependency entirely, eliminating the risk of leaking investigative intent to external platforms, and preserving operational security over extended investigation timelines. This is the right architecture for sensitive investigations with long timelines, but it requires engineering to set up.
Teams that want AI-augmented OSINT capability but cannot access engineering support should invest in Maltego proficiency, understand the limitations of automated extraction, and build disciplined manual verification into every stage of the pipeline. The no-code tools are genuinely capable. The gap between no-code and engineering-supported capability is real but not unbridgeable with skill and discipline.
The Graph Is a Hypothesis
There is a seductive quality to a completed graph. Nodes connected by typed edges, laid out by a force-directed algorithm that clusters the densely connected entities near the center and pushes peripheral nodes to the margins — it looks like a picture of reality. It has the visual authority of a diagram rather than the interpretive authority of an argument. Analysts brief these graphs to seniors who read the topology as if it were a map.
That authority is unearned.
When official narratives were conflicting and opaque regarding the shoot-down of Malaysia Airlines flight MH17 over eastern Ukraine in 2014, Bellingcat analysts turned to the open-source record, meticulously examined social media for photos and videos of a Buk missile launcher moving through eastern Ukraine, geolocated the images by cross-referencing road signs, buildings, and the position of the sun with satellite imagery, traced the launcher's exact route to a field south of Snizhne, and found a contrail photo taken from Torez that, when triangulated, pointed directly to the launch site. That investigation — still a gold standard for the field — worked because analysts treated every intermediate finding as a hypothesis to be tested against additional evidence, not a node to be confirmed by its position in the graph. The graph was an organizing tool for an argument, not the argument itself.
The graph you construct from an AI-augmented OSINT pipeline is a hypothesis about the structure of a network based on the sources you collected, the extraction models you ran, and the relationship types you chose to include. Change any of those parameters and the graph changes. An adversary who understands this can structure their activities to defeat specific collection methods — using virtual office addresses identical to hundreds of other legitimate businesses, routing communications through infrastructure shared with clean parties, timing corporate registrations to cluster with periods of high registration activity that generate noise. The OSINT market's fundamental problem is epistemological: what does it mean to know something about a network from public-source collection, and how do you distinguish confirmed nodes from artifacts of the collection process?
The CACI acquisition of ARKA Group, which closed in March 2026 for $2.6 billion, illustrates where this epistemological challenge meets the institutional frontier. CACI acquired electro-optical, infrared, and hyperspectral satellite sensors alongside the AI software that interprets them in real time — an end-to-end package that pipes finished intelligence into command systems. The vertical integration represents an organizational bet that closing the collection-to-analysis pipeline within a single architecture will produce better outcomes than assembling the pipeline from heterogeneous tools. But integration also compresses the distance at which human judgment can intervene. When collection, tagging, relationship extraction, and graph construction all run on the same platform under the same governance framework, the analyst's authority to stop the pipeline and say "this node is wrong" depends entirely on whether the platform's design preserves that authority or quietly routes around it in the name of workflow efficiency.
That preservation is not automatic. It must be designed for, insisted on, and defended against the institutional pressure of every senior official who looks at a dashboard and asks why the analysis is not moving faster.
What you can now do, having walked through this pipeline, is something specific: you can sit down with a collection requirement, a toolset, and a set of candidate sources and make deliberate decisions about where the automation runs unsupervised and where you insert a checkpoint. You can articulate to your team or your leadership why those checkpoints exist and what analytic risk their removal creates. You can brief a graph to a senior official and accurately characterize it as a working hypothesis that reflects specific collection decisions — not as a picture of ground truth.
The distance between scrape and insight has genuinely shrunk. What has not changed, and will not change regardless of model capability, is that the analyst who understands what the pipeline is doing at each stage will always produce better intelligence than the analyst who treats the output as authoritative. The tools have become powerful enough that you can reach a wrong answer very quickly and with great visual confidence. That is not analysis. The argument is not about getting there faster. It is about getting there accurately.