The hidden economy of customs data

Essay № i Filed by: Analyst desk Date: April MMXXVI Length: 2,900 words

The hidden economy of customs data: why AI is finally making it usable.

Customs declarations are the most underused goldmine in global trade. A wall of strings without entity resolution; a map of flows with it. The shift is operational, not aspirational.

§I

Why this data has been useless.

Until recently, the standard answer to "what is in customs data" was a shrug and a hand wave at a vendor whose dataset cost six figures and whose schema dated from the mid-2000s. The schema was not the problem. The schema was a symptom. The real problem was that there were one thousand two hundred and forty schemas — one per consigning territory, one per receiving territory, occasionally one per port within a territory — and that the strings inside them, on which any cross-border join depended, had no canonical form.

An entity called ACME LOGISTICS INTL LLC in a Felixstowe declaration is Acme Logistics International, Limited in a Hamburg one, ACME LOG INT'L in a New Jersey one, АКМЕ ЛОЖИСТИКС in a Novorossiysk one, and Acme Lojistik Uluslararası A.Ş. when the parent ships under its Turkish trading name. Without entity resolution, a customs database is a wall of these strings, and the wall does not join. With entity resolution, the same database becomes a map of flows — the same supplier, the same lane, the same counterparty bank, the same beneficial owner — and that map is what a compliance officer, an underwriter or a sourcing lead needs.

§II

The new ER stack.

The entity-resolution stack we operate is, in outline, four layers: schema harmonisation, transliteration and name canonicalisation, ensemble blocking, and a graph-aware match function that uses the surrounding hops as evidence. None of the layers is, in isolation, novel. The novel thing is the discipline of running them as a stack against a single canonical schema, calibrated continuously, with the calibration window and the test set published. The current production F1, against a stratified test set of 41,200 entity pairs across the 32 consigning territories, is 0·943. We publish the confusion matrix per territory in the Library, with the dates on which each territory's calibration was last refreshed.

The "graph-aware" piece is worth dwelling on. The naive ER approach treats each entity pair as an independent decision: are these two strings the same entity, yes or no? The graph-aware approach treats them as decisions against the surrounding evidence: are these two strings the same entity, given that one of them ships through the same lane, against the same HS code, financed by the same bank, on the same week, as a third entity which is already canonically resolved? The third-entity evidence shifts the prior. In practice it is the difference between an F1 in the high 0·8s and an F1 above 0·94.

Layer	What it does	F1 contribution
Schema harmonisation	1,240 native schemas → 1 canonical	baseline · enabling
Transliteration · canon.	Latin, Cyrillic, Arabic, CJK → canonical Latin form	+ 0·08 (vs. exact match)
Ensemble blocking	Three blocker families, voted	+ 0·04
Graph-aware match	Surrounding evidence as prior	+ 0·06

§III

What you can finally see.

With the stack running, what becomes visible is the set of objects that the customs string view kept invisible. The supplier that ships from Felixstowe under one name and from Hamburg under another. The bank chain behind a counterparty whose two ends sit, on the resolved graph, one hop from a designated entity. The lane that has been quietly absorbing three percent of its volume into a re-routing through a third territory whose schema does not match the destination's. None of these are exotic findings. They are present in the customs data of every quarter of every year of the last decade. They were just unjoinable.

The compliance officer who could not, last year, defend a screening miss because the data did not join, can this year defend the screen because it does. The trade-finance underwriter who could not, last year, price a lane risk because the counterparty had three different identities depending on which port the declaration came from, can this year price it because the three identities resolve to one canonical entity, and the canonical entity has a documented bank chain.

Made newly visible

Multi-name suppliers
Cross-territory bank chains
Re-routing through schema gaps
HS-code shifting at known borders
Counterparty appointments on the day of designation

§IV

Operational implications.

For the compliance function, the implication is twofold. First, the screening miss is now investigable: the missed entity, the surviving string variants, the canonical match and the date the canonical link was established are all present in the trail. Second, the screening miss is now preventable: the canonicalisation stack runs at ingestion, the screen runs against the canonical entity, and the version receipts on the screen are signed against the methodology release that produced them.

For trade finance, the implication is that the underwriter can now price against the lane and the counterparty at the same time without making the two prices add. Where the lane carries a Signal Routes score and the counterparty carries a canonical bank chain on the resolved graph, the combined exposure is a single number with named contributors, and the contributors are stable across releases. The pricing model does not need to be re-fit when the score moves.

For sourcing, the implication is that supplier discovery and supplier-risk monitoring move onto the same plane. The HS code and the country qualifier no longer return a list of strings; they return a list of canonical entities, each with a resolved trade graph, a counterparty bank chain and a lane history. The second-source mapping that used to take a sourcing analyst three weeks of email-chasing can be drafted from the graph in under an hour, and the email-chasing can be reserved for the questions that the graph cannot answer.

§V

Where this is going.

The next eighteen months are, on our roadmap, mostly about widening the canonicalisation stack to the territories whose schemas are still partial. Five — the territories whose national customs authority publishes a partial schema with a non-canonical entity field — are in active onboarding for May, June and July MMXXVI. Beyond that, the work shifts to time-resolved entity movement: not just which entities are canonically the same, but when each canonical link was established, with what evidence, and under which methodology release. The "as-of" question — what did the graph look like on a particular date in the past — is, for an investigations workspace, the question that matters most.

The methodology behind everything described in this piece is held in the Library against the corresponding release tag. The next briefing will, by default, be run against the methodology release that is current on the day of the session. If you wish it to be run against a historic release — for an investigations or audit purpose, for example — the analyst desk will accommodate that on request.

A wall of strings without entity resolution; a map of flows with it. — Customs Methodology v4.1 · §1

Read further.

Two adjacent essays cover the same ground from different desks. Sanctions screening in 2026 tracks the OFSI Q1 changes through the canonicalisation stack into the operational screen. Lane-risk forecasting walks through how the resolved customs graph feeds into the Signal Routes score and why the contributors are named explicitly.

Sanctions screening, 2026 → Lane-risk explained → Request a briefing