Thick wrapper vs thin wrapper — why regulatory AI needs a purpose-built system.
"Just use ChatGPT" is the most expensive sentence in compliance software right now. Closely followed by "we'll use Harvey for that." Both are good products. Neither does what a regulatory intelligence platform needs to do. Here's why — and what RegAI is built differently for.
This post is opinionated. It's also written for the buyer who has been pitched five different "AI for compliance" tools in the last quarter and is trying to figure out which ones can actually pass an internal audit. We'll name names, including ours.
The thin-wrapper trap
A "thin wrapper" is a product whose substance is: frontier LLM API + a system prompt + nice UI. There's nothing wrong with thin wrappers — many are excellent productivity tools. The trap is that compliance, risk, and legal buyers see a slick demo, ask "how do you handle [specific reg]?", and the answer is some variant of "the model knows it from training." Then audit asks for the citation graph and the demo unravels.
You can usually spot a thin wrapper in 60 seconds:
- Same model behind the scenes as the consumer chatbot. (You can ask it to print its system prompt — it often will.)
- Outputs change between runs on the same input.
- "Sources" are model citations, not links into a curated, version-pinned corpus.
- Performance on niche regulators (MAS Notice 626, OCC Heightened Standards, KFTC guidance) drops off a cliff.
- No mechanism for SMEs to correct outputs and have those corrections persist.
What ChatGPT, Claude, and Perplexity get wrong on regulation
We've been running an internal benchmark for three years against the major frontier models. Same prompts, same source regulations, scored against an SME-annotated ground truth. Two patterns are consistent across model upgrades:
1. Frontier models hallucinate authority. Asked "which DORA article requires recovery time objectives?", the leading models cite article numbers that don't exist, or paraphrase requirements from RTS that haven't been published yet. They confidently invent a paragraph reference because the authority signal makes the answer feel more credible. For a compliance team, that's worse than no answer.
2. They miss multi-jurisdictional nuance. "What's the BO threshold under MAS 626?" gets you 25% — confidently. "What's the BO threshold under HKMA's Authorization Manual?" gets you 25% — also confidently, even where the actual threshold logic differs in important ways. The model has memorized the common case; it doesn't know what it doesn't know.
Neither pattern is solved by the next model release. Both reflect a structural limit: the model isn't trained on a curated regulatory corpus with explicit ground truth. It's trained on the open web, where the most-quoted version usually wins. That's not where compliance lives.
Where horizontal "AI for X" tools land
Tools like Harvey (legal AI), Hebbia (knowledge work), Glean (enterprise search), and generic enterprise copilots are all stronger than a raw chatbot. They add proper enterprise grounding — RAG over your documents, role-based access, audit logs. But they're horizontal. Built for general legal work, general knowledge work, general enterprise search.
Regulation isn't general legal work. A few specific differences:
- Hierarchy matters. A regulation has a Level 1 / RTS / ITS / guidance / national-overlay hierarchy. Treat them as flat documents and the answers are wrong.
- Effective dates and grandfathering. The rule that applies depends on when the obligation came into force and which transitional provisions apply. Most general legal AI ignores time entirely.
- Cross-references compound. DORA Article 28 cites RTS articles that cite ESA Q&As that cite the underlying directive. Resolving that chain correctly requires graph reasoning, not document chunking.
- Quantitative thresholds are not text. A 25% beneficial-ownership threshold isn't best handled by an LLM — it's structured data with downstream rules. Treating it as text is how you get wrong answers about whether a customer is in scope.
- The output is a control, not a memo. Legal AI optimizes for "draft me a memo on this." Regulatory AI has to produce a defensible compliance artifact that an internal auditor and a regulator can both walk through.
What "thick wrapper" actually means
Calling something a thick wrapper is mostly a shorthand for: the system has invested in domain understanding beyond the LLM's pretraining. Specifically:
- A curated, version-pinned regulatory corpus. Every regulation, RTS, ITS, Q&A, and bulletin is ingested from the source, stored with a version stamp, and re-fetched when the source updates. Not RAG over a generic web crawl.
- A domain ontology. Obligations, controls, evidence, jurisdictions, license types, effective dates — all modeled as first-class entities with explicit relationships. The LLM's job is to populate the ontology, not to be the ontology.
- SME-annotated training data. Tens of thousands of obligation extractions, applicability decisions, gap classifications, and control mappings — labeled by senior consultants who do this work for a living and reviewed by independent SMEs. This is the part that takes years to build and can't be cloned with a prompt.
- An eval harness. Every model upgrade is tested against a frozen benchmark covering 20+ regulators across five industries before it ships to clients. New regulator? New eval set first.
- Deterministic post-processing. Where structured rules apply (thresholds, applicability logic, deduplication), classical code runs after the LLM, not the LLM alone.
- Workflow integration as a first-class feature. Outputs land in the right reviewer's queue, with the right permissions, with the right audit log. That's a compliance product, not a chat product.
What we built — and why it took years
RegAI is a thick wrapper. The substance underneath the model:
- Tens of thousands of SME annotations. Senior Sia consultants — most with 10+ years in financial-services regulation, insurance reg, pharma compliance, or AI governance — have labeled obligation extractions, applicability decisions, gap scores, and control drafts across DORA, MAS, HKMA, OCC, FCA, FDA, EMA, EU AI Act, NIST AI RMF, and more. Every annotation is reviewed by a second SME. The result is a training and evaluation set you can't replicate with a few weeks of prompt engineering.
- A regulatory ontology. Every obligation has a typed relationship to source paragraphs, effective dates, license types, and adjacent obligations. We don't ask the LLM to "remember" relationships — we model them and ask the LLM to fill them in.
- An eval harness covering 1,000+ regulations. When we upgrade an underlying model — and we do — we re-run the eval first. If precision drops on, say, MAS Notice 626 obligation classification, that gets flagged before the upgrade ships.
- Deterministic logic where it belongs. Applicability rules, threshold checks, and dedupe logic run as code. The LLM does the parts where natural language genuinely matters; the parts that are rules stay as rules.
- Workflow primitives. Review queues, approval gates, decision logs, role-based permissions, audit trails. The compliance product around the AI.
None of this shows up in a 60-second demo. All of it shows up the first time internal audit asks "show me the trail."
Why purpose-built always wins for compliance
Three reasons the thin-wrapper / horizontal-AI approach loses to a purpose-built system specifically for compliance:
1. The bar is defensibility, not productivity. A productivity tool is allowed to be wrong sometimes if the cost of correction is low. A compliance tool isn't. Every output has to defend itself in front of an auditor and (eventually) a regulator. That requires citation, reasoning, and a chain of human accountability that thin wrappers don't produce by default.
2. The corpus matters more than the model. A frontier model with bad source data outperforms a smaller model with good source data on raw benchmarks — and underperforms catastrophically on real compliance work. Source quality, recency, and version control are the moat. The model is the engine; the corpus is the road.
3. The workflow is the product. Compliance teams don't get value from a chat answer. They get value from "this obligation has been mapped to ICT-POL-07 with 62% coverage; here's a drafted addendum; reviewer assigned to Sarah." That's not an LLM feature. That's a compliance product that happens to use an LLM.
When a thin wrapper is the right answer
To be honest about the tradeoff: a thin wrapper is the right answer when:
- The use case is internal-facing (a compliance team chatting about an interpretation, not producing audit artifacts).
- The corpus is small and bounded (a single regulator, well-trodden).
- The cost of being wrong is reputational only, not regulatory.
If you're a small firm wanting to ask questions about a single regulator's published rules, a thin wrapper is plenty. If you're a GSIB running a DORA program across 14 entities, three jurisdictions, and 600 obligations — you need something else.
The pragmatic test
When you're evaluating any "AI for compliance" tool, run these five questions:
- "For this output, show me the source paragraphs that produced it." (Citation graph?)
- "What changed between the AI's first draft and the published version, and who approved each change?" (Decision log?)
- "How does this perform on regulator X" — pick one obscure to the vendor — "and how do you know?" (Eval harness?)
- "What happens when the regulator updates the source text — how is my matrix kept current?" (Version pinning + diff?)
- "How many SMEs annotated the training data, and over what period?" (Or: is there any training data at all, or is this just a prompt?)
If any of those answers is hand-wavy, the tool is a thin wrapper. That's not necessarily disqualifying — but you should know what you're buying.
Closing
Thin wrappers will keep getting better as the underlying models improve. They'll keep being useful for general productivity. But the gap between "good enough for chat" and "good enough for audit" is a structural one — and closing it is the work, not the bonus.
The thick-wrapper case for regulatory AI is the same as it was for industrial machinery: you can use a general-purpose tool to do the job, or you can use a tool built specifically for the job. The first works when the stakes are low. The second works when the stakes are high.
Compliance is the second.
