Skip to main content
Data Platform

AI Data Governance is a Foundation Problem

Most AI governance programmes start at the wrong end — they govern the model when they should govern the data. The compliance question is not 'is the model fair'. It is 'where did the training data come from and can you prove it'.

Amit Kumar Singh - Technology Consulting Partner at MyData Insights

Technology Consulting Partner · MyData Insights

14+ years in industrial data · Former Accenture & EY · GCC, India, SEA

21 May 2026 · 9 min read

The bottom line

AI governance is a data foundation problem, not a model problem. Microsoft Purview catalogue and lineage, Fabric Lakehouse with auditable bronze-silver-gold, semantic model as the policy enforcement layer. Compliance is what the architecture enforces by default.

Introduction

The compliance audit finds a problem. Your team scrambles to pull batch records from three systems, reconcile two conflicting product codes, and produce a lineage report that no one can agree on. The fine doesn't come from bad intent — it comes from no one owning the data in the first place.

Most compliance programmes in industrial businesses are a BI costume worn over a governance problem. The dashboard looks clean. The underlying data has no owner, no lineage, no master. That is the real audit risk.

The compliance problem is upstream of the report

A food manufacturer running HACCP controls needs to trace a batch from raw material intake to finished-goods despatch — ideally in under two hours. A pharma packaging site needs to satisfy GxP audit queries without manually hunting across SAP S/4HANA, a lab information system, and three spreadsheets. An EPC contractor needs project safety records that are traceable, timestamped, and unalterable for the life of the asset — which could be twenty-five years.

What these scenarios share is not a reporting gap. The report is the last 10 percent. The other 90 percent is whether the underlying data has a defined owner, a classification, a lineage trail, and a master record that every downstream system trusts.

When that foundation is missing, compliance becomes a forensic exercise every time. Audit-finding counts rise. Regulatory submission lead-times stretch. And the finance team starts asking uncomfortable questions about exposure.

What a governed data foundation actually looks like

The governance plane for a mid-market industrial is not complicated — but it does require three layers working together.

**Microsoft Purview** sits at the centre. It scans the data estate — SAP S/4HANA, SAP ByDesign, MES systems, document stores — and classifies what it finds. Sensitive fields, personally identifiable information, batch identifiers, material master attributes. Purview builds the lineage map: this field came from this source, was transformed here, landed there. When an auditor asks "show me the chain of custody for batch 4471," that lineage map is the answer — not a pivot table assembled the night before.

**Microsoft Fabric / OneLake** is the canonical store. One copy of the truth. Batch records, quality inspection results, supplier certificates, project safety logs — all ingested via Azure Data Factory, landed in OneLake as Delta Parquet, and governed by the same classification tags Purview applied upstream. You are not chasing data across seven systems because the systems write to one place.

**Power BI** surfaces the audit-ready view. Direct Lake queries OneLake directly. No import, no stale cache. The compliance scorecard — batch traceability completion rate, open audit findings by category, days to regulatory submission — reflects what is actually in the store at the moment the auditor opens the report.

This is the Unify · Predict · Act sequence applied to governance: unify the data into one store with lineage attached, predict where the classification gaps are before the auditor finds them, act through automated policy enforcement rather than manual remediation sprints.

The three metrics this protects

Compliance programmes tend to measure themselves in lagging indicators — fines levied, audit findings closed, submissions filed. The leading indicators that tell you whether the governance foundation is holding are more useful and more honest.

**Batch traceability time** — how long it takes from a recall trigger to a complete upstream and downstream trace. Programmes with no data lineage routinely take 24–72 hours. With Purview lineage and OneLake as the canonical store, that number can fall to 2–4 hours in the initial deployment phase and tighten further as the model matures.

**Audit-finding count** — specifically, findings attributable to data completeness or data provenance rather than process failures. These are the findings that repeat. They repeat because the root cause — unowned data — was not fixed after the last audit. A governed data foundation stops the repeat.

**Regulatory submission lead-time** — the number of working days between a regulator request and a filed, evidenced response. This is a direct function of how quickly the team can assemble traceable, consistent records. When lineage is automated and the canonical store is current, that lead-time compresses.

Where AI actually fits — and where it doesn't

AI in data governance does two things well. It accelerates classification at scale — scanning millions of records and proposing labels that a human steward then validates. And it surfaces anomalies — a field value that breaks pattern, a record that arrived without expected lineage, a dataset that stopped refreshing. Microsoft Purview's AI-assisted scanning does both.

What AI does not do is write your data policy. It enforces a policy you already wrote. If you have not defined what a "batch record" is, what fields it must contain, who owns it, and what constitutes a complete trace — no amount of machine learning will fill that gap. The policy has to exist before the platform can enforce it.

This is the honest-limits caveat that vendor materials frequently omit. Purview is powerful. It is not a substitute for a data stewardship structure and a documented governance policy. The two have to exist together.

The Question the EU AI Act Actually Asks

As AI moves into operational decisions, the compliance question shifts from "is the model fair" to "where did its data come from and can you prove it" — and that is a data-foundation question, not a model one. The EU AI Act treats data and governance as one of its three pillars (data, model, deployment), and for a high-risk AI system it expects documented, traceable training and input data. An industrial business that cannot show the provenance of the data feeding a quality-prediction or risk model has a governance gap no amount of model documentation closes. The Microsoft Purview lineage graph you built for batch traceability is the same artefact that answers "where did this AI's input come from" — a queryable chain of custody, not a reconstruction assembled the night before.

This reframes most AI-governance programmes as starting at the wrong end. Teams reach for model cards and bias testing while the data feeding the model has no owner, no classification, and no lineage — which is the exposure an auditor actually probes. Govern the data foundation first: classify it at ingestion, trace it source-to-report, enforce policy in the semantic model, and the model-governance work becomes a documentation exercise on top of a provenance trail that already exists.

It also makes the foundation reusable across regimes. The same lineage and classification that satisfy the EU AI Act's data pillar also answer HACCP batch trace, GxP audit queries, CBAM data points, and DPDP provenance — because they are all asking the same underlying question about who owns the data and where it came from. You build the governed foundation once and it serves the auditor, the regulator, and the AI assurance review from one source.

The compliance question for operational AI is not "is the model fair" — it is "where did the data come from and can you prove it." That is a lineage problem, which is why AI governance starts at the data foundation, not the model.

What this looks like in practice

An FMCG packaging site running SAP ByDesign for procurement and a standalone MES for production had a batch traceability time of 36–48 hours. Every recall exercise turned into a cross-department war over which system's batch record was authoritative.

The practitioner approach: wire Purview to scan both SAP ByD and the MES, classify batch identifiers and quality attributes, define the lineage path end-to-end. Land the canonical batch record in OneLake via Azure Data Factory. Build the traceability report in Power BI against Direct Lake. Assign a named data steward in each domain with a Power Apps interface for exception resolution.

Within the first deployment cycle, batch traceability time fell from 36 hours to under 6. Audit-finding count in the data-provenance category dropped materially. The stewards now spend time resolving exceptions rather than hunting records.

Where this approach doesn't fit

If your compliance requirement is primarily contractual — standard ISO certifications, annual supplier questionnaires — a full Purview deployment is probably heavier than you need. Start with a simpler master data process and a Power BI compliance scorecard.

If your data estate is entirely within a single ERP and you have fewer than five data domains in scope, the governance overhead of Purview scanning may not be justified in the early stages. Sequence the investment to match the complexity.

Six weeks to first value

A Discover → Prototype engagement starts with mapping one compliance-critical data domain — typically batch master or product master — through Purview, into OneLake, and out to a Power BI audit report. In six weeks, you have a working lineage trace and a compliance scorecard with live data. That is the proof of concept that earns the investment for the broader rollout.

What This Means for the Compliance-Accountable Leader

The decision is where to spend the next governance pound — and the answer is upstream, on the data, not downstream on the report or the model. The report is the last 10%; the 90% that determines audit exposure is whether the data has an owner, a classification, a lineage trail, and a master every system trusts. Measured honestly, the leading indicators are batch traceability time, data-provenance audit findings, and regulatory submission lead-time — and all three move when the foundation is governed, not when another dashboard is built.

It starts on one compliance-critical domain, not the whole estate. A six-week build maps batch or product master through Microsoft Purview into OneLake and out to a Power BI audit report — a working lineage trace and a live compliance scorecard, first value in 6 weeks — which is the proof that earns the broader rollout. Traceability times falling from 36 hours to under 6 in the first cycle is the kind of result that makes the next phase an easy approval.

And govern the data before you reach for the AI. The same foundation that cuts traceability time makes operational AI defensible and compliant, because the provenance the regulator and the AI assurance review both demand already exists. Unify the data with lineage attached, predict the classification gaps before the auditor finds them, act through enforced policy rather than remediation sprints — in that order. Compliance stops being a forensic exercise every audit and becomes what the architecture enforces by default.

Compliance officers do not care about your model architecture. They care about lineage, audit trail and reproducibility. Build those into the data foundation and AI governance becomes a documentation exercise — not a panicked retrofit when the regulator calls.

Free Assessment

Where does your operation sit on the data maturity curve?

8 questions. 3 minutes. You get a scored breakdown across data infrastructure, analytics readiness, and automation potential — with a specific next step for your industry.

Data GovernanceAI GovernanceMicrosoft PurviewComplianceMicrosoft FabricIndustrial

Your Data · Our Technology · Our Automation

Get practical insights every fortnight

Amit writes about Microsoft Fabric, Power BI, AI in operations, and digital transformation for manufacturing and supply chain leaders. Practitioner perspective - no fluff, no vendor spin.

No spam. Unsubscribe any time. Also on Substack.

Is this the challenge you're facing?

Book a 30-minute call. We'll look at your specific operation and tell you what's achievable - plainly and without slides.