Data Platform

AI Data Governance is a Foundation Problem

Most AI governance programmes start at the wrong end — they govern the model when they should govern the data. The compliance question is not 'is the model fair'. It is 'where did the training data come from and can you prove it'.

Amit Kumar Singh

Technology Consulting Partner · MyData Insights

14+ years in industrial data · Former Accenture & EY · India, GCC, SEA

21 May 2026 · 9 min read

The bottom line
AI governance is a data foundation problem, not a model problem. Microsoft Purview catalogue and lineage, Fabric Lakehouse with auditable bronze-silver-gold, semantic model as the policy enforcement layer. Compliance is what the architecture enforces by default.

In This Article

1The compliance problem is upstream of the report
2What a governed data foundation actually looks like
3The three metrics this protects
4Where AI actually fits — and where it doesn't
5The question the EU AI Act actually asks
6What this looks like in practice
7Where this approach doesn't fit
8Six weeks to first value
9What this means for the compliance-accountable leader

Introduction

The compliance audit finds a problem. Your team scrambles to pull batch records from three systems, reconcile two conflicting product codes, and produce a lineage report that no one can agree on. The fine doesn't come from bad intent — it comes from no one owning the data in the first place.

Most compliance programmes in industrial businesses are a BI costume worn over a governance problem. The dashboard looks clean. The underlying data has no owner, no lineage, no master. That is the real audit risk.

The compliance problem is upstream of the report

A food manufacturer running HACCP controls needs to trace a batch from raw material intake to finished-goods despatch — ideally in under two hours. A pharma packaging site needs to satisfy GxP audit queries without manually hunting across SAP S/4HANA, a lab information system, and three spreadsheets. An EPC contractor needs project safety records that are traceable, timestamped, and unalterable for the life of the asset — which could be twenty-five years.

What these scenarios share is not a reporting gap. The report is the last 10 percent. The other 90 percent is whether the underlying data has a defined owner, a classification, a lineage trail, and a master record that every downstream system trusts.

When that foundation is missing, compliance becomes a forensic exercise every time. Audit-finding counts rise. Regulatory submission lead-times stretch. And the finance team starts asking uncomfortable questions about exposure.

What a governed data foundation actually looks like

The governance plane for a mid-market industrial is not complicated — but it does require three layers working together.

**Microsoft Purview** sits at the centre. It scans the data estate — SAP S/4HANA, SAP ByDesign, MES systems, document stores — and classifies what it finds. Sensitive fields, personally identifiable information, batch identifiers, material master attributes. Purview builds the lineage map: this field came from this source, was transformed here, landed there. When an auditor asks "show me the chain of custody for batch 4471," that lineage map is the answer — not a pivot table assembled the night before.

**Microsoft Fabric / OneLake** is the canonical store. One copy of the truth. Batch records, quality inspection results, supplier certificates, project safety logs — all ingested via Azure Data Factory, landed in OneLake as Delta Parquet, and governed by the same classification tags Purview applied upstream. You are not chasing data across seven systems because the systems write to one place.

**Power BI** surfaces the audit-ready view. Direct Lake queries OneLake directly. No import, no stale cache. The compliance scorecard — batch traceability completion rate, open audit findings by category, days to regulatory submission — reflects what is actually in the store at the moment the auditor opens the report.

This is the Unify · Predict · Act sequence applied to governance: unify the data into one store with lineage attached, predict where the classification gaps are before the auditor finds them, act through automated policy enforcement rather than manual remediation sprints.

The three metrics this protects

Compliance programmes tend to measure themselves in lagging indicators — fines levied, audit findings closed, submissions filed. The leading indicators that tell you whether the governance foundation is holding are more useful and more honest.

**Batch traceability time** — how long it takes from a recall trigger to a complete upstream and downstream trace. Programmes with no data lineage routinely take 24–72 hours. With Purview lineage and OneLake as the canonical store, that number can fall to 2–4 hours in the initial deployment phase and tighten further as the model matures.

**Audit-finding count** — specifically, findings attributable to data completeness or data provenance rather than process failures. These are the findings that repeat. They repeat because the root cause — unowned data — was not fixed after the last audit. A governed data foundation stops the repeat.

**Regulatory submission lead-time** — the number of working days between a regulator request and a filed, evidenced response. This is a direct function of how quickly the team can assemble traceable, consistent records. When lineage is automated and the canonical store is current, that lead-time compresses.

Where AI actually fits — and where it doesn't

AI in data governance does two things well. It accelerates classification at scale — scanning millions of records and proposing labels that a human steward then validates. And it surfaces anomalies — a field value that breaks pattern, a record that arrived without expected lineage, a dataset that stopped refreshing. Microsoft Purview's AI-assisted scanning does both.

What AI does not do is write your data policy. It enforces a policy you already wrote. If you have not defined what a "batch record" is, what fields it must contain, who owns it, and what constitutes a complete trace — no amount of machine learning will fill that gap. The policy has to exist before the platform can enforce it.

This is the honest-limits caveat that vendor materials frequently omit. Purview is powerful. It is not a substitute for a data stewardship structure and a documented governance policy. The two have to exist together.

The Question the EU AI Act Actually Asks

As AI moves into operational decisions, the compliance question shifts from "is the model fair" to "where did its data come from and can you prove it" — and that is a data-foundation question, not a model one. The EU AI Act treats data and governance as one of its three pillars (data, model, deployment), and for a high-risk AI system it expects documented, traceable training and input data. An industrial business that cannot show the provenance of the data feeding a quality-prediction or risk model has a governance gap no amount of model documentation closes. The Microsoft Purview lineage graph you built for batch traceability is the same artefact that answers "where did this AI's input come from" — a queryable chain of custody, not a reconstruction assembled the night before.

This reframes most AI-governance programmes as starting at the wrong end. Teams reach for model cards and bias testing while the data feeding the model has no owner, no classification, and no lineage — which is the exposure an auditor actually probes. Govern the data foundation first: classify it at ingestion, trace it source-to-report, enforce policy in the semantic model, and the model-governance work becomes a documentation exercise on top of a provenance trail that already exists.

It also makes the foundation reusable across regimes. The same lineage and classification that satisfy the EU AI Act's data pillar also answer HACCP batch trace, GxP audit queries, CBAM data points, and DPDP provenance — because they are all asking the same underlying question about who owns the data and where it came from. You build the governed foundation once and it serves the auditor, the regulator, and the AI assurance review from one source.

The compliance question for operational AI is not "is the model fair" — it is "where did the data come from and can you prove it." That is a lineage problem, which is why AI governance starts at the data foundation, not the model.

What this looks like in practice

An FMCG packaging site running SAP ByDesign for procurement and a standalone MES for production had a batch traceability time of 36–48 hours. Every recall exercise turned into a cross-department war over which system's batch record was authoritative.

The practitioner approach: wire Purview to scan both SAP ByD and the MES, classify batch identifiers and quality attributes, define the lineage path end-to-end. Land the canonical batch record in OneLake via Azure Data Factory. Build the traceability report in Power BI against Direct Lake. Assign a named data steward in each domain with a Power Apps interface for exception resolution.

Within the first deployment cycle, batch traceability time fell from 36 hours to under 6. Audit-finding count in the data-provenance category dropped materially. The stewards now spend time resolving exceptions rather than hunting records.

Where this approach doesn't fit

If your compliance requirement is primarily contractual — standard ISO certifications, annual supplier questionnaires — a full Purview deployment is probably heavier than you need. Start with a simpler master data process and a Power BI compliance scorecard.

If your data estate is entirely within a single ERP and you have fewer than five data domains in scope, the governance overhead of Purview scanning may not be justified in the early stages. Sequence the investment to match the complexity.

Six weeks to first value

A Discover → Prototype engagement starts with mapping one compliance-critical data domain — typically batch master or product master — through Purview, into OneLake, and out to a Power BI audit report. In six weeks, you have a working lineage trace and a compliance scorecard with live data. That is the proof of concept that earns the investment for the broader rollout.

What This Means for the Compliance-Accountable Leader

The decision is where to spend the next governance pound — and the answer is upstream, on the data, not downstream on the report or the model. The report is the last 10%; the 90% that determines audit exposure is whether the data has an owner, a classification, a lineage trail, and a master every system trusts. Measured honestly, the leading indicators are batch traceability time, data-provenance audit findings, and regulatory submission lead-time — and all three move when the foundation is governed, not when another dashboard is built.

It starts on one compliance-critical domain, not the whole estate. A six-week build maps batch or product master through Microsoft Purview into OneLake and out to a Power BI audit report — a working lineage trace and a live compliance scorecard, first value in 6 weeks — which is the proof that earns the broader rollout. Traceability times falling from 36 hours to under 6 in the first cycle is the kind of result that makes the next phase an easy approval.

And govern the data before you reach for the AI. The same foundation that cuts traceability time makes operational AI defensible and compliant, because the provenance the regulator and the AI assurance review both demand already exists. Unify the data with lineage attached, predict the classification gaps before the auditor finds them, act through enforced policy rather than remediation sprints — in that order. Compliance stops being a forensic exercise every audit and becomes what the architecture enforces by default.

Compliance officers do not care about your model architecture. They care about lineage, audit trail and reproducibility. Build those into the data foundation and AI governance becomes a documentation exercise — not a panicked retrofit when the regulator calls.

Free Assessment

Where does your operation sit on the data maturity curve?

8 questions. 3 minutes. You get a scored breakdown across data infrastructure, analytics readiness, and automation potential — with a specific next step for your industry.

Take the Free Assessment →Book a Call Instead

Data GovernanceAI GovernanceMicrosoft PurviewComplianceMicrosoft FabricIndustrial

Your Data · Our Technology · Our Automation

Get practical insights every fortnight

Amit writes about Microsoft Fabric, Power BI, AI in operations, and digital transformation for manufacturing and supply chain leaders. Practitioner perspective - no fluff, no vendor spin.

No spam. Unsubscribe any time. Also on Substack.

FAQ

Common questions

What does Microsoft Purview actually do for AI governance?

Catalogue every dataset, every column, every lineage edge — from source ERP table through Fabric Lakehouse to Power BI semantic model. When the compliance question arrives ('where does this AI's training data come from?') the answer is a queryable graph, not a guess.

How does the semantic model help?

It enforces policy. Sensitivity labels propagate from source to report. RLS rules sit at the model layer. Measure definitions are governed. AI agents reading the semantic model inherit the governance — they cannot see what users cannot see.

Is this enough for EU AI Act compliance?

The data foundation is one of the three pillars (data, model, deployment). The Fabric + Purview foundation handles the data pillar audit-ready. Model and deployment governance need separate work — typically Azure ML and Azure AI Studio governance.

Put this into production

Microsoft Fabric & Power BI →Data Lakehouse Implementation →Data Integration Services →

Data Platform

Microsoft Fabric vs Databricks: When to Choose Each for Operations Analytics

Fabric versus Databricks is usually framed as a fight. It is not. For operations analytics the question is not which is better — it is which problem you are actually solving, and which platform your centre of gravity sits on.

9 min read

Data Platform

Microsoft Fabric vs Snowflake: The Honest Cost Comparison for Mid-Market Industrial

Most Fabric-versus-Snowflake comparisons are written by people selling one of them. Snowflake is an excellent product — but for most mid-market manufacturers the real question is which platform they can staff, govern and afford 18 months after the consultant leaves.

9 min read

Data Platform

Your Month-End Numbers Are a Data Problem, Not an Accounting One

Most mid-market industrial finance teams steer the business by looking in the rear-view mirror. By the time the management accounts are signed off, the numbers are three to five weeks old. The slow close is rarely an accounting problem — it is a data problem wearing an accounting costume.

9 min read

Want to see how MDI solves this in your industry? Explore industry solutions

Is this the challenge you're facing?

Book a 30-minute call. We'll look at your specific operation and tell you what's achievable - plainly and without slides.

Book a Discovery Call More Articles

AI Data Governance is a Foundation Problem

Introduction

The compliance problem is upstream of the report

What a governed data foundation actually looks like

The three metrics this protects

Where AI actually fits — and where it doesn't

The Question the EU AI Act Actually Asks

What this looks like in practice

Where this approach doesn't fit

Six weeks to first value

What This Means for the Compliance-Accountable Leader

Where does your operation sit on the data maturity curve?

Get practical insights every fortnight

Common questions

Put this into production

Related Articles

Microsoft Fabric vs Databricks: When to Choose Each for Operations Analytics

Microsoft Fabric vs Snowflake: The Honest Cost Comparison for Mid-Market Industrial

Your Month-End Numbers Are a Data Problem, Not an Accounting One

Is this the challenge you're facing?