Building AI on Ungoverned Data

Published:

There’s a pattern emerging across organisations deploying AI: impressive capabilities built on ungoverned data. I keep seeing the same architectural decision repeated. AI agents with read access to production databases because they need to answer customer questions, and giving them everything is faster than designing appropriate access controls. From a development velocity perspective, this makes sense. From a governance perspective, it’s a compliance violation waiting to happen.

The statistics bear this out. Ninety-five percent of organisations use AI tools, but only 38% have AI governance in place. Of those working on governance, only 28% have formally defined oversight roles. And when breaches occur, 97% of affected organisations lacked proper access controls.

In August 2026, just 8 months from now, the European Union’s AI Act reaches full enforcement. Companies deploying high-risk AI systems without proper data governance will face fines of up to €20 million or 4% of global turnover, whichever is higher. For prohibited AI practices, penalties escalate to €35 million or 7% of turnover. The consequences are already materialising. Italy fined OpenAI €15 million in December 2024 for processing personal data without proper legal basis. McDonald’s exposed 64 million job applicants through an AI chatbot that left test credentials (password “123456”) active for 6 years. Samsung employees leaked intellectual property to ChatGPT in 3 separate incidents within 20 days, with that data now permanently incorporated into OpenAI’s training datasets.

These aren’t theoretical risks or edge cases. They’re documented failures at well-resourced organisations with sophisticated engineering teams. The question facing every organisation deploying AI is not whether inadequate governance creates risk. The question is whether you’ll address that risk before or after an incident.

The Gap Nobody Owns

The gap exists because organisations have treated AI and data as separate strategic initiatives. I see this organisational structure repeatedly:

  • AI teams focused on shipping features
  • Data teams focused on infrastructure and pipelines
  • Security teams trying to apply traditional application security controls
  • Compliance teams working from regulatory frameworks that predate AI-specific risks

Each group is doing their job competently, but nobody owns the intersection. And the intersection is where the risk lives.

The velocity problem

The AI adoption curve has been extraordinarily steep. In 2 years, generative AI went from research curiosity to production deployment in 78% of organisations. This velocity created strong incentives to prioritise speed over foundation.

Market pressure to ship AI-powered features meant teams integrated AI capabilities into existing products without redesigning data access patterns. Products launched without clear definitions of what data AI systems should (and crucially, should not) access. Leadership set aggressive AI roadmaps without establishing accountability for governance, security, and compliance implications.

The infrastructure assumption

The assumption underlying these decisions was that existing data infrastructure was sufficient. Databases were accessible. APIs were available. Data pipelines were running.

But “accessible” is not the same as “appropriately governed for AI access.”

The infrastructure built for traditional applications was not designed with AI-specific risks in mind: prompt injection attacks that can manipulate system behaviour, training data extraction techniques that can reveal sensitive information, algorithmic discrimination that can violate anti-bias regulations, or the compliance requirements emerging specifically for AI systems.

The compounding cost

What makes this particularly problematic is that the gap compounds over time. Systems built without governance become legacy systems that cannot be secured retroactively.

Retrofitting least-privilege access requires identifying every database connection, determining what access is actually necessary, creating scoped credentials, and updating all dependent systems. Without comprehensive logging, which many of these systems lack, this analysis is impossible without first instrumenting everything.

By that point, you’re looking at months of engineering work across multiple teams, all whilst trying to maintain service availability and not break existing functionality.

The cost increases as more data accumulates and more systems are built assuming unclassified, ungoverned data. It’s technical debt, but it’s technical debt with regulatory and security implications rather than just performance or maintainability concerns.

The Shadow AI Problem

The organisational gap created space for shadow AI to flourish:

What’s being exposed

The breakdown by data type is particularly concerning:

This isn’t malicious behaviour. It’s employees trying to be more productive in an environment where organisations haven’t provided governed alternatives or clear policies.

The financial impact

One in 5 organisations (20%) experienced a cyberattack because of security issues with shadow AI. Those attacks cost an average of $670,000 more than breaches at firms with little or no shadow AI.

Once data enters ungoverned external services, the exposure cannot be reversed. That intellectual property is now permanently in a dataset the company has no control over. This creates ongoing risk: the data can be surfaced in responses to other users, incorporated into model training, or exposed through future vulnerabilities in the service provider’s infrastructure.

This is systemic governance failure. Organisations have prioritised deployment speed over strategic foundation whilst treating AI and data strategies as independent concerns when they are fundamentally inseparable.

Why Inseparability Isn’t Optional

The inseparability of AI and data strategy is not a philosophical position. It’s a technical and regulatory reality.

Every AI system processes data. This is definitional. Language models require text. Computer vision systems require images. Recommendation engines require user behaviour data. The notion of an AI system that does not consume data is incoherent.

When the AI system is customer-facing, it necessarily processes customer data. A support chatbot accesses customer account information, support ticket history, and product usage patterns. A credit decision engine processes financial information, employment history, and transaction data. A healthcare diagnostic assistant accesses medical records, test results, and treatment histories.

There is no way to build useful customer-facing AI without giving those systems access to customer data.

Every AI decision is a data decision

This creates an immediate strategic implication that many organisations miss: every decision about AI architecture is simultaneously a decision about data architecture.

When you grant an AI agent unrestricted access to your production database because it’s technically expedient, you’ve made a data governance decision. Specifically, the decision that data governance does not constrain AI access.

When you allow AI systems to process data without comprehensive audit trails because logging has performance overhead, you’ve made a compliance decision. Specifically, the decision that you cannot demonstrate appropriate data handling to regulators when they ask.

And they will ask.

The third-party dimension

Many organisations use OpenAI’s GPT models, Anthropic’s Claude, Google’s Gemini, or numerous specialised AI APIs. From a technical perspective, these are API calls. From a regulatory perspective under GDPR, they are data processing relationships.

When you send personal data to a third-party service for processing, that service becomes a data processor and you become the data controller.

Article 28 of GDPR requires a written contract between controller and processor that specifies the subject matter, duration, nature, purpose, and type of personal data involved. If your AI application sends customer queries containing personal information to an external AI API without a proper Data Processing Agreement in place, you are in violation of GDPR regardless of how sophisticated your AI capabilities are.

In December 2024, the European Data Protection Board adopted an opinion emphasising that AI deployers remain responsible for verifying the lawfulness of data use even when sourced from third-party developers. The Italian privacy watchdog’s €15 million fine against OpenAI came down to exactly this issue: processing personal data without a legal basis and lacking transparency about how that data was being used.

The security dimension

According to IBM’s 2025 research, 97% of organisations experiencing AI breaches lacked proper access controls.

The challenge is that “appropriate” access controls cannot be defined without first classifying data. An AI agent with database credentials can access any table in that database. Without formal data classification, agents can inadvertently expose restricted data because they have no way to distinguish between information that’s safe to include in a response and data that’s prohibited from disclosure.

Consider a customer service agent that can query customer records. Without classification and enforcement, it might expose credit card numbers, internal risk scores, or compliance flags that should never appear in customer-facing responses. The agent doesn’t know these fields are sensitive unless that distinction is encoded into the system through classification schemes, access policies, and architectural controls.

The Architecture Question

Consider the architectural pattern I mentioned at the start: giving AI agents read access to production databases. From a development perspective, this is straightforward to implement. The agent can query whatever data it needs to answer questions. You write some SQL, wrap it in an API endpoint, and you’re done.

From a governance perspective, this decision creates multiple compounding problems.

First: least privilege violation

The agent receives access to the entire database when it may only need specific tables or even specific columns within those tables. If the agent is compromised through a prompt injection attack (OWASP ranks prompt injection as the number one AI security risk, with attack success rates reaching 50-88% across models), the attacker gains access to everything the agent can access, which is everything in the database.

Second: performance risk

AI-generated queries can be inefficient in ways that consume significant database resources. Without isolation between AI query workloads and production transactional workloads, a poorly constructed agent query can degrade application performance for all users.

I’ve seen this cause production incidents where a chatbot’s analytics query locked tables that the main application needed for customer transactions.

Third: no governance enforcement layer

If the agent queries tables directly, there’s no intermediary layer where you can enforce data retention policies, exclude certain categories of sensitive information, or apply row-level security based on the agent’s role and the end user’s permissions.

You could implement these controls in the application layer, but now you’re maintaining governance logic in multiple places, which inevitably leads to inconsistency and gaps.

Fourth: inadequate audit trails

Database query logs show SQL statements, but they may not capture which AI agent made the query, on behalf of which user, for what purpose, with what result. The audit trail necessary for compliance (a record that traces AI behaviour back to specific user requests and demonstrates appropriate data handling) requires more context than standard database logging provides.

The architectural choice between direct database access and mediated access through governed views, read replicas, and proper access controls determines whether governance requirements can be satisfied. This is not something you can retrofit easily. The difficulty I described earlier about retrofitting access controls comes directly from this architectural decision made early in development when “just get it working” took priority over “get it working in a way we can govern.”

What Actually Goes Wrong

Security exploits

The security research on prompt injection demonstrates why these architectural and governance gaps matter in practice. In 2025, researchers disclosed EchoLeak (CVE-2025-32711), a zero-click prompt injection exploit in Microsoft 365 Copilot that allowed remote, unauthenticated data exfiltration via a single crafted email. In December 2025, over 30 security vulnerabilities were disclosed in AI-powered integrated development environments including Cursor, Windsurf, and GitHub Copilot.

Training data extraction

Training data extraction is not theoretical. A 2023 research study developed “divergence attacks” that caused ChatGPT to emit training data at rates 150 times higher than normal behaviour. Over 5% of ChatGPT’s output was verbatim 50-token copies from its training dataset. The researchers estimated that approximately a gigabyte of training data could be extracted by querying the model systematically.

This has direct implications for any organisation that has inadvertently included sensitive data in training sets or that uses AI systems trained on data they don’t control.

Regulatory enforcement

The compliance landscape is becoming more stringent, not less. EU AI Act Article 10 establishes data governance requirements for high-risk AI systems, which include employment decision systems, essential services like credit and healthcare, law enforcement, and education. Full enforcement begins August 2, 2026, just 8 months from now. High-risk violations carry fines up to €20 million or 4% of global turnover. Prohibited practices carry fines up to €35 million or 7% of turnover.

Recent enforcement actions demonstrate that regulators are willing to impose these penalties:

  • OpenAI: €15 million (Italy, data processing violations)
  • Clearview AI: €30.5 million (Netherlands) + €25.2 million (France) for biometric data violations
  • SIDECU: €160,000 (Spain) for failing to conduct required Data Protection Impact Assessments

Industry-specific regulations are being updated to explicitly cover AI. The SEC’s 2025 examination priorities include a new focus on artificial intelligence. HHS proposed the first major update to the HIPAA Security Rule in 20 years in January 2025, with a key requirement: entities using AI tools must include those tools in risk analysis and risk management compliance activities.

Customer trust erosion

The customer trust dimension compounds these regulatory and security risks:

This trust deficit creates a paradox: organisations are deploying AI systems whilst the people using them actively distrust those systems. That tension cannot persist indefinitely. Either organisations build trustworthy AI through transparent governance, or customers, employees, and regulators will force changes through regulation, litigation, or market pressure.

The market is already pricing governance quality. Customers choose providers with clearer data protection practices. Talent chooses employers with stronger governance. Investors favour companies with lower regulatory risk. Organisations without demonstrable governance programmes increasingly find themselves at a competitive disadvantage.

What Good Actually Looks Like

The good news is that what good looks like is documented and implementable.

Data classification

Data classification is the foundation. Without it, you cannot define what “appropriate access” means, which makes enforcement impossible.

Modern classification uses automated discovery. Platforms like Collibra, Alation, and Atlan scan databases and data repositories, using machine learning and pattern matching to identify personally identifiable information, protected health information, payment card information, and other sensitive data categories. This eliminates the manual tagging burden that causes classification projects to stall.

Effective classification schemes are multidimensional. A single field might be classified as:

  • Sensitivity: Confidential
  • Regulatory: GDPR Article 9 (special category data)
  • Data type: Health
  • Retention: 7 years post-treatment
  • Geographic: EU processing only

These classifications stack. An AI system querying this field must satisfy all applicable constraints: proper consent under GDPR, appropriate security controls for confidential data, retention policy compliance, and geographic processing restrictions. The classification scheme makes these requirements explicit and enforceable.

Governance frameworks

The minimum viable approach for organisations without existing governance:

  • Identify the highest-risk systems
  • Establish clear ownership and accountability
  • Document what data they access and why
  • Implement basic security controls (audit logging, access restrictions)
  • Require human review for critical actions
  • Create a process for reviewing and approving new AI capabilities before deployment

This isn’t comprehensive governance, but it creates the foundation and organisational capability to expand governance more broadly.

For organisations ready for more structure:

Architectural patterns

The architectural patterns that enable governance are well-established. A three-layer data access architecture provides both governance and performance isolation.

Layer 1: Scoped data access interfaces

Create dedicated access interfaces that define exactly what AI agents can see. Whether you’re using relational databases, document stores, graph databases, or object storage, the principle remains: agents never query production data stores directly. Instead, they access purpose-built interfaces that enforce retention limits, exclude sensitive fields, and apply business logic ensuring agents see only appropriate data.

A customer support agent might receive access to customer profiles that include account status and usage information whilst excluding payment details, internal risk assessments, and records older than your retention policy allows. These access scopes are documented, version-controlled, and subject to the same review process as production code changes.

The enforcement happens at the data layer (whether through database views, API gateways with field filtering, or query middleware) so an AI agent cannot access excluded data even if compromised through a prompt injection attack. The data infrastructure itself enforces the restriction, not application logic that could be bypassed.

Layer 2: Read replicas

Agents query replicas, never production databases. This prevents AI-generated queries from impacting production application performance regardless of how inefficient those queries might be. It allows different indexing strategies optimised for analytical queries rather than transactional operations. It provides an additional security boundary where different access controls can be applied.

Layer 3: Data warehouse

Implemented as scale and complexity increase, this layer involves a data warehouse or analytics layer optimised for the types of queries AI agents actually perform. Pre-aggregated data, historical snapshots, and denormalised schemas reduce query complexity and improve performance. This layer physically separates operational data from analytical data, enabling different retention policies, different security controls, and different performance characteristics.

Row-level and column-level security

These provide fine-grained access control within databases. Row-level security automatically filters which rows users or agents can see based on their identity and permissions. Column-level security restricts which columns are visible.

The critical implementation requirement: user identity context must propagate through the AI agent stack. If the agent runs under a generic service account rather than the end user’s identity, row-level security policies cannot differentiate between different users’ permissions.

Zero-trust architecture

Every request to access data must be authenticated and authorised regardless of origin. AI agents receive time-limited credentials scoped to specific resources. A support chatbot agent might receive credentials valid for 300 seconds with read access to the support_agent_customer_view and nothing else. When the task completes, credentials expire.

AI firewalls

Platforms like Lakera Guard and Robust Intelligence analyse prompts before they reach language models, blocking attempts to manipulate model behaviour through injection. This is defence in depth: even if an attacker crafts a sophisticated prompt injection, the firewall provides an independent layer of protection.

The combination of these patterns creates a system where governance requirements can actually be satisfied. This is not theoretical. These are implemented patterns at organisations that built governance into their AI strategy from the start rather than treating it as something to bolt on later.

The Eight-Month Reality

The EU AI Act reaches full enforcement in 8 months. Comprehensive AI governance programmes are multi-year journeys. This creates a fundamental tension: the time required exceeds the time available.

The resolution is phased implementation focused on achieving minimum viable compliance for regulatory deadlines whilst building towards more comprehensive governance over a longer timeline.

Phase 1: Assessment and minimum viable compliance (0-6 months, through June 2026)

For organisations with high-risk AI systems (employment decisions, credit assessments, insurance pricing, healthcare diagnostics), the priority is demonstrating good-faith effort to comply with Article 10’s data governance requirements by August 2026:

  • Inventory all AI systems including pilots, departmental tools, and shadow AI
  • Classify which systems are high-risk under EU AI Act criteria
  • Establish a governance council with defined roles and decision-making authority
  • Document data flows for high-risk systems
  • Implement comprehensive audit trails
  • Implement basic access controls restricting what data high-risk systems can access
  • Conduct Data Protection Impact Assessments as required by GDPR Article 35

Phase 2: Architectural implementation (6-12 months, through December 2026)

  • Deploy sandboxed views and read replicas for AI data access
  • Implement row-level and column-level security where appropriate
  • Establish and document AI access policies for all production systems
  • Deploy AI firewalls for external-facing systems
  • Expand governance beyond high-risk systems to all AI deployments

Phase 3: Maturity and automation (12-24 months, through August 2027)

  • Automate data classification and policy enforcement
  • Implement advanced monitoring and anomaly detection
  • Integrate governance into development workflows
  • Establish metrics for governance effectiveness
  • Create continuous improvement processes

Why governance programmes fail

Gartner research indicates that 80% of data and analytics governance programmes will fail by 2027 if not tied to business goals. Treating governance as a compliance checkbox rather than strategic capability creates programmes that exist on paper but not in practice.

The governance work must be connected to business outcomes: faster deployment of new AI capabilities because governance is built in, reduced security incidents, lower compliance costs, and competitive advantage from customer trust.

Cross-functional collaboration

No single function can implement AI-data governance alone:

  • Engineering teams implement technical controls but cannot define governance policies
  • Legal teams interpret regulatory requirements but cannot implement technical controls
  • Compliance teams prepare evidence but need engineering to build the systems that generate it
  • Security teams define access control requirements but need product teams for trade-offs
  • Product teams balance requirements with velocity but need data teams for access architectures

The governance council provides the forum where these disciplines collaborate to make decisions that balance multiple concerns. This is why organisational structure matters as much as technical architecture.

The Business Case

The market signals are clear. The AI governance market is projected to grow from $227.6M (2024) to $1.4B (2030) at 35.7% CAGR, whilst the data governance market expands from $3.91B (2025) to $9.62B (2030). These projections reflect industry recognition that AI governance is transitioning from optional to mandatory, from competitive advantage to table stakes.

The regulatory trajectory reinforces this. The EU AI Act is the first comprehensive AI regulation, not the last. California’s CCPA includes automated decision-making technology regulations taking effect in 2026 and 2027. Industry-specific regulators like the SEC, OCC, and HHS are issuing AI-specific guidance and conducting AI-focused examinations. The direction is towards more regulation, more enforcement, and more specific requirements.

Customer expectations compound the pressure. Trust in AI companies has declined from 50% to 47% in the past year despite increased AI usage. This growing gap between usage and trust creates market opportunity for organisations that can demonstrate transparent, governed AI practices. AI governance is becoming baseline expectation rather than differentiator. In 2025, having formal AI governance might provide competitive advantage. By 2027, lacking it will create competitive disadvantage.

The economics favour governance:

Revenue and efficiency gains:

  • 30% higher ROI from explainable AI implementations versus black-box systems
  • 35% lower compliance costs from proactive governance versus reactive compliance
  • Faster time-to-market for new AI features when governance is built into development workflows
  • Ability to operate in regulated markets that competitors with weak governance cannot enter

Risk mitigation:

  • Avoiding regulatory penalties (€35M or 7% of global turnover for EU AI Act violations)
  • Preventing breach costs (global average: $4.44M, US average: $10.22M, plus $670k shadow AI premium)
  • Protecting customer relationships that take years to build but days to destroy
  • Maintaining talent retention in competitive markets where engineers choose employers with mature practices

The inverse is also true. Organisations deferring governance work accumulate technical debt that becomes exponentially more expensive to remediate. Systems built without governance become legacy systems that block new capabilities, creating a widening gap between what the organisation needs to do and what its infrastructure can support.

The expertise required

This work requires expertise across AI architecture, data governance, regulatory compliance, and security engineering. It requires senior practitioners who understand how these domains interact, what trade-offs are acceptable, and how to implement governance that achieves regulatory compliance without making systems unworkable. The market for this expertise is tight and getting tighter.

I mentioned earlier the pattern I’ve observed in The Expertise Discount: organisations investing billions in AI whilst simultaneously devaluing the senior engineering expertise needed to use it effectively. The “Junior + AI = Senior” assumption that’s driving hiring decisions across the industry runs directly into the governance wall. Junior engineers augmented with AI tools can ship features quickly. They cannot design governance frameworks that will survive regulatory scrutiny. They cannot architect systems that balance security, compliance, performance, and usability. They cannot recognise the subtle ways that architectural decisions made for development velocity create governance problems that take months to fix.

This connects to patterns I’ve documented in other contexts. In The Database Blind Spot, I examined how AI agents default to text-based approaches for problems that require structured data solutions. The governance gap has similar characteristics: AI systems being deployed without the data architecture foundation they require because the people building them don’t recognise what’s missing. The capability to build governance is there, but the instinct to reach for it is absent.

Organisations starting now capture available expertise, build institutional capability, and establish patterns before regulatory deadlines create pressure and market competition for talent intensifies. Those deferring governance work face scarce expertise, compressed timelines, and severe consequences when enforcement arrives.

The work is complex but it’s documented. Architectural patterns are known. Regulatory requirements are specified. Security controls are well-understood. What’s required is organisational commitment to prioritise foundation-building over short-term feature velocity, enabling sustainable AI deployment in the long term. And the expertise to implement that foundation correctly.

What This Means For You

Your organisation will build integrated AI-data governance. This is not optional. The question is timing and cost.

Proactive implementation means starting now with assessment and prioritisation. Identify your highest-risk systems. Establish governance ownership. Document actual data flows. Implement audit trails. Define and enforce access policies through technical controls. Build capability before regulatory deadlines create pressure.

Reactive implementation means waiting until regulators issue fines, customers discover breaches, or security incidents force emergency response. This costs more, takes longer under time pressure, and happens whilst competitors who built foundations capture market advantages.

Implementation guidance exists. The architectural patterns, regulatory requirements, and security controls described throughout this article are documented and proven. What’s required is organisational commitment to prioritise foundation-building over short-term feature velocity, and the expertise to implement that foundation correctly.

The compliance clock is ticking. The work should start now.


About The Author

Tim Huegdon is the founder of Wyrd Technology, a consultancy focused on helping engineering teams achieve operational excellence through strategic AI adoption. With more than 25 years of experience in software engineering and technical leadership, Tim specialises in AI governance, data strategy, and regulatory compliance for AI systems. His approach combines deep technical expertise with practical observation of how engineering practices evolve under AI assistance, helping organisations develop sustainable AI workflows whilst maintaining the quality standards that enable long-term velocity.

Tags:AI, AI Adoption, AI Infrastructure, Data Modelling, Engineering Leadership, Enterprise AI, Operational Excellence, Organisational Design, Regulatory Compliance, Resilience Engineering, Risk Management, Security Architecture, Software Architecture, System Reliability, Technical Strategy