Who Owns Data Quality, Anyway?

Nov 02, 2025

Data quality failures in production systems follow a predictable pattern. A source system changes how it calculates a field. Data pipelines ingest changes without validating their semantic meaning. Feature stores service technically valid but semantically corrupted features to models. Models retrain on poisoned data and produce confidently wrong predictions. By the time anyone notices, the damage has compounded across weeks of decisions.

The root cause isn’t technical. It’s organizational. No single team owns the entire data quality chain, so failures that cross team boundaries become coordination problems disguised as technical incidents.

Hey there! 👋🏿 I’m Hodman, and I help teams build reliable data infrastructure and ML pipelines.

Here are some recent articles you may have missed:

The ownership problem exists because each team operates with a different view of what “data quality” means and where its responsibility ends.

Why No One Can Own It Alone

Data engineers validate schemas and monitor pipelines. They ensure data moves from point A to point B without breaking. They catch null values, schema mismatches, and failed jobs. What they can’t see: a field called user_status suddenly contains “active_v2” alongside “active.” The pipeline runs successfully because the schema hasn’t changed. The corruption becomes visible only when a model starts making nonsensical predictions.

Data scientists validate model performance after training for deployment. They monitor drift, retrain on schedule, and optimize for better metrics. The problem is timing. Upstream data quality issues manifest as model degradation weeks after the corruption began. By then, they’re debugging symptoms instead of root causes. A model trained on 99% accurate data doesn’t work 99% of the time. It makes confidently wrong predictions 100% of the time until you retrain it.

Source system owners implement business logic at the point of generation. They know what the data means and why each field exists. They lack visibility into how downstream teams use that data. A field refractor or status code consolidation is a routine code change to them. To a data science team, it’s a production incident that breaks model assumptions that were trained into the weights over months of historical data.

This isn’t negligence. This is what happens when functional silos create ownership gaps. Each team is responsible for its domain, and the spaces between domains are where data quality dies.

Models Amplify Bad Data

A financial services company I worked with was building a credit risk model using five years of loan performance data. Three months before training began, a source system migration introduced a time zone bug: loan default dates were recorded in a different time zone than origination dates.

For 99% of loans, this didn’t matter, but for the 1% that defaulted within the first week, the corrupted timestamps made it appear that some defaults occurred before the loans were even originated. The model learned that “time until default” could be negative.

It went to production and started rejecting creditworthy applicants based on phantom risk signals learned from corrupted training data. It took three weeks and a 15% drop in loan originations before someone traced the problem back to the timezone bug.

The validation checks focused on what each team could see. Schema compliance looked fine: timestamps were present and properly formatted. Target distribution looked fine: the overall default rate matched historical patterns. The corruption affected only edge cases, so aggregate statistics stayed within normal bounds. The bug was invisible to both pipeline monitoring and model validation until production traffic started hitting the corrupted decision boundaries the model had learned.

Designing a Model on Shared Responsibility

The solution is defining clear ownership boundaries across the data lifecycle.

Source System Owners: Correctness at the point of generation. When they change field definitions, they notify downstream consumers. Data doesn’t stop being their responsibility when it leaves their database.

Data Engineering Teams: Reliable delivery and structural integrity. They validate schemas, enforce data contracts, and monitor pipeline health. They catch technical failures but don’t validate business logic. They surface anomalies and route them to teams who own the context.

Data Science and ML Engineering Teams: Model input quality and training data integrity. They validate that features meet distributional assumptions and that training data is representative of production traffic. They own validation logic in feature stores and monitoring that detects when upstream changes break model assumptions. This includes input validation at inference time, not just training time.

Product and Analytics Teams: Semantic correctness and business logic validation. They define what metrics mean and what constitutes valid business states. When business logic changes, they own the communication to downstream teams.

This works because it aligns responsibility with capability. Each team owns the quality guarantees; they have the context and tools to enforce.

Making It Work

Three mechanisms enforce the boundaries:

Data contracts define the interface between producers and consumers. Schema, freshness SLAs, semantic meaning of fields, and validation rules that express business logic. For ML systems, feature contracts specify semantic stability: what business logic a feature represents, what distributional properties it should maintain, and when it might legitimately change.

Communication channels coordinate changes that cross boundaries. One retailer implemented a simple rule: any change to a field consumed by a production model requires a notification ticket filed with data science at least one sprint before the change ships. Data science reviews the change, assesses its impact, and either approves it or works with data engineering to make models robust to it.

Incident ownership clarity defines who is on call when things break. Data engineering owns pipeline failures, data science owns model failures, and source system owners own data generation failures. Triage determines the category, and responsibility is routed to the appropriate team.

Drawing the Boundaries

Who owns data quality? Everyone who touches the data, within the boundaries of what they can actually control. This isn’t satisfying if you want a single throat to choke when things break, but it’s the only honest answer when data flows through complex systems maintained by specialized teams.

Data quality isn’t a technical problem that can be solved by hiring a data quality team or buying a monitoring tool. It’s an organizational design problem that requires explicit ownership boundaries, structured communication channels, and clear escalation paths.

Figure out who owns what in your organization. Draw the boundaries. Make them explicit. Write them down. The companies that get this right don’t have fewer data quality issues. They catch them fast, route them to the right teams immediately, and fix them before they cascade into expensive production failures.

The Data Letter

Discussion about this post

Ready for more?