Data Contracts in 5 Minutes

Stop upstream changes from breaking your models

Nov 23, 2025

Machine learning pipelines fail in ways you don’t notice. A column gets renamed in your upstream database, and suddenly your fraud detection model starts predicting random outcomes. By the time you realize it, you’ve processed thousands of transactions incorrectly.

Data contracts solve this problem by establishing formal agreements between data producers and consumers, specifically protecting ML pipelines from breaking changes.

What Are Data Contracts?

A data contract is a versioned agreement that defines the structure, quality, and semantics of data being exchanged between systems. Think of it as an API contract, but for data pipelines rather than service endpoints.

For machine learning systems, data contracts act as a protective barrier that catches incompatible changes before they reach your models. Instead of discovering that your input features have changed after your model has already made thousands of bad predictions, the contract fails the pipeline immediately.

🛠️ Need More Ready-to-Use Systems?

While this article focuses on Data Contracts, if you are tackling other major operational challenges, check out my ready-to-use frameworks:

➡️ End pipeline unpredictability with the Pipeline Reliability Framework

➡️ Cut cloud costs using the Cloud Cost Optimization Framework

➡️ Find hidden PII before regulators do with the PII Scanner + Compliance Guide

Or, get access to all three (and every future toolkit I release every week) immediately by upgrading your subscription for just 5/month.

Become a Paid Subscriber

Every data contract needs a technical definition to stand on. This foundation is built around the exact data structure the model expects, which we define using: Feature Schema Contracts.

Feature Schema Contracts

Feature schema contracts define the structure of data that your ML model expects. This includes column names, data types, allowed value ranges, and statistical properties.

A typical feature schema contract specifies that your customer churn model requires a tenure field as an integer representing account age in months, a monthly_charges field as a positive float, and a num_support_calls field as a non-negative integer. If the upstream team renames tenure to customer_lifetime_months, the contract breaks the pipeline before invalid data reaches your model.

Schema contracts also encode semantic meaning. A field called revenue might technically accept any float, but your contract should specify whether it’s in dollars or cents, whether it can be negative (for refunds), and what time period it represents.

Training Data SLAs

Service Level Agreements for training data establish guarantees about data freshness, completeness, and quality metrics. These SLAs protect model retraining pipelines from degraded performance.

A training data SLA might guarantee that customer transaction data will be available within 24 hours of occurrence, with at least 95% completeness for required fields, and that no more than 2% of records will have missing values in critical features. If data quality drops below these thresholds, automated retraining should be paused rather than producing a degraded model.

SLAs should also cover data distribution properties. Suppose your fraud model was trained on data where 2-3% of transactions are fraudulent. In that case, the contract can flag situations where this ratio suddenly drops to 0.1%, indicating either a data pipeline issue or a significant business change that requires model evaluation.

Model Input Validation

Input validation enforces data contracts at inference time, ensuring that production predictions are computed only on valid data. This is your last line of defense against bad predictions.

Validation happens in layers. Type checking ensures that numeric fields contain numbers and categorical fields contain expected categories. Range checking verifies that values fall within trained distributions. A customer tenure of 250 years should be rejected. Relationship validation checks that fields maintain expected correlations, such as ensuring that the total purchase amount matches the sum of the individual line items.

When validation fails, the system should have a clear escalation path. Critical failures block predictions entirely, returning an error instead of a potentially harmful prediction. Minor issues might allow prediction, but flag the record for human review. This prevents the common problem of models silently making predictions on garbage data.

Implementation Patterns

Modern data contract tools integrate with existing data platforms. Great Expectations, Soda, and dbt tests provide frameworks for defining contracts as code. These tools run validation checks at pipeline checkpoints after data extraction, after transformation, and before model consumption.

Contracts should be versioned alongside model versions. When you retrain a model with different features, you also deploy a new contract version simultaneously. This allows safe parallel operation of old and new model versions during deployment.

Benefits for ML Operations

Data contracts transform ML operations from fixing problems after they happen to preventing them before they start. Breaking changes are caught in development or staging environments, not in production. When issues do occur, contracts provide clear error messages pointing to exactly what changed and why it’s incompatible.

Teams can move faster because data consumers trust that breaking changes won’t silently corrupt their pipelines. Data producers get clear specifications of what downstream systems require, reducing misunderstandings. This separation of concerns allows ML engineers to focus on model improvement rather than investigating data quality issues.

Getting Started

Start with your most critical ML pipeline and define its contract thoroughly. Document the exact schema your model expects, including data types and reasonable value ranges. Implement validation checks that fail the pipeline if these expectations aren’t met. Once that’s solid, expand to your next pipeline. The goal is to build a culture where changes to data are treated with the same care as changes to code. Data contracts make this possible by making implicit assumptions explicit and enforceable.

Want to go deeper? This article focused on the ML use case, but data contracts solve reliability problems across your entire data platform. I wrote a detailed framework on implementing production-ready data contracts, including a negotiation playbook for getting producer buy-in and a YAML template you can use immediately: A Proactive Framework for Reliable Data

🚀 Stop Building from Scratch

The complexity of Data Contracts, from defining schemas to establishing governance, underscores one truth: building a stable data platform requires ready-made systems, not more custom code.

If you enjoyed this quick dive into structured solutions, why spend dozens of hours recreating essential tools like SLA templates, PII scanners, and incident matrices?

Upgrade your subscription today for just $5/month and get instant access to the complete toolkit, currently valued at $127:

✅ The Pipeline Reliability Framework (Stops the late-night pager calls)
✅ The Cloud Cost Optimization Framework (Cuts your quarterly bill)
✅ The PII Scanner + Compliance Guide (Protects you from regulatory risk)
✅ Plus: Every single framework and toolkit I release every week

Don’t just read about systems. Own them.

The Data Letter

Discussion about this post

Ready for more?