Data Catalog Implementation

Why 60,000 Tables Break Your Metadata Strategy

Dec 28, 2025

Scale reveals flaws in traditional catalog architecture. Here’s a three-part solution to build metadata systems that actually work.

Most data catalogs collapse at enterprise scale because they rely on a flawed assumption: that automated discovery alone creates useful metadata. When organizations reach 60,000+ tables, this approach creates performance bottlenecks, stale documentation, and search systems that practitioners abandon in favor of Slack messages and tribal knowledge.

Welcome back to The Data Letter! 👋🏿👋🏿👋🏿

✨ For readers: a 20% discount on paid subscriptions is available until end of day January 5th. Upgrade here to access the full TDL premium archive, including articles like:

The Machine Learning Reality Gap

From Data Lineage to Data Observability

How to Detect Model Drift When You Can’t Measure Performance

Why Traditional Catalogs Fail at Scale

Most catalog problems stem from two architectural mistakes that compound at scale.

First, the crawl-only collection model creates a vicious cycle. Automated scrapers pull everything they can find (table schemas, column names, sample data) and dump it into searchable indexes. This works for 500 tables. At 5,000 tables, the search degrades. At 50,000 tables, the system becomes unusable because volume overwhelms relevance signals.

Second, these tools exist in a process vacuum. They operate outside your actual development workflows. Data engineers build pipelines in dbt or Airflow, deploy to production, and then someone might manually update your catalog documentation. The catalog becomes a shadow system that’s always out of sync with reality.

Engineers stop trusting the catalog and ask colleagues on Slack instead. They ask colleagues on Slack instead. Your expensive catalog investment becomes shelfware.

Treat Metadata as a Product

The solution requires reconceptualizing metadata: metadata is a product, not infrastructure.

Traditional catalogs treat metadata as an IT inventory problem. They collect everything, index it, and assume users will find what they need. This approach breaks down because it ignores how data practitioners actually work.

Product thinking starts with user needs. What does your analytics engineer need each week when building reports? They need to know: which customer table is authoritative, who owns it, what SLAs apply, and whether it’s safe to join with orders data. Instead of descriptions for 60,000 tables, they need reliable answers for the 50 tables that matter to their work.

This reframing drives different architectural decisions. Products have owners who make prioritization decisions. They have quality standards. They solve specific problems for defined users. Infrastructure just exists and accumulates.

Three-Part Solution Framework

Implementing metadata as a product requested three interconnected changes to how you architect, enforce, and prioritize metadata collection:

1. Decouple Your Architecture

Stop buying monolithic platforms. Modern metadata systems require three independent layers:

Ingestion layer: Collect metadata from multiple sources: Git repositories, warehouse query logs, orchestration tools, BI platforms. This layer normalizes disparate formats into consistent structures.

Storage layer: A queryable metadata store (graph database, relational database, or specialized metadata lake). This is where lineage, ownership, and documentation live independently from UI concerns.

Presentation layer: User facing search, lineage visualization, and documentation interfaces. This layer queries storage but doesn’t control collection.

Why decouple? Because each layer scales differently. Your UI might handle 100 concurrent users while ingestion processes millions of metadata events daily. Monolithic tools couple these concerns, creating bottlenecks that compound at scale.

2. Enforce with Contracts in Development Workflows

Manual documentation fails. Instead, make metadata a deployment requirement through metadata contracts: schema definitions that specify required metadata before code ships.

Integrate validation into CI/CD: When engineers commit a new dbt model or Airflow DAG, automated checks verify that ownership, business purpose, and data tier are defined. Missing metadata blocks deployment.

This approach changes team behavior. Engineers document assets when they’re built, not months later when context is lost. Metadata stays current because it’s part of the development process, not a separate compliance exercise.

3. Curate Ruthlessly Through Tiering

Not all data assets deserve equal attention. Implement a tiering system that assigns every table an explicit priority:

Tier 1: Core business metrics. These 20-50 tables get comprehensive documentation, guaranteed SLAs, and priority support. They power executive dashboards and financial reporting.

Tier 2: Department-level analytics. 200-500 tables with defined owners, basic documentation, and standard support.

Tier 3: Everything else. Experimental models, archived tables, and one off analyses. Minimal metadata, no SLA guarantees.

Curation means saying no. It means archiving unused tables. It means focusing documentation effort where it drives business value rather than treating all metadata equally.

Actionable Next Steps

For greenfield implementations: Start small. Pick one core business domain (customer data, product analytics) and build metadata for 20 Tier 1 tables. Validate the product delivers value before scaling.

For broken catalogs: Audit your current state. Which tables are actually used? Implement tiering retroactively, focusing on the 5% of tables that drive 80% of queries. Sunset unused crawlers.

For vendor evaluation: Ask pointed questions. Can we ingest metadata from our CI/CD pipeline? Does the system separate storage from presentation? Can we enforce metadata contracts before deployment? If vendors can’t answer clearly, they’re selling monolithic infrastructure, not product focused tools.

Successful metadata programs treat catalogs as products that solve specific user problems, not infrastructure that collects everything possible. They integrate metadata creation into development workflows, enforce quality through contracts, and curate ruthlessly based on business value.

Moving from a comprehensive inventory to focused product thinking separates catalogs that practitioners love from catalogs that gather dust. Your metadata strategy should scale with intention, not just accumulate with volume.

Ame_data scientist

Dec 29

Great article. cataloging data can simplify workflow and make the process more efficient, but i'm curious can the metadata also be an infrastructure? or is it because they "exist outside the actual dev workflow".

*also i noticed some repetitions on the slack messaging line.

Dr Sam Illingworth

Dec 28

Thanks for the great read Hodman, the actionable steps in particular are excellent. 👌

1 reply by Hodman Murad

4 more comments...

The Data Letter

Discussion about this post

Ready for more?