A Small Data Manifesto

How to choose simple solutions over complex data infrastructure

Jan 18, 2026

I’ve built data systems across startups and enterprises, and I keep seeing the same mistake: teams build for problems they don’t have. The worst example I’ve seen is a startup with 50,000 daily users running a Kafka cluster capable of handling 10 million events per second. They spent $15,000 per month on infrastructure and another 40 hours per week on operational maintenance. Their actual data volume was about 2GB per day. A managed Postgres database with a cron job would have cost $60 per month and required minimal maintenance.

This happens everywhere. Teams choose Spark when DuckDB would suffice. They deploy Airflow when cron would work. They architect for scale they’ll never reach, then spend years paying the operational cost. This pattern repeats so consistently that I’ve developed a checklist to separate genuine requirements from resume-driven development.

Two diseases that plague modern data teams

The first disease is the future proofing fallacy. Engineers convince themselves they’re building for tomorrow’s scale, but tomorrow never arrives. I once worked with a team that implemented a complex real-time streaming pipeline because “we might need low-latency analytics eventually.” Two years later, they were still running batch reports overnight. The streaming infrastructure sat there, consuming cloud credits and engineering time, solving a problem that existed only in planning documents.

The second disease is tool selection by conference talk. An engineer attends a presentation about how Netflix uses Flink, then returns to their 50-person company convinced they need the same architecture. I call this conference driven development. The engineer forgets that Netflix processes petabytes of data daily, with hundreds of engineers. Their company processes gigabytes with three engineers. The scale mismatch renders the comparison meaningless.

Both diseases stem from the same root cause: engineers optimize for perceived sophistication rather than actual requirements. The cure is to ask better questions before writing any code.

👋🏿👋🏿👋🏿 Welcome back to The Data Letter! Here are some recent articles you may have missed:

Production Hell of AI Agents

AI Agent Starter Kit

DIY Data Catalog Template

Five diagnostic questions for architecture decisions

Before choosing any data tool, I run through these five questions. They’ve saved me from countless overengineering disasters.

Question 1: What problem am I solving right now, not hypothetically? Write down the specific pain point. If you can’t articulate it in one sentence without using “might” or “could,” you don’t have a clear requirement. Real problems are concrete: “The daily report takes 6 hours to run” or “Users wait 30 seconds for search results.” Hypothetical problems sound like: “We might need to scale to millions of users.”

Question 2: What’s the actual data volume today, and what will it realistically be in 12 months? Measure in concrete units. If you’re processing 50GB daily now, honest growth projections rarely exceed 200GB in a year unless you’re experiencing explosive user acquisition. I’ve never seen a team accurately predict they’d 10x their data volume. Most teams 2x or 3x over years, not months.

Question 3: Can I solve this with tools I already understand? Familiarity has enormous value. A tool you know well will always outperform a superior tool you barely understand. I’d rather maintain a slightly awkward Postgres solution than a theoretically elegant Kafka setup that requires constant Stack Overflow searches.

Question 4: What’s the operational burden of this choice? Count the hours weekly. Will someone need to monitor dashboards? Investigate failures? Tune performance? Upgrade versions? If the answer exceeds 5 hours per week, you need strong justification, as operational costs compound while delivering zero feature value to users.

Question 5: What happens if I’m wrong and need to migrate later? Most engineers fear this scenario excessively. Migration projects are common in data engineering. I’ve migrated from Postgres to distributed databases, from cron to Airflow, from batch to streaming. None took more than three months. Compare that to the years you’ll spend maintaining infrastructure you don’t need.

Choosing between dull and complex tools

Here’s how I evaluate tool pairs for common scenarios:

Postgres vs. Kafka: Use Postgres for anything under 100GB daily or where hour-level latency is acceptable. Use Kafka only when you need sub-second latency across multiple consumers, or when the daily data volume exceeds 500 GB. The operational complexity difference between them is substantial. Postgres requires basic SQL knowledge. Kafka requires understanding partitions, consumer groups, retention policies, and cluster management.

DuckDB vs. Spark: Use DuckDB for anything that fits in memory on a single machine, roughly 100GB of source data. Use Spark only when your data genuinely spans terabytes and requires distributed processing. I’ve seen DuckDB outperform badly configured Spark clusters on datasets under 500GB. The performance gap narrows as data grows, but the operational burden stays disproportionately high.

Cron + SQL vs. workflow orchestration engines: Use cron for dependency graphs with fewer than 20 steps or when failures can wait hours for manual intervention. Use Airflow or Prefect when you have complex dependencies, need sophisticated retry logic, or require detailed execution history. I’ve maintained cron-based pipelines for years with zero issues. I’ve also maintained Airflow deployments that required weekly troubleshooting.

The pattern is consistent: simple tools excel when you operate below certain thresholds. Complex tools become justified only at specific scale or complexity levels.

Mental model for tool selection

I visualize tool selection as a decision tree that starts with one question: Can this wait hours or days? If yes, proceed to batch processing territory. If no, advance to real-time considerations.

For batch processing, the next question becomes: Does this fit on one machine? If your data compresses to under 100GB, stay with single-machine tools like DuckDB or Postgres. If it exceeds that, advance to distributed processing tools.

For real-time requirements, ask: Do multiple systems need this data simultaneously? If no, consider whether a faster database query or materialized view solves the problem. If yes, you’ve arrived at legitimate streaming territory.

At each decision point, the answer “I’m not sure” defaults to the simpler option. You can always migrate up in complexity. Migrating down is harder but still possible.

When complex tools are actually justified

I’m not arguing against sophisticated data infrastructure entirely. I can think of three scenarios that genuinely require complex tools:

First, when the data volume exceeds what one server can handle. If you’re processing multiple terabytes of data daily and a single Postgres instance can’t handle the load, even with optimization, distributed systems become necessary. But verify you’ve actually hit this limit through measurement, not assumption.

Second, when latency requirements demand it. If users expect sub-second responses on queries that scan billions of rows, or if downstream systems require millisecond-level data freshness, streaming architectures and specialized databases earn their complexity. Ensure these requirements come from user needs, not engineering preferences.

Third, when regulatory or business requirements mandate specific capabilities. Some industries require audit trails, point-in-time recovery, or cross-region replication that simpler tools can’t provide. These are external constraints that override technical preferences.

Notice how all three of these scenarios involve measurable thresholds or external requirements, and not hypothetical future needs or architectural aesthetics?

Action plan for escaping overengineering

Start by auditing your current data infrastructure. For each component, answer the five diagnostic questions above. If you can’t justify a tool with concrete measurements, add it to your migration backlog.

Next, establish volume thresholds for tool selection. Write them down and share them with your team. Mine are: under 100GB daily on Postgres, under 1 TB daily on single-machine processing, and under an hour of latency in batch territory. Your thresholds might differ based on team expertise and existing infrastructure.

Then implement a complexity budget. Each new tool requires approval from the full engineering team. The proposer must demonstrate that simpler alternatives are inadequate using actual measurements, not projections. I’ve seen this single practice eliminate 80% of overengineering proposals.

Finally, celebrate boring solutions. When someone solves a problem with cron and SQL instead of a new framework, recognize it explicitly. Engineering culture often rewards complexity over simplicity. Reversing this incentive structure takes deliberate effort but pays enormous dividends.

Most data problems are small data problems wearing big data costumes. Strip away the costume, and you’ll find that simple tools, wielded competently, solve the vast majority of real requirements.

I’m currently documenting the build of Asaura AI, an AI personal assistant for people with ADHD and executive dysfunction. If you’re interested in watching a product get built from scratch (complete with user research, design decisions, and technical choices), I’m writing about it on my Asaura substack.

Three articles so far:

Executive Dysfunction ≠ Laziness

Patterns From User Research

Asaura’s Training Manual

Support The Data Letter

If you found the small data manifesto useful, consider becoming a paid subscriber. You’ll get access to deeper technical breakdowns, case studies from real projects, and the occasional rant about data engineering decisions that keep me up at night.

Become a Paid Subscriber

The Data Letter

Discussion about this post

Ready for more?