13 Comments
User's avatar
Ame_data analyst's avatar

I remember reading about thir A/B testing on their blog and was so mesmerized.

this is a good engineering deep dive and their architecture is outstanding.

Expand full comment
Hodman Murad's avatar

Their pexperimentation program really is impressive. The fact that they can run thousands of concurrent tests with proper statistical rigor and causal inference shows serious engineering discipline.

Expand full comment
Laura Ferraz Baick's avatar

This is the kind of engineering discipline I want to see more of in the AI space (and among growth marketers). Not just building things that work once, but building things that keep working when reality gets messy.

Expand full comment
Hodman Murad's avatar

I was thinking about this. I want to see more AI tools where reliability isn't an afterthought, it's baked into the infrastructure from day one. The experimentation platform, observability tools, and failure isolation aren't glamorous, but they're what make the ML systems actually trustworthy at scale.

Expand full comment
Neural Foundry's avatar

The Keystone architecture really stands out here procesing 3 petabytes of incoming data daily is staggering. What fascinates me most is how they treat failre as a first class citizen, storing every goal state in RDS for reconstruction. More teams shoud adopt this mindset instead of building for the happy path. The independent Flink clusters for each stream job is a smart isolation strategy that prevents those cascading failures we've all seen.

Expand full comment
Hodman Murad's avatar

The RDS as a source-of-truth pattern is underrated. A lot of stream processing systems treat their current state as ephemeral, which works until it doesn't. Netflix's approach means recovery is deterministic, not heroic effort.

The Flink cluster isolation is expensive but proven. They're explicitly trading infrastructure cost for operational reliability. When you're processing trillions of events daily, a single bad job can take down your entire streaming platform. Worth studying how they balance that cost-reliability tradeoff.

Expand full comment
Melanie Goodman's avatar

This ia a proper geek’s goldmine!

Which of Netflix’s practices do you think most companies underestimate and why?

Expand full comment
Hodman Murad's avatar

The experimentation rigor. Most companies run A/B tests, but Netflix validates everything before deployment. Not just UI changes, but ML model updates, algorithm tweaks, and even infrastructure changes. They've built statistical frameworks and causal inference techniques directly into their deployment pipeline.

This takes discipline. I've worked with many teams that skip experimentation when they're confident or under pressure. Netflix treats it as non-negotiable infrastructure, which is why they can move fast without breaking things at 300M subscriber scale.

Expand full comment
John Brewton's avatar

Huge fan of engineering deep dives. This is gold.

Expand full comment
Hodman Murad's avatar

Thanks! Planning to do more of these. Any companies you'd want to see covered?

Expand full comment
John Brewton's avatar

I think JP Morgan Chase would be interesting, as would Amazon.

Expand full comment
Hodman Murad's avatar

JP Morgan Chase would be a really fun one to write about! Thank you!

Expand full comment
John Brewton's avatar

Definitely. So much interesting detail on the security front and what they’re doing relative to AI in their operational and technological infrastructures seems really interest.

Expand full comment