<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[System Design Interview Roadmap]]></title><description><![CDATA[System Design Interview Roadmap - Step by step process that will make you comfortable, familiar and then expert at System Design.]]></description><link>https://systemdr.systemdrd.com</link><image><url>https://substackcdn.com/image/fetch/$s_!_3Z_!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fd573e1-44ca-4a06-be42-264560574975_500x500.png</url><title>System Design Interview Roadmap</title><link>https://systemdr.systemdrd.com</link></image><generator>Substack</generator><lastBuildDate>Tue, 16 Jun 2026 19:17:06 GMT</lastBuildDate><atom:link href="https://systemdr.systemdrd.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[SystemDR Inc]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[systemdr@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[systemdr@substack.com]]></itunes:email><itunes:name><![CDATA[System Design Roadmap]]></itunes:name></itunes:owner><itunes:author><![CDATA[System Design Roadmap]]></itunes:author><googleplay:owner><![CDATA[systemdr@substack.com]]></googleplay:owner><googleplay:email><![CDATA[systemdr@substack.com]]></googleplay:email><googleplay:author><![CDATA[System Design Roadmap]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Design Uber Dispatch — The Senior+ Walkthrough]]></title><description><![CDATA[This is the question that defines the marketplace and matching archetype.]]></description><link>https://systemdr.systemdrd.com/p/design-uber-dispatch-the-senior-walkthrough</link><guid isPermaLink="false">https://systemdr.systemdrd.com/p/design-uber-dispatch-the-senior-walkthrough</guid><dc:creator><![CDATA[System Design Roadmap]]></dc:creator><pubDate>Tue, 16 Jun 2026 03:30:36 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!X5Su!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff94c092b-6e6d-42a4-906c-81bdd893ef09_2048x1341.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p>This is the question that defines the marketplace and matching archetype. Ten questions in the question bank are variants of it &#8212; DoorDash delivery, Airbnb booking, Tinder matching, parking reservation, surge pricing, ride-share fraud. They share a spine: a supply side, a demand side, a location component, and a real-time matching requirement. The architecture that works for Uber dispatch, with minimal modification, works for all ten.</p></blockquote><blockquote><p>The surface question &#8212; &#8220;design Uber&#8221; &#8212; is a 4-hour answer. Interviewers narrow it deliberately: &#8220;design the dispatch system,&#8221; &#8220;design how a driver gets matched to a rider,&#8221; &#8220;walk me through what happens between the rider hitting &#8216;Request&#8217; and the driver getting a notification.&#8221; All three collapse to the same probe: <strong>how do you find the nearest available supply in real time, at scale, without burning down your database on every request?</strong></p></blockquote><p>The answer involves a data structure most candidates haven&#8217;t touched since their algorithms course: geospatial indexing. Specifically, quadtrees or geohashing &#8212; and knowing <em>which one</em> and <em>why</em> is the senior signal.</p><p>For L4 / mid-level: describe the matching flow and a naive approach. For L5 / senior: articulate the geospatial indexing choice with its trade-offs. For L6 / staff: the consistency model across supply/demand state, the surge pricing data flow, multi-city scaling, and what happens when a driver&#8217;s GPS signal drops.</p><div><hr></div><h2>The Question</h2><blockquote><p>&#8220;Design the Uber dispatch system. A rider requests a ride. The system finds nearby available drivers and assigns one to the rider.&#8221;</p></blockquote><p>Common variants:</p><ul><li><p>&#8220;Design DoorDash / UberEats food delivery dispatch.&#8221;</p></li><li><p>&#8220;Design Lyft&#8217;s matching system.&#8221;</p></li><li><p>&#8220;Design a system to match drivers to delivery orders in real time.&#8221;</p></li><li><p>&#8220;Design the driver-to-rider matching at Uber.&#8221;</p></li></ul><p>All four are the same architecture. DoorDash has one additional dimension &#8212; the restaurant (a fixed supply point instead of a moving one) &#8212; but the matching logic is identical.</p><div><hr></div><h2>Step 1 &#8212; Clarify Before You Draw</h2><p>Six questions. Two of them are load-bearing for the whole design.</p><p><strong>1. What does &#8220;dispatch&#8221; include &#8212; match only, or also route and surge?</strong> Dispatch in the narrow sense is &#8220;find a driver and assign them to a rider.&#8221; Routing (which road to take) is a separate system (Google Maps API or in-house). Surge pricing (dynamic fare) is another separate system. Clarify scope explicitly. Design the matching layer; acknowledge the others exist.</p><p><strong>2. How many active drivers? What&#8217;s the geography &#8212; one city or global?</strong> Active drivers in a single city: tens of thousands. Globally: millions. This shapes whether your geo-index fits in memory on one machine or needs distribution. For a US metro at peak, 10,000&#8211;50,000 active drivers is a reasonable working number. Global scale: 5&#8211;10M.</p><p><strong>3. How frequently do drivers update their location?</strong> Every 4&#8211;5 seconds is typical. This is a high-frequency write stream &#8212; 50,000 drivers &#215; 1 write/4s = ~12,500 writes/sec in one busy metro. Globally: millions of writes/sec. State this number; it drives the design of the location ingestion layer.</p><p><strong>4. What&#8217;s the matching latency SLO?</strong> A rider hits &#8220;request&#8221; and expects a driver to accept within 30&#8211;60 seconds. The matching decision itself should happen in milliseconds &#8212; sub-100ms for the initial candidate selection. If it takes 5 seconds to find a candidate, the experience breaks.</p><p><strong>5. Can a driver be matched to multiple requests simultaneously?</strong> No &#8212; a driver has one state: available or not. This is the consistency problem at the heart of the question. Two riders cannot be dispatched to the same driver. The system that manages this state is load-bearing.</p><p><strong>6. What failure modes matter most?</strong> Driver goes offline mid-trip. GPS signal drops. Driver declines the match. Rider cancels after match. Each one requires a state machine transition. State the three most important ones out loud.</p><div><hr></div><h2>Step 2 &#8212; Estimate</h2><p>Working assumptions for a single large metro deployment:</p><ul><li><p>50,000 active drivers at peak</p></li><li><p>200,000 ride requests per hour at peak = ~55 ride requests/sec</p></li><li><p>Driver location update: every 4 seconds, so 50,000 / 4 = 12,500 location writes/sec</p></li><li><p>Matching query: for every ride request, find all drivers within ~2 km radius</p></li><li><p>Average matching candidate set: 20&#8211;50 drivers</p></li><li><p>End-to-end dispatch latency target: &lt; 100ms for candidate selection, &lt; 30s for driver acceptance</p></li></ul><p>Location data per driver per update: ~100 bytes (driver_id, lat/lng, timestamp, status).</p><p>Writes: 12,500/s &#215; 100 bytes &#8776; 1.25 MB/s. Trivial for a single machine.</p><p>If global (5M drivers): 5M / 4 = 1.25M writes/sec. Serious &#8212; needs distributed location ingestion. But for this walkthrough, start with the single-metro case and note the global extension.</p><p>The geo-index in memory for 50,000 drivers: 50,000 &#215; 100 bytes = 5 MB. <em>Fits entirely in memory on a single machine.</em> This insight &#8212; that the active driver set for one city is tiny &#8212; is what enables the in-memory geo-index approach.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://systemdr.systemdrd.com/subscribe&quot;,&quot;text&quot;:&quot;Get Access to GitHub Repo&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://systemdr.systemdrd.com/subscribe"><span>Get Access to GitHub Repo</span></a></p><div><hr></div><h2>Step 3 &#8212; API Design</h2><p>Three endpoints. The first is rider-facing, the second is driver-facing, the third is the internal matching call.</p><div class="callout-block" data-callout="true"><p>POST /v1/rides/request</p><p>Headers:</p><p>  Authorization: Bearer {rider_token}</p><p>Body:</p><p>  pickup_lat: float</p><p>  pickup_lng: float</p><p>  destination_lat: float</p><p>  destination_lng: float</p><p>  ride_type: &#8220;uberx&#8221; | &#8220;pool&#8221; | &#8220;xl&#8221; | &#8220;black&#8221;</p><p>Response:</p><p>  ride_id: &#8220;r_xxx&#8221;</p><p>  status: &#8220;searching&#8221;</p><p>  eta_to_pickup_seconds: integer (estimate, not a promise)</p><p>  price_estimate: { low_cents, high_cents, surge_multiplier }</p><p>PUT /v1/drivers/location</p><p>Headers:</p><p>  Authorization: Bearer {driver_token}</p><p>Body:</p><p>  lat: float</p><p>  lng: float</p><p>  heading: float (degrees)</p><p>  speed_mps: float (meters/sec)</p><p>  status: &#8220;available&#8221; | &#8220;on_trip&#8221; | &#8220;offline&#8221;</p><p>Response: 200 (ack)</p><p>POST /v1/dispatch/match   (internal, not public-facing)</p><p>Body:</p><p>  ride_id: &#8220;r_xxx&#8221;</p><p>  pickup_lat, pickup_lng</p><p>  ride_type</p><p>Response:</p><p>  candidate_drivers: [{ driver_id, distance_meters, eta_seconds, rating }]</p></div><p><strong>The senior move on location updates:</strong> note that the driver app batches and sends at 4-second intervals, not on every GPS tick (which might be 1-second intervals). Batching reduces write load by 4x and extends driver battery life. Saying this out loud &#8212; that the <em>client</em> has a role in write amplification &#8212; is an insight most candidates miss.</p><p><strong>heading and speed_mps</strong> in the location update. Junior candidates send only lat/lng. Senior candidates send velocity vector too. Why: ETA estimation to a pickup point requires knowing not just where the driver is but which way they&#8217;re going and how fast. A driver 500m away facing the wrong direction on a one-way street is further than a driver 700m away pointed at you. The heading/speed fields feed the ETA model.</p><div><hr></div><h2>Step 4 &#8212; Data Model</h2><p>Four stores. The geo-index is the unusual one.</p><h3>drivers table &#8212; source of truth for driver state</h3><p>driver_id, name, vehicle_type, license_plate, rating_avg, total_trips,</p><p>current_status (ENUM: available/on_trip/offline/suspended),</p><p>current_lat, current_lng, last_location_update, created_at</p><p>Sharded by driver_id. The current_status and current_lat/lng columns are updated frequently &#8212; this is the consistency-critical state. More on that in Step 6.</p><h3>rides table &#8212; the ride state machine</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KsEo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0696dbad-e66d-47bd-ad8e-b5ed032c76da_1157x457.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KsEo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0696dbad-e66d-47bd-ad8e-b5ed032c76da_1157x457.png 424w, https://substackcdn.com/image/fetch/$s_!KsEo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0696dbad-e66d-47bd-ad8e-b5ed032c76da_1157x457.png 848w, https://substackcdn.com/image/fetch/$s_!KsEo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0696dbad-e66d-47bd-ad8e-b5ed032c76da_1157x457.png 1272w, https://substackcdn.com/image/fetch/$s_!KsEo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0696dbad-e66d-47bd-ad8e-b5ed032c76da_1157x457.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KsEo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0696dbad-e66d-47bd-ad8e-b5ed032c76da_1157x457.png" width="1157" height="457" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0696dbad-e66d-47bd-ad8e-b5ed032c76da_1157x457.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:457,&quot;width&quot;:1157,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:84112,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://systemdr.systemdrd.com/i/201697493?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0696dbad-e66d-47bd-ad8e-b5ed032c76da_1157x457.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KsEo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0696dbad-e66d-47bd-ad8e-b5ed032c76da_1157x457.png 424w, https://substackcdn.com/image/fetch/$s_!KsEo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0696dbad-e66d-47bd-ad8e-b5ed032c76da_1157x457.png 848w, https://substackcdn.com/image/fetch/$s_!KsEo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0696dbad-e66d-47bd-ad8e-b5ed032c76da_1157x457.png 1272w, https://substackcdn.com/image/fetch/$s_!KsEo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0696dbad-e66d-47bd-ad8e-b5ed032c76da_1157x457.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UHAO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F493fdbf1-f3d3-4f15-b030-a33635aad5a0_862x345.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UHAO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F493fdbf1-f3d3-4f15-b030-a33635aad5a0_862x345.png 424w, https://substackcdn.com/image/fetch/$s_!UHAO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F493fdbf1-f3d3-4f15-b030-a33635aad5a0_862x345.png 848w, https://substackcdn.com/image/fetch/$s_!UHAO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F493fdbf1-f3d3-4f15-b030-a33635aad5a0_862x345.png 1272w, https://substackcdn.com/image/fetch/$s_!UHAO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F493fdbf1-f3d3-4f15-b030-a33635aad5a0_862x345.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UHAO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F493fdbf1-f3d3-4f15-b030-a33635aad5a0_862x345.png" width="862" height="345" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/493fdbf1-f3d3-4f15-b030-a33635aad5a0_862x345.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:345,&quot;width&quot;:862,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:41337,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://systemdr.systemdrd.com/i/201697493?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F493fdbf1-f3d3-4f15-b030-a33635aad5a0_862x345.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UHAO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F493fdbf1-f3d3-4f15-b030-a33635aad5a0_862x345.png 424w, https://substackcdn.com/image/fetch/$s_!UHAO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F493fdbf1-f3d3-4f15-b030-a33635aad5a0_862x345.png 848w, https://substackcdn.com/image/fetch/$s_!UHAO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F493fdbf1-f3d3-4f15-b030-a33635aad5a0_862x345.png 1272w, https://substackcdn.com/image/fetch/$s_!UHAO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F493fdbf1-f3d3-4f15-b030-a33635aad5a0_862x345.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Ride state machine:</strong> searching &#8594; matched (driver found) &#8594; accepted (driver acknowledged) &#8594; driver_en_route &#8594; arrived &#8594; in_progress &#8594; completed | cancelled</p><p>Any state can transition to cancelled. The driver_id is only set at the matched transition. If a driver declines, status reverts to searching and the matching process re-runs.</p><h3>location_history &#8212; time series store</h3><p>Not your OLTP database. Location events are time-series data: (driver_id, lat, lng, timestamp, heading, speed). At 12,500 events/sec, that&#8217;s 1 billion events/day in one metro. You don&#8217;t store this in Postgres. You use a time-series database (InfluxDB, TimescaleDB, Cassandra with a time-ordered key) or push to a data lake for analytics.</p><p>The hot query (&#8221;where is this driver right now?&#8221;) hits the geo-index, not this table. This table is for analytics, replay, and dispute resolution (&#8221;I was at X at time T &#8212; prove it&#8221;).</p><h3>driver_geo_index &#8212; the in-memory geo-index</h3><p>This is the heart of the system. Not a SQL table. An in-memory spatial index that answers &#8220;which drivers are within radius R of point (lat, lng) in sub-millisecond time.&#8221;</p><p>Two real options: <strong>quadtrees</strong> or <strong>geohashing</strong>. This is Step 6&#8217;s deep dive.</p><div class="callout-block" data-callout="true"><p>Preparing for a distributed systems interview?<br>&#8594;<a href="https://systemdrd.com/ebooks/sdcourse-distributed-systems-interview">Download the free Interview Pack</a><br>&#8594; <a href="https://systemdr.systemdrd.com/subscribe">Subscribe</a> now to access source code repository - 200 + coding lessons</p></div>
      <p>
          <a href="https://systemdr.systemdrd.com/p/design-uber-dispatch-the-senior-walkthrough">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[The Question Every FAANG Loop Asks. You Probably Haven't Drilled It.]]></title><description><![CDATA[Learn System Design with System building, Subscribe Hands On coding course - LogStream]]></description><link>https://systemdr.systemdrd.com/p/the-same-10-questions-hiding-behind</link><guid isPermaLink="false">https://systemdr.systemdrd.com/p/the-same-10-questions-hiding-behind</guid><dc:creator><![CDATA[System Design Roadmap]]></dc:creator><pubDate>Sun, 14 Jun 2026 03:30:54 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!JQAU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad370b07-b22e-499c-ad0e-354c8c40aba2_4675x2750.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="callout-block" data-callout="true"><p>Learn <strong>System Design with System building, Subscribe Hands On coding course - <a href="https://sdcourse.substack.com/p/start-here-how-to-use-sdcourse">LogStream</a></strong></p></div><p>There are 52 questions in the FAANG system design question bank. Engineers preparing for FAANG-tier interviews often drill them one by one: design Twitter, then design Uber, then design WhatsApp, then design Stripe.</p><p>This is the wrong approach. Not because the questions don&#8217;t matter &#8212; they do. But because drilling 52 individual questions treats each one as unique when most of them are the same question with a different surface.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://systemdr.systemdrd.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">System Design Interview Roadmap is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Here&#8217;s what I mean.</p><p><strong>The 10 underlying problems</strong></p><p>Every system design question is fundamentally an instance of one or more of these:</p><p><em>Fan-out under load</em> &#8212; how do you efficiently deliver one write to many readers? Twitter&#8217;s feed, Instagram&#8217;s feed, TikTok&#8217;s feed, Slack&#8217;s channel messages, notifications. All the same problem. The celebrity/large-group edge case appears in all of them.</p><p><em>Inventory reservation</em> &#8212; how do you prevent two users from claiming the same unit simultaneously? Airbnb&#8217;s booking, ticket booking, parking lot reservation. All the same atomic-claim-under-contention problem. The same <code>FOR UPDATE SKIP LOCKED</code> pattern solves all three.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JQAU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad370b07-b22e-499c-ad0e-354c8c40aba2_4675x2750.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JQAU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad370b07-b22e-499c-ad0e-354c8c40aba2_4675x2750.png 424w, https://substackcdn.com/image/fetch/$s_!JQAU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad370b07-b22e-499c-ad0e-354c8c40aba2_4675x2750.png 848w, https://substackcdn.com/image/fetch/$s_!JQAU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad370b07-b22e-499c-ad0e-354c8c40aba2_4675x2750.png 1272w, https://substackcdn.com/image/fetch/$s_!JQAU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad370b07-b22e-499c-ad0e-354c8c40aba2_4675x2750.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JQAU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad370b07-b22e-499c-ad0e-354c8c40aba2_4675x2750.png" width="1456" height="856" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ad370b07-b22e-499c-ad0e-354c8c40aba2_4675x2750.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:856,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:707122,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://systemdr.systemdrd.com/i/198936505?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad370b07-b22e-499c-ad0e-354c8c40aba2_4675x2750.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JQAU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad370b07-b22e-499c-ad0e-354c8c40aba2_4675x2750.png 424w, https://substackcdn.com/image/fetch/$s_!JQAU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad370b07-b22e-499c-ad0e-354c8c40aba2_4675x2750.png 848w, https://substackcdn.com/image/fetch/$s_!JQAU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad370b07-b22e-499c-ad0e-354c8c40aba2_4675x2750.png 1272w, https://substackcdn.com/image/fetch/$s_!JQAU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad370b07-b22e-499c-ad0e-354c8c40aba2_4675x2750.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://systemdr.systemdrd.com/subscribe&quot;,&quot;text&quot;:&quot;Subscribe for Question Walkthrough&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://systemdr.systemdrd.com/subscribe"><span>Subscribe for Question Walkthrough</span></a></p><p><em>Delivery semantics under failure</em> &#8212; what happens when the network drops in the middle of a message delivery? WhatsApp, Slack, notification services, task queues. All wrestling with at-least-once vs exactly-once and the FLP impossibility argument.</p><p><em>Geospatial proximity</em> &#8212; how do you efficiently find nearby things in real time? Uber&#8217;s driver matching, Yelp&#8217;s restaurant search, Tinder&#8217;s candidate discovery, DoorDash&#8217;s Dasher assignment. Geohashing or quadtrees. Same index. Different shape of data.</p><p><em>Time-series at scale</em> &#8212; how do you store and query data that&#8217;s ordered by time and written at massive volume? Metrics monitoring, log aggregation, ride history, payment history. Same write-optimized, rollup-tiered architecture.</p><p><em>Pre-computation vs on-demand</em> &#8212; when do you compute the answer at write time vs query time? Feed ranking, typeahead, recommendation engines. Same trade-off: store more, compute less per query.</p><p><em>Adaptive delivery</em> &#8212; how do you serve the right quality given uncertain network conditions? Netflix streaming, Spotify, live video, file transfer. Same ABR pattern. Different media format.</p><p><em>Content-addressed storage</em> &#8212; how do you store large binary objects efficiently across millions of users? Dropbox, iCloud Photos, any file storage product. Same chunking, hashing, deduplication pattern.</p><p><em>Global rate control</em> &#8212; how do you enforce a limit across a distributed fleet of servers? Rate limiters, budget pacing in ad systems, quota enforcement. Same sliding window counter in Redis.</p><p><em>Fraud via relationship patterns</em> &#8212; how do you detect bad actors whose individual actions look legitimate? Payment fraud, ride-share fraud, ad click fraud. Same two-layer architecture: streaming rules for velocity, graph analysis for network patterns.</p><p><strong>What this changes about preparation</strong></p><p>Once you see the underlying patterns, the preparation strategy changes.</p><p>Instead of drilling 52 individual questions, drill 10 problem types. When you understand fan-out deeply &#8212; the write amplification, the celebrity threshold, the hybrid approach &#8212; you can answer Twitter&#8217;s timeline, Instagram&#8217;s feed, TikTok&#8217;s FYF, and Slack&#8217;s channels as variants of the same problem. The surface details change. The architecture doesn&#8217;t.</p><p>This is why the question bank is organized by archetype, not by company. Twitter&#8217;s timeline is filed under &#8220;Social Feed&#8221; alongside Instagram, TikTok, and Reddit. Uber is filed under &#8220;Marketplace &amp; Matching&#8221; alongside Airbnb, Tinder, and DoorDash. The point is to make the pattern visible.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EJp_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9d71421-fa17-481f-a6f7-0c929d75c65b_5225x4510.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EJp_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9d71421-fa17-481f-a6f7-0c929d75c65b_5225x4510.png 424w, https://substackcdn.com/image/fetch/$s_!EJp_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9d71421-fa17-481f-a6f7-0c929d75c65b_5225x4510.png 848w, https://substackcdn.com/image/fetch/$s_!EJp_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9d71421-fa17-481f-a6f7-0c929d75c65b_5225x4510.png 1272w, https://substackcdn.com/image/fetch/$s_!EJp_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9d71421-fa17-481f-a6f7-0c929d75c65b_5225x4510.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EJp_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9d71421-fa17-481f-a6f7-0c929d75c65b_5225x4510.png" width="1456" height="1257" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d9d71421-fa17-481f-a6f7-0c929d75c65b_5225x4510.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1257,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1445437,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://systemdr.systemdrd.com/i/198936505?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9d71421-fa17-481f-a6f7-0c929d75c65b_5225x4510.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EJp_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9d71421-fa17-481f-a6f7-0c929d75c65b_5225x4510.png 424w, https://substackcdn.com/image/fetch/$s_!EJp_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9d71421-fa17-481f-a6f7-0c929d75c65b_5225x4510.png 848w, https://substackcdn.com/image/fetch/$s_!EJp_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9d71421-fa17-481f-a6f7-0c929d75c65b_5225x4510.png 1272w, https://substackcdn.com/image/fetch/$s_!EJp_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9d71421-fa17-481f-a6f7-0c929d75c65b_5225x4510.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>The test you can run right now</strong></p><p>Take any two questions from your preparation list. Can you identify the one or two underlying patterns they share? If yes, drilling one deeply makes the other easier. If no &#8212; reply to this post and I&#8217;ll map it for you.</p><p>The engineers who consistently pass at L5 and L6 don&#8217;t have 52 memorized answers. They have 10 deeply understood patterns and the ability to apply them to a question they haven&#8217;t seen before.</p><p>That second skill is what the interview is actually testing.</p><div><hr></div><div class="callout-block" data-callout="true"><p>The Question Vault has all 52 walkthroughs organized by archetype &#8212; so you can see the pattern across questions, not just the surface answer. </p><p><strong>Access all 52 Quesstions <a href="https://systemdrd.com/ebooks/52-faang-questions-drillcards-cheatsheets/">here</a></strong> </p></div><h2><strong>Subscription link</strong></h2><p><a href="https://systemdr.systemdrd.com/subscribe">https://systemdr.systemdrd.com/subscribe</a></p><p>&#8212;Sumedh</p><p><strong>Want the complete learning path?</strong></p><p>Unlock advanced modules, case studies, and guided exercises.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://systemdr.systemdrd.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">System Design Interview Roadmap is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[CDN Cache Busting Strategies: Ensuring Users See New Content Immediately]]></title><description><![CDATA[Introduction]]></description><link>https://systemdr.systemdrd.com/p/cdn-cache-busting-strategies-ensuring</link><guid isPermaLink="false">https://systemdr.systemdrd.com/p/cdn-cache-busting-strategies-ensuring</guid><dc:creator><![CDATA[System Design Roadmap]]></dc:creator><pubDate>Fri, 12 Jun 2026 08:30:37 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ke-l!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed9ae05c-3634-4b66-abd2-798109575157_4500x2900.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>Introduction</h2><blockquote><p>Your team just shipped a critical CSS fix at 2 AM. The bug was embarrassing &#8212; a broken checkout button on mobile. You deploy, verify on staging, and push to production. But for the next six hours, your support queue fills with the same complaint: &#8220;The button is still broken.&#8221; Your CDN cached the old file. Every edge node in 40 countries is serving the broken version. The fix exists on your origin servers, but the world doesn&#8217;t know it yet.</p></blockquote><p>This is the cache busting problem &#8212; and it&#8217;s one of the most operationally painful gaps between &#8220;deployed&#8221; and &#8220;live.&#8221;</p><div><hr></div><h2>What Cache Busting Actually Is</h2><p>A CDN cache is a distributed key-value store where the key is a URL and the value is the HTTP response. When you update <code>app.js</code>, the CDN has no automatic mechanism to detect that the file changed. Its TTL (Time-To-Live) hasn&#8217;t expired, so it keeps serving the stale version &#8212; possibly for hours or days, depending on your <code>Cache-Control</code> headers.</p><p>Cache busting is the set of techniques that force CDN edges to treat a new version of a resource as a <em>different resource entirely</em>, bypassing the cached version.</p><h3>The Three Core Strategies</h3><p><strong>1. URL Fingerprinting (Content Hash in Filename)</strong></p><p>The most reliable strategy. Your build tool (Webpack, Vite, esbuild) computes a hash of the file&#8217;s content and embeds it in the filename:</p><pre><code><code>app.js &#8594; app.a3f92bc1.js
app.js &#8594; app.d71c4e02.js  &#8592; after a change</code></code></pre><p>Since the URL changed, the CDN treats it as a new resource &#8212; no conflict with old caches. The old URL continues to serve the old file (good for users mid-session), while the new URL immediately serves fresh content. This is zero-downtime cache busting.</p><p>The key non-obvious detail: this requires your HTML <code>index.html</code> to <em>not</em> be heavily cached, since it must always reference the latest fingerprinted filenames. HTML typically gets <code>Cache-Control: no-cache</code> (revalidate every request) or a very short TTL, while fingerprinted assets get long-lived TTLs (<code>Cache-Control: max-age=31536000, immutable</code>).</p><p><strong>2. Query String Versioning</strong></p><pre><code><code>app.js?v=1.4.2
app.js?v=1.4.3</code></code></pre><p>Simpler to implement but less reliable. Some CDN configurations strip or ignore query strings when determining cache keys. Cloudflare, by default, includes query strings in cache keys &#8212; but intermediate proxies and some ISP-level caches may not. This strategy is appropriate for API responses or endpoints where fingerprinted filenames aren&#8217;t feasible.</p><p><strong>3. CDN Cache Invalidation (Purge API)</strong></p><p>Every major CDN exposes an API to explicitly purge cached objects by URL pattern. Cloudflare&#8217;s Cache Purge API, AWS CloudFront&#8217;s <code>create-invalidation</code>, and Fastly&#8217;s purge endpoints let you push a list of URLs or wildcard patterns that should be evicted immediately.</p><p>The failure mode: purge propagation is <em>not instantaneous</em>. CloudFront invalidations can take 10&#8211;60 seconds to propagate across all edge locations. During that window, some users get old content, some get new. Wildcard invalidations (<code>/*</code>) are especially dangerous at scale because they cause a thundering-herd effect: every edge node simultaneously fetches fresh content from origin, potentially overwhelming your origin servers if you have thousands of edge nodes and millions of cached objects.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ke-l!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed9ae05c-3634-4b66-abd2-798109575157_4500x2900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ke-l!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed9ae05c-3634-4b66-abd2-798109575157_4500x2900.png 424w, https://substackcdn.com/image/fetch/$s_!ke-l!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed9ae05c-3634-4b66-abd2-798109575157_4500x2900.png 848w, https://substackcdn.com/image/fetch/$s_!ke-l!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed9ae05c-3634-4b66-abd2-798109575157_4500x2900.png 1272w, https://substackcdn.com/image/fetch/$s_!ke-l!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed9ae05c-3634-4b66-abd2-798109575157_4500x2900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ke-l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed9ae05c-3634-4b66-abd2-798109575157_4500x2900.png" width="1456" height="938" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ed9ae05c-3634-4b66-abd2-798109575157_4500x2900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:938,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2101545,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://systemdr.substack.com/i/190496711?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed9ae05c-3634-4b66-abd2-798109575157_4500x2900.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ke-l!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed9ae05c-3634-4b66-abd2-798109575157_4500x2900.png 424w, https://substackcdn.com/image/fetch/$s_!ke-l!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed9ae05c-3634-4b66-abd2-798109575157_4500x2900.png 848w, https://substackcdn.com/image/fetch/$s_!ke-l!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed9ae05c-3634-4b66-abd2-798109575157_4500x2900.png 1272w, https://substackcdn.com/image/fetch/$s_!ke-l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed9ae05c-3634-4b66-abd2-798109575157_4500x2900.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div>
      <p>
          <a href="https://systemdr.systemdrd.com/p/cdn-cache-busting-strategies-ensuring">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Design Twitter's Timeline — The Senior+ Walkthrough]]></title><description><![CDATA[This is the question that defines the social feed archetype.]]></description><link>https://systemdr.systemdrd.com/p/design-twitters-timeline-the-senior</link><guid isPermaLink="false">https://systemdr.systemdrd.com/p/design-twitters-timeline-the-senior</guid><dc:creator><![CDATA[System Design Roadmap]]></dc:creator><pubDate>Tue, 09 Jun 2026 03:30:58 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!2dyu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb3096da-0f8e-42d2-b5ff-dddbabf33e41_1950x2048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This is the question that defines the social feed archetype. Twelve questions in the FAANG question bank are variants of this one &#8212; Instagram feed, TikTok For You, Reddit front page, LinkedIn news feed, Pinterest home, YouTube next-up, Stories, trending topics. They share a backbone: many users, many posts, a feed that has to be both fresh and ranked, and a write/read ratio that decides the entire architecture. If you internalize Twitter timeline, you can answer eleven other questions you haven&#8217;t drilled.</p><blockquote><p>The version most interviewers ask is not &#8220;design Twitter&#8221; &#8212; that&#8217;s a 4-hour question. They ask &#8220;design the Twitter home timeline&#8221; or &#8220;design the news feed for [our product]&#8221; or &#8220;design the part of [X] that shows posts from people you follow.&#8221; All of these collapse to the same probe: <strong>how do you serve a feed for a user whose followed set has wildly variable size and post rate, while keeping the feed fresh?</strong></p></blockquote><blockquote><p>For L4 / mid-level the bar is articulating one approach (fanout-on-write) and walking it cleanly. For L5 / senior, the probe is the trade-off between fanout-on-write and fanout-on-read, and the realization that production systems do <em>both</em> depending on the user. For L6 / staff, the probe is hot-user handling, ranking, and the operational story for a system that ingests millions of writes per second and serves billions of feed reads.</p></blockquote><div><hr></div><h2>The Question</h2><blockquote><p>&#8220;Design the Twitter home timeline. A user opens the app and sees a feed of recent posts from the people they follow, sorted appropriately. Posts can be original tweets, retweets, or replies.&#8221;</p></blockquote><p>Common variants:</p><ul><li><p>&#8220;Design Instagram&#8217;s feed.&#8221;</p></li><li><p>&#8220;Design Facebook News Feed.&#8221;</p></li><li><p>&#8220;Design the LinkedIn home page.&#8221;</p></li><li><p>&#8220;Design Reddit&#8217;s front page.&#8221;</p></li><li><p>&#8220;Design the TikTok For You feed.&#8221;</p></li></ul><p>The first four are direct equivalents. TikTok For You is <em>almost</em> the same &#8212; but with one critical difference: it&#8217;s not a follow-graph feed, it&#8217;s a recommendation feed. The architecture is similar; the candidate generation is different. Note this if asked.</p><div><hr></div><h2>Step 1 &#8212; Clarify Before You Draw</h2><p>Six questions. The first three matter most.</p><p><strong>1. Are we ranking or just chronological?</strong> Old Twitter was reverse-chronological. Modern Twitter is ranked. This is the single biggest variable. Chronological is far simpler &#8212; sort by timestamp, done. Ranked introduces a scoring system, a feature pipeline, and probably an ML model. State which one you&#8217;re designing for explicitly. If unclear, design ranked &#8212; it&#8217;s the harder and more realistic version.</p><p><strong>2. What&#8217;s the active user count? What&#8217;s the average follow count? What&#8217;s the maximum?</strong> The numbers determine the architecture. 100M daily active users with average follow count 200 looks different from 500M DAU with average follow count 500. The maximum follow count is the more decisive number &#8212; it&#8217;s the long-tail celebrity problem in disguise. State explicit numbers.</p><p><strong>3. What&#8217;s the read-to-write ratio?</strong> Always lopsided in social feeds, but how lopsided matters. 100:1 means you have lots of writes; 1000:1 means you can afford to do work on the write path. State the ratio; it justifies fanout-on-write later.</p><p><strong>4. How fresh? Real-time, or 5-minute staleness OK?</strong> Real-time means push. Stale-OK means pull or hybrid. Most production feeds tolerate seconds-to-minutes of staleness; saying &#8220;I&#8217;d target 5-second p99 freshness&#8221; is a senior signal.</p><p><strong>5. What goes in the feed besides original posts?</strong> Retweets count as posts, generally. Replies usually don&#8217;t appear in the home timeline by default (only on the conversation thread). Trending topics, ads, suggested follows, &#8220;in case you missed it&#8221; pulls &#8212; all of these are <em>injected</em> into the feed in production. Acknowledge this exists; don&#8217;t try to design it all.</p><p><strong>6. What&#8217;s the SLO for feed load?</strong> Typical: p50 &lt; 200ms, p99 &lt; 500ms for the API. The user opens the app and the timeline has to be there. State the number; it drives caching.</p><div><hr></div><h2>Step 2 &#8212; Estimate</h2><p>Working assumptions for the rest of this walkthrough:</p><ul><li><p>500M daily active users (DAU)</p></li><li><p>Each user reads their feed ~5 times/day on average &#8594; 2.5B feed reads/day &#8594; ~30K reads/sec average, peaking at 200K</p></li><li><p>Each user posts ~0.2 times/day on average &#8594; 100M posts/day &#8594; ~1,200 posts/sec average, peaks at ~10K</p></li><li><p>Average follow count: 200 (median is much lower; mean is dragged up by power users)</p></li><li><p>Maximum follow count: 5,000 (for normal users; verified accounts can exceed)</p></li><li><p>Maximum <em>follower</em> count: 100M+ (for celebrities &#8212; this is the asymmetry that breaks naive designs)</p></li><li><p>Posts retained in timelines: roughly 1,000 per user (older items can be paginated in on demand)</p></li></ul><p>Storage:</p><ul><li><p>Posts table: 100M posts/day &#215; 365 days &#215; 5 years &#215; 1 KB &#8776; 180 TB</p></li><li><p>Timeline cache (precomputed timelines): 500M users &#215; 1,000 posts &#215; 200 bytes/entry &#8776; 100 TB</p></li><li><p>Follow graph: hundreds of billions of edges, but each edge is small (~30 bytes) &#8594; ~3-5 TB</p></li></ul><p>The timeline cache is the unusual one. It&#8217;s bigger than your hot OLTP and lives in a different system (typically Redis, Memcached, or a custom in-memory store). When a senior candidate names this layer separately from the posts store, you know they&#8217;ve built something like this.</p><p>Read bandwidth:</p><ul><li><p>200K reads/sec &#215; 1,000 timeline entries &#215; 200 bytes &#8776; 40 GB/s</p></li><li><p>This is what the timeline cache absorbs &#8212; the posts table never sees this traffic.</p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://systemdr.systemdrd.com/subscribe&quot;,&quot;text&quot;:&quot;Get Access to GitHub Repo&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://systemdr.systemdrd.com/subscribe"><span>Get Access to GitHub Repo</span></a></p><div><hr></div><h2>Step 3 &#8212; API Design</h2><p>Three endpoints. The first one is the headline.</p><div class="callout-block" data-callout="true"><p>GET /v1/timeline/home</p><p>Query:</p><p>  cursor: string (optional, for pagination)</p><p>  limit: integer (default 20, max 100)</p><p>Response:</p><p>  posts: [array of post objects]</p><p>  next_cursor: string or null</p><p>  ranked_at: timestamp (when this version was ranked)</p><p>POST /v1/posts</p><p>Body:</p><p>  content: string (max 280 chars or whatever your limit is)</p><p>  media_ids: [array of pre-uploaded media references]</p><p>  reply_to_post_id: optional</p><p>  Idempotency-Key: header</p><p>Response:</p><p>  id, created_at, author_id</p><p>POST /v1/follow</p><p>Body:</p><p>  target_user_id: string</p><p>Response:</p><p>  follow_id, created_at</p></div><p><strong>Pagination is cursor-based, not offset-based.</strong> Offset (page=3, per_page=20) is broken at scale because the data shifts beneath you &#8212; by the time you fetch page 3, new posts have arrived and pushed everything down. Cursors (an opaque token encoding a position in the feed) are immune to this. Saying <em>&#8220;cursor-based, never offset-based for feeds&#8221;</em> is a senior signal.</p><p><strong>ranked_at</strong> in the response is the time the timeline was scored. Useful for client-side caching (&#8221;don&#8217;t re-fetch for 30 seconds&#8221;) and for debugging staleness reports. Most candidates omit this. Including it is a small but real senior tell.</p><div class="callout-block" data-callout="true"><p>Subscribe now to <br>&#8594;unlock complete system design walkthroughs<br>&#8594; Get access to downloadable Drill Cards and full walkthrough</p></div>
      <p>
          <a href="https://systemdr.systemdrd.com/p/design-twitters-timeline-the-senior">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[L4, L5, L6, L7 — What Actually Changes Between Levels]]></title><description><![CDATA[Learn System Design with System building, Subscribe Hands On coding course - LogStream]]></description><link>https://systemdr.systemdrd.com/p/l4-l5-l6-l7-what-actually-changes</link><guid isPermaLink="false">https://systemdr.systemdrd.com/p/l4-l5-l6-l7-what-actually-changes</guid><dc:creator><![CDATA[System Design Roadmap]]></dc:creator><pubDate>Sun, 07 Jun 2026 02:48:48 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!5Yib!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fa0d1da-16f1-4680-9a26-52679241d836_5500x3190.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="callout-block" data-callout="true"><p>Learn <strong>System Design with System building, Subscribe Hands On coding course - <a href="https://sdcourse.substack.com/p/start-here-how-to-use-sdcourse">LogStream</a></strong></p></div><p></p><p>Most engineers preparing for senior-level interviews have a vague sense that &#8220;L6 answers go deeper.&#8221; This is both true and useless as preparation advice. Deeper in what direction? Deeper by how much? What specifically changes?</p><p>refer basics <a href="https://systemdr.systemdrd.com/p/load-balancing-101-how-traffic-gets">here - load balancing 101</a>  - </p><p>Here is what specifically changes.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://systemdr.systemdrd.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">System Design Interview Roadmap is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p><strong>L4 &#8594; L5: from &#8220;what&#8221; to &#8220;why&#8221;</strong></p><p>An L4 answer describes the architecture. An L5 answer defends it.</p><p>At L4, the bar is: can you run a framework, estimate sensibly, and produce a coherent design with read and write paths? That&#8217;s it. Interviewers are checking whether you know the building blocks.</p><p>At L5, the bar is: can you justify your decisions? When the interviewer asks &#8220;why did you partition by recipient_id instead of sender_id?&#8221;, the L4 candidate says &#8220;that&#8217;s the standard approach.&#8221; The L5 candidate says &#8220;because the hot query is &#8216;give me all pending messages for user X&#8217; &#8212; partitioning on the read pattern means that query hits one shard. If I partitioned by sender, delivering to user X requires querying every node.&#8221;</p><p>The transition from L4 to L5 is the transition from &#8220;stating decisions&#8221; to &#8220;reasoning about decisions.&#8221; Same architecture. Different explanation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5Yib!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fa0d1da-16f1-4680-9a26-52679241d836_5500x3190.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5Yib!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fa0d1da-16f1-4680-9a26-52679241d836_5500x3190.png 424w, https://substackcdn.com/image/fetch/$s_!5Yib!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fa0d1da-16f1-4680-9a26-52679241d836_5500x3190.png 848w, https://substackcdn.com/image/fetch/$s_!5Yib!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fa0d1da-16f1-4680-9a26-52679241d836_5500x3190.png 1272w, https://substackcdn.com/image/fetch/$s_!5Yib!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fa0d1da-16f1-4680-9a26-52679241d836_5500x3190.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5Yib!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fa0d1da-16f1-4680-9a26-52679241d836_5500x3190.png" width="1456" height="844" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3fa0d1da-16f1-4680-9a26-52679241d836_5500x3190.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:844,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:945218,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://systemdr.systemdrd.com/i/198936263?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fa0d1da-16f1-4680-9a26-52679241d836_5500x3190.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5Yib!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fa0d1da-16f1-4680-9a26-52679241d836_5500x3190.png 424w, https://substackcdn.com/image/fetch/$s_!5Yib!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fa0d1da-16f1-4680-9a26-52679241d836_5500x3190.png 848w, https://substackcdn.com/image/fetch/$s_!5Yib!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fa0d1da-16f1-4680-9a26-52679241d836_5500x3190.png 1272w, https://substackcdn.com/image/fetch/$s_!5Yib!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fa0d1da-16f1-4680-9a26-52679241d836_5500x3190.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://systemdr.systemdrd.com/subscribe&quot;,&quot;text&quot;:&quot;Subscribe for Question Walkthrough&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://systemdr.systemdrd.com/subscribe"><span>Subscribe for Question Walkthrough</span></a></p><p><strong>L5 &#8594; L6: from design to operation</strong></p><p>An L5 answer designs a system that works. An L6 answer designs a system that works <em>and</em> describes how you&#8217;d know if it was failing.</p><p>L6 adds three things that L5 almost never has:</p><p><em>SLOs with monitoring signals.</em> Not &#8220;the API should be fast.&#8221; &#8220;I&#8217;m targeting p99 &lt; 100ms for the feed read. The metric I&#8217;d alert on is cache hit ratio dropping below 90% &#8212; that&#8217;s the leading indicator that Postgres load is about to spike.&#8221;</p><p><em>Failure modes named and partially addressed.</em> Not &#8220;we&#8217;d add monitoring.&#8221; &#8220;The failure mode here is the thundering herd when the Redis node recovers &#8212; if we don&#8217;t add jitter to the retry, every failed request retries simultaneously and we bring Redis down again the moment it comes back up.&#8221;</p><p><em>The operational reality of trade-offs.</em> L5 says &#8220;consistent hashing handles node failure gracefully.&#8221; L6 says &#8220;consistent hashing with virtual nodes handles it, but I&#8217;d still want an active health check every 5 seconds and a circuit breaker so that if a node is responding slowly rather than failing hard, we stop routing to it before the tail latency affects users.&#8221;</p><p>The transition from L5 to L6 is the transition from &#8220;designing for the happy path&#8221; to &#8220;designing for the failure path.&#8221;</p><p><strong>L6 &#8594; L7: from system to organization</strong></p><p>L7 is the level where the interview stops being about a single system and starts being about the relationship between systems &#8212; and the teams that build them.</p><p>L7 adds:</p><p><em>Build vs buy decisions with economic reasoning.</em> Not &#8220;we&#8217;d use Kafka.&#8221; &#8220;For this use case, Kafka is the right call &#8212; the producers and consumers have different scale characteristics and we want replay capability. But if we&#8217;re a 20-person company, the operational overhead of Kafka is probably not worth it; I&#8217;d use SQS and revisit when the team can support it.&#8221;</p><p><em>Organizational implications.</em> &#8220;This architecture creates a hard dependency between the payments team and the notifications team. I&#8217;d extract the notification contract into a shared schema owned by a platform team to prevent that coupling from slowing down both teams&#8217; deploy cadence.&#8221;</p><p><em>Platform thinking.</em> Not &#8220;how do I build this feature&#8221; but &#8220;how do I build this such that 10 teams can build features on top of it without coordinating with me.&#8221;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!q8Ju!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95231af2-836d-47da-b1b5-86fd1632c0a4_5500x3080.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!q8Ju!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95231af2-836d-47da-b1b5-86fd1632c0a4_5500x3080.png 424w, https://substackcdn.com/image/fetch/$s_!q8Ju!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95231af2-836d-47da-b1b5-86fd1632c0a4_5500x3080.png 848w, https://substackcdn.com/image/fetch/$s_!q8Ju!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95231af2-836d-47da-b1b5-86fd1632c0a4_5500x3080.png 1272w, https://substackcdn.com/image/fetch/$s_!q8Ju!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95231af2-836d-47da-b1b5-86fd1632c0a4_5500x3080.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!q8Ju!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95231af2-836d-47da-b1b5-86fd1632c0a4_5500x3080.png" width="1456" height="815" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/95231af2-836d-47da-b1b5-86fd1632c0a4_5500x3080.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:815,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:984924,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://systemdr.systemdrd.com/i/198936263?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95231af2-836d-47da-b1b5-86fd1632c0a4_5500x3080.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!q8Ju!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95231af2-836d-47da-b1b5-86fd1632c0a4_5500x3080.png 424w, https://substackcdn.com/image/fetch/$s_!q8Ju!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95231af2-836d-47da-b1b5-86fd1632c0a4_5500x3080.png 848w, https://substackcdn.com/image/fetch/$s_!q8Ju!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95231af2-836d-47da-b1b5-86fd1632c0a4_5500x3080.png 1272w, https://substackcdn.com/image/fetch/$s_!q8Ju!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95231af2-836d-47da-b1b5-86fd1632c0a4_5500x3080.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>The pattern</strong></p><p>Each level adds a <em>type</em> of thinking, not just more of the same thinking.</p><p>L4: knows the vocabulary. L5: can reason about trade-offs. L6: thinks about failure, operations, and monitoring. L7: thinks about organizations, economics, and platforms.</p><p>If you&#8217;re targeting L5, make sure you can answer &#8220;why did you make that choice?&#8221; for every major decision. That&#8217;s the bar.</p><p>If you&#8217;re targeting L6, make sure every design decision has three parts: what you chose, why you chose it, and what would change your mind.</p><p>If you&#8217;re targeting L7, make sure you can talk about one decision in terms of its organizational and economic implications &#8212; not just its technical ones.</p><p><strong>Paid subscribers get a full system design walkthrough every Tuesday. </strong></p><p><strong>This week: Stripe Payments &#8212; the idempotency key problem most engineers miss.</strong></p><div><hr></div><p>The paid posts go deeper on the Senior vs Staff distinction for each specific question &#8212; here&#8217;s what the L5 answer looks like, here&#8217;s what L6 adds, here&#8217;s the exact phrasing that signals each. </p><h2><strong>Subscription link</strong></h2><p><a href="https://systemdr.systemdrd.com/subscribe">https://systemdr.systemdrd.com/subscribe</a></p><p>&#8212;Sumedh</p><p><strong>Want the complete learning path?</strong> </p><p>Unlock advanced modules, case studies, and guided exercises.</p><div class="callout-block" data-callout="true"><p>The Question Vault has all 52 walkthroughs organized by archetype &#8212; so you can see the pattern across questions, not just the surface answer.</p><p><strong>Access all 52 Quesstions <a href="https://systemdrd.com/ebooks/52-faang-questions-drillcards-cheatsheets/">here</a></strong></p></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://systemdr.systemdrd.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">System Design Interview Roadmap is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Bot Detection and Mitigation: Identifying Non-Human Traffic in Real-Time]]></title><description><![CDATA[Section 8: Production Engineering & Optimization | Article 213]]></description><link>https://systemdr.systemdrd.com/p/bot-detection-and-mitigation-identifying</link><guid isPermaLink="false">https://systemdr.systemdrd.com/p/bot-detection-and-mitigation-identifying</guid><dc:creator><![CDATA[System Design Roadmap]]></dc:creator><pubDate>Fri, 05 Jun 2026 01:47:30 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!6E_K!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2da37543-c310-4c97-aeea-8f4b15cbeff1_4500x3100.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>Introduction</h2><blockquote><p>Your login endpoint just processed 4,000 requests in 60 seconds from a single IP. Your rate limiter fires, blocks the IP, and you declare victory. Thirty seconds later, the same credential-stuffing attack resumes&#8212;now spread across 800 IPs, using valid browser User-Agent strings, with randomized delays between requests. The attacker bypassed your layer-4 defense by simply reading your block response and adapting. This is the fundamental asymmetry of bot detection: defenders must evaluate every signal with precision; attackers only need to find one gap.</p></blockquote><div><hr></div><h2>What Bot Detection Actually Is</h2><p>Bot detection is a multi-signal classification problem running under strict latency budgets. Every request arriving at your edge must be evaluated&#8212;typically in under 10ms&#8212;against a fingerprint of behavioral, environmental, and network signals to produce a risk score. Requests above a threshold get challenged or blocked; those below pass through.</p><blockquote><p>The naive approach&#8212;blocklisting known bad IPs or User-Agent strings&#8212;fails because these signals are trivially spoofed. Sophisticated bots rotate through residential proxies (real IP addresses belonging to ISPs, not datacenters), mimic browser TLS fingerprints, and replay valid JavaScript challenge tokens captured from real browsers.</p></blockquote><p><strong>Signal categories</strong> that meaningful detection systems evaluate:</p><p><strong>Network-layer signals:</strong> IP reputation, ASN classification (datacenter vs. residential vs. mobile), IP velocity (how many sessions from this IP in the last N seconds), and geolocation consistency (a session from US &#8594; Brazil &#8594; Germany in 10 minutes is impossible for a human).</p><p><strong>TLS fingerprinting (JA3/JA3S):</strong> The TLS ClientHello message contains a deterministic fingerprint of the client&#8217;s cipher suite ordering, extensions, and elliptic curves. Browsers have distinctive, stable JA3 hashes. A <code>curl</code> binary, a Python <code>requests</code> client, and a headless Chromium each produce different JA3 hashes&#8212;even if they all send <code>User-Agent: Mozilla/5.0</code>. This is one of the most reliable passive signals because it happens before the HTTP handshake and requires effort to spoof.</p><p><strong>HTTP/2 fingerprinting:</strong> H2 frames have ordering, priority weights, and SETTINGS values that differ between real browsers and bot libraries. A Chrome browser sending HTTP/2 produces a different SETTINGS frame than Go&#8217;s <code>net/http</code> client&#8212;even if both claim to be Chrome.</p><p><strong>Behavioral signals (client-side):</strong> JavaScript executed in the browser collects mouse movement entropy, scroll patterns, keyboard timing, touch event presence, WebGL renderer strings, canvas fingerprints, and AudioContext oscillator outputs. Real humans produce noisy, irregular interaction patterns. Bots replicate clicks and keystrokes with machine precision or not at all.</p><p><strong>Session behavioral signals (server-side):</strong> Request rate, page visit sequences, time-on-page distributions, and form interaction timing. A real user takes 15&#8211;90 seconds to fill a checkout form. A bot fills it in 200ms or exactly 5,000ms (hardcoded delay).</p><p>These signals feed a scoring pipeline. Individual signals carry low confidence; their combination&#8212;especially cross-referencing network signals with client behavioral signals&#8212;drives accuracy above 99% for most attack categories.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6E_K!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2da37543-c310-4c97-aeea-8f4b15cbeff1_4500x3100.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6E_K!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2da37543-c310-4c97-aeea-8f4b15cbeff1_4500x3100.png 424w, https://substackcdn.com/image/fetch/$s_!6E_K!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2da37543-c310-4c97-aeea-8f4b15cbeff1_4500x3100.png 848w, https://substackcdn.com/image/fetch/$s_!6E_K!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2da37543-c310-4c97-aeea-8f4b15cbeff1_4500x3100.png 1272w, https://substackcdn.com/image/fetch/$s_!6E_K!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2da37543-c310-4c97-aeea-8f4b15cbeff1_4500x3100.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6E_K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2da37543-c310-4c97-aeea-8f4b15cbeff1_4500x3100.png" width="1456" height="1003" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2da37543-c310-4c97-aeea-8f4b15cbeff1_4500x3100.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1003,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2256788,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://systemdr.substack.com/i/190478624?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2da37543-c310-4c97-aeea-8f4b15cbeff1_4500x3100.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6E_K!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2da37543-c310-4c97-aeea-8f4b15cbeff1_4500x3100.png 424w, https://substackcdn.com/image/fetch/$s_!6E_K!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2da37543-c310-4c97-aeea-8f4b15cbeff1_4500x3100.png 848w, https://substackcdn.com/image/fetch/$s_!6E_K!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2da37543-c310-4c97-aeea-8f4b15cbeff1_4500x3100.png 1272w, https://substackcdn.com/image/fetch/$s_!6E_K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2da37543-c310-4c97-aeea-8f4b15cbeff1_4500x3100.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p>
      <p>
          <a href="https://systemdr.systemdrd.com/p/bot-detection-and-mitigation-identifying">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Design Stripe Payments — The Senior+ Walkthrough]]></title><description><![CDATA[Subscribe Now..]]></description><link>https://systemdr.systemdrd.com/p/design-stripe-payments-the-senior</link><guid isPermaLink="false">https://systemdr.systemdrd.com/p/design-stripe-payments-the-senior</guid><dc:creator><![CDATA[System Design Roadmap]]></dc:creator><pubDate>Tue, 02 Jun 2026 03:31:11 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!bctk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a21a344-0661-43eb-991b-190f9d3d22ec_4800x2700.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="callout-block" data-callout="true"><p style="text-align: justify;"><a href="https://systemdr.systemdrd.com/subscribe">Subscribe Now</a>.. Here&#8217;s what&#8217;s included:</p><p>&#8594; Every Tuesday paid walkthrough &#8212; one full named-question answer every week, calibrated to L4&#8211;L7. 52 posts per year.</p><p>&#8594; All drill cards &#8212; single-page PDFs for each question, designed to be re-read the morning of your interview</p><p>&#8594; The full paid archive &#8212; every walkthrough ever published, available immediately</p><p>&#8594; Framework cheatsheets &#8212; estimation numbers, API design patterns, trade-off communication, scaling decisions</p><p>&#8594; Discord community &#8212; #ask-the-author, mock interview partners, #offers-landed</p></div><blockquote><p style="text-align: justify;">This is the question that filters senior from staff in money-movement loops. The surface of the question &#8212; &#8220;design a payments processor&#8221; &#8212; looks ordinary. The probe underneath is one of the few places in system design interviews where being wrong has audit-trail consequences in production. Interviewers know this, which means they don&#8217;t accept the kind of hand-waving that flies on a feed-design question. You either understand idempotency, ledgers, and double-entry, or you don&#8217;t. The answer reveals it within five minutes.</p></blockquote><p>If you are interviewing at Stripe, Block, Adyen, PayPal, Robinhood, Coinbase, or any fintech doing real money movement &#8212; this is the question. If you&#8217;re at FAANG-tier companies whose products handle payments (Amazon, Apple, Google), expect a variant. The frame is always the same: a customer initiates a payment, the system has to debit one party and credit another, and <em>every single step </em>has to be reconcilable, retryable, and auditable.</p><p>For L4 / mid-level the bar is the surface design. For L5 / senior, the probe is idempotency keys and the retry semantics around them. For L6 / staff, the probe is the ledger, double-entry consistency, and what happens when the upstream bank says yes-then-no.</p><p>The Question</p><blockquote><p>&#8220;Design a payment processing service like Stripe. Customers (merchants) integrate your API to charge their end-users&#8217; credit cards. The service handles the full flow: take card details, authorize the charge with the card network, settle the funds, and notify the merchant of the outcome.&#8221;</p></blockquote><p>The question is rarely asked exactly this way. Common variants:</p><blockquote><p>- &#8220;Design the core payment flow at a payments processor.&#8221;</p><p>- &#8220;Design a system to process credit card charges.&#8221;</p><p>- &#8220;How would you build the part of Stripe that charges a card?&#8221;</p><p>- &#8220;Walk me through what happens between stripe.charges.create() and money landing in the merchant&#8217;s account.&#8221;</p></blockquote><p>All four are the same question. Don&#8217;t be thrown by the framing.</p><h3>Step 1 &#8212; Clarify Before You Draw</h3><p>Five questions before you draw anything. Each one is loaded.</p><p><strong>1. Synchronous or asynchronous API? </strong>The merchant calls your endpoint and waits &#8212; <em>or </em>&#8212; the merchant gets a 202 Accepted and a webhook later? This single question determines half the architecture. Synchronous means your API call blocks on the card network round-trip (200ms&#8211;2s of bank latency on the hot path). Asynchronous means you queue the work and notify via webhook. Most production processors offer both modes; the synchronous mode is what merchants reach for first because it&#8217;s easier to integrate.</p><p><strong>2. What payment instruments? Cards only, or also bank transfers, wallets, and BNPL? </strong>Pin this down. Cards only is a different design from &#8220;any payment method.&#8221; Cards have card networks (Visa/Mastercard/Amex). Bank transfers go through ACH (US), SEPA (EU), UPI (India), and have settlement windows of hours to days. BNPL (Klarna/Afterpay) involves a credit decision. <em>Scope to cards for the core walkthrough; mention that a real system extends to other rails.</em></p><p><strong>3. Are we issuing the cards, or just accepting them? </strong>Issuing cards (like Stripe Issuing or Cash App&#8217;s debit card) is a totally different system &#8212; you&#8217;re now responsible for card production, fraud on outbound spend, and chargebacks against your own balance. Accepting is what Stripe Payments does and what most candidates mean. Confirm scope explicitly.</p><p><strong>4. Single currency or multi-currency? </strong>Multi-currency adds an FX layer, which adds rate-locking semantics (&#8221;the customer saw $20, but settle in EUR at what rate?&#8221;) and accounting complexity. Most candidates assume USD-only. State the assumption out loud.</p><p><strong>5. What&#8217;s the SLO? </strong>Payment APIs typically target p99 &lt; 500ms for the synchronous flow. <em>And </em>99.99% availability &#8212; payments are revenue, downtime is direct loss for every merchant on the platform. State both numbers explicitly. They drive every later decision.</p><p>If you ask all five, you&#8217;ve already separated yourself from 80% of candidates. Most jump to drawing.</p><h3>Step 2 &#8212; Estimate</h3><p>Working assumptions for the rest of this walkthrough:</p><ul><li><p>10,000 transactions per second at peak. (Stripe processes more than this; pick a number</p></li><li><p>with a buffer.)</p></li><li><p>- 100 million transactions per day average; 500 million on peak shopping days.</p></li><li><p>- Average ticket: $50. So daily payment volume ~$5B at average, ~$25B at peak.</p></li><li><p>- 99.99% availability target = 52 minutes of downtime allowed per year.</p></li><li><p>- p99 latency target: 500ms for the synchronous charge endpoint.</p></li></ul><p>Storage:</p><p>- Each transaction record is roughly 2 KB after enrichment.</p><p>- 100M/day &#215; 365 = 36.5B transactions/year &#215; 2 KB = ~73 TB/year.</p><p>- Retention: 7 years for financial records (PCI/regulatory). 7 &#215; 73 = ~500 TB total.</p><p>This is <em>big </em>but not unprecedented. The implication: your transactional store is sharded relational (Postgres, Spanner, CockroachDB) and your archive is cold object storage with a hot/warm/cold tier. You won&#8217;t keep 7 years in your hot OLTP path.</p><p>Network: 10K tps &#215; 2 KB = 20 MB/s. Trivial.</p><p>External dependencies &#8212; this is the unusual one for this question. You&#8217;re round-tripping to:</p><p>- Card networks (Visa, Mastercard) for authorization. ~200ms typical, can spike to 2s.</p><p>- Acquiring banks for settlement. Async; settlement is a batch process.</p><p>- Issuer banks for fraud signals. Sometimes inline, sometimes async.</p><p>- Fraud / risk services (could be in-house or external like Sift). ~50ms inline.</p><p>- Webhook delivery to merchants. Async, fire-and-forget with retry.</p><p></p><p>The interviewer will probe this list. Knowing the names of these systems is what separates a senior who has thought about payments from a candidate who&#8217;s read one blog post.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://systemdr.systemdrd.com/subscribe&quot;,&quot;text&quot;:&quot;Get Access to GitHuB Repo&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://systemdr.systemdrd.com/subscribe"><span>Get Access to GitHuB Repo</span></a></p><h3>Step 3 &#8212; API Design</h3><p>Three endpoints. The first one is where the round is decided.</p><div class="callout-block" data-callout="true"><p>POST /v1/charges</p><p>Headers:</p><p>Authorization: Bearer sk_live_xxx</p><p>Idempotency-Key: &lt;client-generated UUID&gt; &#8592; THIS LINE IS THE PROBE</p><p>Body:</p><p>amount: integer (cents &#8212; never floats for money)</p><p>currency: string (ISO 4217: &#8220;usd&#8221;, &#8220;eur&#8221;)</p><p>source: string (tokenized card reference, e.g., &#8220;tok_xxx&#8221;)</p><p>description: string (optional)</p><p>metadata: object (merchant-defined)</p><p>Response: 200</p><p>id: &#8220;ch_xxx&#8221;</p><p>amount, currency, status: &#8220;succeeded&#8221; | &#8220;failed&#8221; | &#8220;pending&#8221;</p><p>outcome: { network_status, risk_level, seller_message }</p><p>GET /v1/charges/:id</p><p>Returns the current state of the charge. Idempotent by definition (HTTP GET).</p><p>POST /v1/refunds</p><p>Body:</p><p>charge_id: &#8220;ch_xxx&#8221;</p><p>amount: integer (defaults to full charge amount)</p><p>Idempotency-Key: &lt;client-generated UUID&gt;</p><p>Returns: { id: &#8220;re_xxx&#8221;, amount, status }</p></div><p><strong>The senior move on this step: </strong>name the Idempotency-Key header out loud and explain why it&#8217;s there. Say something like: <em>&#8220;Charging a card is the highest-stakes API call most companies ever make. Network failures, retries, double-clicks &#8212; all of these can cause the merchant&#8217;s code to call our endpoint twice with the same intent. We have to detect that and return the cached response instead of charging twice. This is what the Idempotency-Key header does.&#8221;</em></p><p>If you say that sentence in a Stripe interview, you&#8217;ve passed the first probe. If you don&#8217;t mention</p><p>idempotency by minute 10 of this question, you fail at minute 30 when the interviewer asks:</p><p>&#8220;What happens if the merchant&#8217;s code retries this call?&#8221;</p><p><strong>Use integer cents, not floats. </strong>I&#8217;m calling it out because every junior candidate reaches for float for currency. Floating-point math has rounding errors. $0.10 + $0.20 &#8800; $0.30 in IEEE 754. In production this causes accounting drift that takes weeks to find. Integer cents (or smaller &#8212; Stripe internally uses millicents for some flows) is the only correct answer. Saying this out loud is another senior signal.</p><h3>Step 4 &#8212; Data Model</h3><p>The OLTP record for the API call. Optimized for point lookups by id and by idempotency key.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!i15y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79d34cb4-2907-4f6f-b938-fc6ad24055a2_782x612.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!i15y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79d34cb4-2907-4f6f-b938-fc6ad24055a2_782x612.png 424w, https://substackcdn.com/image/fetch/$s_!i15y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79d34cb4-2907-4f6f-b938-fc6ad24055a2_782x612.png 848w, https://substackcdn.com/image/fetch/$s_!i15y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79d34cb4-2907-4f6f-b938-fc6ad24055a2_782x612.png 1272w, https://substackcdn.com/image/fetch/$s_!i15y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79d34cb4-2907-4f6f-b938-fc6ad24055a2_782x612.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!i15y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79d34cb4-2907-4f6f-b938-fc6ad24055a2_782x612.png" width="782" height="612" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/79d34cb4-2907-4f6f-b938-fc6ad24055a2_782x612.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:612,&quot;width&quot;:782,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:71904,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://systemdr.systemdrd.com/i/199698022?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79d34cb4-2907-4f6f-b938-fc6ad24055a2_782x612.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!i15y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79d34cb4-2907-4f6f-b938-fc6ad24055a2_782x612.png 424w, https://substackcdn.com/image/fetch/$s_!i15y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79d34cb4-2907-4f6f-b938-fc6ad24055a2_782x612.png 848w, https://substackcdn.com/image/fetch/$s_!i15y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79d34cb4-2907-4f6f-b938-fc6ad24055a2_782x612.png 1272w, https://substackcdn.com/image/fetch/$s_!i15y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79d34cb4-2907-4f6f-b938-fc6ad24055a2_782x612.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WbjB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F700fd374-9bba-4a30-81ef-e21bb469aeec_767x110.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WbjB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F700fd374-9bba-4a30-81ef-e21bb469aeec_767x110.png 424w, https://substackcdn.com/image/fetch/$s_!WbjB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F700fd374-9bba-4a30-81ef-e21bb469aeec_767x110.png 848w, https://substackcdn.com/image/fetch/$s_!WbjB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F700fd374-9bba-4a30-81ef-e21bb469aeec_767x110.png 1272w, https://substackcdn.com/image/fetch/$s_!WbjB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F700fd374-9bba-4a30-81ef-e21bb469aeec_767x110.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WbjB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F700fd374-9bba-4a30-81ef-e21bb469aeec_767x110.png" width="767" height="110" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/700fd374-9bba-4a30-81ef-e21bb469aeec_767x110.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:110,&quot;width&quot;:767,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:24177,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://systemdr.systemdrd.com/i/199698022?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F700fd374-9bba-4a30-81ef-e21bb469aeec_767x110.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WbjB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F700fd374-9bba-4a30-81ef-e21bb469aeec_767x110.png 424w, https://substackcdn.com/image/fetch/$s_!WbjB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F700fd374-9bba-4a30-81ef-e21bb469aeec_767x110.png 848w, https://substackcdn.com/image/fetch/$s_!WbjB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F700fd374-9bba-4a30-81ef-e21bb469aeec_767x110.png 1272w, https://substackcdn.com/image/fetch/$s_!WbjB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F700fd374-9bba-4a30-81ef-e21bb469aeec_767x110.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div class="callout-block" data-callout="true"><p><strong>Preparing for a distributed systems interview?</strong><br>&#8594;<strong><a href="https://systemdrd.com/ebooks/sdcourse-distributed-systems-interview">Download the free Interview Pack</a></strong><br><strong>&#8594; <a href="https://systemdr.systemdrd.com/subscribe">Subscribe</a> now to access source code repository - 200 + coding lessons</strong></p></div>
      <p>
          <a href="https://systemdr.systemdrd.com/p/design-stripe-payments-the-senior">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Consistent Hashing: Why It Matters and How It Works]]></title><description><![CDATA[There are two types of engineers who fail the distributed systems section of a design interview.]]></description><link>https://systemdr.systemdrd.com/p/consistent-hashing-why-it-matters</link><guid isPermaLink="false">https://systemdr.systemdrd.com/p/consistent-hashing-why-it-matters</guid><dc:creator><![CDATA[System Design Roadmap]]></dc:creator><pubDate>Sun, 31 May 2026 03:30:54 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!NXk3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc36c2686-33e8-4ca0-97f8-4b19e7a92a36_5500x3080.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>There are two types of engineers who fail the distributed systems section of a design interview.</p><p>The first type doesn&#8217;t know what consistent hashing is. The second type knows the definition but can&#8217;t explain <em>why</em> modulo hashing fails or <em>what problem</em> consistent hashing actually solves.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://systemdr.systemdrd.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">System Design Interview Roadmap is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>The second type fails more often. Knowing the name without the reasoning is worse than knowing nothing, because it telegraphs that you&#8217;ve memorized vocabulary rather than understood the underlying problem.</p><p>Here&#8217;s the underlying problem.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NXk3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc36c2686-33e8-4ca0-97f8-4b19e7a92a36_5500x3080.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NXk3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc36c2686-33e8-4ca0-97f8-4b19e7a92a36_5500x3080.png 424w, https://substackcdn.com/image/fetch/$s_!NXk3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc36c2686-33e8-4ca0-97f8-4b19e7a92a36_5500x3080.png 848w, https://substackcdn.com/image/fetch/$s_!NXk3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc36c2686-33e8-4ca0-97f8-4b19e7a92a36_5500x3080.png 1272w, https://substackcdn.com/image/fetch/$s_!NXk3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc36c2686-33e8-4ca0-97f8-4b19e7a92a36_5500x3080.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NXk3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc36c2686-33e8-4ca0-97f8-4b19e7a92a36_5500x3080.png" width="1456" height="815" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c36c2686-33e8-4ca0-97f8-4b19e7a92a36_5500x3080.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:815,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:940155,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://systemdr.systemdrd.com/i/198935755?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc36c2686-33e8-4ca0-97f8-4b19e7a92a36_5500x3080.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NXk3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc36c2686-33e8-4ca0-97f8-4b19e7a92a36_5500x3080.png 424w, https://substackcdn.com/image/fetch/$s_!NXk3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc36c2686-33e8-4ca0-97f8-4b19e7a92a36_5500x3080.png 848w, https://substackcdn.com/image/fetch/$s_!NXk3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc36c2686-33e8-4ca0-97f8-4b19e7a92a36_5500x3080.png 1272w, https://substackcdn.com/image/fetch/$s_!NXk3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc36c2686-33e8-4ca0-97f8-4b19e7a92a36_5500x3080.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>The naive approach: modulo hashing</strong></p><p>You have 4 servers and a cache key. You run <code>hash(key) % 4</code> and get a server number. Simple. Fast. Evenly distributed.</p><p>Now one of your 4 servers dies. You have 3 servers. Every single key in the system now maps to a different server, because <code>hash(key) % 3</code> gives completely different results than <code>hash(key) % 4</code>.</p><p>Your cache hit rate drops to near zero. Every miss hits the database. Your database falls over. This is a real production incident pattern.</p><p>Same problem in reverse: you add a fifth server to handle load. Every key remaps. Every cache entry is now on the wrong server. The cache is empty until it warms up. If you&#8217;re adding a server because you&#8217;re under load, this is the worst possible time to invalidate your entire cache.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://systemdr.systemdrd.com/subscribe&quot;,&quot;text&quot;:&quot;Subscribe for Question Walkthrough&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://systemdr.systemdrd.com/subscribe"><span>Subscribe for Question Walkthrough</span></a></p><p><strong>The consistent hashing solution</strong></p><p>Consistent hashing puts both servers and keys on an imaginary ring numbered 0 to 2&#179;&#178;. To find which server owns a key: hash the key, find its position on the ring, walk clockwise until you hit a server.</p><p>When a server is added or removed, only the keys between the new server and its predecessor on the ring need to move. For N servers, adding or removing one server moves approximately 1/N of the keys. Not all of them. 1/N.</p><p>That&#8217;s the entire point. The damage from topology changes is contained.</p><p><strong>Virtual nodes: the production detail</strong></p><p>Naive consistent hashing has a problem: if you hash 4 server IDs onto a ring, they won&#8217;t land evenly spaced. Some servers end up owning much more of the ring than others. When one server dies, all of its load dumps onto a single neighbor &#8212; the one clockwise from it.</p><p>Virtual nodes fix this. Instead of placing each server once on the ring, you place it 100&#8211;200 times (with different hash inputs like <code>server1-0</code>, <code>server1-1</code>, <code>server1-2</code>...). Each physical server now has 100&#8211;200 positions spread around the ring. Load is distributed evenly. When a server dies, its load spreads across every other server, not just one neighbor.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!43Nu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9d4953-b38f-415f-b3db-2025e2f9edb4_5500x3080.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!43Nu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9d4953-b38f-415f-b3db-2025e2f9edb4_5500x3080.png 424w, https://substackcdn.com/image/fetch/$s_!43Nu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9d4953-b38f-415f-b3db-2025e2f9edb4_5500x3080.png 848w, https://substackcdn.com/image/fetch/$s_!43Nu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9d4953-b38f-415f-b3db-2025e2f9edb4_5500x3080.png 1272w, https://substackcdn.com/image/fetch/$s_!43Nu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9d4953-b38f-415f-b3db-2025e2f9edb4_5500x3080.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!43Nu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9d4953-b38f-415f-b3db-2025e2f9edb4_5500x3080.png" width="1456" height="815" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3e9d4953-b38f-415f-b3db-2025e2f9edb4_5500x3080.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:815,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:933373,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://systemdr.systemdrd.com/i/198935755?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9d4953-b38f-415f-b3db-2025e2f9edb4_5500x3080.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!43Nu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9d4953-b38f-415f-b3db-2025e2f9edb4_5500x3080.png 424w, https://substackcdn.com/image/fetch/$s_!43Nu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9d4953-b38f-415f-b3db-2025e2f9edb4_5500x3080.png 848w, https://substackcdn.com/image/fetch/$s_!43Nu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9d4953-b38f-415f-b3db-2025e2f9edb4_5500x3080.png 1272w, https://substackcdn.com/image/fetch/$s_!43Nu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9d4953-b38f-415f-b3db-2025e2f9edb4_5500x3080.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Why this matters in an interview</strong></p><p>Consistent hashing shows up in at least 8 of the 52 most common system design questions: distributed caches, rate limiters, database sharding, CDN routing, load balancers, distributed file systems.</p><p>When you say &#8220;I&#8217;d shard by user ID&#8221; without specifying <em>how</em>, you&#8217;re leaving a gap the interviewer is likely to probe. When you say &#8220;I&#8217;d use consistent hashing with 150 virtual nodes per server to prevent the hot-neighbor problem on topology changes&#8221; &#8212; that&#8217;s the L5 signal for any infrastructure question.</p><p>The rule: anytime you&#8217;re distributing data across multiple nodes, consistent hashing is worth naming explicitly. Even if it&#8217;s not the exact implementation, mentioning it signals you understand that modulo hashing breaks under real operational conditions.</p><h2><strong>Subscription link</strong></h2><p><a href="https://systemdr.systemdrd.com/subscribe">https://systemdr.systemdrd.com/subscribe</a></p><p>&#8212;Sumedh</p><h4><strong>Want to explore this topic further?</strong> </h4><p>The paid version includes detailed walkthroughs, bonus resources, and hands-on exercises.</p><div class="callout-block" data-callout="true"><p>The Question Vault has all 52 walkthroughs organized by archetype &#8212; so you can see the pattern across questions, not just the surface answer.</p><p><strong>Access all 52 Quesstions <a href="https://systemdrd.com/ebooks/52-faang-questions-drillcards-cheatsheets/">here</a></strong></p></div><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://systemdr.systemdrd.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">System Design Interview Roadmap is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Designing for Data Compliance — Automated PII Redaction in Logs and Backups]]></title><description><![CDATA[Introduction]]></description><link>https://systemdr.systemdrd.com/p/designing-for-data-compliance-automated</link><guid isPermaLink="false">https://systemdr.systemdrd.com/p/designing-for-data-compliance-automated</guid><dc:creator><![CDATA[System Design Roadmap]]></dc:creator><pubDate>Fri, 29 May 2026 08:30:52 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!So7h!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19979074-3728-486b-9995-976862736860_4500x3100.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3>Introduction</h3><blockquote><p>Your on-call alert fires at 2 AM. A junior engineer debugging a payment failure copies a log snippet into Slack to get help. That log contains a customer&#8217;s full credit card number, email address, and home address. It&#8217;s now sitting in a SaaS messaging platform&#8217;s servers, outside your security perimeter, accessible to dozens of people &#8212; and you&#8217;ve simultaneously violated GDPR Article 32, PCI-DSS Requirement 3, and your customer&#8217;s trust. The engineer did nothing malicious. They did exactly what engineers do. The system failed to protect them from the failure mode.</p></blockquote><p>Automated PII redaction in logs and backups is not a nice-to-have. It is the structural defense that makes compliance survivable at scale.</p><div><hr></div><h2>How PII Ends Up in Logs</h2><p>The mechanism is mundane: developers log objects. A <code>User</code> object gets serialized into a log line during an exception handler, and suddenly <code>{name: "Jane Doe", ssn: "123-45-6789", email: "jane@example.com"}</code> is sitting in your ELK stack, replicated to three availability zones, backed up nightly, and retained for 90 days.</p><p>Five primary ingestion vectors drive the majority of PII leakage:</p><p><strong>Serialized exception payloads</strong> &#8212; Stack traces that include request bodies. This is the most common source. An unhandled exception in a payment service dumps the entire deserialized request, which contains cardholder data.</p><p><strong>ORM query logging</strong> &#8212; Hibernate, ActiveRecord, and SQLAlchemy can log full SQL statements including bound parameters. <code>WHERE email = 'jane@example.com'</code> appears in your slow query log.</p><p><strong>Distributed trace spans</strong> &#8212; OpenTelemetry and Jaeger spans often carry HTTP headers (including <code>Authorization</code> tokens) and request attributes engineers attached for debugging.</p><p><strong>Backup streams</strong> &#8212; Database logical replication logs (WAL in Postgres, binlog in MySQL) capture every row mutation and stream to S3 or GCS, frequently exempted from the scrubbing pipeline that handles application logs.</p><p><strong>Third-party SDK payloads</strong> &#8212; Analytics, error-tracking (Sentry, Datadog), and A/B testing SDKs serialize event contexts that may include user identity fields.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!So7h!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19979074-3728-486b-9995-976862736860_4500x3100.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!So7h!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19979074-3728-486b-9995-976862736860_4500x3100.png 424w, https://substackcdn.com/image/fetch/$s_!So7h!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19979074-3728-486b-9995-976862736860_4500x3100.png 848w, https://substackcdn.com/image/fetch/$s_!So7h!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19979074-3728-486b-9995-976862736860_4500x3100.png 1272w, https://substackcdn.com/image/fetch/$s_!So7h!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19979074-3728-486b-9995-976862736860_4500x3100.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!So7h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19979074-3728-486b-9995-976862736860_4500x3100.png" width="1456" height="1003" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/19979074-3728-486b-9995-976862736860_4500x3100.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1003,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1221922,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://systemdr.substack.com/i/190367955?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19979074-3728-486b-9995-976862736860_4500x3100.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!So7h!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19979074-3728-486b-9995-976862736860_4500x3100.png 424w, https://substackcdn.com/image/fetch/$s_!So7h!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19979074-3728-486b-9995-976862736860_4500x3100.png 848w, https://substackcdn.com/image/fetch/$s_!So7h!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19979074-3728-486b-9995-976862736860_4500x3100.png 1272w, https://substackcdn.com/image/fetch/$s_!So7h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19979074-3728-486b-9995-976862736860_4500x3100.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div>
      <p>
          <a href="https://systemdr.systemdrd.com/p/designing-for-data-compliance-automated">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[ Design a URL Shortener]]></title><description><![CDATA[This is the question that gets butchered more than any other in system design prep.]]></description><link>https://systemdr.systemdrd.com/p/design-a-url-shortener</link><guid isPermaLink="false">https://systemdr.systemdrd.com/p/design-a-url-shortener</guid><dc:creator><![CDATA[System Design Roadmap]]></dc:creator><pubDate>Tue, 26 May 2026 03:31:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!CNP5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93ea80c7-679e-4898-a21c-397643f35885_5500x4125.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p>This is the question that gets butchered more than any other in system design prep. Most resources hand you a 5-minute toy answer suitable for a phone screen and call it done. In a real onsite, this question runs 50 to 60 minutes, and the interviewer is using it to probe at least four distinct skills: capacity reasoning, ID generation under distributed constraints, caching strategy, and your awareness of failure modes. Miss two of the four and you fail the round, regardless of how clean your boxes-and-arrows came out.</p></blockquote><p>This walkthrough is the senior version. If you are prepping for L4 / mid-level the bar is lower &#8212; you can stop after Step 5. For L5 / senior and L6 / staff, the deep dives are where the round is won or lost.</p><div><hr></div><h2>The Question</h2><blockquote><p>&#8220;Design a URL shortener service like bit.ly. The service takes a long URL and returns a shorter alias that, when accessed, redirects to the original.&#8221;</p></blockquote><p>This question turns up in essentially every system design loop &#8212; Amazon, Google, Twitter / X, Reddit, Stripe, Pinterest, and most Series B+ startups doing senior-and-up hires. It is popular because it is a bounded problem with multiple defensible architectures, which makes it a good lens on how you think rather than what you have memorized.</p><div><hr></div><h2>Step 1 &#8212; Clarify Before You Draw</h2><p>Three questions before you draw a single box. Ask these out loud, even if you think you can guess the answer. Spending ninety seconds here is what produces the senior signal.</p><p><strong>1. Read-to-write ratio?</strong> The almost-universal answer is somewhere between 100:1 and 1000:1. Most short URLs get clicked many more times than they get created. This number drives every caching decision later in the round, so you need it pinned down.</p><p><strong>2. Custom short codes allowed, or system-generated only?</strong> This question splits the design in half. System-generated only: you can use auto-incrementing IDs and have no collision problem at all. Custom codes allowed: you have a uniqueness check on every write, plus a hot-key problem because everyone wants short, memorable aliases like /sale or /promo.</p><p><strong>3. Analytics required? Click counts, geographic data, referrer tracking?</strong> If yes, this is no longer a key-value problem. It becomes a key-value problem plus an event pipeline plus an aggregation system. The interviewer is testing whether you spot that the analytics requirement reshapes the architecture.</p><p>A senior candidate writes the answers to these three questions on the whiteboard. A junior candidate skips this step, designs for the wrong assumptions, and has to restart twenty minutes in.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CNP5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93ea80c7-679e-4898-a21c-397643f35885_5500x4125.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CNP5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93ea80c7-679e-4898-a21c-397643f35885_5500x4125.png 424w, https://substackcdn.com/image/fetch/$s_!CNP5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93ea80c7-679e-4898-a21c-397643f35885_5500x4125.png 848w, https://substackcdn.com/image/fetch/$s_!CNP5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93ea80c7-679e-4898-a21c-397643f35885_5500x4125.png 1272w, https://substackcdn.com/image/fetch/$s_!CNP5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93ea80c7-679e-4898-a21c-397643f35885_5500x4125.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CNP5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93ea80c7-679e-4898-a21c-397643f35885_5500x4125.png" width="1456" height="1092" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/93ea80c7-679e-4898-a21c-397643f35885_5500x4125.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1092,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1578009,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://systemdr.systemdrd.com/i/199143482?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93ea80c7-679e-4898-a21c-397643f35885_5500x4125.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CNP5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93ea80c7-679e-4898-a21c-397643f35885_5500x4125.png 424w, https://substackcdn.com/image/fetch/$s_!CNP5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93ea80c7-679e-4898-a21c-397643f35885_5500x4125.png 848w, https://substackcdn.com/image/fetch/$s_!CNP5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93ea80c7-679e-4898-a21c-397643f35885_5500x4125.png 1272w, https://substackcdn.com/image/fetch/$s_!CNP5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93ea80c7-679e-4898-a21c-397643f35885_5500x4125.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Step 2 &#8212; Estimate</h2><p>Pick numbers that justify your architecture. Imprecision is fine; sloppiness is not.</p><p>Working assumptions for the rest of this walkthrough:</p><ul><li><p>100M new URLs created per month, which gives roughly 40 writes per second on average, with peaks 5 to 10x higher</p></li><li><p>100:1 read-to-write ratio, giving roughly 4,000 reads per second on average, with peaks above 40K</p></li><li><p>5-year retention horizon</p></li><li><p>Average record size: 100 bytes for the long URL, 7 bytes for the short code, plus metadata &#8212; call it 500 bytes per record</p></li></ul><p>Storage: 100M &#215; 12 &#215; 5 = 6 billion URLs over five years. At 500 bytes per record that is 3 TB. This fits comfortably on a single sharded relational database. You do not need anything exotic.</p><p>Bandwidth: 4,000 reads per second &#215; 500 bytes is about 2 MB/s of read traffic. Trivial.</p><p>The estimation is not decoration. You will reference these numbers four more times before the round ends &#8212; when you justify caching, when you justify sharding, when the interviewer asks &#8220;what if we 100x&#8217;d the traffic?&#8221; Every later trade-off comes back to these inputs.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://systemdr.systemdrd.com/subscribe&quot;,&quot;text&quot;:&quot;Get Access to GitHub Repo&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://systemdr.systemdrd.com/subscribe"><span>Get Access to GitHub Repo</span></a></p>
      <p>
          <a href="https://systemdr.systemdrd.com/p/design-a-url-shortener">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[What Interviewers Actually Score in System Design Rounds]]></title><description><![CDATA[Every engineer who has failed a system design round and wanted to understand why has hit the same wall: the feedback is generic, vague, or nonexistent.]]></description><link>https://systemdr.systemdrd.com/p/what-interviewers-actually-score</link><guid isPermaLink="false">https://systemdr.systemdrd.com/p/what-interviewers-actually-score</guid><dc:creator><![CDATA[System Design Roadmap]]></dc:creator><pubDate>Sun, 24 May 2026 03:30:52 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!RpZR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7ff6632-2ac7-49be-aafa-8fcf4e096dd0_4000x2250.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Every engineer who has failed a system design round and wanted to understand why has hit the same wall: the feedback is generic, vague, or nonexistent.</p><p>&#8220;Didn&#8217;t demonstrate sufficient depth.&#8221; &#8220;Wanted to see stronger technical judgment.&#8221; &#8220;Good fundamentals but the design wasn&#8217;t production-ready.&#8221;</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://systemdr.systemdrd.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">System Design Interview Roadmap is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>None of these tell you what to fix. They&#8217;re descriptions of an outcome, not a diagnosis.</p><p>After building prep material for engineers targeting L4&#8211;L7 at FAANG-tier companies, I&#8217;ve spent a lot of time understanding what interviewers actually score &#8212; not what they say they score, not what the official rubric says, but what the scorecard entries look like in practice.</p><p>Here&#8217;s what I&#8217;ve found.</p><div><hr></div><h2>The scoring dimensions most engineers don&#8217;t know about</h2><p>The system design interview at most major tech companies is scored on 5&#8211;8 explicit dimensions. Two of them, almost everyone knows about:</p><p><strong>Technical correctness</strong> &#8212; did the candidate&#8217;s design actually solve the stated problem? Could it work in production?</p><p><strong>Depth</strong> &#8212; did the candidate go beyond the surface answer on at least one component?</p><p>The dimensions that separate candidates who barely pass from candidates who get &#8220;Strong Hire&#8221; are the ones most prep resources ignore:</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://systemdr.systemdrd.com/subscribe&quot;,&quot;text&quot;:&quot;Subscribe for Question Walkthrough&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://systemdr.systemdrd.com/subscribe"><span>Subscribe for Question Walkthrough</span></a></p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RpZR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7ff6632-2ac7-49be-aafa-8fcf4e096dd0_4000x2250.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RpZR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7ff6632-2ac7-49be-aafa-8fcf4e096dd0_4000x2250.png 424w, https://substackcdn.com/image/fetch/$s_!RpZR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7ff6632-2ac7-49be-aafa-8fcf4e096dd0_4000x2250.png 848w, https://substackcdn.com/image/fetch/$s_!RpZR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7ff6632-2ac7-49be-aafa-8fcf4e096dd0_4000x2250.png 1272w, https://substackcdn.com/image/fetch/$s_!RpZR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7ff6632-2ac7-49be-aafa-8fcf4e096dd0_4000x2250.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RpZR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7ff6632-2ac7-49be-aafa-8fcf4e096dd0_4000x2250.png" width="578" height="325.125" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a7ff6632-2ac7-49be-aafa-8fcf4e096dd0_4000x2250.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:578,&quot;bytes&quot;:557012,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://systemdr.systemdrd.com/i/198933214?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7ff6632-2ac7-49be-aafa-8fcf4e096dd0_4000x2250.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RpZR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7ff6632-2ac7-49be-aafa-8fcf4e096dd0_4000x2250.png 424w, https://substackcdn.com/image/fetch/$s_!RpZR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7ff6632-2ac7-49be-aafa-8fcf4e096dd0_4000x2250.png 848w, https://substackcdn.com/image/fetch/$s_!RpZR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7ff6632-2ac7-49be-aafa-8fcf4e096dd0_4000x2250.png 1272w, https://substackcdn.com/image/fetch/$s_!RpZR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7ff6632-2ac7-49be-aafa-8fcf4e096dd0_4000x2250.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Problem scoping</strong></p><p>The interviewer is watching whether you negotiate the scope of the problem before you start designing. Not ask questions &#8212; <em>negotiate scope</em>.</p><p>There&#8217;s a difference between &#8220;what&#8217;s the expected scale?&#8221; (a question) and &#8220;I&#8217;m going to focus on the timeline and the fan-out architecture, acknowledge that notifications and search exist, and go deep on the hardest part &#8212; does that work for you?&#8221; (a negotiation).</p><p>Candidates who score well on this dimension don&#8217;t try to design everything. They explicitly choose what they&#8217;re designing and why, and they do this in the first 3&#8211;5 minutes.</p><p>Candidates who score poorly try to design everything and end up designing nothing deeply. The 50-minute interview is not enough time to produce depth across a system with the surface area of Twitter. Everyone who attempts it runs out of time before the interesting parts.</p><div><hr></div><p><strong>Communication clarity</strong></p><p>This dimension is scored throughout the round, not at any specific moment. The question: can the interviewer follow your reasoning in real time?</p><p>The failure mode is surprisingly common among technically strong candidates: they know what they&#8217;re building, but they don&#8217;t say it out loud. They draw boxes on the whiteboard and connect them with arrows, and the interviewer has to ask &#8220;why did you connect those two boxes?&#8221; to understand the design decision.</p><p>The fix is specific: every decision should be stated before it&#8217;s drawn. Not &#8220;here&#8217;s a Redis cache&#8221; while drawing &#8212; &#8220;I&#8217;m going to put a Redis cache in front of Postgres here because the read:write ratio is 100:1 and the hot read pattern is point lookups by user_id, which Redis handles well at sub-millisecond latency. So here&#8217;s the cache.&#8221; The decision, then the reasoning, then the diagram.</p><p>This sounds slow. It&#8217;s actually faster, because the interviewer is never confused about what you&#8217;re doing or why. Confusion is the time-killer, not explanation.</p><div><hr></div><p><strong>Operational maturity</strong></p><p>This is the dimension that most separates L5 from L6 scorecards.</p><p>An answer that lacks operational maturity designs a system that works on the happy path. Everything goes right. The database responds. The messages are delivered. The cache is warm.</p><p>An answer with operational maturity designs a system and then asks: <em>what happens when it breaks?</em></p><p>&#8220;The failure mode here is the thundering herd when the Redis node recovers. Without jitter on the retry logic, every request that failed during the outage retries simultaneously the moment Redis comes back. I&#8217;d add exponential backoff with &#177;20% jitter to spread the retry load.&#8221;</p><p>&#8220;I&#8217;m targeting p99 &lt; 100ms for the feed read. The leading indicator I&#8217;d alert on is cache hit ratio dropping below 90% &#8212; that tells me Postgres load is about to spike before users start feeling degraded latency.&#8221;</p><p>These additions take 30 seconds to say. They produce a measurable difference in how the round is scored.</p><div><hr></div><p><strong>Level calibration</strong></p><p>Interviewers are calibrating your answer against the level you&#8217;re being hired for. A brilliant answer that&#8217;s calibrated for L4 will not earn a Strong Hire at L6. An adequate answer calibrated correctly for L5 can earn a Hire at L5.</p><p>This is worth saying explicitly: you don&#8217;t need to demonstrate the maximum possible depth. You need to demonstrate the right depth for your level.</p><p>At L4, the bar is: run the framework correctly. Estimate. Design the read and write paths. Mention at least one deep dive.</p><p>At L5, the bar is: defend your decisions under probing. Name failure modes. Cover the hard part of the question &#8212; the thing that makes this question different from the trivial version.</p><p>At L6, the bar is: SLOs with monitoring signals. Cross-system trade-offs. Operational layer. The three things above, plus the &#8220;what would tell me this design is failing&#8221; for each major component.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DtIR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faddcc91e-7809-4056-99b9-14f8b1e58478_4000x2250.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DtIR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faddcc91e-7809-4056-99b9-14f8b1e58478_4000x2250.png 424w, https://substackcdn.com/image/fetch/$s_!DtIR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faddcc91e-7809-4056-99b9-14f8b1e58478_4000x2250.png 848w, https://substackcdn.com/image/fetch/$s_!DtIR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faddcc91e-7809-4056-99b9-14f8b1e58478_4000x2250.png 1272w, https://substackcdn.com/image/fetch/$s_!DtIR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faddcc91e-7809-4056-99b9-14f8b1e58478_4000x2250.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DtIR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faddcc91e-7809-4056-99b9-14f8b1e58478_4000x2250.png" width="546" height="307.125" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/addcc91e-7809-4056-99b9-14f8b1e58478_4000x2250.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:546,&quot;bytes&quot;:578846,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://systemdr.systemdrd.com/i/198933214?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faddcc91e-7809-4056-99b9-14f8b1e58478_4000x2250.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DtIR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faddcc91e-7809-4056-99b9-14f8b1e58478_4000x2250.png 424w, https://substackcdn.com/image/fetch/$s_!DtIR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faddcc91e-7809-4056-99b9-14f8b1e58478_4000x2250.png 848w, https://substackcdn.com/image/fetch/$s_!DtIR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faddcc91e-7809-4056-99b9-14f8b1e58478_4000x2250.png 1272w, https://substackcdn.com/image/fetch/$s_!DtIR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faddcc91e-7809-4056-99b9-14f8b1e58478_4000x2250.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>Knowing your level and calibrating your answer to it is itself a scored dimension.</p><div><hr></div><h2>The one thing most guides don&#8217;t tell you</h2><p>The design round is not a knowledge test. It&#8217;s a judgment test.</p><p>Two candidates can produce completely different architectures for the same question and both get Strong Hire. Two candidates can produce very similar architectures and one gets Strong Hire while the other gets No Hire. The architecture is the medium. The judgment &#8212; in how you scope, how you communicate, how you reason under pressure, how you handle failure cases &#8212; is what&#8217;s being evaluated.</p><p>This changes how you should prepare.</p><p>Memorizing architectures is necessary. It&#8217;s not sufficient. The prep that moves the needle is timed practice under interview conditions, with feedback specifically on the four dimensions above &#8212; not just on whether your architecture was technically correct.</p><div><hr></div><p>Every Tuesday, this newsletter publishes one full named-question walkthrough: the exact answer that would earn a Hire signal at your target level, plus the follow-up probes interviewers actually use, plus the common mistakes that fail the round. It covers what the architecture is, why each decision was made, and what the interviewer is specifically looking for.</p><p>The first ten questions are published. </p><p><a href="https://systemdr.systemdrd.com/subscribe">https://systemdr.systemdrd.com/subscribe</a></p><p>The free archive has a question bank, the six-step framework cheatsheet, and the estimation numbers reference. All three are downloadable without a subscription.</p><p>&#8212; Sumedh </p><div class="callout-block" data-callout="true"><p>The Question Vault has all 52 walkthroughs organized by archetype &#8212; so you can see the pattern across questions, not just the surface answer.</p><p><strong>Access all 52 Quesstions <a href="https://systemdrd.com/ebooks/52-faang-questions-drillcards-cheatsheets/">here</a></strong></p></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://systemdr.systemdrd.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">System Design Interview Roadmap is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Building a Research Chat App on LangChain Managed Deep Agents (With Human Approval Before Web Search)]]></title><description><![CDATA[Master the Blueprint of Modern AI Engineering Go Beyond Prompting and Learn How Real AI Systems Are Built, Scaled, and Deployed in Production.]]></description><link>https://systemdr.systemdrd.com/p/building-a-research-chat-app-on-langchain</link><guid isPermaLink="false">https://systemdr.systemdrd.com/p/building-a-research-chat-app-on-langchain</guid><dc:creator><![CDATA[System Design Roadmap]]></dc:creator><pubDate>Thu, 21 May 2026 13:51:11 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!DqcK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f80629-6682-4f21-ac83-a559ec139048_1194x1162.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="callout-block" data-callout="true"><p>Master the Blueprint of Modern AI Engineering Go Beyond Prompting and Learn How Real AI Systems Are Built, Scaled, and Deployed in Production. AI engineering is no longer about calling...</p><p>https://systemdrd.com/ebooks/ai-engineers-blueprint</p></div><p>Most &#8220;AI demos&#8221; are a text box wired to an LLM. That works until the model tries to <strong>search the web</strong>, <strong>read a URL</strong>, or <strong>spend money on tools</strong> without you noticing.</p><p>This project is different. It is a small but complete app: a React chat UI, a FastAPI backend, and an agent definition you keep in Git. The interesting part is not the chat bubbles&#8212;it is how the same UI talks to <strong>three different runtimes</strong> (cloud managed agent, local open-source agent, or your own LangGraph deployment) and how it <strong>pauses the agent</strong> until a human approves a web search.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://systemdr.systemdrd.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">System Design Interview Roadmap is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>If you have been following system design topics&#8212;timeouts, idempotency, backpressure, &#8220;who owns state?&#8221;&#8212;you will recognize the same questions here, just with agents instead of microservices</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DqcK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f80629-6682-4f21-ac83-a559ec139048_1194x1162.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DqcK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f80629-6682-4f21-ac83-a559ec139048_1194x1162.png 424w, https://substackcdn.com/image/fetch/$s_!DqcK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f80629-6682-4f21-ac83-a559ec139048_1194x1162.png 848w, https://substackcdn.com/image/fetch/$s_!DqcK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f80629-6682-4f21-ac83-a559ec139048_1194x1162.png 1272w, https://substackcdn.com/image/fetch/$s_!DqcK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f80629-6682-4f21-ac83-a559ec139048_1194x1162.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DqcK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f80629-6682-4f21-ac83-a559ec139048_1194x1162.png" width="1194" height="1162" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/62f80629-6682-4f21-ac83-a559ec139048_1194x1162.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1162,&quot;width&quot;:1194,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:86610,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://langchaineco.substack.com/i/198707391?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f80629-6682-4f21-ac83-a559ec139048_1194x1162.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!DqcK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f80629-6682-4f21-ac83-a559ec139048_1194x1162.png 424w, https://substackcdn.com/image/fetch/$s_!DqcK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f80629-6682-4f21-ac83-a559ec139048_1194x1162.png 848w, https://substackcdn.com/image/fetch/$s_!DqcK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f80629-6682-4f21-ac83-a559ec139048_1194x1162.png 1272w, https://substackcdn.com/image/fetch/$s_!DqcK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f80629-6682-4f21-ac83-a559ec139048_1194x1162.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2><strong>What you are looking at </strong></h2><p>Open the app and you get a <strong>Research Assistant</strong>. You type a question. The agent can plan, take notes in a virtual filesystem, search the web, read pages, and even call a <strong>fact-checker subagent</strong> for specific claims.</p><p>Github : <a href="https://github.com/sysdr/langchain-echosystem">https://github.com/sysdr/langchain-echosystem</a></p><p>The repo is called langchain-echosystem . Layout: </p><p>langchain-echosystem/</p><p>&#9500;&#9472;&#9472; agent/          &#8592; what the agent *is* (instructions, tools, skills)</p><p>&#9500;&#9472;&#9472; backend/        &#8592; API + runtime switch</p><p>&#9500;&#9472;&#9472; frontend/       &#8592; chat UI + approval modal</p><p>&#9492;&#9472;&#9472; langgraph.json  &#8592; optional deploy to LangSmith</p><p>Three layers, one product:</p><p><strong>LayerJob</strong></p><p><code>agent/</code></p><p>Personality, tools, when to ask a human</p><p><code>backend/</code></p><p>Pick runtime, stream tokens, handle interrupts</p><p><code>frontend/</code></p><p>Show chat, block input until you Approve/Reject</p><div><hr></div><h2><strong>The agent is just files in </strong><code>agent/</code></h2><p>Managed Deep Agents let you define an agent from the repo instead of clicking around a dashboard. That matters for newsletters and teams: <strong>version control beats copy-paste</strong>.</p><h3><strong>Instructions (</strong><code>agent/AGENTS.md</code><strong>)</strong></h3><p>The agent is told to behave like a careful researcher:</p><ul><li><p>Clarify vague questions</p></li><li><p>Search when facts need the outside world</p></li><li><p><strong>Not invent citations</strong></p></li><li><p>Use bullet points and links</p></li><li><p>Delegate doubtful claims to a fact-checker</p></li></ul><p>There is also a <code>/memories/preferences.txt</code> convention&#8212;durable user prefs saved across chats. That is a simple pattern for &#8220;long-term memory&#8221; without a separate database in this demo.</p><h3><strong>Tools (</strong><code>agent/tools.json</code><strong>)</strong></h3><p>Two tools come from LangChain&#8217;s <strong>Fleet</strong> MCP server:</p><ul><li><p><code>tavily_web_search</code></p></li><li><p><code>read_url_content</code></p></li></ul><p>The important line is <code>interrupt_config</code>. Web search is set to <strong>require human approval</strong>; reading a URL does not:</p><p>&#8220;interrupt_config&#8221;: {</p><p>&#8220;https://tools.langchain.com::tavily_web_search::Fleet&#8221;: true,</p><p>&#8220;https://tools.langchain.com::read_url_content::Fleet&#8221;: false</p><p>}</p><p>From a system design angle: this is <strong>policy as data</strong>. You are not hard-coding &#8220;if tool == search then pause&#8221; in Python; you declare it once and provision pushes it to the cloud agent.</p><h3><strong>Skills and subagents</strong></h3><ul><li><p><code>agent/skills/research/SKILL.md</code> &#8212; multi-step research workflow (plan &#8594; search &#8594; notes &#8594; synthesis).</p></li><li><p><code>agent/subagents/fact-checker.md</code> &#8212; narrow job: verify claims, label Supported / Contradicted / Insufficient evidence, cite URLs.</p></li></ul><p>Subagents are the agent equivalent of <strong>calling a specialist service</strong> instead of bloating one prompt.</p><div><hr></div><h2><strong>One backend, three ways to run the brain</strong></h2><p>The backend does not assume you always have LangSmith preview access. <code>AGENT_RUNTIME</code> in <code>.env</code> can be <code>auto</code>, <code>managed</code>, <code>local</code>, or <code>deployment</code>.</p><p><code>backend/app/config.py</code> resolves <code>auto</code> like this:</p><ol><li><p>If <code>LANGGRAPH_DEPLOYMENT_URL</code> is set &#8594; <strong>deployment</strong></p></li><li><p>Else if <code>MANAGED_AGENT_ID</code> + API key &#8594; <strong>managed</strong></p></li><li><p>Else &#8594; <strong>local</strong></p></li></ol><p><code>get_runtime()</code> in <code>backend/app/runtime/__init__.py</code> returns one of three classes with the <strong>same interface</strong>: create thread, stream chat, resolve interrupt, resume stream.</p><p>That is classic <strong>strategy pattern</strong> thinking: one API contract, pluggable implementations. Your frontend never branches on &#8220;are we local today?&#8221;</p><h3><strong>Managed mode (production-shaped)</strong></h3><p><code>ManagedRuntime</code> talks to LangSmith&#8217;s <code>/v1/deepagents</code> API via <code>DeepAgentsClient</code>. It creates a thread, starts a streamed run, maps LangChain events to SSE:</p><ul><li><p><code>messages</code> &#8594; token chunks for the UI</p></li><li><p><code>values</code> &#8594; if there is an interrupt, emit an <code>interrupt</code> event</p></li></ul><p>You provision the cloud agent once:</p><p>make provision</p><p><code>backend/scripts/provision_agent.py</code> reads everything under <code>agent/</code>, builds a JSON payload (instructions, tools, subagents, skills), POSTs or PATCHes the Managed Deep Agents API, and writes <code>MANAGED_AGENT_ID</code> back into <code>.env</code>. Change <code>AGENTS.md</code>, run provision again&#8212;the cloud agent updates. <strong>Git is the source of truth.</strong></p><p><code>REQUIRE_HITL_APPROVAL</code> env toggles whether web search needs approval at provision time&#8212;useful for demos vs stricter prod.</p><h3><strong>Local mode (laptop-friendly)</strong></h3><p><code>LocalRuntime</code> uses open-source <code>deepagents</code> with an in-memory checkpointer. It still loads <code>AGENTS.md</code> as the system prompt, but web search is a <strong>stub</strong> that tells you to use managed mode for real search.</p><p>Good for UI work and backend tests without cloud keys. Bad for &#8220;did it really find that paper?&#8221;&#8212;by design.</p><h3><strong>Deployment mode (your own graph)</strong></h3><p><code>backend/agent.py</code> defines a LangGraph-compatible graph with <code>create_deep_agent</code>. <code>langgraph.json</code> points at it for <code>langgraph up</code>. Point <code>LANGGRAPH_DEPLOYMENT_URL</code> and <code>LANGGRAPH_ASSISTANT_ID</code> at that deployment and <code>AGENT_RUNTIME=deployment</code>.</p><p>Same agent instructions file; different hosting. Useful when you want <strong>your</strong> infra and observability, not only the managed API.</p><div><hr></div><h2><strong>How a message travels through the system</strong></h2><p>Here is the happy path in managed mode:</p><p>You (browser)</p><p>  &#8594; POST /api/conversations          (new thread_id)</p><p>  &#8594; POST /api/chat/stream            (SSE: tokens + maybe interrupt)</p><p>  &#8594; [optional] POST resolve-interrupt</p><p>  &#8594; [optional] POST resume-stream    (more SSE tokens)</p><p><strong>SSE (Server-Sent Events)</strong> means the server pushes many small events on one HTTP response. The frontend&#8217;s <code>api.ts</code> parses <code>event:</code> and <code>data:</code> lines&#8212;no WebSocket server required. For token streaming, that is often enough and simpler to operate behind proxies.</p><p><code>backend/app/routes/chat.py</code> wraps the runtime iterator in <code>EventSourceResponse</code>. Event types include <code>token</code>, <code>interrupt</code>, <code>error</code>, and <code>done</code>.</p><p>On the React side (<code>App.tsx</code>):</p><ol><li><p>User sends message &#8594; append user bubble + empty assistant bubble.</p></li><li><p><code>streamChat</code> feeds tokens into the assistant bubble (markdown via <code>react-markdown</code>).</p></li><li><p>If <code>onInterrupt</code> fires &#8594; show <code>InterruptPrompt</code> modal, disable composer.</p></li><li><p>Approve &#8594; <code>resolveInterrupt</code> then <code>resumeStream</code> continues the same assistant message.</p></li><li><p>Reject &#8594; run stops; status says tool rejected.</p></li></ol><p>The modal (<code>InterruptPrompt.tsx</code>) is deliberately plain: tool name, description, Approve / Reject. No mystery about what the agent wanted to do.</p><p><strong>System design takeaway:</strong> the interrupt is a <strong>synchronization point</strong>. The agent&#8217;s run is not &#8220;failed&#8221;; it is <strong>blocked</strong> until an external decision arrives&#8212;like waiting on a human task in a workflow engine, or a payment authorization hold.</p><div><hr></div><h2><strong>Human-in-the-loop in one paragraph</strong></h2><p>Without HITL, an agent can issue searches you did not intend (wrong query, leaked context, cost). With HITL:</p><ol><li><p>Agent decides it needs <code>tavily_web_search</code>.</p></li><li><p>Runtime surfaces an interrupt in the stream.</p></li><li><p>UI stops; user approves or rejects.</p></li><li><p><code>resolve_interrupt</code> tells the API the decision.</p></li><li><p><code>resume-stream</code> continues generation.</p></li></ol><p>That is <strong>fail-safe by default</strong> for the risky tool only. Reading URLs stays automatic&#8212;policy choice, not universal slowdown.</p><p>For interviews: relate this to <strong>circuit breakers</strong>, <strong>approval workflows</strong>, and <strong>least privilege</strong>. The agent does not get unfettered egress; it gets egress <strong>after</strong> a human gate for the sensitive action.</p><div><hr></div><h2><strong>Frontend: small surface, clear states</strong></h2><p>The UI is one main component plus the interrupt overlay. State that matters:</p><ul><li><p><code>health</code> &#8212; from <code>/api/health</code>: which runtime, is it <code>ready</code>?</p></li><li><p><code>threadId</code> / <code>agentId</code> &#8212; conversation scope</p></li><li><p><code>interrupt</code> &#8212; blocks send until resolved</p></li><li><p><code>loading</code> / <code>resolving</code> &#8212; button and textarea disabled appropriately</p></li></ul><p>The header shows <strong>Managed Deep Agents</strong> vs <strong>Local</strong> vs <strong>LangSmith Deployment</strong> so you are never confused about which brain is answering.</p><p>Sample prompts on the empty state nudge system-design-style questions (&#8220;tradeoffs in agent memory&#8221;, &#8220;LangGraph durable execution&#8221;, &#8220;RAG vs long context&#8221;)&#8212;aligned with what your readers care about.</p><div><hr></div><h2><strong>Docker and Makefile: how you actually run it</strong></h2><p>cp .env.example .env</p><p>make install</p><p>make provision    <em># managed mode</em></p><p>make backend      <em># :8000</em></p><p>make frontend     <em># :5173</em></p><p>Or <code>make docker-up</code> &#8594; frontend on <strong>3000</strong>, backend on <strong>8000</strong>, healthcheck on the API before the UI container starts. Compose wires CORS for local and container hostnames.</p><p>The Makefile is thin on purpose: install, provision, run, docker. No hidden magic.</p><div><hr></div><h2><strong>What I would tell a system design reader</strong></h2><ol><li><p><strong>Separate &#8220;agent definition&#8221; from &#8220;runtime.&#8221;</strong> Files in <code>agent/</code> vs Python runtimes&#8212;same product, different ops models.</p></li><li><p><strong>Stream tokens; don&#8217;t buffer the whole answer.</strong> SSE keeps latency honest and UX responsive.</p></li><li><p><strong>Treat tool calls as side effects.</strong> Search is an side effect; gate it with HITL config, not hope.</p></li><li><p><strong>Subagents bound blast radius.</strong> Fact-checking is a focused delegate, not a bigger main prompt.</p></li><li><p><strong>Provision script = deployment pipeline for agents.</strong> CI could run <code>provision_agent.py</code> on every merge to <code>agent/</code>.</p></li></ol><p>This is not a billion-user architecture. It is a <strong>correct end-to-end slice</strong>: auth to API keys in env, threaded conversations, streaming, interrupts, multi-runtime fallback, and deploy hooks. That is exactly what you want before scaling traffic&#8212;get the state machine right first.</p><div><hr></div><h2><strong>Try it yourself</strong></h2><p>Clone the repo, set keys per <code>.env.example</code>, run <code>make provision</code> if you have LangSmith managed access, then ask something that needs the web. When the approval modal appears, you are seeing the <code>interrupt_config</code> from <code>tools.json</code> alive&#8212;not a mock.</p><p>If you only have model API keys, stay on <strong>local</strong> runtime: the UI and streaming still work; search returns the stub message until you point at managed or deployment.</p><div><hr></div><h2><strong>Closing thought</strong></h2><p>Agents are moving from &#8220;chat completion&#8221; to <strong>systems</strong>: tools, memory, subgraphs, pauses, resumes. This app is a readable map of that shift&#8212;files for behavior, a router for where the graph runs, SSE for the wire, and a modal for the one tool call you refused to automate.</p><p>For <a href="https://systemdr.systemdrd.com/">System Design Interview Roadmap</a> readers, the interview question is no longer only &#8220;design Twitter.&#8221; It is increasingly &#8220;design a worker that can call external APIs&#8212;who approves, where is state, what happens on retry?&#8221; This codebase is one honest answer.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://systemdr.systemdrd.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">System Design Interview Roadmap is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Kernel Tuning for High-Load Systems: File Descriptors, TCP Buffers, and Ephemeral Ports]]></title><description><![CDATA[The Wall Nobody Sees Coming]]></description><link>https://systemdr.systemdrd.com/p/kernel-tuning-for-high-load-systems</link><guid isPermaLink="false">https://systemdr.systemdrd.com/p/kernel-tuning-for-high-load-systems</guid><dc:creator><![CDATA[System Design Roadmap]]></dc:creator><pubDate>Mon, 18 May 2026 06:47:13 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!xJ9Z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40af0a5-db93-40a0-a162-6063faa5fe28_4500x3100.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>The Wall Nobody Sees Coming</h2><blockquote><p>Your application is humming. Load balancer green. CPU at 30%. Memory comfortable. Then at 4 AM, on-call fires: thousands of connections timing out, upstream services unreachable, health checks failing &#8212; yet every metric dashboard looks normal. The culprit isn&#8217;t your code. It&#8217;s the operating system silently running out of file descriptors, exhausted ephemeral ports, or TCP buffers too small to fill a gigabit pipe. Kernel limits are invisible until they snap. This is what that looks like, and how to prevent it.</p></blockquote><div><hr></div><h2>Core Concept: The Kernel as a Silent Resource Broker</h2><blockquote><p>Every TCP connection your process opens requires a file descriptor (FD) &#8212; a kernel-managed integer handle. The same FD table also tracks open files, sockets, pipes, and device handles. The default limit on most Linux distributions is 1,024 per process (<code>ulimit -n</code>). A single Nginx worker handling 10,000 connections needs 10,000+ FDs. If the limit is 1,024, the 1,025th <code>accept()</code> call returns <code>EMFILE: Too many open files</code> &#8212; silently dropped from the application&#8217;s perspective unless you instrument for it.</p></blockquote><p><strong>The three-layer limit model:</strong> Linux enforces FD limits at three levels simultaneously. The system-wide hard cap lives in <code>/proc/sys/fs/file-max</code>. The per-process soft limit is the one <code>ulimit -n</code> reports. The per-process hard limit is the ceiling a process can raise its own soft limit to without root. All three must be in alignment. Raising one without the others produces confusing partial failures: the process can&#8217;t open more FDs even though <code>file-max</code> looks healthy.</p><p><strong>TCP buffers and throughput math:</strong> Each established TCP socket has a send buffer and receive buffer allocated in kernel memory &#8212; defaulting to 87KB receive (<code>net.core.rmem_default</code>) and 16KB send (<code>net.core.wmem_default</code>) on stock kernels. Buffer size directly determines throughput on high-latency links via the bandwidth-delay product (BDP): a 1 Gbps link with 50ms RTT requires <code>1,000,000,000 &#215; 0.05 / 8 = 6.25 MB</code> per connection to keep the pipe full. A 87KB buffer caps that connection at ~14 Mbps regardless of link speed. On AWS cross-region or trans-oceanic connections, under-buffered TCP is a consistent, measurable bottleneck.</p><p>The relevant kernel parameters:</p><ul><li><p><code>net.core.rmem_max</code> / <code>net.core.wmem_max</code> &#8212; per-socket maximums the application can request</p></li><li><p><code>net.ipv4.tcp_rmem</code> / <code>net.ipv4.tcp_wmem</code> &#8212; kernel&#8217;s auto-tuning range (min, default, max)</p></li><li><p><code>net.core.netdev_max_backlog</code> &#8212; packets queued when NIC receives faster than the kernel processes</p></li></ul><p><strong>Ephemeral port exhaustion:</strong> Outbound connections require a source port. The kernel allocates from the ephemeral range, default <code>32768&#8211;60999</code> on Linux &#8212; 28,232 ports. Each connection to the same destination IP:port consumes one 4-tuple <code>(src_ip, src_port, dst_ip, dst_port)</code>. A service making 30,000 outbound connections per second to a single backend (common in reverse proxies, API gateways, or connection pools) will exhaust this range and start receiving <code>EADDRNOTAVAIL</code>. The fix is expanding the range via <code>net.ipv4.ip_local_port_range</code> to <code>1024&#8211;65535</code> (64,511 ports) and enabling <code>net.ipv4.tcp_tw_reuse</code> to recycle TIME_WAIT sockets safely.</p><p>TIME_WAIT itself is intentional &#8212; the kernel holds closed connections for 2&#215;MSL (typically 60 seconds) to prevent delayed packets from poisoning new connections on the same 4-tuple. At 30K connections/second, that&#8217;s 1.8 million simultaneous TIME_WAIT entries. Each consumes ~350 bytes of kernel memory &#8212; 630MB just for state. <code>tcp_tw_reuse</code> allows reuse for outbound connections when timestamps confirm no stale data, eliminating the explosion without compromising correctness.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xJ9Z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40af0a5-db93-40a0-a162-6063faa5fe28_4500x3100.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xJ9Z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40af0a5-db93-40a0-a162-6063faa5fe28_4500x3100.png 424w, https://substackcdn.com/image/fetch/$s_!xJ9Z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40af0a5-db93-40a0-a162-6063faa5fe28_4500x3100.png 848w, https://substackcdn.com/image/fetch/$s_!xJ9Z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40af0a5-db93-40a0-a162-6063faa5fe28_4500x3100.png 1272w, https://substackcdn.com/image/fetch/$s_!xJ9Z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40af0a5-db93-40a0-a162-6063faa5fe28_4500x3100.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xJ9Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40af0a5-db93-40a0-a162-6063faa5fe28_4500x3100.png" width="1456" height="1003" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a40af0a5-db93-40a0-a162-6063faa5fe28_4500x3100.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1003,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3460021,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://systemdr.substack.com/i/190349934?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40af0a5-db93-40a0-a162-6063faa5fe28_4500x3100.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xJ9Z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40af0a5-db93-40a0-a162-6063faa5fe28_4500x3100.png 424w, https://substackcdn.com/image/fetch/$s_!xJ9Z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40af0a5-db93-40a0-a162-6063faa5fe28_4500x3100.png 848w, https://substackcdn.com/image/fetch/$s_!xJ9Z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40af0a5-db93-40a0-a162-6063faa5fe28_4500x3100.png 1272w, https://substackcdn.com/image/fetch/$s_!xJ9Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40af0a5-db93-40a0-a162-6063faa5fe28_4500x3100.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><div class="callout-block" data-callout="true"><p><strong>Preparing for a distributed systems interview? </strong></p><p>&#8594;<strong><a href="https://systemdrd.com/ebooks/sdcourse-distributed-systems-interview">Download the free Interview Pack</a></strong><a href="https://systemdrd.com/ebooks/sdcourse-distributed-systems-interview"> </a></p><p><strong>&#8594; <a href="https://systemdr.systemdrd.com/subscribe">Subscribe</a> now to access source code repository - 200 + coding lessons </strong></p></div>
      <p>
          <a href="https://systemdr.systemdrd.com/p/kernel-tuning-for-high-load-systems">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Service Mesh Performance Costs: The Reality of Sidecar Latency]]></title><description><![CDATA[Introduction]]></description><link>https://systemdr.systemdrd.com/p/service-mesh-performance-costs-the</link><guid isPermaLink="false">https://systemdr.systemdrd.com/p/service-mesh-performance-costs-the</guid><dc:creator><![CDATA[System Design Roadmap]]></dc:creator><pubDate>Tue, 12 May 2026 08:30:59 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!71Jk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa254056a-3788-4be4-aa96-d1bce9afb635_4500x2900.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>Introduction</h2><blockquote><p>Your service handles 2ms p99 latency. You adopt Istio for zero-trust security and traffic management. Six weeks later, on-call engineers are staring at 11ms p99. The architecture didn&#8217;t change. The code didn&#8217;t change. Only the mesh did. This is the tax every team pays when they deploy a sidecar proxy, and understanding exactly where that tax comes from &#8212; and when it&#8217;s worth paying &#8212; is what separates architects who build resilient systems from those who debug them forever.</p></blockquote><div><hr></div><h2>What a Service Mesh Actually Does at Runtime</h2><p>A service mesh inserts a proxy (almost always Envoy) as a sidecar container into every pod. All inbound and outbound traffic is intercepted via iptables rules injected by an init container, redirected to the sidecar on loopback, processed, and then forwarded to the real destination. From the application&#8217;s perspective, it&#8217;s talking to localhost. From the network&#8217;s perspective, every request traverses two additional software stacks.</p><p>This interception model creates four distinct latency contributors that compound differently depending on load:</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://systemdr.systemdrd.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">System Design Interview Roadmap is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p><strong>1. iptables traversal cost.</strong> Every packet entering or leaving the pod is evaluated against NAT rules injected by the mesh. On a lightly loaded host, this costs roughly 20&#8211;50&#181;s per packet. Under high connection churn (hundreds of new connections per second), the cost rises non-linearly because iptables is evaluated sequentially and lacks O(1) lookups for large rule sets. Cilium and eBPF-based meshes (like Cilium Service Mesh) bypass this by attaching programs at the kernel&#8217;s XDP or TC layer, reducing this overhead to near zero.</p><p><strong>2. Loopback socket handoff.</strong> Traffic redirected to the sidecar crosses two user-kernel boundaries &#8212; once into the proxy process, and once back out to the destination. Even over loopback, this is a copy operation. At low RPS, this is negligible. At 50k+ RPS per pod, you&#8217;re looking at consistent kernel scheduling overhead that adds 0.1&#8211;0.3ms of base latency.</p><p><strong>3. Envoy request processing.</strong> Envoy parses L7 protocol metadata (HTTP headers, gRPC frames), applies routing rules, enforces rate limits, evaluates RBAC policies, and records telemetry. Each filter in the chain adds processing time. A standard Istio deployment activates 10&#8211;15 Envoy filters by default. Disabling unused filters (e.g., <code>envoy.filters.http.cors</code> when CORS is handled upstream) directly reduces this cost.</p><p><strong>4. mTLS handshake amortization.</strong> Mutual TLS between sidecars requires a full TLS 1.3 handshake for new connections. For long-lived connections, this cost is amortized across thousands of requests. For short-lived connections &#8212; HTTP/1.1 workloads that open a new connection per request &#8212; mTLS overhead can dominate total service-to-service latency.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!71Jk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa254056a-3788-4be4-aa96-d1bce9afb635_4500x2900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!71Jk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa254056a-3788-4be4-aa96-d1bce9afb635_4500x2900.png 424w, https://substackcdn.com/image/fetch/$s_!71Jk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa254056a-3788-4be4-aa96-d1bce9afb635_4500x2900.png 848w, https://substackcdn.com/image/fetch/$s_!71Jk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa254056a-3788-4be4-aa96-d1bce9afb635_4500x2900.png 1272w, https://substackcdn.com/image/fetch/$s_!71Jk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa254056a-3788-4be4-aa96-d1bce9afb635_4500x2900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!71Jk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa254056a-3788-4be4-aa96-d1bce9afb635_4500x2900.png" width="1456" height="938" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a254056a-3788-4be4-aa96-d1bce9afb635_4500x2900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:938,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1277071,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://systemdr.substack.com/i/190347864?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa254056a-3788-4be4-aa96-d1bce9afb635_4500x2900.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!71Jk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa254056a-3788-4be4-aa96-d1bce9afb635_4500x2900.png 424w, https://substackcdn.com/image/fetch/$s_!71Jk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa254056a-3788-4be4-aa96-d1bce9afb635_4500x2900.png 848w, https://substackcdn.com/image/fetch/$s_!71Jk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa254056a-3788-4be4-aa96-d1bce9afb635_4500x2900.png 1272w, https://substackcdn.com/image/fetch/$s_!71Jk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa254056a-3788-4be4-aa96-d1bce9afb635_4500x2900.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Critical Insights</h2><p><strong>Connection pooling is the most underutilized optimization.</strong> Envoy maintains separate connection pools per upstream cluster per worker thread. With 4 Envoy worker threads and 20 upstream instances, Envoy can hold 80 simultaneous connection pools. If your application uses HTTP/1.1 without persistent connections, Envoy cannot pool effectively and will perform a TLS handshake for nearly every request. Switching to HTTP/2 or gRPC enables multiplexed streams over single connections, reducing TLS overhead by 90%+ in high-request-rate scenarios.</p><p><strong>Control plane churn is a hidden latency spike source.</strong> Every time xDS (the Envoy configuration API) pushes an update &#8212; a new endpoint, a changed routing rule, a certificate rotation &#8212; Envoy must reload affected configuration. During reload, in-flight requests using stale configuration can be dropped. In large clusters (500+ services), certificate rotation events from SPIFFE/SPIRE can trigger near-simultaneous xDS updates across all sidecars, creating a thundering herd on the control plane. Istio&#8217;s <code>PILOT_ENABLE_EDS_DEBOUNCE</code> and similar flags exist specifically to batch these updates.</p><p><strong>Telemetry cardinality is a CPU amplifier.</strong> Envoy emits metrics with labels like <code>source_workload</code>, <code>destination_workload</code>, <code>response_code</code>, and <code>grpc_response_status</code>. At 1000 services with 10 response codes each, the cardinality of the Prometheus time series grows quadratically with the number of service pairs. Teams have reported Envoy spending 15&#8211;20% of CPU on stats collection alone at high cardinality. Setting <code>disablePolicyChecks: true</code> for non-critical paths and using metric merging with Prometheus remote write reduces this substantially.</p><p><strong>Sidecar resource limits interact with scheduling.</strong> Envoy defaults to requesting 100m CPU and 128Mi memory in Istio. Under request pressure, the kernel CPU scheduler may preempt the sidecar mid-processing, adding latency spikes invisible in p50 metrics but glaring in p99.9. Setting CPU limits equal to requests (guaranteed QoS class in Kubernetes) eliminates this variability at the cost of over-provisioning.</p><p><strong>The &#8220;ambient mesh&#8221; architecture eliminates the sidecar problem entirely.</strong> Istio&#8217;s ambient mode (stable in Istio 1.22+) moves L4 processing to a per-node <code>ztunnel</code> DaemonSet and L7 processing to optional waypoint proxies. Services with no L7 policy incur only node-level tunnel overhead (~0.1ms), not per-pod proxy overhead. This is not a future design &#8212; production clusters have validated it.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Wk6X!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fe96847-c0d3-46c3-b8bf-c9dd7e7baeaa_4500x3100.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Wk6X!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fe96847-c0d3-46c3-b8bf-c9dd7e7baeaa_4500x3100.png 424w, https://substackcdn.com/image/fetch/$s_!Wk6X!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fe96847-c0d3-46c3-b8bf-c9dd7e7baeaa_4500x3100.png 848w, https://substackcdn.com/image/fetch/$s_!Wk6X!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fe96847-c0d3-46c3-b8bf-c9dd7e7baeaa_4500x3100.png 1272w, https://substackcdn.com/image/fetch/$s_!Wk6X!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fe96847-c0d3-46c3-b8bf-c9dd7e7baeaa_4500x3100.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Wk6X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fe96847-c0d3-46c3-b8bf-c9dd7e7baeaa_4500x3100.png" width="1456" height="1003" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3fe96847-c0d3-46c3-b8bf-c9dd7e7baeaa_4500x3100.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1003,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:949046,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://systemdr.substack.com/i/190347864?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fe96847-c0d3-46c3-b8bf-c9dd7e7baeaa_4500x3100.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Wk6X!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fe96847-c0d3-46c3-b8bf-c9dd7e7baeaa_4500x3100.png 424w, https://substackcdn.com/image/fetch/$s_!Wk6X!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fe96847-c0d3-46c3-b8bf-c9dd7e7baeaa_4500x3100.png 848w, https://substackcdn.com/image/fetch/$s_!Wk6X!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fe96847-c0d3-46c3-b8bf-c9dd7e7baeaa_4500x3100.png 1272w, https://substackcdn.com/image/fetch/$s_!Wk6X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fe96847-c0d3-46c3-b8bf-c9dd7e7baeaa_4500x3100.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Real-World Examples</h2><p><strong>Shopify</strong> reported that after adopting Istio at scale (2021), their p99 inter-service latency increased by 3&#8211;4ms on average. The root cause was mTLS handshake cost on their Rails services, which used short-lived HTTP/1.1 connections. Their fix was enabling Envoy connection coalescing and mandating HTTP/2 for all internal service communication, reducing handshake frequency by 80%.</p><p><strong>LinkedIn</strong> published detailed analysis showing that at 2M+ RPS, iptables redirect overhead contributed 8% of total CPU consumption cluster-wide. They migrated to Cilium&#8217;s eBPF-based transparent proxy, eliminating iptables NAT entirely and recovering that CPU headroom for application workloads.</p><p><strong>Uber</strong> addressed control plane scalability by running geographically distributed Istiod instances per region and implementing aggressive endpoint caching. Their finding: with 4000+ services, a naive Istiod deployment received xDS config updates 300+ times per minute, causing Envoy reload storms. Reducing xDS push frequency through debounce tuning cut reload events by 70%.</p><div><hr></div><p><strong>Preparing for a distributed systems interview? Download the free Interview Pack</strong> </p><p>&#8594;<a href="https://systemdrd.com/ebooks/sdcourse-distributed-systems-interview">https://systemdrd.com/ebooks/sdcourse-distributed-systems-interview</a></p><p><strong><a href="https://systemdr.systemdrd.com/subscribe">Subscribe</a> now to access source code repository - 200 + coding lessons &#8594; <a href="https://systemdr.systemdrd.com/subscribe">Subscribe</a></strong></p><div><hr></div><h2>Architectural Considerations</h2><h2><strong>GitHub Link</strong></h2><p><a href="https://github.com/sysdr/sdir-p/tree/main/Service_Mesh_Performance_costs/sidecar-latency-demo">https://github.com/sysdr/sdir-p/tree/main/Service_Mesh_Performance_costs/sidecar-latency-demo</a></p><blockquote><p>Service meshes belong in your stack when you need uniform mTLS, traffic splitting for canary deployments, or consistent observability across polyglot services. They do not belong when every microsecond matters &#8212; financial systems, gaming backends, or ML inference services should evaluate eBPF-native solutions or application-layer mutual authentication instead.</p></blockquote><p>Monitor proxy resource usage separately from application resources. Envoy&#8217;s <code>/stats</code> endpoint and Prometheus integration surface queue depths, connection pool saturation, and filter processing times. Alert on <code>envoy_cluster_upstream_cx_connect_ms</code> p99 as a leading indicator of mTLS overhead before it degrades user-visible latency. Cost-wise, sidecars running at 100m CPU across 500 pods add 50 cores of persistent compute &#8212; approximately $2,000&#8211;$5,000/month on major cloud providers at standard pricing.</p><div><hr></div><h2>Practical Takeaway</h2><p>Before adopting or blaming a service mesh, profile the four cost centers: iptables overhead, loopback socket cost, Envoy filter chain depth, and mTLS connection reuse. Most mesh-related latency problems are solved by three changes: enabling HTTP/2 end-to-end, disabling unused Envoy filters, and tuning connection pool sizes to match your concurrency profile.</p><p><strong>Run </strong><code>bash setup.sh</code><strong> to see this in action.</strong> The demo deploys two services behind Envoy sidecars and a mock control plane, generating traffic at configurable RPS. The dashboard shows real-time p50/p99/p99.9 latency broken down by sidecar cost component &#8212; iptables, TLS handshake, filter processing, and application processing &#8212; so you can see exactly where each millisecond goes. Toggle HTTP/1.1 vs HTTP/2 and watch TLS amortization change the latency profile live. Extend it by adding more Envoy filters to the chain and observing additive latency. This is the fastest way to build an intuition for what your mesh is actually costing you.</p><h3>Youtube Demo Link:</h3><div id="youtube2-RGZ3EaZhm9k" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;RGZ3EaZhm9k&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/RGZ3EaZhm9k?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://systemdr.systemdrd.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">System Design Interview Roadmap is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Handling "Hot Keys" in Distributed Databases: Detection and Splitting Strategies]]></title><description><![CDATA[The Problem That Silently Kills Your Database]]></description><link>https://systemdr.systemdrd.com/p/handling-hot-keys-in-distributed</link><guid isPermaLink="false">https://systemdr.systemdrd.com/p/handling-hot-keys-in-distributed</guid><dc:creator><![CDATA[System Design Roadmap]]></dc:creator><pubDate>Sat, 09 May 2026 08:30:54 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!LOiL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb48b6ef1-2e01-486b-be58-217dcb4a90e1_4500x3000.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>The Problem That Silently Kills Your Database</h2><blockquote><p>Your Redis cluster is running fine&#8212;CPU at 12%, memory comfortable, latency in single-digit milliseconds. Then one shard starts spiking: 98% CPU, 800ms p99, timeouts cascading into your application tier. Every other shard is idle. You&#8217;ve hit a hot key.</p></blockquote><blockquote><p>A hot key is a single cache or database key receiving a disproportionate share of traffic. In a distributed system where data is partitioned by key hash, one key always maps to one node. If that key is <code>product:iphone16-pro</code> during a product launch, that single node absorbs all reads and writes for it&#8212;regardless of how many nodes are in your cluster. More nodes don&#8217;t help. This is a partitioning problem, not a capacity problem.</p></blockquote><div><hr></div><h2>Core Concept: Why Distribution Fails at Extremes</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LOiL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb48b6ef1-2e01-486b-be58-217dcb4a90e1_4500x3000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LOiL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb48b6ef1-2e01-486b-be58-217dcb4a90e1_4500x3000.png 424w, https://substackcdn.com/image/fetch/$s_!LOiL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb48b6ef1-2e01-486b-be58-217dcb4a90e1_4500x3000.png 848w, https://substackcdn.com/image/fetch/$s_!LOiL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb48b6ef1-2e01-486b-be58-217dcb4a90e1_4500x3000.png 1272w, https://substackcdn.com/image/fetch/$s_!LOiL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb48b6ef1-2e01-486b-be58-217dcb4a90e1_4500x3000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LOiL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb48b6ef1-2e01-486b-be58-217dcb4a90e1_4500x3000.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b48b6ef1-2e01-486b-be58-217dcb4a90e1_4500x3000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1588355,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://systemdr.substack.com/i/189115655?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb48b6ef1-2e01-486b-be58-217dcb4a90e1_4500x3000.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LOiL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb48b6ef1-2e01-486b-be58-217dcb4a90e1_4500x3000.png 424w, https://substackcdn.com/image/fetch/$s_!LOiL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb48b6ef1-2e01-486b-be58-217dcb4a90e1_4500x3000.png 848w, https://substackcdn.com/image/fetch/$s_!LOiL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb48b6ef1-2e01-486b-be58-217dcb4a90e1_4500x3000.png 1272w, https://substackcdn.com/image/fetch/$s_!LOiL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb48b6ef1-2e01-486b-be58-217dcb4a90e1_4500x3000.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Distributed databases partition data using consistent hashing or range-based sharding. The goal is uniform distribution across nodes. This works when access patterns are uniform&#8212;but real traffic is never uniform. Power-law distributions (Zipf&#8217;s law) govern most real-world access patterns: a small percentage of keys receive the vast majority of requests.</p>
      <p>
          <a href="https://systemdr.systemdrd.com/p/handling-hot-keys-in-distributed">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Database Schema Migrations with Zero Downtime: The Expand-Contract Pattern]]></title><description><![CDATA[The Problem at 3 AM]]></description><link>https://systemdr.systemdrd.com/p/database-schema-migrations-with-zero</link><guid isPermaLink="false">https://systemdr.systemdrd.com/p/database-schema-migrations-with-zero</guid><dc:creator><![CDATA[System Design Roadmap]]></dc:creator><pubDate>Wed, 06 May 2026 08:31:12 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Dl2h!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedcec279-0799-4fb4-8908-182a91c82a6d_4500x3100.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>The Problem at 3 AM</h2><blockquote><p>Your team lands a contract requiring you to split a <code>full_name</code> column into <code>first_name</code> and <code>last_name</code> across 200 million rows. The naive approach: <code>ALTER TABLE users DROP COLUMN full_name, ADD COLUMN first_name VARCHAR, ADD COLUMN last_name VARCHAR</code>. You run it during &#8220;low traffic&#8221; at 2 AM. Postgres acquires an <code>ACCESS EXCLUSIVE</code> lock. For 47 minutes, your entire application is offline because every query touching <code>users</code> is blocked. The on-call engineer gets paged. The customer escalates. You revert, losing four hours of data. This is a schema migration war story that has happened at every company operating relational databases at scale.</p></blockquote><p>The Expand-Contract pattern eliminates this failure mode entirely.</p><div><hr></div><h2>Core Concept: Expand, Then Contract</h2><p>The pattern decomposes a breaking schema change into three discrete deployment phases, each independently reversible and safe to ship independently.</p><p><strong>Phase 1: Expand</strong><br>Add new columns/tables alongside the existing ones. Never drop, never rename&#8212;only add. The new schema coexists with the old. The application code is not yet modified to write to the new columns; only background jobs or triggers begin populating them.</p><p><strong>Phase 2: Migrate</strong><br>Backfill historical data into the new columns in batches (typically 1,000&#8211;10,000 rows per batch with a deliberate sleep between batches to avoid I/O saturation). Once backfill completes, deploy application code that writes to <em>both</em> old and new columns simultaneously. This dual-write phase is critical: any new writes land in both locations, ensuring the new schema never falls behind.</p><p><strong>Phase 3: Contract</strong><br>After verifying that the new columns are fully populated and the new code has been live long enough to drain in-flight requests from the old code path, drop the old column. This ALTER TABLE is now a metadata-only operation in most modern databases&#8212;it completes in milliseconds regardless of table size.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Dl2h!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedcec279-0799-4fb4-8908-182a91c82a6d_4500x3100.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Dl2h!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedcec279-0799-4fb4-8908-182a91c82a6d_4500x3100.png 424w, https://substackcdn.com/image/fetch/$s_!Dl2h!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedcec279-0799-4fb4-8908-182a91c82a6d_4500x3100.png 848w, https://substackcdn.com/image/fetch/$s_!Dl2h!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedcec279-0799-4fb4-8908-182a91c82a6d_4500x3100.png 1272w, https://substackcdn.com/image/fetch/$s_!Dl2h!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedcec279-0799-4fb4-8908-182a91c82a6d_4500x3100.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Dl2h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedcec279-0799-4fb4-8908-182a91c82a6d_4500x3100.png" width="1456" height="1003" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/edcec279-0799-4fb4-8908-182a91c82a6d_4500x3100.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1003,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3186627,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://systemdr.substack.com/i/188999974?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedcec279-0799-4fb4-8908-182a91c82a6d_4500x3100.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Dl2h!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedcec279-0799-4fb4-8908-182a91c82a6d_4500x3100.png 424w, https://substackcdn.com/image/fetch/$s_!Dl2h!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedcec279-0799-4fb4-8908-182a91c82a6d_4500x3100.png 848w, https://substackcdn.com/image/fetch/$s_!Dl2h!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedcec279-0799-4fb4-8908-182a91c82a6d_4500x3100.png 1272w, https://substackcdn.com/image/fetch/$s_!Dl2h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedcec279-0799-4fb4-8908-182a91c82a6d_4500x3100.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div>
      <p>
          <a href="https://systemdr.systemdrd.com/p/database-schema-migrations-with-zero">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Capacity Planning Modeling: Using Little's Law to Predict Hardware Needs]]></title><description><![CDATA[Introduction]]></description><link>https://systemdr.systemdrd.com/p/capacity-planning-modeling-using</link><guid isPermaLink="false">https://systemdr.systemdrd.com/p/capacity-planning-modeling-using</guid><dc:creator><![CDATA[System Design Roadmap]]></dc:creator><pubDate>Sun, 03 May 2026 08:31:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!J3-h!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff67f1e1d-9f86-4b20-b594-c6c2d751b8c6_4500x2900.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>Introduction</h2><blockquote><p>Your checkout service just fell over during a flash sale. Post-mortem reveals the root cause: you had 3&#215; the expected concurrent users, but your server pool was sized for peak <em>throughput</em>, not peak <em>concurrency</em>. These are different numbers, and conflating them is what burned you. Little&#8217;s Law is the one equation that connects these three dimensions &#8212; and it&#8217;s the reason capacity planning at companies like Amazon and Stripe is grounded in queueing theory rather than guesswork.</p></blockquote><div><hr></div><h2>The Core Equation</h2><p>Little&#8217;s Law states:</p><blockquote><p><strong>L = &#955;W</strong></p></blockquote><ul><li><p><strong>L</strong> &#8212; average number of requests in the system (concurrency)</p></li><li><p><strong>&#955;</strong> (lambda) &#8212; average arrival rate (requests per second)</p></li><li><p><strong>W</strong> &#8212; average time a request spends in the system (latency in seconds)</p></li></ul><p>This identity holds for any stable, ergodic system &#8212; it doesn&#8217;t care about your distribution. Poisson arrivals, bursty traffic, uniform load: the math holds. That&#8217;s what makes it powerful.</p><p><strong>Worked example:</strong> Your API handles 500 RPS (&#955; = 500). Average response time is 200ms (W = 0.2s). Little&#8217;s Law gives you L = 500 &#215; 0.2 = <strong>100 concurrent requests</strong> in flight at any moment. To handle this, your thread pool, connection pool, and in-flight request budget all need headroom above 100. Size them at 100 and you&#8217;ll queue under any latency spike.</p><p>The dangerous trap: engineers look at throughput (500 RPS) and provision servers to handle that rate in isolation. They miss that latency is a multiplier on concurrency. A 2&#215; latency spike &#8212; say, from a slow DB query &#8212; doubles concurrent load without changing RPS at all. Your servers are now saturated even though your traffic didn&#8217;t increase.</p><p><strong>Rearranging for capacity planning:</strong></p><ul><li><p><strong>Max throughput given concurrency limit:</strong> &#955; = L / W</p></li><li><p><strong>Latency budget given target throughput:</strong> W = L / &#955;</p></li><li><p><strong>Required concurrency for a throughput target:</strong> L = &#955; &#215; W</p></li></ul><blockquote><p>If your load balancer enforces a max concurrency of 200 (connection limit), and your p99 latency is 400ms, your max sustainable throughput is 200 / 0.4 = <strong>500 RPS</strong>. If marketing promises the system will handle 800 RPS, you need either more concurrency slots or lower latency &#8212; not more CPU cores alone.</p></blockquote><p><strong>Segmenting by service tier:</strong> Apply Little&#8217;s Law independently to each layer. Your API gateway, application servers, database connection pool, and downstream service all have their own L, &#955;, and W. A bottleneck in any one layer creates backpressure upstream. The system&#8217;s actual throughput ceiling is the minimum across all layers &#8212; classic bottleneck analysis meets queueing theory.</p><p><strong>Stability condition:</strong> Little&#8217;s Law only applies to <em>stable</em> queues where arrival rate doesn&#8217;t permanently exceed service rate (&#955; &lt; &#956;, where &#956; is your service rate). When load exceeds capacity, queues grow without bound. This is why your 95th-percentile latency explodes suddenly at saturation &#8212; you&#8217;ve crossed the stability boundary.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!J3-h!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff67f1e1d-9f86-4b20-b594-c6c2d751b8c6_4500x2900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!J3-h!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff67f1e1d-9f86-4b20-b594-c6c2d751b8c6_4500x2900.png 424w, https://substackcdn.com/image/fetch/$s_!J3-h!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff67f1e1d-9f86-4b20-b594-c6c2d751b8c6_4500x2900.png 848w, https://substackcdn.com/image/fetch/$s_!J3-h!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff67f1e1d-9f86-4b20-b594-c6c2d751b8c6_4500x2900.png 1272w, https://substackcdn.com/image/fetch/$s_!J3-h!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff67f1e1d-9f86-4b20-b594-c6c2d751b8c6_4500x2900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!J3-h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff67f1e1d-9f86-4b20-b594-c6c2d751b8c6_4500x2900.png" width="1456" height="938" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f67f1e1d-9f86-4b20-b594-c6c2d751b8c6_4500x2900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:938,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2966556,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://systemdr.substack.com/i/188879948?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff67f1e1d-9f86-4b20-b594-c6c2d751b8c6_4500x2900.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!J3-h!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff67f1e1d-9f86-4b20-b594-c6c2d751b8c6_4500x2900.png 424w, https://substackcdn.com/image/fetch/$s_!J3-h!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff67f1e1d-9f86-4b20-b594-c6c2d751b8c6_4500x2900.png 848w, https://substackcdn.com/image/fetch/$s_!J3-h!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff67f1e1d-9f86-4b20-b594-c6c2d751b8c6_4500x2900.png 1272w, https://substackcdn.com/image/fetch/$s_!J3-h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff67f1e1d-9f86-4b20-b594-c6c2d751b8c6_4500x2900.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p>
      <p>
          <a href="https://systemdr.systemdrd.com/p/capacity-planning-modeling-using">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Immutable Infrastructure: Why You Should Never Patch Production Servers]]></title><description><![CDATA[Introduction]]></description><link>https://systemdr.systemdrd.com/p/immutable-infrastructure-why-you</link><guid isPermaLink="false">https://systemdr.systemdrd.com/p/immutable-infrastructure-why-you</guid><dc:creator><![CDATA[System Design Roadmap]]></dc:creator><pubDate>Thu, 30 Apr 2026 08:30:10 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!g-WA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff87895df-2119-4198-8397-aff8a5d57b53_3600x2240.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>Introduction</h2><blockquote><p>Your on-call rotation fires at 2 AM. A CVE dropped six hours ago, and your security team wants the patch deployed to 400 production nodes before morning. One engineer starts SSHing into boxes one-by-one. Another runs Ansible. A third realizes the first 50 boxes now have a slightly different kernel version than the rest. By dawn, you have a fleet that can&#8217;t be described accurately by any manifest, and the next incident will be twice as hard to debug because you no longer know what&#8217;s actually running.</p></blockquote><p>That is the mutable infrastructure trap, and immutable infrastructure exists specifically to make that scenario impossible.</p><div><hr></div><h2>What Immutable Infrastructure Actually Means</h2><blockquote><p>The word &#8220;immutable&#8221; is borrowed from functional programming: once a value is created, it never changes. Applied to servers, it means: <strong>once a machine image is baked and deployed, that instance is never modified</strong>. No SSH sessions. No config patches. No live package upgrades. If something needs to change&#8212;a new config value, a library update, a bug fix&#8212;you build a new image, replace the old instances, and terminate them.</p></blockquote><p>The operational model becomes:</p><ol><li><p><strong>Build</strong>: Code change triggers a CI pipeline that bakes a new OS image (AMI, container image, VM snapshot). Every dependency is pinned and installed fresh.</p></li><li><p><strong>Test</strong>: The image is validated in a staging environment that mirrors production.</p></li><li><p><strong>Deploy</strong>: New instances launch from the validated image. Traffic shifts via load balancer or service mesh.</p></li><li><p><strong>Terminate</strong>: Old instances drain connections and are destroyed. No orphan configs survive.</p></li></ol><p>This is fundamentally different from Ansible playbooks or Chef recipes that mutate existing machines. Those tools are applying changes to an unknown prior state. Immutable infrastructure eliminates the prior state entirely.</p><p><strong>The underlying insight</strong>: configuration drift is cumulative and invisible. Every hotfix applied directly to a server, every manually tweaked sysctl, every &#8220;temporary&#8221; cron job added during an incident&#8212;these accumulate over months until your fleet is a snowflake collection where no two boxes are identical. Automated tools can&#8217;t reliably detect what they didn&#8217;t apply. Immutable infrastructure makes drift structurally impossible because instances are never modified, only replaced.</p><p><strong>Replacement vs. In-Place Update</strong>: When you replace rather than patch, you also solve the partial-failure problem. A rolling patch across 400 nodes can leave you in a mixed state if it fails halfway. A rolling image replacement can be rolled back atomically: keep old instances, shift traffic back, terminate new ones.</p><p><strong>Image baking vs. runtime configuration</strong>: There&#8217;s an important nuance. Some configuration&#8212;environment-specific secrets, feature flags, endpoints&#8212;should not be baked into an image (that would mean a different image per environment). The split is: infrastructure configuration goes into the image; application configuration is injected at runtime via environment variables or a secrets manager. This keeps images environment-agnostic while still preventing runtime mutation.</p><p><strong>Immutable does not mean stateless</strong>: Stateless application tiers are the most natural fit, but databases and stateful services can participate too. The data plane (the database files) lives on persistent volumes that survive instance replacement; the control plane (the database binary, OS, config files) is replaced via the same image pipeline.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!g-WA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff87895df-2119-4198-8397-aff8a5d57b53_3600x2240.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!g-WA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff87895df-2119-4198-8397-aff8a5d57b53_3600x2240.png 424w, https://substackcdn.com/image/fetch/$s_!g-WA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff87895df-2119-4198-8397-aff8a5d57b53_3600x2240.png 848w, https://substackcdn.com/image/fetch/$s_!g-WA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff87895df-2119-4198-8397-aff8a5d57b53_3600x2240.png 1272w, https://substackcdn.com/image/fetch/$s_!g-WA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff87895df-2119-4198-8397-aff8a5d57b53_3600x2240.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!g-WA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff87895df-2119-4198-8397-aff8a5d57b53_3600x2240.png" width="1456" height="906" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f87895df-2119-4198-8397-aff8a5d57b53_3600x2240.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:906,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1695454,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://systemdr.substack.com/i/188577319?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff87895df-2119-4198-8397-aff8a5d57b53_3600x2240.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!g-WA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff87895df-2119-4198-8397-aff8a5d57b53_3600x2240.png 424w, https://substackcdn.com/image/fetch/$s_!g-WA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff87895df-2119-4198-8397-aff8a5d57b53_3600x2240.png 848w, https://substackcdn.com/image/fetch/$s_!g-WA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff87895df-2119-4198-8397-aff8a5d57b53_3600x2240.png 1272w, https://substackcdn.com/image/fetch/$s_!g-WA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff87895df-2119-4198-8397-aff8a5d57b53_3600x2240.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p>
      <p>
          <a href="https://systemdr.systemdrd.com/p/immutable-infrastructure-why-you">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Secret Management in Production: Vault, KMS, and Rotation Strategies]]></title><description><![CDATA[Introduction]]></description><link>https://systemdr.systemdrd.com/p/secret-management-in-production-vault</link><guid isPermaLink="false">https://systemdr.systemdrd.com/p/secret-management-in-production-vault</guid><dc:creator><![CDATA[System Design Roadmap]]></dc:creator><pubDate>Mon, 27 Apr 2026 08:31:02 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Wdcf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F516db3b2-092a-4644-a73c-a0fabe9a2eba_3600x2480.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>Introduction</h2><blockquote><p>A team ships a microservice. Six months later, a security scan finds a PostgreSQL password buried in git history &#8212; committed in a <code>.env</code> file, pushed before the <code>.gitignore</code> was set up. The password rotated to nothing in production, but three developers still have the original credential memorized. That&#8217;s not a hypothetical; it&#8217;s how breaches start. Secret management is the infrastructure that sits between &#8220;we have credentials&#8221; and &#8220;those credentials can never leak, expire gracefully, and rotate without a deployment.&#8221;</p></blockquote><div><hr></div><h2>The Three-Layer Hierarchy</h2><p>Secret management operates across three distinct layers, and conflating them causes architectural mistakes.</p><p><strong>KMS (Key Management Service)</strong> &#8212; AWS KMS, GCP Cloud KMS, Azure Key Vault HSM &#8212; manages <em>cryptographic keys only</em>. It does not store your database passwords. Its job is to encrypt and decrypt other keys. You call KMS to wrap a key; KMS never sees your actual application secrets.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://systemdr.systemdrd.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">System Design Interview Roadmap is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p><strong>Secret Store (HashiCorp Vault, AWS Secrets Manager, GCP Secret Manager)</strong> &#8212; manages secret <em>lifecycle</em>: storage, access control, rotation, and auditing. Vault encrypts secrets at rest using an internal barrier key. That barrier key is itself encrypted by KMS. Vault without KMS integration requires a manual unseal ceremony (covered below).</p><p><strong>Application Layer</strong> &#8212; consumes secrets via the Vault SDK, environment injection at container start, or a Vault Agent sidecar. The right choice depends on your rotation requirements.</p><h2>Envelope Encryption</h2><p>KMS introduces a pattern called envelope encryption, which solves a fundamental problem: you can&#8217;t send a 10 GB database to KMS to encrypt &#8212; KMS API requests have a 4 KB payload limit, and the cost would be enormous.</p><p>Instead: (1) generate a random 256-bit Data Encryption Key (DEK) locally, (2) encrypt your data with the DEK using AES-256-GCM locally, (3) send <em>only</em> the DEK to KMS to encrypt using your master Customer Master Key (CMK), receiving an Encrypted DEK (EDEK) back, (4) store the EDEK alongside the ciphertext.</p><p>To decrypt: call KMS with the EDEK, receive the DEK, decrypt locally. AWS KMS charges $0.03 per 10,000 API calls. At 1,000 decrypt operations per second, that&#8217;s ~$260/month &#8212; manageable, but worth metering.</p><p>The critical benefit: rotating encryption keys means re-encrypting only the DEK (a few bytes), not re-encrypting all data. Shopify uses exactly this pattern for PCI-DSS compliance &#8212; DEK rotation is a millisecond operation regardless of database size.</p><h2>HashiCorp Vault Architecture</h2><p>Vault&#8217;s core is a cryptographic <em>barrier</em>. All data written to storage crosses this barrier and gets encrypted. The barrier key is split using Shamir&#8217;s Secret Sharing: with a 5-of-3 configuration, five key shares are generated at initialization and any three are required to unseal. On restart, Vault starts sealed &#8212; it cannot serve any requests until unsealed.</p><p>Inside the barrier, Vault has three primitives:</p><ul><li><p><strong>Auth Methods</strong>: How identities are verified. Kubernetes JWT, AWS IAM, AppRole, LDAP. The Kubernetes auth method is the dominant choice for cloud-native: pods present a bound service account token, and Vault validates it against the Kubernetes API.</p></li><li><p><strong>Secrets Engines</strong>: Plugins that generate secrets. KV v2 stores static secrets with versioning. The database engine generates dynamic credentials. The PKI engine issues X.509 certificates.</p></li><li><p><strong>Policies</strong>: HCL rules mapping identity paths to capabilities (<code>read</code>, <code>write</code>, <code>create</code>, <code>delete</code>, <code>list</code>).</p></li></ul><h2>Dynamic Secrets: The Core Value Proposition</h2><p>Static secrets in KV have a fundamental problem: they exist until you explicitly rotate them. An attacker who exfiltrates a static credential has indefinite access.</p><p>Dynamic secrets invert this model. When an app requests a PostgreSQL credential from Vault&#8217;s database engine, Vault connects to PostgreSQL, executes <code>CREATE USER vault_&lt;random&gt; WITH PASSWORD '&lt;random&gt;'</code>, grants permissions per the configured role, and returns the credential with a lease TTL (e.g., 1 hour). When the TTL expires &#8212; or when explicitly revoked &#8212; Vault executes <code>DROP USER vault_&lt;random&gt;</code>. The credential was useful for exactly its lifetime.</p><p>Every dynamic credential is unique per requester, per request. A compromised credential from Pod A cannot be used by Pod B, and it self-destructs at TTL.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Wdcf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F516db3b2-092a-4644-a73c-a0fabe9a2eba_3600x2480.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Wdcf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F516db3b2-092a-4644-a73c-a0fabe9a2eba_3600x2480.png 424w, https://substackcdn.com/image/fetch/$s_!Wdcf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F516db3b2-092a-4644-a73c-a0fabe9a2eba_3600x2480.png 848w, https://substackcdn.com/image/fetch/$s_!Wdcf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F516db3b2-092a-4644-a73c-a0fabe9a2eba_3600x2480.png 1272w, https://substackcdn.com/image/fetch/$s_!Wdcf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F516db3b2-092a-4644-a73c-a0fabe9a2eba_3600x2480.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Wdcf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F516db3b2-092a-4644-a73c-a0fabe9a2eba_3600x2480.png" width="1456" height="1003" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/516db3b2-092a-4644-a73c-a0fabe9a2eba_3600x2480.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1003,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1261437,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://systemdr.substack.com/i/188362358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F516db3b2-092a-4644-a73c-a0fabe9a2eba_3600x2480.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Wdcf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F516db3b2-092a-4644-a73c-a0fabe9a2eba_3600x2480.png 424w, https://substackcdn.com/image/fetch/$s_!Wdcf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F516db3b2-092a-4644-a73c-a0fabe9a2eba_3600x2480.png 848w, https://substackcdn.com/image/fetch/$s_!Wdcf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F516db3b2-092a-4644-a73c-a0fabe9a2eba_3600x2480.png 1272w, https://substackcdn.com/image/fetch/$s_!Wdcf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F516db3b2-092a-4644-a73c-a0fabe9a2eba_3600x2480.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Critical Insights</h2><p><strong>Static secrets in environment variables are a rotation anti-pattern.</strong> A secret injected at container start cannot be rotated without restarting the container. Worse, env vars are readable by any code in the process &#8212; including third-party SDKs. Use Vault Agent with templated files instead: the agent rewrites a credentials file on rotation, and the app watches the file for changes.</p><p><strong>Auto-unseal via KMS changes the security threat model.</strong> Traditional unsealing requires multiple key holders to physically present their key shares &#8212; a ceremony analogous to nuclear launch authorization. Auto-unseal (KMS wraps the barrier key) is operationally convenient and necessary for HA deployments, but it means control of the KMS CMK = control of Vault. Document this dependency in your threat model and restrict CMK access rigorously.</p><p><strong>Lease renewal storms are a hyperscale failure mode.</strong> If 50,000 pods start simultaneously &#8212; as happens after a cluster-wide deployment or a mass restart &#8212; all their 1-hour leases are issued within seconds of each other. At the 30-minute mark, all 50,000 pods attempt lease renewal simultaneously. Vault&#8217;s Raft FSM processes renewals serially. Solution: set renewal trigger at 70% TTL plus <code>&#177;(TTL * 0.1 * random())</code> jitter. Vault Agent handles this automatically.</p><p><strong>The &#8220;secret zero&#8221; problem</strong> &#8212; how an app authenticates to Vault without a pre-shared secret &#8212; is solved by platform identity. Kubernetes workload identity tokens, AWS IAM role assumption, and GCP service account keys all establish identity without an initial credential. AppRole (Vault&#8217;s own auth method) still requires bootstrapping a RoleID and SecretID through an external mechanism; use it only when platform identity isn&#8217;t available.</p><p><strong>Vault namespace isolation (Enterprise feature)</strong> allows teams to operate independent Vault instances within a shared cluster. Each namespace has its own auth methods, secrets engines, and policies. A credential leak in the <code>payments</code> namespace cannot access secrets in the <code>ml-training</code> namespace. Lyft adopted this pattern to enforce blast radius boundaries across 300+ microservices.</p><p><strong>Raft leader elections introduce a 1&#8211;3 second unavailability window</strong> when the active Vault node fails. Applications must implement retry logic with exponential backoff. A naive app that fails immediately on a 503 from Vault will shed the Vault HA benefit entirely.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Msib!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05895e32-bff5-47ce-bcac-63a2c9dfbfee_3600x2320.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Msib!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05895e32-bff5-47ce-bcac-63a2c9dfbfee_3600x2320.png 424w, https://substackcdn.com/image/fetch/$s_!Msib!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05895e32-bff5-47ce-bcac-63a2c9dfbfee_3600x2320.png 848w, https://substackcdn.com/image/fetch/$s_!Msib!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05895e32-bff5-47ce-bcac-63a2c9dfbfee_3600x2320.png 1272w, https://substackcdn.com/image/fetch/$s_!Msib!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05895e32-bff5-47ce-bcac-63a2c9dfbfee_3600x2320.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Msib!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05895e32-bff5-47ce-bcac-63a2c9dfbfee_3600x2320.png" width="1456" height="938" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/05895e32-bff5-47ce-bcac-63a2c9dfbfee_3600x2320.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:938,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1534516,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://systemdr.substack.com/i/188362358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05895e32-bff5-47ce-bcac-63a2c9dfbfee_3600x2320.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Msib!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05895e32-bff5-47ce-bcac-63a2c9dfbfee_3600x2320.png 424w, https://substackcdn.com/image/fetch/$s_!Msib!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05895e32-bff5-47ce-bcac-63a2c9dfbfee_3600x2320.png 848w, https://substackcdn.com/image/fetch/$s_!Msib!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05895e32-bff5-47ce-bcac-63a2c9dfbfee_3600x2320.png 1272w, https://substackcdn.com/image/fetch/$s_!Msib!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05895e32-bff5-47ce-bcac-63a2c9dfbfee_3600x2320.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Real-World Examples</h2><p><strong>Netflix</strong> manages secrets across 100,000+ microservices using Vault with a Spinnaker pipeline integration that injects Vault tokens at deploy time. Their PKI engine issues 4-hour TLS certificates, eliminating certificate revocation list (CRL) maintenance &#8212; short TTLs make revocation unnecessary, since a compromised certificate expires before it can be significantly abused.</p><p><strong>Shopify</strong> combines AWS KMS envelope encryption for data-at-rest with Vault for runtime secret injection. Their database credentials are dynamic, generated per-deploy with a 24-hour TTL aligned to their deployment cycle. Critically, they pin credential TTL to slightly longer than their P99 deployment duration to prevent mid-deploy credential expiry.</p><p><strong>Square</strong> uses Vault&#8217;s transit secrets engine as a shared encryption-as-a-service layer. Rather than distributing encryption keys to individual services, services send plaintext to Vault&#8217;s transit engine and receive ciphertext back &#8212; Vault never stores the data, and the encryption key never leaves Vault&#8217;s barrier.</p><div><hr></div><h2>Architectural Considerations</h2><h2><strong>GitHub Link</strong></h2><pre><code><a href="https://github.com/sysdr/sdir/tree/main/Secret_Management_in_Production/vault-secrets-demo">https://github.com/sysdr/sdir/tree/main/Secret_Management_in_Production/vault-secrets-demo</a></code></pre><p>Monitor Vault&#8217;s <code>/v1/sys/health</code> for sealed status, <code>vault.token.ttl</code> metric for approaching token expirations, and lease creation/revocation rates for anomalies. An alert on &#8220;lease creation rate drops to zero&#8221; catches Vault outages before apps notice.</p><p>Cost considerations: KMS API calls, Vault Enterprise licensing (~$30k/year/cluster), and operational overhead of managing Vault HA. For smaller teams, AWS Secrets Manager at $0.40/secret/month with automatic rotation built in can be more cost-effective than operating Vault.</p><p>Do not put non-sensitive configuration in Vault. Feature flags, timeout values, and service URLs belong in a config store (etcd, Consul KV, LaunchDarkly). Vault is optimized for secrets &#8212; its audit logging, encryption, and access control add overhead inappropriate for high-frequency configuration reads.</p><div><hr></div><h2>Practical Takeaway</h2><p>Run <code>bash setup.sh</code> to deploy a complete secret management stack: HashiCorp Vault, PostgreSQL, and a Node.js service with a real-time dashboard. The demo shows KV v2 secrets with versioning, dynamic PostgreSQL credentials with live TTL countdowns, lease revocation, and a simulated rotation event stream.</p><p>Specifically: watch a dynamic credential get created, connect to PostgreSQL with it directly, then watch Vault revoke the DB user when the lease expires &#8212; the credential becomes invalid in real time, no deployment required.</p><p>After the demo, explore extending it with Vault Agent sidecar injection (add <code>vault_agent</code> service to the compose file), or enable the PKI engine to issue short-lived TLS certificates. Both patterns are production-standard at major tech companies and directly relevant to Staff+ system design interviews where secret lifecycle management and zero-trust credential issuance are increasingly common topics.</p><p>Run <code>bash cleanup.sh</code> to remove all containers and volumes when finished.</p><h3>Youtube Demo Link:</h3><div id="youtube2-qYq0bqBW3Jw" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;qYq0bqBW3Jw&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/qYq0bqBW3Jw?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://systemdr.systemdrd.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">System Design Interview Roadmap is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Distributed Tracing Sampling Strategies: Balancing Visibility vs. Storage Costs]]></title><description><![CDATA[Introduction]]></description><link>https://systemdr.systemdrd.com/p/distributed-tracing-sampling-strategies</link><guid isPermaLink="false">https://systemdr.systemdrd.com/p/distributed-tracing-sampling-strategies</guid><dc:creator><![CDATA[System Design Roadmap]]></dc:creator><pubDate>Fri, 24 Apr 2026 08:31:10 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!fS3n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78eb85fc-8d51-454b-ab8f-8753253d0b74_3680x2080.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>Introduction</h2><blockquote><p>At 10 million requests per minute, storing a complete trace for every request would flood your Jaeger backend with roughly 400&#8211;600 GB of span data per hour, depending on service depth. Nobody does that. You sample. But sampling is not just &#8220;keep 1% of traces and move on.&#8221; The decision of <em>which</em> traces to keep, <em>when</em> to make that decision, and <em>how</em> to adapt under load separates systems that debug in minutes from teams that fly blind during incidents.</p></blockquote><div><hr></div><h2>What Sampling Actually Does</h2><blockquote><p>A distributed trace is a tree of spans &#8212; each span recording one unit of work (an RPC call, a database query, a cache lookup) with timestamps, metadata, and status codes. In a system with 30 microservices and 8-hop average request depth, a single user request generates ~240 spans. At 10M RPM, that&#8217;s 2.4 billion spans per minute.</p></blockquote><p>Sampling is the process of deciding which trace trees to persist and which to discard. Every sampling strategy must answer two questions: <strong>when does the decision happen</strong>, and <strong>what information is available at decision time</strong>?</p><h3>Head-Based Sampling</h3><p>The sampling decision is made at the trace&#8217;s <em>entry point</em> &#8212; before any downstream spans exist. The API gateway or load balancer rolls a coin: 10% probability, keep the trace ID; 90%, mark it discarded. All downstream services check the trace context header and skip recording if the trace is marked discarded.</p><p><strong>Mechanism</strong>: The trace context (W3C TraceContext spec, or Jaeger&#8217;s <code>uber-trace-id</code>) carries a <code>sampled</code> flag. Downstream services read this flag and skip span creation entirely, saving both CPU and network overhead.</p><p><strong>The fatal flaw</strong>: You make the keep/drop decision before you know whether anything interesting happened. A payment that timed out at step 7 of 8 &#8212; dropped at step 0 because the coin flip went against it. A 4-second database stall &#8212; dropped. An auth service returning 403 for a premium user &#8212; dropped. Head-based sampling is statistically unbiased but operationally blind.</p><h3>Tail-Based Sampling</h3><p>The decision is deferred until the trace is <em>complete</em>. All spans from all services flow into a central buffer. After a configurable window (typically 2&#8211;30 seconds), a tail-sampling processor evaluates the complete trace tree and decides: does this trace contain an error? Was end-to-end latency above the P99 threshold? Did it hit a rare code path?</p><p><strong>Mechanism</strong>: The buffer stores spans in-memory, grouped by trace ID. When a trace is complete (all spans received, or the timeout fires), a set of rules runs: <code>has_error OR latency &gt; threshold OR service_count &gt; N</code>. Matching traces write to storage; non-matching traces are discarded.</p><p><strong>The cost</strong>: You buffer <em>everything</em> before deciding. Memory scales with <code>(RPS) &#215; (avg trace duration) &#215; (avg span size)</code>. At 10K RPS, 500ms average, 8 spans of 2KB each: 80MB/sec flowing through buffer RAM continuously. Manageable until your latency distribution has a long tail &#8212; a few 30-second traces balloon your buffer by orders of magnitude.</p><h3>Adaptive (Dynamic) Sampling</h3><p>The sampling rate adjusts automatically based on observed traffic volume, aiming for a target throughput: &#8220;keep 100 interesting traces per second regardless of incoming load.&#8221; When traffic is 1,000 RPS, sample at 10%. When traffic spikes to 50,000 RPS, drop to 0.2%.</p><p><strong>Mechanism</strong>: A feedback controller tracks the actual kept-trace rate against the target. If the actual rate exceeds the target for N consecutive seconds, it tightens the sampling probability. If under target, it relaxes. Per-operation tracking (Jaeger&#8217;s adaptive sampler) sets different rates per endpoint &#8212; <code>/health</code> at 0.001%, <code>/checkout</code> at 5%.</p><p><strong>Non-obvious failure</strong>: Adaptive samplers can oscillate. Traffic spikes &#8594; rate drops &#8594; fewer traces &#8594; pressure decreases &#8594; rate rises &#8594; traffic spikes again. Use exponential smoothing (EWMA) on the rate adjustment, not raw instantaneous values.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fS3n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78eb85fc-8d51-454b-ab8f-8753253d0b74_3680x2080.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fS3n!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78eb85fc-8d51-454b-ab8f-8753253d0b74_3680x2080.png 424w, https://substackcdn.com/image/fetch/$s_!fS3n!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78eb85fc-8d51-454b-ab8f-8753253d0b74_3680x2080.png 848w, https://substackcdn.com/image/fetch/$s_!fS3n!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78eb85fc-8d51-454b-ab8f-8753253d0b74_3680x2080.png 1272w, https://substackcdn.com/image/fetch/$s_!fS3n!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78eb85fc-8d51-454b-ab8f-8753253d0b74_3680x2080.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fS3n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78eb85fc-8d51-454b-ab8f-8753253d0b74_3680x2080.png" width="1456" height="823" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/78eb85fc-8d51-454b-ab8f-8753253d0b74_3680x2080.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:823,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1196641,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://systemdr.substack.com/i/188340862?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78eb85fc-8d51-454b-ab8f-8753253d0b74_3680x2080.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fS3n!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78eb85fc-8d51-454b-ab8f-8753253d0b74_3680x2080.png 424w, https://substackcdn.com/image/fetch/$s_!fS3n!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78eb85fc-8d51-454b-ab8f-8753253d0b74_3680x2080.png 848w, https://substackcdn.com/image/fetch/$s_!fS3n!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78eb85fc-8d51-454b-ab8f-8753253d0b74_3680x2080.png 1272w, https://substackcdn.com/image/fetch/$s_!fS3n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78eb85fc-8d51-454b-ab8f-8753253d0b74_3680x2080.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div>
      <p>
          <a href="https://systemdr.systemdrd.com/p/distributed-tracing-sampling-strategies">
              Read more
          </a>
      </p>
   ]]></content:encoded></item></channel></rss>