System Design Interview Roadmap

System Design Interview Roadmap

Designing a Chat System: Storing History, Read Receipts, and Online Status (Presence)

Feb 12, 2026
∙ Paid

The Three-Second Problem

You send a message in a group chat with 47 people. Within three seconds, you need to know: did it save? Who saw it? Who’s actually online right now? That three-second window contains six distinct database writes, 47 WebSocket notifications, and a presence check across three data centers. Miss any piece, and users lose trust in your platform. This is why WhatsApp processes over 100 billion messages daily while maintaining sub-second delivery guarantees—every architectural decision cascades into user perception of reliability.

The Three Pillars of Chat Architecture

Building a chat system requires solving three interconnected challenges simultaneously: durable message storage, accurate read receipt tracking, and real-time presence management. Each pillar has failure modes that aren’t immediately obvious.

Message Storage: Write-Through vs Write-Behind

Most chat systems face a critical decision: do you confirm message delivery after writing to the database (write-through) or after acknowledging receipt in memory (write-behind)? WhatsApp uses write-through—your message doesn’t show a checkmark until it’s persisted to their distributed storage. This adds 50-100ms of latency but guarantees durability. Slack, conversely, uses write-behind with append-only logs, acknowledging messages immediately then asynchronously persisting to their data warehouse. The trade-off: Slack feels faster but has a small window where messages could be lost in a catastrophic failure.

The storage schema matters enormously at scale. Discord stores messages in ScyllaDB using a compound partition key: (channel_id, bucket) where bucket represents a 10-day window. This prevents hotspot partitions when popular channels receive thousands of messages per minute. Without bucketing, all writes to a single channel would hit the same partition, creating a bottleneck. They learned this the hard way when a server with 1.5 million users experienced 90-second message delays because all messages shared a single partition key.

Read Receipts: The Phantom Update Problem

Read receipts seem simple—just update a timestamp when someone reads a message, right? Wrong. At scale, you face the “phantom update problem”: if 1,000 people are in a channel and one person sends a message, you potentially need to track 1,000 read states. If those users read the message within the same minute, you’re writing 1,000 database updates in rapid succession.

Telegram solves this with a distributed counter architecture. Each read receipt increments a Redis counter, and every 10 seconds, a background job flushes aggregated counts to PostgreSQL. The sender sees “1,847 views” instead of individual timestamps, drastically reducing write amplification. For one-on-one chats where precision matters, they use a hybrid approach: instant updates for direct messages, aggregated counts for groups.

The critical edge case: what happens when a user goes offline mid-read? Signal handles this by storing “last_read_message_id” per conversation. When you reconnect, the client sends a batch read receipt for all messages since that ID. This prevents the “unread badge flickering” problem where offline periods cause read states to become inconsistent.

User's avatar

Continue reading this post for free, courtesy of System Design Roadmap.

Or purchase a paid subscription.
© 2026 SystemDR Inc · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture