Skip to main content
Essay//23 MIN READ

Real-Time at Scale: How Distributed Systems Handle Millions of Events Per Second

A deep dive into how real-time systems are built, how they break under load, and how engineers scale them, covering Kafka, WebSockets, backpressure, ordering guarantees, and the tradeoffs that define production architecture.

real-time·distributed-systems·system-design·kafka·backend·scalability·streaming
Real-Time at Scale: How Distributed Systems Handle Millions of Events Per Second

You watch a live score tick up without refreshing. You see a friend's cursor glide across a shared doc. You get matched with a driver in under three seconds. Every one of these moments is a real-time system doing its job underneath.

At small scale, these systems are about as simple as they look. But turn the numbers up and you are suddenly solving a completely different problem. This post is about that gap, the space between "works on my laptop" and "works for ten million people at once." The primitives barely change. What changes is which failures you are willing to accept, and that turns out to be the entire job.

What "Real-Time" Actually Means

"Real-time" is a slippery word. People use it for everything from sub-millisecond trading systems to dashboards that refresh twice a minute. Before going further, it helps to split it into three categories.

The first is hard real-time, where missing a deadline counts as a full system failure. Think aircraft flight controllers, pacemakers, and industrial control systems. The deadline is absolute, and blowing past it has physical consequences.

The second is soft real-time, where deadlines matter but the occasional miss is survivable. A live video stream that buffers for 200ms is annoying, not dangerous. Most consumer software that calls itself real-time lives here.

The third is near real-time, where a delay of a few seconds is perfectly fine. Most analytics dashboards, monitoring pipelines, and "live" feeds actually sit in this bucket. They feel instant only because nobody can perceive a two-second lag in aggregated data.

Almost all of us are building soft or near-real-time systems. That is what this post is about. And the reason the distinction matters is not pedantic: the category you are actually in is your latency budget, and your latency budget is what every later tradeoff gets spent against. Engineers who skip this step end up paying for hard-real-time guarantees on a feed where two seconds of staleness would have been completely fine.

The Building Blocks

Before we talk architecture, it is worth knowing the individual parts and what each one is actually responsible for.

Event streams are the raw material of real-time data. An event is an immutable record that something happened: a user clicked, a sensor fired, an order was placed. Once it exists, you never edit it. A stream is just an ordered sequence of these records, and that ordering is the thing that lets you rebuild state over time.

Message brokers sit between the things producing events and the things consuming them. Their whole purpose is decoupling. The producer does not need to know who is listening, and the consumer does not need to know who published. Kafka is the broker you will see most often in production. It groups events into topics, splits each topic into partitions so work can happen in parallel, and lets multiple consumer groups read the same data independently without tripping over each other. Every event has an offset, which is just its position inside a partition. Consumers track their own offset, so they decide how far behind they are willing to read. That last detail is quietly the most important one in this whole post: because the consumer owns its offset, the broker never pushes harder than the consumer pulls. Most of the scaling story below is really a story about what happens to that offset gap under load.

WebSockets and SSE are how data actually reaches the client. Plain HTTP is request-and-response, which makes it a bad fit for pushing data out. WebSockets fix this by holding open a persistent two-way connection. SSE is the simpler cousin: a one-way stream from server to client over a normal HTTP connection, with less overhead. Which one you reach for comes down to whether the client ever needs to talk back. The thing to internalize early is that a held-open connection is not free the way a stateless HTTP request is. It is a live object sitting in memory on some server for as long as the user has the tab open, and that single fact is what makes fan-out the hardest part of the whole system later on.

In-memory stores like Redis are your fast-read layer. Kafka is durable, but it is not built for instant point lookups. Anything you need to read in microseconds, like who is currently online, live leaderboard positions, or rate limit counters, belongs in Redis. It is the bridge between your event pipeline and the layer that serves users.

Low-latency networking is the constraint sitting quietly underneath all of this. Physical distance is something no amount of clever code can erase. A request from Mumbai to a server in Virginia eats roughly 200ms of round-trip time before your application even gets a look at it. That is physics, not a bug you can fix, and it is why the last section of this post is about moving compute physically closer to people rather than optimizing code that was never the bottleneck.

Stitched together, the basic flow looks like this:

Producer --> Broker (Kafka) --> Consumer --> Application State (Redis) --> Client (WebSocket)

Everything else in this post is either a refinement of this pipeline or a story about what happens when one piece of it breaks.

DiagramClick to expand

A Real Architecture: Live Match Scores

Let's make this concrete with a live cricket score feed serving 5 million people at once during an India match. It is a great example because it slams into every real-time scaling problem at the same time: high write volume, enormous fan-out, and an audience that notices a two-second delay instantly.

A score ingestion service takes in ball-by-ball events from the official data provider. Each event gets published to a Kafka topic called match-events, partitioned by match ID. Partitioning by match ID keeps all events for a single match in the same partition, which preserves per-match ordering without any global coordination.

A score processor reads from that topic and works out the derived state: current score, run rate, partnership data, wagon wheel. It writes the result into Redis with a TTL. This consumer is stateful. It keeps a running tally in memory, which means every event for a given match has to land on the same processor instance.

A fan-out service reads the processed state from Redis and pushes updates to connected clients over WebSockets. This is where the scale problem really concentrates, and it is worth being honest about why. The write side is almost trivial here: a cricket match produces maybe one meaningful event every few seconds. You could ingest every live international match on earth on a single Kafka broker without noticing. The hard number is on the other end. Five million open connections, each one a live object holding a socket, a send buffer, and a bit of session state, is on the order of tens of gigabytes of memory that exists whether or not anything is happening. A well-tuned Node or Go gateway holds somewhere around 50,000 to 100,000 concurrent WebSocket connections before memory and event-loop scheduling start to degrade. Do the division and 5 million users is not a tuning problem, it is 50 to 100 machines whose entire job is holding connections open. The bottleneck was never throughput. It was the connection count, and the connection count is set by your product's popularity, not by anything you can optimize in the hot path.

A presence service tracks who is watching which match, stored in Redis sorted sets. With that, the fan-out service can push only to the connections watching a given match instead of blasting everyone. Across 5 million users split over dozens of live matches, that targeting is not a nicety, it is the difference between pushing one update to 200,000 relevant sockets and waking up all 5 million to discard 96% of the work. At this scale, knowing precisely who to ignore is the optimization.

DiagramClick to expand

The shape is simple. Every hard part lives in the gap between the processor writing one update and 200,000 sockets needing it 50 milliseconds later.

How Real-Time Systems Break at Scale

Now for the interesting part. Here is what actually happens when 10 million people open the app in the same moment, say at the toss of an India-Pakistan World Cup final.

Backpressure

This is the most common failure mode and the easiest to miss. Your consumers cannot keep up with how fast producers are writing. The Kafka topic starts piling up lag, which is just the growing gap between the newest produced offset and the newest consumed offset. If nothing reacts to that signal, the consumer keeps falling further behind. Users start seeing scores from three overs ago. Nothing has crashed. The system is simply serving the past.

Watch it happen on a single partition. A wicket falls and a flurry of events lands at once: produced offset jumps to 8,400 while your processor sits at 8,200. That 200-event gap is recoverable, and with a little processing headroom it closes in well under a second, so nobody notices. Now the processor slows down, say a Redis write starts taking 40ms instead of 2ms because Redis is busy. Production holds steady at its old rate, consumption drops to a fraction of it, and the gap stops being a transient. It grows by a few hundred events every second, linearly, with no ceiling. A minute and a half in, the consumer is tens of thousands of events behind and the score on screen is from two overs ago. There was never an error. There was a line on a graph with a positive slope, and the only question that ever mattered was whether anyone was looking at its slope rather than its current value.

That is the whole tell with backpressure: it does not throw, it accumulates. Graph consumer lag, alert on the derivative and not the number, and the failure announces itself well before your users do. Skip that one graph and the system will lie to everyone, politely, until the complaints arrive.

Thundering Herd

Ten million clients connect inside the same thirty-second window. Every service behind the WebSocket gateway gets hit with a vertical wall of load at once. Connection pools drain. Redis starts queuing commands because its event loop is maxed out. The score processor's CPU pins at 100% because it is also fielding presence lookups from 10 million fresh connections. Each struggling component triggers retries from whatever depends on it, which only piles onto the original spike. That is the thundering herd: demand arriving all at once, hard enough to flatten even a system you thought was well provisioned.

The detail that separates people who have lived through this from people who have read about it: the herd is rarely your users alone. It is your own retries. The first wave of failures generates a second wave of retry traffic that is often larger than the original spike, and that secondary wave is what actually takes you down. Which is why the fixes that matter are jittered backoff, request coalescing, and circuit breakers, all aimed at your own clients, not just rate limits aimed at users.

Head-of-Line Blocking

One slow or malformed event at the front of a partition stalls everything behind it. The processor is stuck on a retry or an error handler. Meanwhile every later event for that match just sits in the partition, unread, until the jam clears. Users watching that match see a frozen feed. Every other match on a different partition keeps humming along. From the outside the system looks healthy, but it is quietly broken for exactly the audience you cannot afford to lose during a marquee game.

This is the case for a dead-letter queue. If an event fails twice, shunt it aside to be dealt with out of band and let the partition keep moving. You give up perfect per-event completeness in exchange for liveness, and on a score feed that is the right trade every time.

Out-of-Order Delivery

Events show up at the consumer in the wrong order. Ball 47.3 arrives before ball 47.2 because they got retried along different network paths. The score display flashes the wrong state for a moment before correcting itself. In a score feed that is a cosmetic glitch. In a trading system it is a correctness bug that can cost real money. Either way, you design for it on purpose: a sequence number on every event and a consumer that buffers anything arriving early instead of applying it. That quietly converts an ordering problem into a buffering problem, which is one you can actually reason about. Hoping it never happens is not a strategy.

Clock Skew

The ingestion service and the processor run on different machines, and their clocks drift apart over time even with NTP keeping them roughly honest. An event stamped at time T on the producer gets processed at a wall-clock moment that does not line up with the broker's clock. If your processor does any time-windowed math, like "events in the last 5 seconds," that drift quietly poisons the window boundaries. You end up with wrong aggregates and no error to chase. So never trust wall-clock time for ordering or windowing across machines. Stamp the timestamp at the source, carry it on the event, and let the stream processor reason about lateness explicitly.

Cascading Failures

Redis slows down under the connection surge. The fan-out service starts timing out on Redis reads and retries them. Those retries make Redis even slower. The score processor, which also reads Redis for match state, begins timing out too, so it backs up. Kafka consumer lag climbs. The WebSocket layer is now pushing stale data because the processor has fallen too far behind to refresh Redis. One component's slowness has spread into every component's failure. The system never crashed. It degraded in a coordinated, tangled way that is far harder to debug than a clean outage.

The failure almost never starts where the symptoms show up. Your alert fires on stale WebSocket data; the root cause is three hops upstream in a saturated Redis. That is what makes cascades miserable to debug and why the defense is structural rather than reactive: bulkheads between dependencies, and timeouts that fail fast instead of retrying into a brownout. Done right, one slow dependency degrades one feature instead of laminating into a system-wide stall.

DiagramClick to expand

How Real-Time Systems Scale

Each technique below answers one or more of the failures above directly.

Partitioning

Partitions are Kafka's basic unit of parallelism. More partitions means more consumer instances reading at the same time, which is the direct cure for backpressure. For the score feed, partitioning by match ID buys you natural isolation: one match's thundering herd cannot starve another match's partition. The price is global ordering. You get strong ordering inside a partition and zero ordering guarantees across partitions. Picking the right partition key is really the decision about which ordering guarantees you keep and which you let go, and it is one of the few decisions here that is genuinely hard to walk back once you have data flowing, so it deserves more thought than it usually gets.

Consumer Groups

A consumer group is a set of consumer instances splitting the work of reading a topic. Kafka hands each partition to exactly one consumer in the group at a time, so nothing gets processed twice. To scale throughput, you add consumers up to the number of partitions. Past that point, extra consumers do nothing because there are no partitions left to give them. That is why partition count is a capacity decision you make when you create the topic: it sets the hard ceiling on parallelism, and raising it later forces a repartition that breaks your ordering guarantees mid-flight. Over-provision partitions slightly at creation time. Empty partitions cost almost nothing; a repartition under load costs you the thing you were trying to protect.

Backpressure Strategies

Instead of letting lag creep up in silence, you measure it and build deliberate responses. You can slow producers down by switching to a pull model where consumers ask for batches when they are ready. You can shed load by dropping lower-priority events once lag crosses a threshold. Or you can auto-scale consumer instances when lag breaks your SLA. Which one fits depends on a single question you should answer before writing any code: can this system stomach dropped events? A score feed can. A payment pipeline cannot. That one answer eliminates most of the option space for you.

Stream Processing Engines

Plain event-to-state updates fit fine in a basic consumer. But anything fancier, like windowed aggregations, joins across streams, or stateful pattern matching, needs a real stream processor.

Kafka Streams runs right inside your application process and uses Kafka itself to store state. It is the lowest-overhead choice if you are already on Kafka. Apache Flink is a separate cluster built for complex event processing, with exactly-once guarantees and serious windowing support. It is what Uber runs their real-time data platform on. Spark Streaming is the older option and fits micro-batch workloads better than true continuous streaming. It is tempting to reach for Flink because it is powerful, but Flink is a distributed system you now have to run, monitor, and recover. Earn that operational weight by letting a plain consumer fail you first, then adopt the engine that fixes the specific thing that broke.

Load Shedding

When demand outstrips what the system can handle, deliberately throw away low-priority work rather than letting everything sink together. For a score feed, that might mean dropping historical data queries to protect live event delivery. Rate limiting at the API gateway stops any single client from amplifying the load during a spike. The core idea is to decide ahead of time which traffic is expendable, instead of letting the system make that call by failing at random. A system that has not been told what to drop will drop whatever happens to break first, and that is never the thing you would have chosen.

CQRS

CQRS, short for Command Query Responsibility Segregation, means splitting the write path and the read path completely. Writes flow through Kafka and the score processor. Reads go straight to Redis. The two paths have very different shapes: writes are low-volume and stateful, reads are high-volume and stateless. Separating them lets you scale each one on its own, which is exactly what makes it possible to serve 5 million concurrent reads without that load ever touching the processor writing new state. Once you see the read and write paths as two systems with different scaling laws, half the architecture in this post stops looking like a collection of tricks and starts looking like one idea applied repeatedly.

Event Sourcing

Instead of storing the current state, you store the full sequence of events that produced it. The current score is not a row in a database. It is what you get by replaying every ball event since the start of the innings. This gives you a full audit trail and the ability to reconstruct state at any past moment, and it maps cleanly onto Kafka's log structure. The catch is that reads now require computation instead of a simple lookup, and replaying long histories gets expensive as a match drags on. The standard answer is periodic snapshots so you replay from the last checkpoint rather than the beginning, which is worth knowing before you adopt the pattern and discover the cost yourself.

Edge Computing

For last-mile latency, push compute closer to the people. A CDN edge node in Mumbai handling WebSocket connections for Indian users wipes out the round-trip to a central data center. The fan-out service runs at the edge and reads from a regional Redis replica. Tools like Cloudflare Durable Objects are making this pattern far easier to build on without having to run your own edge infrastructure. The judgment call is that the edge is where you put the read and fan-out path, the part that benefits from being near users, while writes still flow back to a central source of truth. Pushing the write path to the edge buys you a consistency problem that is almost never worth it.

DiagramClick to expand

The Tradeoffs

Every architectural choice in a real-time system trades one thing away to gain another. No configuration wins on every axis at once.

Latency vs Consistency

The CAP theorem says that in a distributed system, during a network partition, you have to pick between consistency and availability. Consistency means every read returns the most recent write. Availability means the system keeps answering, even if the answer is a little stale. Most real-time systems pick availability on purpose. A score feed showing data that is two seconds old is a minor annoyance. A score feed that goes dark during the final is a product disaster. That choice, availability over consistency, is not a flaw in the design. It is a deliberate bet that staleness hurts less than downtime, and the reason it should be made explicitly, written down and agreed on, is that the day it bites you is the day everyone forgets it was ever a choice.

Throughput vs Ordering

Ordering guarantees cost you, because enforcing them takes coordination. Kafka gives you strong ordering inside a partition, but spreading work across partitions breaks global ordering. More throughput means more partitions, which means weaker ordering. The real question is not whether to keep ordering but at what granularity it actually matters. A score feed needs per-match ordering. A bank ledger needs per-account ordering. A click analytics pipeline might need none at all. Find the smallest unit that actually needs to stay ordered, make that your partition key, and you have bought yourself the maximum throughput your correctness requirements allow.

Exactly-Once vs At-Least-Once Delivery

At-least-once is the default almost everywhere. If an event might have been lost, retry it. The side effect is the occasional duplicate. Exactly-once delivery is possible in Kafka through transactions, but it adds latency and a lot of operational complexity. The practical test is whether your consumer is idempotent. If processing the same event twice lands you in the same place as processing it once, at-least-once is fine and you skip the complexity. If duplicates cause correctness problems, the cheaper fix is almost always to make the consumer idempotent rather than to buy exactly-once. An idempotent consumer is easier to operate and easier to reason about than a distributed transaction, and it degrades more gracefully when something downstream misbehaves.

Stateful vs Stateless Processing

Stateless processors are easy to scale. Any instance can handle any event. Stateful processors, the ones holding running tallies or windowed aggregations, need consistent routing. The same match's events have to reach the same processor instance, or the state has to be pushed out to Redis. Stateful processing is more powerful but much harder to scale sideways and to recover after a failure. When a stateful processor restarts, it has to rebuild its in-memory state before it can correctly handle anything new, and that recovery window is a real outage you have to design for, not an edge case you can wave away.

Cost vs Freshness

More real-time means more infrastructure, plain and simple. Holding 5 million WebSocket connections open burns memory and CPU in proportion to the connection count, even when nothing is happening. Polling once a second is cheaper but adds up to a second of lag. Polling every thirty seconds is cheaper still. Most product debates framed as "should this be real-time" are really debates about where on the cost-versus-freshness curve you want to sit. The engineering conversation and the product conversation are the same conversation, and it goes better when the latency options arrive with dollar figures attached, in front of the people who own the product rather than buried in an engineering thread.

What Real Companies Actually Did

Discord

Discord's early setup ran a single Elixir process per channel to manage message fan-out. It worked nicely at small scale, because the Erlang/OTP model is genuinely great at juggling lots of lightweight concurrent processes. As they grew into billions of messages a day, two problems surfaced: Erlang's per-process memory overhead added up across millions of channels, and Mnesia, their original database, could not keep up with the write load.

So they moved message storage to Cassandra, which is built for high write throughput. They spread their WebSocket gateway connections across a large fleet instead of routing by channel. They also built a dedicated presence system, because tracking who is online turned out to be its own hard problem at scale. The lesson from their engineering blog is worth repeating: the hardest problem was not raw throughput. It was the thundering herd inside their own system, when one large server with hundreds of thousands of members got a single message that had to fan out instantly to every connected member. That is the same fan-out math from our score feed wearing different clothes, and it is why "fan-out is the hard part" is a pattern and not a coincidence.

Uber

Uber's dispatch system is a soft real-time problem with unusually tight latency demands: match a rider to a driver in under a second, continuously, across a whole city, while GPS positions update every few seconds per driver. The matching alone is computationally expensive. Doing it in real time at city scale takes careful partitioning, and the way they partitioned is the genuinely interesting part.

The naive approach is to carve the city into a square grid and handle each square independently. That breaks immediately, because squares have a nasty property: the distance from the center of a square to its corner is much larger than the distance to its edge, so "drivers near this rider" pulls in wildly different real-world areas depending on which direction you look. Uber partitions on H3, a hexagonal grid, instead. Hexagons have only one kind of neighbor and a roughly uniform center-to-edge distance in every direction, which makes "expand the search to adjacent cells" behave consistently no matter which way the search grows. That uniformity is the whole point: proximity queries stay sane at cell boundaries.

The cost they take on in return is the boundary problem that all spatial sharding shares. A rider on the edge of one hexagon may be closest to a driver sitting just inside the next one, so the matcher cannot look at a single cell in isolation. It has to gather the cell plus its ring of neighbors, which means a single match touches several shards at once. They run this on Flink with Kafka as the event backbone, using stateful stream processing to keep the live position of every driver in a cell continuously updated. The reason their writing on this is worth reading is that it shows distributed-systems ideas, partitioning, stateful processing, exactly-once semantics, applied to a problem that looks nothing like a message queue and yet decomposes the exact same way.

Robinhood

The GameStop event in January 2021 is the clearest recent example of a correlated thundering herd in the wild. Trade volume jumped to many times normal load inside a window of minutes, not the gentle ramp that capacity planning usually assumes. Robinhood's real-time systems, the order routing, portfolio updates, and market data feeds, were not provisioned for that load profile and could not shed load selectively.

The result was degraded performance, then trading restrictions that turned into a major public controversy. The engineering takeaway is narrow but important: your load shedding and capacity buffers have to account for correlated spikes across your entire user base, not just average growth or typical peaks. The mistake is not under-provisioning, it is modeling your load as if users behave independently. They do not. The moment an external event makes every user do the same thing at the same time, the averages that your capacity plan was built on stop describing reality entirely.

Closing Thoughts

Real-time at scale is not really a technology problem. Kafka, Flink, Redis, and WebSockets are well understood, and the implementations are mature. The hard part is the discipline of deciding which failures you can live with and which you cannot, then building the system to reflect those decisions honestly.

Every real-time architecture is a stack of explicit bets. Accepting stale reads to stay available. Dropping events to protect throughput. Choosing at-least-once delivery and designing for idempotency instead of paying for exactly-once. None of these bets is right or wrong in the abstract. They are right or wrong relative to what your product actually needs, and the engineers who build these systems well are the ones who make the bets out loud, write them down, and revisit them on purpose, rather than discovering them in a postmortem.

The field is slowly making some of these tradeoffs cheaper. CRDTs, or Conflict-free Replicated Data Types, let distributed state merge without coordination, which eases the consistency-versus-latency tension for certain data structures. Edge computing keeps pushing fan-out physically closer to users, cutting last-mile latency that no backend tuning could ever touch. Systems like FoundationDB are making exactly-once semantics cheaper down at the storage layer.

But the core constraint never disappears. You cannot have low latency, high consistency, and high availability all at once during a partition. That is a property of distributed systems that falls out of the physics of networks, not a limitation of today's tools. The job is to figure out which corner of that triangle your product truly needs, choose it on purpose, and build everything else around that choice. Everything in this post is a tool for living inside that constraint gracefully. None of it lets you escape it.

Related