Essay/FEB 03, 2026/8 MIN READ

Designing a Real-Time Voice AI Architecture with WebSockets and LLMs

A deep dive into the system architecture behind real-time voice AI, exploring how WebSockets, streaming STT, LLMs, and TTS work together to deliver low-latency conversational experiences.

voice-ai·architecture·websockets·llm·backend

Introduction

Most LLM applications start with a simple question–answer flow: a user sends text, the backend forwards it to an LLM, and a complete response is returned. This works well for text because both input and output are discrete and finite.

Voice breaks this model.

Audio is continuous, not atomic. Waiting for a user to finish speaking before processing introduces noticeable delays, and waiting for a full response before playback makes conversations feel slow and unnatural. Traditional HTTP-based, request–response architectures are not designed for this kind of interaction.

This is where WebSockets become essential. By providing a persistent, bidirectional connection, WebSockets allow audio, transcripts, model responses, and synthesized speech to flow incrementally instead of in isolated requests.

As a result, real-time voice AI systems require a different architectural approach. At a minimum, they consist of a streaming-capable client, a WebSocket gateway, a Speech-to-Text (STT) service, an LLM orchestration layer, and a Text-to-Speech (TTS) system.

This article focuses on how these components fit together architecturally, and why streaming and low-latency design are fundamental to building real-time voice-based LLM systems.

The Problem: High Latency

High latency is the biggest reason most voice-based LLM systems feel unnatural.

A typical naïve architecture follows a strict, sequential flow: record audio → transcribe → send to LLM → generate response → synthesize speech → play audio. Each step waits for the previous one to finish, pushing the first audible response several seconds late.

This happens not because LLMs are slow, but because the system treats audio, text, and responses as static payloads instead of streams.

Audio is processed only after recording completes
STT runs on full clips, not partial speech
LLMs respond only once the full prompt is ready
TTS waits for the entire response before playback

The result is a batch-oriented pipeline that is fundamentally incompatible with real-time conversation.

Solving this problem requires a streaming-first architecture where STT, LLM, and TTS operate concurrently, minimizing end-to-end latency rather than optimizing individual components.

Why WebSockets Over REST and SSE

Real-time voice AI systems require continuous, low-latency, bidirectional communication. This immediately rules out traditional REST APIs.

REST is request–response by design. Each interaction requires a new HTTP request, making it unsuitable for streaming audio, partial transcriptions, or incremental model responses. The overhead of repeated connections and the lack of real-time bidirectional flow introduce unnecessary latency.

Server-Sent Events (SSE) improve on REST by enabling the server to push updates to the client over a persistent connection. However, SSE is fundamentally unidirectional. While it works for streaming text responses, it cannot efficiently handle upstream audio streaming from the client to the server.

WebSockets solve both problems.

They establish a single, long-lived connection that supports full-duplex communication, allowing the client to stream audio while simultaneously receiving partial transcripts, LLM responses, and synthesized speech. This makes it possible to overlap STT, LLM, and TTS processing instead of executing them sequentially.

In a real-time voice architecture, WebSockets are not an optimization—they are a requirement. They provide the communication model needed to treat voice interactions as continuous streams rather than isolated requests.

High-Level Architecture Overview

A real-time voice AI system is best understood as a streaming pipeline, where components operate concurrently rather than in isolation. Each layer is designed to handle partial data, reducing end-to-end latency and enabling natural conversational flow.

Core Components

Client

Captures microphone audio and plays synthesized speech
Maintains a continuous connection to the backend

WebSocket Gateway

Provides a persistent, bidirectional communication channel
Acts as the system's real-time coordination layer

Speech-to-Text (STT)

Transcribes incoming audio streams
Produces partial and final text output

LLM Orchestration Layer

Handles conversation state and turn management
Streams generated tokens downstream

Text-to-Speech (TTS)

Converts model output into audio
Streams synthesized speech back to the client

Understanding the Diagram: Data Flow in Action

The architecture diagram above illustrates how data flows through the system in real-time. Let's trace a typical interaction:

User speaks → Audio frames flow from the client to the WebSocket gateway
STT processes → Partial transcripts are emitted while the user is still speaking
LLM generates → Tokens stream as soon as sufficient context is available
TTS synthesizes → Audio chunks are produced from incoming tokens
Client plays → Speech playback begins before the LLM finishes generating

Notice the overlapping timelines. Unlike a sequential pipeline where each stage waits for the previous one to complete, this architecture allows multiple operations to run concurrently. When the user finishes speaking, the system has already begun synthesizing a response.

The WebSocket gateway acts as the coordination point, routing messages between subsystems and managing control signals for interruptions and turn-taking.

End-to-End Data Flow

Once the architecture is in place, data moves through the system as a continuous stream rather than discrete requests.

The client streams audio frames to the backend over WebSockets
STT transcribes speech incrementally and emits partial text
The LLM consumes transcripts and streams generated tokens
TTS converts tokens into audio chunks
Audio is streamed back to the client for immediate playback

Because each stage processes data as it arrives, transcription, reasoning, and synthesis overlap in time. This streaming flow is the key architectural decision that keeps conversational latency within human expectations.

Designing the WebSocket Event Protocol

With a persistent WebSocket connection in place, the next architectural challenge is defining how data moves across that connection. A real-time voice system is not just streaming audio—it is coordinating multiple asynchronous subsystems. This makes the event protocol as important as the transport itself.

A well-designed WebSocket protocol should be:

Explicit about intent (audio, text, control)
Stream-friendly, supporting partial and incremental data
Extensible, allowing new events without breaking clients

Event Categories

At a high level, WebSocket messages can be grouped into three categories.

Audio Events

Carry raw or encoded audio frames from the client
Are sent at a high frequency
Represent an ongoing stream rather than discrete actions

Text Events

Contain partial and final transcripts from STT
Carry streamed LLM tokens or text chunks
Flow in both directions depending on system design

Control Events

Signal boundaries such as start/stop speaking
Handle interruptions, cancellations, and resets
Coordinate turn-taking across the pipeline

Separating events by intent avoids overloading a single message type and keeps the protocol understandable as the system grows.

Streaming Semantics

The protocol must treat all major data types—audio, text, and speech output—as streams, not single payloads. This means:

Audio frames are sent continuously until a stop signal is emitted
Transcripts are updated incrementally
LLM responses are streamed token-by-token or chunk-by-chunk
TTS output begins before the full response is available

Crucially, control events allow the system to interrupt or cancel downstream processing when new audio arrives, preventing stale responses from being synthesized or played.

Designing the WebSocket event protocol around streams and control signals is what enables low-latency, interruption-aware voice interactions, rather than a simple request–response exchange over a persistent connection.

Example Protocol Structure

Here's a concrete example of how WebSocket events might be structured:

// Client → Server: Audio stream
{
  "type": "audio.input",
  "data": "base64EncodedAudioChunk",
  "format": "pcm16",
  "sampleRate": 16000
}

// Server → Client: Partial transcript
{
  "type": "transcript.partial",
  "text": "Hello, I need help with",
  "isFinal": false,
  "timestamp": 1234567890
}

// Server → Client: Final transcript
{
  "type": "transcript.final",
  "text": "Hello, I need help with my account.",
  "timestamp": 1234567895
}

// Server → Client: LLM token stream
{
  "type": "llm.token",
  "token": "I'd",
  "isComplete": false
}

// Server → Client: Audio output
{
  "type": "audio.output",
  "data": "base64EncodedSynthesizedAudio",
  "format": "pcm16",
  "sampleRate": 24000
}

// Client → Server: Interrupt current response
{
  "type": "control.interrupt",
  "reason": "user_speaking"
}

// Server → Client: Response complete
{
  "type": "control.response_complete",
  "tokenCount": 45,
  "audioLengthMs": 3200
}

This structure keeps events lightweight while providing all necessary metadata for debugging, monitoring, and coordinating the pipeline.

Real-World Reference: OpenAI Realtime API

OpenAI's Realtime API (released in 2024) demonstrates many of these architectural patterns in production. It uses WebSockets for bidirectional communication, supports function calling during voice conversations, and implements server-side Voice Activity Detection (VAD) to handle turn-taking automatically.

Similarly, providers like Deepgram and AssemblyAI offer streaming STT APIs that emit partial transcripts over WebSocket connections, allowing applications to begin processing before a user finishes speaking. On the TTS side, services like ElevenLabs and PlayHT stream audio chunks incrementally, enabling near-instantaneous playback.

These real-world implementations validate the architectural patterns discussed here and demonstrate that streaming-first design is becoming the industry standard for voice AI systems.

Latency Optimizations

In a real-time voice AI system, latency is not solved by a single optimization but by a series of small, compounding decisions across the pipeline. Each layer must be designed to reduce waiting and maximize overlap.

Small Audio Buffer Sizes

Audio should be captured and transmitted in small frames rather than large chunks. Smaller buffers reduce the delay between speech and transcription, allowing downstream components to begin processing almost immediately. The goal is to balance responsiveness with network overhead, favoring lower latency over throughput.

// Example: Client-side audio buffering
const BUFFER_SIZE = 4096; // Small buffer for low latency
const SAMPLE_RATE = 16000;

navigator.mediaDevices.getUserMedia({ audio: true }).then((stream) => {
  const audioContext = new AudioContext({ sampleRate: SAMPLE_RATE });
  const source = audioContext.createMediaStreamSource(stream);
  const processor = audioContext.createScriptProcessor(BUFFER_SIZE, 1, 1);

  processor.onaudioprocess = (e) => {
    const audioData = e.inputBuffer.getChannelData(0);
    // Send immediately - don't wait for large chunks
    websocket.send(
      JSON.stringify({
        type: "audio.input",
        data: encodeAudioData(audioData),
        timestamp: Date.now(),
      }),
    );
  };

  source.connect(processor);
  processor.connect(audioContext.destination);
});

Partial STT Usage

Streaming STT services emit partial transcripts before speech segments are complete. Forwarding these partial results to the LLM allows response generation to start early, instead of waiting for final transcripts. Even imperfect partial text is often sufficient to prime the model and reduce perceived response time.

# Example: Server-side partial transcript handling
async def handle_stt_stream(audio_stream, websocket):
    async for transcript_event in stt_service.stream(audio_stream):
        if transcript_event.is_partial:
            # Send partial transcript to client
            await websocket.send_json({
                'type': 'transcript.partial',
                'text': transcript_event.text,
                'isFinal': False
            })

            # Start LLM processing early with partial text
            if len(transcript_event.text) > 20:  # Sufficient context
                asyncio.create_task(
                    prepare_llm_context(transcript_event.text)
                )
        else:
            # Final transcript - trigger full LLM response
            await websocket.send_json({
                'type': 'transcript.final',
                'text': transcript_event.text,
                'isFinal': True
            })
            await generate_and_stream_response(
                transcript_event.text,
                websocket
            )

Token-to-TTS Piping

Rather than waiting for a full LLM response, generated tokens should be streamed directly into the TTS system. This enables speech synthesis to begin while the model is still reasoning, significantly reducing the time before the first audio output is heard.

# Example: Streaming tokens directly to TTS
async def generate_and_stream_response(transcript, websocket):
    buffer = []

    async for token in llm.stream(transcript):
        buffer.append(token)

        # Send token to client for display
        await websocket.send_json({
            'type': 'llm.token',
            'token': token
        })

        # When we have enough tokens for natural speech, synthesize
        if len(buffer) >= 10 or is_punctuation(token):
            text_chunk = ''.join(buffer)
            buffer.clear()

            # Stream to TTS immediately
            async for audio_chunk in tts.synthesize_stream(text_chunk):
                await websocket.send_json({
                    'type': 'audio.output',
                    'data': base64.b64encode(audio_chunk).decode()
                })

Early Stream Cancellation

Voice interactions are interrupt-driven. When new audio arrives, ongoing LLM inference or TTS synthesis should be canceled immediately. Early cancellation prevents wasted computation and avoids playing outdated responses, keeping the system responsive to user intent.

# Example: Cancellation handling
class ConversationSession:
    def __init__(self):
        self.current_llm_task = None
        self.current_tts_task = None

    async def handle_interrupt(self, websocket):
        # Cancel ongoing operations immediately
        if self.current_llm_task and not self.current_llm_task.done():
            self.current_llm_task.cancel()

        if self.current_tts_task and not self.current_tts_task.done():
            self.current_tts_task.cancel()

        # Notify client
        await websocket.send_json({
            'type': 'control.cancelled',
            'reason': 'user_interrupt'
        })

        # Clear audio playback buffer on client
        await websocket.send_json({
            'type': 'control.clear_audio'
        })

Minimal Payload Size

WebSocket messages should carry only what is necessary. Compact audio encoding, lightweight event metadata, and minimal JSON structures reduce serialization overhead and network latency. In streaming systems, even small payload savings matter when multiplied across hundreds of messages per interaction.

Taken together, these optimizations shift the system from a batch-oriented pipeline to a truly streaming architecture, where useful work happens continuously instead of waiting on artificial boundaries.

Final Thoughts

Building real-time voice AI systems is less about choosing the right models and more about designing the right architecture. Most latency and reliability issues emerge not from STT, LLMs, or TTS themselves, but from how they are connected and coordinated.

By shifting from a request–response mindset to a streaming-first design, and by using WebSockets as the backbone for bidirectional communication, voice interactions can feel immediate and natural rather than delayed and mechanical.

The key takeaway is simple: real-time voice AI is a systems problem. Small architectural decisions—buffer sizes, streaming boundaries, cancellation semantics, and protocol design—compound into massive differences in user experience.

Get the architecture right, and the models can shine. Get it wrong, and no amount of model quality will hide the latency.