Build a Multilingual Voice Agent with LiveKit and Gladia

Your Voicebot is Linguistically Trapped

Ask any smart speaker a question in English, then slip into Spanish halfway through the sentence. Most systems freeze, mis-transcribe, or snap back with something uncanny in the wrong language. Today’s mainstream voicebots effectively run in single-language lockstep: one language per session, chosen in a settings menu or hard-coded by a developer.

Humans do the opposite. Bilingual speakers “code switch” constantly—“Can you book la cita for mañana?”—without thinking about what model supports which locale. In cities like London, New York, or Mexico City, a single conversation can bounce between English, Polish, and French in under 10 seconds, and nobody fills out a form first to declare their language.

Voice AI mostly lives in what Hugo Pod calls Tier 1: it can handle multiple languages, but only if you tell it upfront which one to expect. That works for rigid call flows and IVRs, but it breaks the moment a caller asks in English, “Do you speak Spanish?” and then actually switches to Spanish. The agent either keeps replying in English, or worse, mangles the transcription and derails the LLM.

Tier 2 is the upgrade: a multilingual agent that detects and switches languages mid-sentence, with no manual toggles, no “press 2 for Español,” no restart. A user can start in English, pivot to Polish, then toss in a French phrase, and the system tracks all of it in real time. That kind of fluidity turns a voicebot from a settings panel into a conversation.

Building that Tier 2 agent demands three components working in lockstep: - A smart framework like LiveKit to orchestrate real-time audio and agent logic - A powerful brain (an LLM) that can respond naturally in many languages - A hyper-aware ear (STT) that performs low-latency, high-accuracy code switching

Most LLMs and text-to-speech engines already handle multiple languages reasonably well. The real bottleneck is speech-to-text that can hear “Do you speak Spanish?” and seamlessly follow when the rest of the sentence arrives in Spanish—no reconfiguration, no hard reset, just continuous, multilingual understanding.

Tier 1 vs. Tier 2: The Multilingual Divide

Tier 1 multilingual agents sound flexible on paper: one system, many languages. In practice, they only work if you declare the language up front, before anyone says a word. You configure “Spanish,” “Polish,” or “French” as a session parameter, then the entire conversation stays locked to that choice.

That design shows up everywhere from IVR phone trees to customer support bots. You pick from a dropdown, press “2 for Español,” or tap a flag icon, and only then does the speech-to-text pipeline load the right acoustic and language models. Change your mind mid-call, or mix in another language, and the system either mishears you or ignores the switch.

Logistically, Tier 1 feels clumsy. Forms need an extra “preferred language” field, call flows need a menu, and kiosks need UI affordances just to get started. Every added step increases friction and abandonment; many consumer apps lose users if onboarding takes more than 10–20 seconds.

Tier 2 multilingual agents work differently. They listen first and decide on the fly which language—or languages—you are using, with no prior declaration. A conversation can start in English, jump to Spanish for a question, then slip into Polish, and the agent tracks those transitions in real time.

That shift turns multilingual from a checkbox feature into actual conversational fluency. A Tier 2 system supports natural “code switching,” where a user blends languages inside a single sentence, like “Can you send the factura to my work email?” or “Czy mówisz Spanish as well?” The agent must transcribe, reason, and respond appropriately at every switch.

For global products, Tier 2 is the gold standard. One agent can serve users across dozens of markets without separate phone numbers, separate bots, or hard language routing rules. Companies avoid maintaining parallel flows for English, French, and Polish, and instead deploy a single logic layer that adapts to whatever the user speaks.

Hugo Pod’s “How to Build a Multilingual Voice Agent with LiveKit & Gladia” explicitly targets this Tier 2 model. Using Gladia for low-latency code switching and LiveKit for real-time audio, his stack aims at that higher bar: an agent that behaves less like a form and more like a person.

Why 'Code-Switching' is the Holy Grail

Code-switching describes how bilingual people flip languages mid-sentence without thinking: “Oye, did you send that report?” or “Ça marche, I’ll ping you later.” Psycholinguists see it as a feature, not a bug—research shows bilinguals switch based on topic, emotion, or who they’re talking to, often several times per minute.

For AI voice agents, that behavior is the holy grail. A Spanish-speaking customer might open in English for the IVR menu, slide into Spanish to explain a billing problem, then jump back to English for card numbers. Any system that freezes on the first language loses trust, time, and often the user.

Real-world stakes are high. Global support centers in Mexico City, Manila, or Warsaw routinely juggle English plus 2–4 local languages on the same line. International sales calls in fintech, travel, or SaaS bounce between English, Hindi, and regional dialects. Public services in cities like New York or London must handle mixed-language conversations across healthcare, housing, and education.

Technically, this is brutal because raw audio is ambiguous without linguistic context. A two-second clip might map to plausible words in English, Polish, or Portuguese, all with different meanings. Background noise, accents, and domain jargon multiply the confusion, so naïve models “lock in” to the wrong language and never recover.

All three pillars—STT (speech-to-text), LLM, and TTS—have to stay in perfect sync on language choice. LLMs already handle multilingual prompts well, and modern TTS engines like 11 Labs can speak convincing Polish or Spanish once they get clean text. Speech recognition is the real boss fight.

Multilingual STT has to detect language boundaries in real time, sometimes on a single word, while keeping latency under ~300 ms for a natural call. It must decide “was that ‘no’ in English or ‘não’ in Portuguese?” on the fly and switch models or vocabularies instantly. Tools like Gladia’s code-switching models and frameworks documented in Voice AI quickstart | LiveKit docs are emerging, but perfect code-switching remains a frontier problem.

Our Tech Stack for Fluid Conversations

Modern code-switching voice AI stands on four pillars: real-time routing, speech recognition, language reasoning, and synthetic speech. Swap any one of them for a weaker component and the whole illusion of a fluid, bilingual conversation breaks instantly.

At the center sits LiveKit, the real-time communication framework that behaves like the agent’s nervous system. It manages low-latency audio streams, session state, and backpressure, making sure audio packets, transcripts, and responses arrive in under a few hundred milliseconds instead of seconds.

LiveKit wires together three specialized services that each own a different part of the stack: - Gladia for Speech-to-Text - OpenAI GPT-4.1 for language understanding - 11Labs for Text-to-Speech

Gladia acts as the agent’s ears, continuously transcribing raw audio into text while the user is still talking. Its multilingual model, such as the SEA SALARIA 1 variant, supports code-switching across dozens of languages, detecting when a sentence jumps from English to Spanish to Polish without resetting the session.

That code-switching ability matters because speech-to-text is the most fragile link in this chain. If Gladia mislabels Spanish as accented English, GPT-4.1 never sees the correct words, and the entire “multilingual” experience collapses into nonsense or awkward clarifying questions.

Once Gladia emits text, OpenAI GPT-4.1 steps in as the brain. The LLM tracks conversation history, user intent, and language shifts, then decides not just what to say, but in which language to say it. Prompting can nudge GPT-4.1 to mirror the user’s language automatically or to switch when explicitly asked (“¿Puedes hablar polaco?”).

11Labs closes the loop as the voice. Feed it Polish, French, or English tokens and it returns natural-sounding audio in that same language, using the same synthetic voice so the agent feels like one consistent persona, not a patchwork of different systems.

Together, LiveKit, Gladia, GPT-4.1, and 11Labs form a tight real-time circuit. Audio flows in, language-aware text flows through, and correctly localized speech flows out—fast enough that code-switching feels casual, not like switching apps.

The STT Bottleneck: Why Gladia is the Key

Speech-to-text quietly decides whether a multilingual voice agent works or falls apart. For Tier 2 systems that need to follow a caller from English to Spanish to Polish in a single sentence, STT is the hardest part of the stack by far. LLMs and TTS can already juggle dozens of languages from clean text; STT has to do it from noisy, overlapping, heavily accented audio in real time.

Gladia’s sea-salaria-v1 model sits at that choke point. It supports 40+ languages out of the box, with native code-switching, so a phrase like “Can you call mi mamá en Madrid?” doesn’t confuse it into one mangled language. Instead, it cleanly segments and transcribes English and Spanish as they actually appear in the waveform.

Regional routing is where sea-salaria-v1 becomes viable for live products rather than just demos. Gladia lets you pin processing to specific regions, such as EU West, so if your users sit in London or Paris you avoid the 100–200 ms penalty of transatlantic hops. For a voice agent, shaving that latency keeps back-and-forth responses under the ~300 ms threshold where “AI pause” becomes obvious.

Without an STT engine that can detect language changes directly from audio, nothing else in the pipeline ever has a chance to be smart. The LLM only sees whatever text transcription it gets; if the STT mislabels Polish as English and outputs gibberish tokens, even the best model will confidently respond in the wrong language. TTS then happily speaks that mistake back to the user, locking in the failure.

Code-switching support at the STT layer also prevents brittle pre-routing hacks. You no longer have to guess a caller’s language from their phone number, a menu choice, or the first sentence. Sea-salaria-v1 can listen from second zero, recognize that the user just switched from English instructions to rapid-fire French, and adjust character sets and language models on the fly.

Deepgram and other STT providers do advertise multilingual and even code-switching features, and they work for many use cases. For this specific Tier 2 agent, though, Gladia won on raw transcription accuracy across mixed-language audio, especially with fast switches and less common combinations like English–Polish. When your entire experience depends on nailing those edge cases, that accuracy gap is decisive.

Orchestration with the LiveKit Agent Framework

LiveKit no longer acts only as a WebRTC router; it behaves like an agent runtime that owns the entire call loop. Instead of wiring STT, LLM, and TTS together by hand, you define an agent that reacts to events—audio frames, messages, timeouts—and LiveKit orchestrates the rest in real time.

At the center is the LiveKit Agent Framework, which runs your Python (or Node) logic close to the media pipeline. That proximity matters: fewer hops between media, inference, and business logic translate into lower end-to-end latency, which is life-or-death for a code-switching voice agent.

LiveKit Inference slots directly into this loop as a managed LLM and TTS layer. You point your agent at models—OpenAI, local, or vendor-hosted—and LiveKit handles streaming tokens out and audio back without you juggling three different SDKs.

Using LiveKit Inference also sidesteps a mess of operational headaches. You avoid per-vendor rate limits on LLM and TTS calls, consolidate usage into one bill, and often get lower latency because LiveKit talks to providers over enterprise-tier links instead of public API gateways.

Billing consolidation is not just convenience; it changes how you architect. Instead of building custom throttling and fallback logic for each provider, you treat inference as a single resource pool with predictable quotas and monitoring.

LiveKit’s structure makes swapping components almost mechanical. In Hugo Pod’s agent.py, Gladia plugs in as the STT provider via a simple configuration block: model name (sea salaria 1), region (EU West), and a list of supported languages.

That design means you can experiment aggressively. Want to A/B test two TTS voices or two LLM prompts? You change a few lines in the agent definition; LiveKit still handles session state, media routing, and reconnection logic.

For teams coming from raw WebRTC or DIY gRPC services, this is a different abstraction level. You stop thinking in sockets and codecs and start thinking in “agent sessions” and “jobs” that can be scaled horizontally.

LiveKit’s documentation leans into this model; Building voice agents | LiveKit docs walks through patterns like background jobs, multi-agent routing, and custom tools that you can reuse across multilingual projects.

The Brain and Voice: Easy Wins for LLM & TTS

Modern LLMs barely break a sweat when you ask them to juggle languages. Models in the GPT-4 class train on trillions of tokens scraped from the multilingual web, books, forums, and code repositories, covering everything from English and Spanish to Polish and niche dialects. If you prompt, “Answer in French, then summarize in English,” they just do it, token by token.

That multilingual behavior is not a bolt-on feature; it falls out of how these models learn. During training, they see parallel concepts expressed in different languages and optimize one gigantic shared embedding space. So when a user flips mid-sentence from “Can you book a flight?” to “para mañana a Madrid,” the model simply continues predicting the most likely next token, now in Spanish.

Prompting gives you precise control. You can tell the LLM, “Always respond in the caller’s language,” or “Speak English but mirror any quoted foreign phrases.” With a single system message, the same GPT-4 instance can handle customer support in German, tech onboarding in Portuguese, and follow-up questions in English, all in one continuous session.

On the output side, TTS systems like 11Labs are even more straightforward. They do not need to infer what language you meant; they just synthesize whatever language the text already uses. Feed them Polish text, you get Polish audio; swap in French, you get French, often with consistent voice timbre across languages.

Multilingual TTS mainly depends on two things: language coverage and voice quality. If a provider supports, say, 28 languages and cross-lingual voices, your app can keep the same “agent persona” while hopping from English to Spanish to Polish in real time. No reconfiguration, no separate voice per language.

All that elegance collapses if the words going into the LLM are wrong. The real magic—and the real risk—sits upstream in STT, where models like Gladia must detect language shifts, segment them correctly, and hand the LLM clean, code-switched transcripts.

Anatomy of the Agent: Code Deep Dive

Agent.py acts as the wiring diagram for this multilingual setup, and almost all of the magic comes from configuration, not custom algorithms. Hugo defines a single `Agent` that binds GladiaSpeechToText, LiveKit’s inference services, and some conversation controls into one real-time loop.

Speech recognition gets the most detailed tuning. The `GladiaSpeechToText` block specifies three critical parameters: `model="sea-salaria-1"`, `region="eu-west"`, and a `languages` array. That `sea-salaria-1` model is Gladia’s code-switching workhorse, designed to handle mid-sentence flips between English, Spanish, Polish, and more.

Region selection matters for latency. By pinning `region="eu-west"` from London, Hugo keeps round-trip times low instead of bouncing audio across the Atlantic to a default US endpoint. Many STT providers hide region routing; Gladia exposes it directly, which is rare and extremely useful for real-time voice.

The `languages` parameter is where this jumps from Tier 1 to Tier 2. Instead of telling the model “this call is French,” Hugo passes a list of allowed options, for example: - `"en"` - `"fr"` - `"es"` - `"pl"` Gladia then auto-detects which language is being spoken at any given moment and switches transcription rules on the fly.

LiveKit’s side looks almost boring by comparison, which is exactly the point. For LLM inference, Hugo wires in a `LiveKitInference` client with a model like `"gpt-4o-realtime-preview"`, plus a short system prompt: “You are a helpful voice assistant.” No extra multilingual flags, no routing logic, just one model that already understands dozens of languages.

Text-to-speech uses the same pattern: a `LiveKitInference` TTS client pointing at a model such as `"eleven_multilingual_v2"` with a chosen voice ID. As long as the TTS engine supports the target language, feeding it Polish or Spanish text simply works, so the code stays almost configuration-only.

Turn-taking is where tiny config changes dramatically affect user experience. Hugo swaps LiveKit’s `turn_detection` model from `"english"` to `"multilingual"`, so the agent detects pauses and end-of-utterance correctly in non-English languages and mixed-language sentences.

Finally, `preemptive_generation=False` disables the agent’s habit of talking over users. Many real-time systems start speaking as soon as they “think” you are done; that breaks code-switching when users add a clause in another language. Forcing the agent to wait for a clear turn boundary keeps conversations natural and prevents mid-sentence interruptions.

Deconstructing the Demo: From English to Polish

Code-switching moment in the demo starts innocently enough. The user opens in English, chatting with the agent as if it were any other Tier 1 system. Then comes the pivot line that would break most production voicebots: “I just wanted to know if you can speak Polish.”

Instead of replying in English or freezing, the agent instantly flips. It answers in fluent, natural-sounding Polish, with correct phonetics and prosody from the TTS stack, signaling that the LLM, prompt, and voice settings all accepted the language switch without a reset. No manual language toggle, no reinitialization, no “switching languages, please wait” delay.

What matters more is what happens next. The user continues in Polish, pushing into a full back-and-forth that stays entirely in that language. The agent understands follow-up Polish phrases, keeps context, and returns coherent, on-topic Polish responses—exactly the Tier 2 behavior multilingual products promise but rarely deliver.

Under the hood, that performance hangs on STT. Gladia’s model receives audio that begins in English, then mid-conversation shifts into Polish, and still produces accurate transcripts with low latency. That transcription quality is what lets the LLM maintain a single conversation state instead of spawning “English mode” and “Polish mode” threads.

Logs from the run surface an intriguing wrinkle: `turn detector does not support language Polish`. Turn detection decides when a user has finished speaking, so this warning means a secondary component only knows how to segment certain languages. Despite that, the system never visibly stutters because the core STT pipeline continues to recognize and transcribe Polish reliably.

This is a subtle but important architectural point. You can have non-critical pieces—like a language-limited turn detector—throw warnings while the main **Gladia** transcription engine keeps operating flawlessly across languages. In real deployments, that separation of concerns means you can iterate on ancillary modules without risking the multilingual brain that actually powers the experience.

The Future is a Polyglot AI

Polyglot agents stop being a research toy once you wire a high-level framework like LiveKit into a purpose-built STT engine like Gladia. LiveKit handles the messy real-time plumbing—WebRTC, sessions, agent lifecycle—while Gladia’s low-latency, code-switching model (like its sea-salaria-1 variant) does the one job generic models still fumble: detecting and transcribing multiple languages in the same breath. That pairing upgrades a simple voice bot into a Tier 2 agent that tracks human conversation instead of forcing humans to track system settings.

Stacked together, these pieces unlock products that actually work at global scale. A single support line can route customers from Mexico City, Warsaw, and Paris into the same multilingual voice agent, which follows them as they bounce between English for product names and native language for everything else. No IVR trees, no “Press 3 for Spanish,” just one endpoint that adapts in real time.

Meetings change too. Imagine a Zoom or Meet companion that listens to a 10-person call where participants swap between English, German, and Polish, and still produces: - Live captions in each participant’s preferred language - Searchable transcripts tagged by speaker and language - Summaries that preserve when and why code-switching happened

Consumer assistants benefit just as much. A bilingual family can talk to a home device in English, switch to French mid-sentence to address a grandparent, then flip back without a wake-word reset or app setting change. Accessibility jumps when users with limited proficiency in a “default” language no longer have to stick to it just to stay understood.

Barriers that once demanded a research lab—fast ASR, robust code-switching, low-latency streaming—now fit into a weekend project. LiveKit abstracts the real-time stack; Gladia handles multilingual STT; mainstream LLMs and TTS already speak dozens of languages out of the box. The hard part is no longer “Can this be built?” but “What should this agent actually do?”

You can answer that yourself. Check out the GitHub repo from “How to Build a Multilingual Voice Agent with LiveKit & Gladia,” plug in your own prompts and voices, and start shipping agents that speak to users the way users already speak to each other.

Frequently Asked Questions

What is AI code-switching?

Code-switching is the ability for an AI voice agent to detect and switch between multiple languages within the same conversation, just like a bilingual human would. This requires advanced speech-to-text technology.

Why is Gladia recommended for multilingual voice agents?

Gladia's speech-to-text is highlighted for its high accuracy across many languages, low latency, and its specific support for code-switching, which is the most critical feature for this type of agent.

What is LiveKit's role in this project?

LiveKit acts as the underlying framework for the voice agent, managing real-time communication (WebRTC) and providing an agent development kit. Its inference feature also simplifies using models like GPT-4 and 11Labs by proxying API calls.

Can I use a different LLM or TTS with this LiveKit setup?

Yes. LiveKit's framework is flexible. While the tutorial uses OpenAI's GPT-4 and 11Labs via LiveKit Inference, you can integrate other language models and text-to-speech services that fit your needs.

This AI Switches Languages Mid-Sentence