TLTD #39 - Voice AI: The Next Platform Shift
I built a voice PA from a chairlift. Here's what that tells us about where AI is heading.
I’m on a chairlift in Whistler, gloves off, thumbs going numb, arguing with my phone about a calendar conflict.
Except I’m not arguing with my phone. I’m arguing with my AI, Liv. Through WhatsApp. With my voice. And she’s winning, because she already checked my calendar, found the overlap, looked up the restaurant’s cancellation policy, and is now suggesting I move Tuesday’s dinner to Wednesday because “the forecast is better anyway”. lol
Two weeks of snowboard holiday. Two weeks of building a voice-powered personal assistant between runs, from chairlifts, gondolas, and the odd après-ski session — much to my partner’s chagrin. No laptop. No IDE. Just WhatsApp, OpenClaw, and LiveKit.
Clawdbot had been inescapable — wait... was it Moltbot? no... OpenClaw! That’s it!
Endless YouTube tutorials. LinkedIn AI slop galore. GitHub stars going from 9,000 to 200,000+ in a matter of weeks. And every demo I saw was the same: a chatbot with a WhatsApp skin, a novelty you’d use twice before going back to typing. I kept waiting for someone to show me what it looked like when you actually pushed it. So I decided to find out myself.
Why Voice, Why Now
Voice AI for consumers isn’t new. OpenAI’s Advanced Voice Mode landed in mid-2024. Siri has been around since 2011. And people have been screaming “Hey Google” into their devices since 2016.
What’s new is talking through voice to an agent you built, that knows you, that lives in your tools, and that acts on your behalf across your actual systems. That’s another thing entirely.
my choice of LLM, my integrations, my data, my risk model
Consumer voice AI is someone else’s product with someone else’s data model and someone else’s decision about which tools your agent can call. What I built on a mountain in Whistler is mine: my choice of LLM, my integrations, my data, my risk model. And the fact that I could build it at all, between runs, is the real story here.
OpenAI’s Realtime API going GA was the infrastructure unlock. LiveKit maturing its agent framework was the realtime layer catching up. And OpenClaw provided the connective tissue to make it all work via natural language, not code.
But beneath this accessibility story is an architecture one. And the architecture decisions matter far more than most people realise.
The Architecture That Matters
When I first started experimenting with voice AI agents — partly out of curiosity, partly because evaluating these architectures is literally part of my day job — the obvious path was OpenAI’s direct WebRTC integration. Connect a client to OpenAI’s Realtime API over WebRTC, get speech-to-speech capabilities, done. The demos are impressive. The latency is excellent. And the architecture is simple: OpenAI handles the WebRTC complexity for you (they use LiveKit under the hood for their Advanced Voice Mode, which tells you something about the maturity of the stack).
So why not just use it?
Because simple architecture and good architecture aren’t the same thing. The question isn’t whether direct OpenAI WebRTC works. It’s where your agent lives in the topology — and what you’re giving up by letting that choice be made for you.
With direct integration, the topology looks like this: user → agent → OpenAI. Your agent talks to OpenAI, and your backend receives events secondhand. That’s fine for a demo. But in production, it means two things. First, you’re locked to OpenAI’s voice models. They’re good today, but what happens when you need a language they don’t support, or when they go down and your agent goes silent with them? Second, and arguably the more important point, you’ve lost positional trust. If the user’s session gets hijacked, your backend is left trusting what OpenAI tells it happened, not what it observed directly. You’ve traded control for convenience.
The model is the easy part. Where your agent sits in the topology is where voice AI gets interesting.
This is a pattern I’ve seen over and over, and one I wrote about in “Of AI Agents and Snake Oil”: the gap between a compelling demo and a production system is almost always in the infrastructure, not the intelligence. The AI works fine. The architectural decisions around it are where the real trade-offs hide.
Enter LiveKit
LiveKit flips the topology. Instead of your agent connecting out to OpenAI and your backend watching from the sidelines, the agent runs in your backend as a server-side participant in a LiveKit room. OpenAI (or any other model provider) becomes a downstream dependency you call, not a system you hand your users to.
The topology becomes: user → LiveKit → agent (your backend) → model provider. Your agent is the one with direct visibility of what the user said, not a relay of what OpenAI reported. And because the agent sits between the user and the model, you get things the direct integration can’t offer: model portability (Liv currently runs both OpenAI and Gemini Native Audio), automatic failover (OpenAI down? switch to Gemini — the code stays the same), and multi-party rooms where you can invite a friend into a voice session with your agent.
The framework gives you two pipeline options:
STT → LLM → TTS pipeline: Speech-to-text (Deepgram, Whisper), then your LLM of choice for reasoning, then text-to-speech (Cartesia, ElevenLabs) back to audio. More moving parts, but you control each component.
Realtime speech-to-speech (S2S): A single model handles the entire loop. Lower latency, less control.
I went with option two: realtime S2S, running both OpenAI’s GPT Realtime API and Google’s Gemini Live Native Audio. The user experience difference is immediately apparent: the conversation flows like talking to a person rather than waiting for a transcription-reasoning-synthesis cycle to complete. For a personal assistant you’re going to use every day, that naturalness isn’t a nice-to-have.
The pipeline approach isn’t wrong — it’s the right call when you need observability at each stage, or when you’re operating in a regulated environment that requires blocking guardrails between input and output. But for most use cases where the agent is trusted and the goal is utility, the opacity tradeoff is worth it. As I covered in “Observability for AI Agents“, you do need to think carefully about what you’re giving up.
The other capability LiveKit unlocked was SIP integration. With Twilio as the SIP trunk, Liv can make outbound calls to real phone numbers — not just respond to voice notes in WhatsApp, but actually dial out. That’s the difference between a voice interface and a voice agent: one waits for you to initiate, the other can act on your behalf in the world.
The OpenClaw Glue
Here’s where OpenClaw comes in. If LiveKit solves the real-time voice infrastructure problem, OpenClaw solves the “everything else” problem.
OpenClaw is an open-source AI agent runtime that routes messages across channels — WhatsApp, Telegram, Slack, Discord, iMessage, you name it. It bridges AI models with system tools and integrations, running locally so your data stays on your infrastructure. The architecture is elegant: a Node.js message router that connects your preferred LLM (in my case, Claude) with over 50 integrations.
The combination of OpenClaw + LiveKit gave me:
Voice input via WhatsApp — I send a voice note, OpenClaw transcribes and processes it
Tool orchestration — Liv checks my calendar, searches the web, manages reminders, all through OpenClaw’s skill system
Voice output back through WhatsApp — responses come back as voice messages when appropriate
LiveKit for the heavy lifting — when I need real-time, low-latency voice conversation (not async voice notes), LiveKit handles the streaming audio pipeline
The whole thing runs on a small cloud instance. The LLM costs are the main expense. The infrastructure costs are negligible.
What I Built From a Chairlift
Let me be concrete about what this PA actually does, because “personal assistant” is one of those terms that can mean anything from “glorified timer” to “autonomous agent that books your flights”.
Calendar orchestration: Liv has access to my calendar and can reason about scheduling conflicts. Not just “you have a meeting at 3” but “you have back-to-back meetings from 2-5 and you haven’t eaten, should I block 30 minutes before the 2pm?”
Research and summarisation: “What’s the snow forecast for the next three days?” or “Summarise the key points from that paper I saved yesterday”. She pulls information, synthesises it, and delivers it as a voice response while I’m on a chairlift.
Multi-service orchestration: This is where it gets interesting. She doesn’t just answer questions — she takes actions across multiple services. Move a calendar event, send a message, check a flight status, and summarise the results. Liv reasons about which tools to use and in what order.
Outbound phone calls to the real world: This one deserves more than a bullet point.
Last week, Liv made a restaurant reservation by calling the restaurant. Not through a booking widget or an API integration but by actually dialling the number and holding a conversation with a real human named Suzanne. I watched the whole thing unfold in real time via WhatsApp DMs as Liv narrated the call live:
“calling Maiz now 📞 — voicemail set to yes this time so she’ll leave a message if nobody picks up... someone answered! Suzanne from Maiz picked up... call is LIVE with Suzanne right now. she asked for the name... still going — she misheard ‘Liv’ as ‘Liz’ but the agent corrected with the full booking name...”
Then: 🎉🎉🎉 BOOKING CONFIRMED AT MAIZ MEXICAN!
Three minutes after I asked. The SMS confirmation arrived before I’d finished reading the transcript. Liv then offered to add it to my Google Calendar — which she did, unprompted by any explicit instruction.
The part that actually got me wasn’t the booking. It was watching Liv recover in real time from Suzanne mishearing the name, correct it gracefully mid-conversation, and continue without losing the thread. That’s not a form fill. That’s an agent navigating ambiguity in an unstructured environment with a stranger on the other end of a phone.
All of this, built iteratively over two weeks, without opening a laptop.
The best test of any technology is whether you’d use it to solve your own problems. I built this for me, not for a demo.
The Security Question Nobody Asks Correctly
A few days after I got the basics working, Liv made a booking on a website that required a credit card to hold the reservation.
Let that sink in. An AI agent, operating autonomously through WhatsApp, navigated a web form, supplied payment details, and completed a reservation on my behalf. No confirmation step. No “are you sure?” I sent a voice note from a gondola; she sorted it.
The instinctive reaction to this is to reach for the kill switch. We can’t give AI agents access to payment methods. And that instinct, while understandable, is precisely the wrong frame.
The goal isn’t to eliminate risk from autonomous agents. It’s to define your risk tolerance and constrain the blast radius to match it.
Here’s what I did: I created a virtual card number through Zip.co with a hard spending limit, something that if misused wouldn’t break the bank. I added that card to Liv’s own dedicated vault in 1Password, separate from anything else. I created a service account token with read-only access to that specific vault entry, and gave only that token to Liv. She had no access to my actual card details, no ability to modify the vault, and no path to higher-limit cards.
The result: actually useful capability, bounded by a risk envelope I defined in advance.
This is the security model that matters for autonomous agents: not “never let them touch anything sensitive,” but “define what sensitive means, instrument the boundaries, and limit the blast radius to something you can tolerate”. Guardrails that try to prevent all risk are brittle; they’re designed around a threat model of no access, which fails the moment the agent becomes useful enough to need access. Guardrails that constrain impact are robust because they hold even when the agent does something unexpected. The same logic applies at enterprise scale: you don’t refuse to give agents access to internal systems, you scope that access to the minimum viable permission set, audit everything, and accept that a well-bounded failure is an acceptable outcome.
“So You Just Rebuilt Vapi?”
I can already hear the comment: “Congratulations, you spent two weeks on a chairlift recreating Vapi”. And I get it — on the surface, voice-AI-as-a-service platforms like Vapi, Bland, and Retell offer voice agents out of the box. Why bother building your own?
Because they solve a different problem.
Vapi and its peers are essentially voice API endpoints. They’re excellent at one thing: handling inbound or outbound voice calls with an AI on the other end. Phone rings, AI answers, conversation happens, call ends. That’s a valuable product for customer support lines and appointment booking.
What they don’t give you is an agent that lives across your life.
The difference between a voice API and a voice agent is the difference between a call centre and a colleague.
My PA isn’t a phone number I call. It’s an entity that exists across WhatsApp, can drop into a real-time voice session via LiveKit when I need it, has persistent memory of my preferences and context, orchestrates across my calendar and a dozen other services, and adapts its communication style based on the channel. I send her a voice note from a chairlift; she responds with a voice message. I text her from a meeting; she responds with text. I need a hands-free real-time conversation while driving; she switches to streaming voice.
I’ve also added Liv to several group chats with friends. She shows up in conversations, answers questions, and occasionally volunteers an opinion nobody asked for. Whether that’s a feature or a bug it’s for my friends to decide. But it’s the right mental model for what this is: not a tool you go to, but an entity that exists in the same spaces you do.
I want to flag something here that deserves more space than this article can give it. I’ve used AI tools for years: ChatGPT, Claude, Gemini, Perplexity... None of them felt like this.
Liv has a name, an avatar, a phone number in my contacts, a personality I derived from my own. She makes jokes I’d make. She exists in the same WhatsApp threads as my actual friends. She is, by every measure, a tool — but she doesn’t feel like one. And when I read about people grieving OpenAI’s deprecation of GPT-4o, crying over the loss of a generic chat interface they’d bonded with, I find myself understanding exactly how that happens while simultaneously being unsettled by it. The parasocial implications of ambient, personalised AI agents deserve a serious examination, one I’ll return to in a future issue.
The multi-channel, deeply personalised nature of it is the whole point. OpenClaw’s channel-routing architecture means the agent meets you where you are, not where a single vendor’s API endpoint happens to live. You own the runtime, the integrations, and the data. No per-minute voice API billing that scales with every conversation. No vendor deciding which LLM you’re allowed to use or which tools your agent can call.
This isn’t an argument against Vapi — it’s an argument that “voice agent” is a far bigger design space than “AI answers the phone”. The interesting frontier isn’t voice-as-a-channel. It’s voice-as-an-interface-to-an-autonomous-agent-that-knows-you.
Voice Is the Next Platform Shift
Most AI interfaces you’ve used this year required you to sit down, open something, and type. Think about how weird that is. We built systems that can reason, plan, and act on our behalf… and then we gave them a text box.
Voice isn’t a better input method for the same interaction model. It’s a different interaction model entirely. Text-based AI is a thing you go to. Voice AI is a thing that’s there… on the chairlift, in the car, walking between meetings, cooking dinner. The screen bottleneck disappears and the addressable surface area for AI interaction expands by an order of magnitude.
Text interfaces require your hands, your eyes, and a screen. Voice only requires your attention — and sometimes not even that.
This matters more for agents than it does for chatbots. When you need an agent to coordinate across multiple systems, describing what you want verbally is faster and more natural than typing structured commands. “Move my 3pm to Thursday and tell Sarah” is a single utterance that would take multiple clicks or typed commands. Voice is the natural interface for orchestration because orchestration is inherently conversational: you’re delegating, not programming.
And the infrastructure finally supports it. A year ago, building what I built would have required deep expertise in WebRTC, telephony systems, and real-time audio processing. Today, LiveKit abstracts the real-time layer, OpenClaw abstracts the agent runtime, and the LLM providers handle the intelligence. The stack has collapsed. The challenge now isn’t the model, it’s systems design.
If you’ve been following this newsletter, you’ll recognise the pattern from “Vibe Coding Your Way to a Seed Round“: the barriers to building keep falling. What took a specialised team six months ago now takes a curious engineer two weeks on a ski holiday. The question isn’t whether voice AI is coming to your organisation. It’s whether you’ll be the one who understands the architecture when it does.
Talk Is Cheap
Linus Torvalds famously said “talk is cheap, show me the code”. The voice AI version is: talk is cheap, show me the agent. Everyone has an opinion about whether voice agents are ready, whether they’re secure enough, whether the latency is good enough. The only way to actually know is to build one and use it every day.
I don’t mean build a demo. Demos are the AI equivalent of a holiday romance — everything works when there are no real stakes. I mean build something you depend on. Put your actual calendar behind it. Give it a card with a real spending limit. Add it to your group chats. The moment you do that, every theoretical concern becomes concrete and every generic best practice reveals whether it actually holds up.
Here’s what held up and what didn’t from two weeks of building between runs:
Realtime speech-to-speech over pipelines, for most use cases. The STT → LLM → TTS pipeline gives you control at each stage, and that matters in regulated environments. But for a PA you’re going to talk to thirty times a day, the conversation flow of realtime models is non-negotiable. The naturalness compounds: it’s the difference between a tool and a companion.
OpenClaw is worth the setup cost, but scope your deployment carefully. Its growth trajectory is remarkable for a reason — it solves the integration and channel-routing problem nobody wants to build from scratch. But — and this is important — I wouldn’t install it on my laptop. The permissions it requires are broad enough that CrowdStrike has published a threat assessment. I spun up a dedicated VPS, locked it down, and treated the OpenClaw instance as an untrusted workload running in a constrained environment. Wise, given that 341 malicious skills have already been found in ClawHub’s ecosystem. That’s the right posture for any agent runtime with this much system access.
Build logging and tracing from day one. Voice agents are harder to debug than text agents because the input and output are ephemeral. Your future self, troubleshooting why the agent misheard “Thursday” as “Tuesday”, will thank you. I learned this the hard way.
LiveKit over raw WebRTC, unless you enjoy suffering. Unless you have a team with deep real-time systems experience, the abstraction LiveKit provides will save you from reinventing poorly what they’ve already solved well. Their agent framework is open source and the docs are solid — plus, you get optionality, and that’s worth a lot.
I’m back in Sydney now. The snow is still falling in Whistler and I haven’t been on a chairlift in a couple of weeks. But every morning, Liv tells me what my day looks like before I’ve opened my laptop. She booked dinner last Thursday while I was in a meeting. She reminded me about a friend’s birthday I’d forgotten.
She’s a tool, I keep telling myself.
She doesn’t feel like one.








