A detailed exploration of building real-time voice-enabled AI assistants, featuring multiple approaches from different companies and developers. The case study covers how to achieve low-latency voice processing, transcription, and LLM integration for interactive AI assistants. Solutions demonstrated include both commercial services like Deepgram and open-source implementations, with a focus on achieving sub-second latency, high accuracy, and cost-effective deployment.
This case study captures a meetup presentation featuring multiple speakers discussing the emerging field of personal AI assistants. The session covers several interconnected topics: real-time voice bot infrastructure, wearable hardware for continuous life capture, and memory systems for LLM personalization. Rather than a single company’s implementation, this represents a snapshot of the current state of personal AI development from multiple practitioners including Deepgram (voice infrastructure), Bee/Owl and Friend (wearable hardware), and LangChain (memory systems).
Damien Murphy from Deepgram presented the core technical considerations for building production-ready voice bots. The key insight is that real-time voice interactions require careful attention to latency across three main components: speech-to-text, LLM inference, and text-to-speech.
The presentation emphasized that sub-second response time is critical for natural conversation. If latency exceeds approximately 1.5-2 seconds, users will typically say something again, assuming the AI is no longer listening. This creates a compound challenge because you need to optimize latency across three sequential operations.
For speech-to-text, Deepgram claims around 200ms latency on their hosted API, with the ability to reduce this to approximately 50ms through self-hosting at the cost of additional GPU compute. The demo showed Vapi, a YC startup, which self-hosts Deepgram’s models to achieve even lower latencies.
For LLM inference, GPT-3.5 Turbo typically delivers 400-600ms latency on OpenAI’s hosted API. The presentation noted that GPT-4 is significantly slower with “massive like second fluctuations” on the hosted endpoint. Azure’s hosted versions can provide more consistent, lower latency. Some customers run their own Llama 2 models co-located with the speech models to eliminate network latency entirely.
For text-to-speech, Deepgram offers their own TTS solution focused on low latency and cost, though alternatives like ElevenLabs provide more human-like output at approximately 40x the price.
A critical implementation detail is the use of streaming throughout the pipeline. The presentation emphasized “time to first token” as a key metric—you want to start processing and playing audio as soon as possible rather than waiting for complete responses. This applies to both the LLM output (where you stream text to TTS immediately) and the audio output (where you begin playback as soon as the first bytes are ready).
The basic architecture involves: browser captures audio → voice bot server → speech-to-text API → LLM API → text-to-speech API → back to browser. For telephony integration, you swap the browser for a service like Twilio, which adds approximately 100ms of latency.
The presentation provided concrete cost estimates for production voice bot deployments. Using list prices (volume discounts available), a 5-minute call costs approximately 6.5 cents when combining Deepgram’s speech-to-text, GPT-3.5 Turbo, and Deepgram’s text-to-speech. Claude Haiku was mentioned as an even cheaper LLM alternative for cost-sensitive applications. For comparison, using ElevenLabs for TTS would cost around $1.20 for the same call duration.
Deepgram offers $200 in free credits for new users, representing approximately 750 hours of post-call transcription or 500 hours of real-time transcription.
Ethan presented Bee (originally Owl), a wearable device for continuous audio and eventually video capture. The philosophy centers on “ultra personal alignment”—the idea that an AI which has experienced your entire life as context will be dramatically more helpful than one starting from scratch.
The device connects via Bluetooth Low Energy and runs for multiple days on a single charge. An LTE variant using LTE-M (a low-power cellular subset) is also in development. The system performs several sophisticated processing tasks:
The app provides a query interface for searching across all recorded conversations, with source citations linking back to specific moments in past recordings.
Most impressively, the system can perform autonomous actions across multiple apps. In a live demo, a voice command to “find a good taco restaurant in Pacific Heights and send it to Maria” triggered the AI to open Google Maps, find a restaurant, maintain working memory of the task and context (implicitly understanding “Maria” meant WhatsApp based on recent conversation), and successfully complete the multi-step action.
Igor presented Friend, another open-source wearable focused on achieving the lowest possible power consumption. The device uses what he described as “probably the best opportunity you can currently find on the market” in terms of low-power chips.
Key technical challenges included:
The project is fully open source and was presented alongside Adam’s repository (2,600+ stars), which Igor credited as the foundational project that sparked the open-source wearable AI movement.
Harrison from LangChain presented their work on memory systems for LLM applications, using a journaling app as a testbed. The core insight is that memory is crucial for personal AI but remains a vaguely defined problem space.
The presentation outlined several memory paradigms currently in use:
LangChain’s experimental memory system (“Lang Memory” working title) focuses on chat experiences with flexibility in defining memory schemas. They identified that different applications require different types of memories—a SQL assistant bot needs very different memory structures than a journaling app.
The presentation highlighted the generative agents paper from Stanford as particularly influential. That paper introduced the idea of fetching memories based not just on semantic similarity but also on recency and importance (with an LLM assigning importance scores). This addresses the reality that some memories should persist indefinitely while others naturally decay—you remember what you had for breakfast today but not ten days ago.
An alternative approach mentioned was MemGPT, where the language model actively decides when to write to short-term versus long-term memory during conversations.
A key unsolved problem is memory consolidation—in the demo, multiple very similar memory entries existed that should logically be collapsed. The RAPTOR paper was suggested as a potential solution, using hierarchical summarization (clustering chunks and recursively summarizing them) which could enable periodic background processes to consolidate memories while accounting for recency.
Several themes emerged across all presentations:
Privacy and Recording: One speaker mentioned recording all personal conversations continuously (legal in the Netherlands when you’re a participant), creating terabytes of personal data annually. The philosophy is to capture everything now even without immediate use cases, as the data will become valuable for future AI alignment.
Hardware Form Factor: Multiple speakers noted that minimizing device size and cognitive burden is crucial for adoption. One suggestion was simply using a cheap Android phone in a front pocket as a more socially acceptable alternative to dedicated hardware.
The “Make People Want It” Problem: Adam (creator of the largest open-source wearable project) noted that the biggest challenge isn’t technical—it’s making products people actually want to use. Developer experience matters enormously; projects with complex setup requirements (CUDA, cross-platform compatibility issues) saw disappointing uptake despite interesting functionality.
Vision Capture: All speakers agreed that continuous video capture remains the next major frontier and is significantly harder than audio due to power consumption and bandwidth constraints. One approach involves novel ideas about compression and periodic capture rather than continuous streaming.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Lance Martin from LangChain discusses the emerging discipline of "context engineering" through his experience building Open Deep Research, a deep research agent that evolved over a year to become the best-performing open-source solution on Deep Research Bench. The conversation explores how managing context in production agent systems—particularly across dozens to hundreds of tool calls—presents challenges distinct from simple prompt engineering, requiring techniques like context offloading, summarization, pruning, and multi-agent isolation. Martin's iterative development journey illustrates the "bitter lesson" for AI engineering: structured workflows that work well with current models can become bottlenecks as models improve, requiring engineers to continuously remove structure and embrace more general approaches to capture exponential model improvements.
Sierra, an AI agent platform company, discusses their comprehensive approach to deploying LLMs in production for customer service automation across voice and chat channels. The company addresses fundamental challenges in productionizing AI agents including non-deterministic behavior, latency requirements, and quality assurance through novel solutions like simulation-based testing that runs thousands of parallel test scenarios, speculative execution for voice latency optimization, and constellation-based multi-model orchestration where 10-20 different models handle various aspects of each conversation. Their outcome-based pricing model aligns incentives with customer success, while their hybrid no-code/code platform enables both business and technical teams to collaboratively build, test, and deploy agents. The platform serves large enterprise customers across multiple industries, with agents handling millions of customer interactions in production environments.