Bee: Building Voice-Enabled AI Assistants with Real-Time Processing

Overview

This case study captures a meetup presentation featuring multiple speakers discussing the emerging field of personal AI assistants. The session covers several interconnected topics: real-time voice bot infrastructure, wearable hardware for continuous life capture, and memory systems for LLM personalization. Rather than a single company’s implementation, this represents a snapshot of the current state of personal AI development from multiple practitioners including Deepgram (voice infrastructure), Bee/Owl and Friend (wearable hardware), and LangChain (memory systems).

Real-Time Voice Bot Architecture (Deepgram)

Damien Murphy from Deepgram presented the core technical considerations for building production-ready voice bots. The key insight is that real-time voice interactions require careful attention to latency across three main components: speech-to-text, LLM inference, and text-to-speech.

Latency Requirements

The presentation emphasized that sub-second response time is critical for natural conversation. If latency exceeds approximately 1.5-2 seconds, users will typically say something again, assuming the AI is no longer listening. This creates a compound challenge because you need to optimize latency across three sequential operations.

For speech-to-text, Deepgram claims around 200ms latency on their hosted API, with the ability to reduce this to approximately 50ms through self-hosting at the cost of additional GPU compute. The demo showed Vapi, a YC startup, which self-hosts Deepgram’s models to achieve even lower latencies.

For LLM inference, GPT-3.5 Turbo typically delivers 400-600ms latency on OpenAI’s hosted API. The presentation noted that GPT-4 is significantly slower with “massive like second fluctuations” on the hosted endpoint. Azure’s hosted versions can provide more consistent, lower latency. Some customers run their own Llama 2 models co-located with the speech models to eliminate network latency entirely.

For text-to-speech, Deepgram offers their own TTS solution focused on low latency and cost, though alternatives like ElevenLabs provide more human-like output at approximately 40x the price.

Streaming Architecture

A critical implementation detail is the use of streaming throughout the pipeline. The presentation emphasized “time to first token” as a key metric—you want to start processing and playing audio as soon as possible rather than waiting for complete responses. This applies to both the LLM output (where you stream text to TTS immediately) and the audio output (where you begin playback as soon as the first bytes are ready).

The basic architecture involves: browser captures audio → voice bot server → speech-to-text API → LLM API → text-to-speech API → back to browser. For telephony integration, you swap the browser for a service like Twilio, which adds approximately 100ms of latency.

Cost Economics

The presentation provided concrete cost estimates for production voice bot deployments. Using list prices (volume discounts available), a 5-minute call costs approximately 6.5 cents when combining Deepgram’s speech-to-text, GPT-3.5 Turbo, and Deepgram’s text-to-speech. Claude Haiku was mentioned as an even cheaper LLM alternative for cost-sensitive applications. For comparison, using ElevenLabs for TTS would cost around $1.20 for the same call duration.

Deepgram offers $200 in free credits for new users, representing approximately 750 hours of post-call transcription or 500 hours of real-time transcription.

Wearable Hardware for Continuous Capture

Bee/Owl (Ethan’s Project)

Ethan presented Bee (originally Owl), a wearable device for continuous audio and eventually video capture. The philosophy centers on “ultra personal alignment”—the idea that an AI which has experienced your entire life as context will be dramatically more helpful than one starting from scratch.

The device connects via Bluetooth Low Energy and runs for multiple days on a single charge. An LTE variant using LTE-M (a low-power cellular subset) is also in development. The system performs several sophisticated processing tasks:

Real-time transcription with speaker verification to distinguish the user’s speech from others
Conversation endpoint detection—determining when one conversation ends and another begins, which requires not just voice activity detection but also signals around location and topic changes at the LLM level
Post-conversation processing using larger models to generate summaries, takeaways, atmosphere descriptions, and relevant links through RAG

The app provides a query interface for searching across all recorded conversations, with source citations linking back to specific moments in past recordings.

Most impressively, the system can perform autonomous actions across multiple apps. In a live demo, a voice command to “find a good taco restaurant in Pacific Heights and send it to Maria” triggered the AI to open Google Maps, find a restaurant, maintain working memory of the task and context (implicitly understanding “Maria” meant WhatsApp based on recent conversation), and successfully complete the multi-step action.

Friend (Igor’s Project)

Igor presented Friend, another open-source wearable focused on achieving the lowest possible power consumption. The device uses what he described as “probably the best opportunity you can currently find on the market” in terms of low-power chips.

Key technical challenges included:

Limited onboard memory requiring innovative compression to store audio quality at 16,000Hz (up from an initial 4,000Hz that was “completely horrible”)
Streaming compressed audio continuously over Bluetooth to a phone app
Supporting both iOS and Android platforms
Effective range of approximately 2 feet for high-quality capture, degrading to 50% accuracy at 4 feet

The project is fully open source and was presented alongside Adam’s repository (2,600+ stars), which Igor credited as the foundational project that sparked the open-source wearable AI movement.

Memory Systems for LLM Applications (LangChain)

Harrison from LangChain presented their work on memory systems for LLM applications, using a journaling app as a testbed. The core insight is that memory is crucial for personal AI but remains a vaguely defined problem space.

Types of Memory

The presentation outlined several memory paradigms currently in use:

Conversational memory: The simplest form—just remembering previous messages in a conversation
Semantic memory: Storing memory fragments in a vector store and retrieving relevant ones based on semantic similarity
Knowledge graph memory: Constructing a graph of relationships over time, though this is often “overly complex” for many use cases

LangChain’s experimental memory system (“Lang Memory” working title) focuses on chat experiences with flexibility in defining memory schemas. They identified that different applications require different types of memories—a SQL assistant bot needs very different memory structures than a journaling app.

Memory Types Under Development

Thread-level memory: Summaries of conversations and extracted follow-up items
User profile: A JSON schema that updates over time with user characteristics
Append-only lists: For things like “restaurants I’ve mentioned” where you want to accumulate items rather than overwrite
Knowledge triplets: More structured semantic extraction

Retrieval Strategies

The presentation highlighted the generative agents paper from Stanford as particularly influential. That paper introduced the idea of fetching memories based not just on semantic similarity but also on recency and importance (with an LLM assigning importance scores). This addresses the reality that some memories should persist indefinitely while others naturally decay—you remember what you had for breakfast today but not ten days ago.

An alternative approach mentioned was MemGPT, where the language model actively decides when to write to short-term versus long-term memory during conversations.

Challenges Discussed

A key unsolved problem is memory consolidation—in the demo, multiple very similar memory entries existed that should logically be collapsed. The RAPTOR paper was suggested as a potential solution, using hierarchical summarization (clustering chunks and recursively summarizing them) which could enable periodic background processes to consolidate memories while accounting for recency.

Cross-Cutting Production Considerations

Several themes emerged across all presentations:

Privacy and Recording: One speaker mentioned recording all personal conversations continuously (legal in the Netherlands when you’re a participant), creating terabytes of personal data annually. The philosophy is to capture everything now even without immediate use cases, as the data will become valuable for future AI alignment.

Hardware Form Factor: Multiple speakers noted that minimizing device size and cognitive burden is crucial for adoption. One suggestion was simply using a cheap Android phone in a front pocket as a more socially acceptable alternative to dedicated hardware.

The “Make People Want It” Problem: Adam (creator of the largest open-source wearable project) noted that the biggest challenge isn’t technical—it’s making products people actually want to use. Developer experience matters enormously; projects with complex setup requirements (CUDA, cross-platform compatibility issues) saw disappointing uptake despite interesting functionality.

Vision Capture: All speakers agreed that continuous video capture remains the next major frontier and is significantly harder than audio due to power consumption and bandwidth constraints. One approach involves novel ideas about compression and periodic capture rather than continuous streaming.

Building Voice-Enabled AI Assistants with Real-Time Processing

Industry

Technologies