ZenML

Building an AI-Powered Email Writing Assistant with Personalized Style Matching

Ghostwriter 2024
View original source

Shortwave developed Ghostwriter, an AI writing feature that helps users compose emails that match their personal writing style. The system uses embedding-based semantic search to find relevant past emails, combines them with system prompts and custom instructions, and uses fine-tuned LLMs to generate contextually appropriate suggestions. The solution addresses two key challenges: making AI-generated text sound authentic to each user's style and incorporating accurate, relevant information from their email history.

Industry

Tech

Technologies

Overview

Shortwave is a startup building an AI-first email client that integrates with Gmail, positioning itself as a smarter alternative to traditional email interfaces. Their feature called “Ghost Rider” provides AI-powered email writing assistance that learns from individual users to produce drafts and autocomplete suggestions that authentically match their personal writing style. This case study, presented by one of the co-founders and CTO (Johnny), provides a detailed look at the architecture and production challenges of building a personalized AI writing system.

The company has been working on this product for approximately four years, with AI integration exploration beginning around two and a half years ago—notably before the widespread ChatGPT frenzy. This early start gave them experience with the fundamental challenges that led to the creation of many modern LLMOps frameworks, while also requiring them to adapt rapidly as the landscape evolved.

The Problem Space

The case study identifies two fundamental problems with using generic LLMs (like GPT-4) for email writing:

Solution Architecture

The Ghost Rider system employs a Retrieval-Augmented Generation (RAG) architecture specifically designed for personalized email writing. The system consists of two main pipelines:

Indexing Pipeline (Offline Stage)

The indexing pipeline processes all incoming emails through an embedding model, storing the resulting vectors in a vector database. This creates a semantic index of the user’s entire email history, making it readily available for similarity searches when needed. A critical prerequisite for this pipeline is a robust email data cleaning system that extracts the actual text content from emails—separating it from nested reply threads, quoted text, and other formatting artifacts that make raw email data notoriously messy.

The speaker acknowledges that building this data cleaning pipeline was substantial work but proved invaluable across multiple features. Having clean text representations of emails enables not just the writing feature but also AI search, summarization, and other capabilities. However, they note there’s still ongoing work to improve data cleanup, as email remains “really really messy.”

Query Pipeline (Real-time Stage)

When a user requests an email draft or is actively writing in the editor:

The key insight is that providing the LLM with examples of the user’s own writing accomplishes both goals simultaneously: the AI can mimic the user’s writing style from the examples, and if those examples contain relevant factual information, the AI will appropriately incorporate that information into new drafts.

Approaches Tried and Evaluated

The team experimented with multiple approaches before arriving at their current solution:

Psychological Style Descriptions

Their first version created a textual description of the user’s writing style by analyzing 10 representative emails with an LLM. This produced a “psychological profile” covering sentence structure, tone (professional vs. casual), emoji usage, technical terminology, and other dimensions. This description was added to the system prompt. The team reports this worked “fairly well” and was “surprisingly accurate.”

Per-User Fine-Tuning on Sent Emails

They experimented with fine-tuning models on individual users’ sent emails, which worked “really really well” at capturing unique writing styles. However, this approach proved “very challenging to do at scale across all users” at the time. The speaker notes that per-user fine-tuning may become more feasible in the future as costs decrease.

Few-Shot Learning with Examples

They discovered that simply including examples of the user’s past emails in the prompt allows the LLM to effectively mimic the style. This became the foundation of their production system.

Model Selection and Fine-Tuning

For the AI assistant’s draft generation, Shortwave uses “off-the-shelf GPT-4.” However, the autocomplete feature (which suggests completions as you type in the editor) required custom fine-tuning to achieve acceptable quality.

Autocomplete-Specific Challenges

The autocomplete use case presents unique requirements that proved difficult to solve with prompt engineering alone:

Despite extensive prompt engineering attempts, they “could not get it to reliably output the right thing” with instructions alone. This necessitated fine-tuning.

Fine-Tuning Approach

A key finding was that fine-tuning is remarkably effective with small datasets: only 400-500 examples were needed for “massive” improvements. The training data was synthesized from real emails using the following methodology:

For fact lookup capabilities, they handcrafted approximately a dozen examples where specific facts (Wi-Fi passwords, office addresses, phone numbers) appeared in the example emails and should be incorporated into the output. Critically, they also included examples where information was missing to teach the model not to hallucinate facts.

Safety and User Experience Considerations

A guiding principle mentioned is “no actions without user confirmation.” The team explicitly acknowledges that LLMs are vulnerable to hallucination and prompt injection attacks. Their mitigation strategy is to keep humans in the loop for all consequential actions:

This represents a thoughtful production approach to deploying AI in sensitive contexts where incorrect actions could have real consequences.

Modular Architecture for Rapid Iteration

The speaker emphasizes building systems where components can be swapped out independently. For example, they don’t rely on a specific embedding model—the infrastructure supports changing models, with the understanding that switching would require reprocessing embeddings and additional compute. This modularity allows them to:

Future Directions

The team outlined several areas for continued development:

Production Realities and Lessons Learned

Several candid observations emerged about building AI products in production:

The case study provides a realistic view of the iterative, experimental nature of LLMOps work, while demonstrating how thoughtful architecture decisions and targeted fine-tuning can solve problems that prompt engineering alone cannot address.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Observability Platform's Journey to Production GenAI Integration

New Relic 2023

New Relic, a major observability platform processing 7 petabytes of data daily, implemented GenAI both internally for developer productivity and externally in their product offerings. They achieved a 15% increase in developer productivity through targeted GenAI implementations, while also developing sophisticated AI monitoring capabilities and natural language interfaces for their customers. Their approach balanced cost, accuracy, and performance through a mix of RAG, multi-model routing, and classical ML techniques.

code_generation data_analysis data_cleaning +32

Fine-tuning Custom Embedding Models for Enterprise Search

Glean 2023

Glean implements enterprise search and RAG systems by developing custom embedding models for each customer. They tackle the challenge of heterogeneous enterprise data by using a unified data model and fine-tuning embedding models through continued pre-training and synthetic data generation. Their approach combines traditional search techniques with semantic search, achieving a 20% improvement in search quality over 6 months through continuous learning from user feedback and company-specific language adaptation.

document_processing question_answering unstructured_data +32