Etsy: Context Engineering for AI-Assisted Employee Onboarding

Overview

Etsy Engineering conducted a comprehensive investigation into using large language models for AI-assisted employee onboarding through prompt engineering rather than traditional fine-tuning approaches. This case study presents a balanced exploration of both the capabilities and limitations of prompt-based LLM systems in production environments, specifically focusing on two use cases: internal Travel & Entertainment (T&E) policy questions and external community forum support.

The motivation behind this initiative was practical and cost-effective. Rather than investing in the expensive process of fine-tuning models with Etsy-specific datasets, the engineering team wanted to assess whether prompt engineering alone could deliver reliable, truthful answers to company-specific questions. This approach treats the LLM as a black box while leveraging prompt optimization to achieve task-specific performance.

Technical Architecture and Implementation

Etsy implemented a Retrieval-Augmented Generation (RAG) style architecture that represents a solid LLMOps pattern for production question-answering systems. The core technical approach involved embedding Etsy-specific documents into the rich latent space of foundation models, specifically mentioning OpenAI’s models and Google’s Gemini family as the underlying LLMs.

The system architecture followed a standard RAG pipeline: documents were processed and indexed using embeddings, user queries were converted to embeddings through the embeddings API, similarity search identified relevant text sections, and these sections were incorporated into prompts to augment the LLM’s context before generating responses. This approach effectively updates the underlying index to account for newly folded data without requiring model parameter updates.

The embedding-based search mechanism represents a core LLMOps pattern where production systems must efficiently retrieve relevant context from large document collections. Etsy’s implementation demonstrates how organizations can leverage existing foundation model capabilities while incorporating domain-specific knowledge through careful prompt construction and context management.

Use Case 1: Travel & Entertainment Policy Questions

The first production deployment focused on answering T&E policy questions, chosen as a well-circumscribed domain with clear, unambiguous rules. This represents an ideal starting point for LLMOps deployments as it provides clear ground truth for evaluation while addressing a genuine business need - most Etsy employees still had questions for every trip despite existing documentation.

The system was evaluated on a manually curated test set of 40 question-and-answer pairs, demonstrating proper LLMOps evaluation practices with human-in-the-loop assessment. The initial results showed approximately 86% accuracy, which while impressive, revealed significant reliability concerns in the remaining 14% of cases. These errors weren’t minor inaccuracies but potentially harmful misinformation, such as incorrectly stating that employees are responsible for corporate card balances when Etsy actually pays these directly.

This evaluation approach highlights crucial LLMOps considerations around assessment methodology. Rather than relying solely on automated metrics, Etsy employed human expert judgment to compare LLM-generated answers with extracted policy document answers. This human-centric evaluation is particularly important for high-stakes applications like policy compliance where incorrect information could have real consequences.

Hallucination Mitigation Strategies

The case study provides valuable insights into practical hallucination mitigation techniques that are essential for production LLMOps systems. Etsy experimented with several prompt engineering strategies to address the 14% error rate, demonstrating both successes and limitations.

The first approach involved explicit uncertainty instructions, asking the LLM to say “I have no idea” when uncertain. While this eliminated false confident statements, it also led to missed opportunities where the correct information was actually available in the document collection. This trade-off between avoiding hallucinations and maintaining system utility is a classic LLMOps challenge.

More successful was the implementation of chain-of-thought reasoning prompts. By asking “why do you think so?” the system not only produced correct answers but also provided source citations (e.g., “This information is mentioned on page 42”). This approach effectively raised the factual checking bar and provided transparency that users could verify, representing a crucial pattern for production LLM systems requiring accountability.

The chain-of-thought technique demonstrates how simple prompt modifications can dramatically improve system reliability. The same question that initially produced a completely wrong answer, then an unhelpfully vague response, finally yielded the correct answer with source attribution through progressive prompt refinement. This iterative improvement process exemplifies practical LLMOps development cycles.

Use Case 2: Community Forum Support

The second deployment expanded to external community forum data, presenting significantly greater complexity than internal policy documents. This use case involved publicly available content where sellers ask questions about shop optimization, SEO strategies, and platform usage, with answers provided by both Etsy staff and community members.

The forum data presented multiple LLMOps challenges: much higher heterogeneity in style and scope, more opinionated and subjective content, and potential divergence between community answers and official Etsy policies. The system architecture remained identical to the T&E implementation, but the evaluation revealed the impact of data complexity on system performance.

Performance dropped to 72% accuracy on a 50-question test set, with 28% generating wrong or misleading answers. This performance degradation illustrates how data characteristics significantly impact LLMOps system reliability, even when the underlying technical architecture remains constant. The case study notes that LLMs performed better when query formulations closely matched reference document wording and worse when answers required satisfying multiple conditions.

The forum use case revealed important limitations in current prompt engineering approaches. Adding contextual information didn’t always improve performance, and the system sometimes punted with statements like “Without prior knowledge, it is impossible to determine…” even when information was available. These limitations suggest that while prompt engineering can be effective, it’s not a panacea for all LLMOps challenges.

Production Considerations and Limitations

Etsy’s experience reveals several crucial LLMOps considerations for production deployments. The study emphasizes that asking LLMs to disclose specific sources emerged as the most effective technique for flagging potential hallucinations. This source citation approach provides both transparency and a mechanism for users to verify information, essential features for production question-answering systems.

The research also highlighted that complex reasoning scenarios require carefully structured chain-of-thought prompting. Generic prompting strategies don’t universally apply across different types of queries or data complexity levels. This finding has significant implications for LLMOps teams who must develop domain-specific prompt optimization strategies rather than relying on one-size-fits-all approaches.

The case study demonstrates responsible LLMOps evaluation practices by acknowledging both successes and limitations. Rather than overselling the technology’s capabilities, Etsy provides a balanced assessment that other organizations can use to set realistic expectations for similar implementations. The authors explicitly note that “care should still be taken when assessing answer truthfulness” even with advanced reasoning models.

Technical Implementation Details

From an operational perspective, the system leverages embedding-based search as a core component, effectively implementing a form of query expansion to improve information retrieval. The prompt engineering approaches include instruction prompting, role prompting, and few-shot prompting, representing standard techniques in the LLMOps toolkit.

The architectural choice to treat LLMs as black boxes while focusing on prompt optimization reflects a pragmatic LLMOps approach that balances capability with cost-effectiveness. This strategy avoids the substantial computational resources required for fine-tuning while still achieving task-specific performance through careful context management.

The system’s reliance on foundation models like OpenAI’s GPT family and Google’s Gemini demonstrates the typical production LLMOps pattern of building on top of established model providers rather than training custom models. This approach allows organizations to focus on application-specific challenges rather than model development.

Evaluation and Monitoring Approaches

Etsy’s evaluation methodology represents LLMOps best practices with manual curation of test sets and human expert assessment of answer quality. The comparison between LLM-generated answers and extracted reference answers provides a robust evaluation framework that goes beyond simple automated metrics.

The case study reveals the importance of continuous monitoring and evaluation in LLMOps deployments. The authors note that everything being equal, performance varies based on query formulation closeness to reference documents, highlighting the need for ongoing system observation and optimization.

The research approach of testing multiple prompt variations for the same question demonstrates the iterative nature of prompt optimization in production systems. This experimental mindset, combined with systematic evaluation, represents mature LLMOps practices that other organizations can emulate.

Broader Implications for LLMOps

This case study provides valuable insights into the practical realities of deploying LLM-based systems in production environments. The 86% and 72% accuracy rates, while impressive, illustrate that current LLM technology still requires careful implementation and monitoring to achieve production reliability standards.

The research demonstrates how prompt engineering can serve as a viable alternative to fine-tuning for many use cases, particularly when combined with proper evaluation frameworks and hallucination mitigation strategies. However, it also shows the limits of this approach, especially when dealing with complex, heterogeneous data or requiring multi-step reasoning.

Etsy’s approach represents a thoughtful, measured implementation of LLMOps practices that prioritizes reliability and transparency over aggressive claims about AI capabilities. This balanced perspective provides a valuable reference point for other organizations considering similar implementations, emphasizing both the potential benefits and the continued need for human oversight in AI-assisted systems.

Context Engineering for AI-Assisted Employee Onboarding

Industry

Technologies

Overview

Technical Architecture and Implementation

Use Case 1: Travel & Entertainment Policy Questions

Hallucination Mitigation Strategies

Use Case 2: Community Forum Support

Production Considerations and Limitations

Technical Implementation Details

Evaluation and Monitoring Approaches

Broader Implications for LLMOps

More Like This

Building Custom Agents at Scale: Notion's Multi-Year Journey to Production-Ready Agentic Workflows

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Systematic AI Application Improvement Through Evaluation-Driven Development