Etsy explored using prompt engineering as an alternative to fine-tuning for AI-assisted employee onboarding, focusing on Travel & Entertainment policy questions and community forum support. They implemented a RAG-style approach using embeddings-based search to augment prompts with relevant Etsy-specific documents. The system achieved 86% accuracy on T&E policy questions and 72% on community forum queries, with various prompt engineering techniques like chain-of-thought reasoning and source citation helping to mitigate hallucinations and improve reliability.
Etsy Engineering conducted a comprehensive investigation into using large language models for AI-assisted employee onboarding through prompt engineering rather than traditional fine-tuning approaches. This case study presents a balanced exploration of both the capabilities and limitations of prompt-based LLM systems in production environments, specifically focusing on two use cases: internal Travel & Entertainment (T&E) policy questions and external community forum support.
The motivation behind this initiative was practical and cost-effective. Rather than investing in the expensive process of fine-tuning models with Etsy-specific datasets, the engineering team wanted to assess whether prompt engineering alone could deliver reliable, truthful answers to company-specific questions. This approach treats the LLM as a black box while leveraging prompt optimization to achieve task-specific performance.
Etsy implemented a Retrieval-Augmented Generation (RAG) style architecture that represents a solid LLMOps pattern for production question-answering systems. The core technical approach involved embedding Etsy-specific documents into the rich latent space of foundation models, specifically mentioning OpenAI’s models and Google’s Gemini family as the underlying LLMs.
The system architecture followed a standard RAG pipeline: documents were processed and indexed using embeddings, user queries were converted to embeddings through the embeddings API, similarity search identified relevant text sections, and these sections were incorporated into prompts to augment the LLM’s context before generating responses. This approach effectively updates the underlying index to account for newly folded data without requiring model parameter updates.
The embedding-based search mechanism represents a core LLMOps pattern where production systems must efficiently retrieve relevant context from large document collections. Etsy’s implementation demonstrates how organizations can leverage existing foundation model capabilities while incorporating domain-specific knowledge through careful prompt construction and context management.
The first production deployment focused on answering T&E policy questions, chosen as a well-circumscribed domain with clear, unambiguous rules. This represents an ideal starting point for LLMOps deployments as it provides clear ground truth for evaluation while addressing a genuine business need - most Etsy employees still had questions for every trip despite existing documentation.
The system was evaluated on a manually curated test set of 40 question-and-answer pairs, demonstrating proper LLMOps evaluation practices with human-in-the-loop assessment. The initial results showed approximately 86% accuracy, which while impressive, revealed significant reliability concerns in the remaining 14% of cases. These errors weren’t minor inaccuracies but potentially harmful misinformation, such as incorrectly stating that employees are responsible for corporate card balances when Etsy actually pays these directly.
This evaluation approach highlights crucial LLMOps considerations around assessment methodology. Rather than relying solely on automated metrics, Etsy employed human expert judgment to compare LLM-generated answers with extracted policy document answers. This human-centric evaluation is particularly important for high-stakes applications like policy compliance where incorrect information could have real consequences.
The case study provides valuable insights into practical hallucination mitigation techniques that are essential for production LLMOps systems. Etsy experimented with several prompt engineering strategies to address the 14% error rate, demonstrating both successes and limitations.
The first approach involved explicit uncertainty instructions, asking the LLM to say “I have no idea” when uncertain. While this eliminated false confident statements, it also led to missed opportunities where the correct information was actually available in the document collection. This trade-off between avoiding hallucinations and maintaining system utility is a classic LLMOps challenge.
More successful was the implementation of chain-of-thought reasoning prompts. By asking “why do you think so?” the system not only produced correct answers but also provided source citations (e.g., “This information is mentioned on page 42”). This approach effectively raised the factual checking bar and provided transparency that users could verify, representing a crucial pattern for production LLM systems requiring accountability.
The chain-of-thought technique demonstrates how simple prompt modifications can dramatically improve system reliability. The same question that initially produced a completely wrong answer, then an unhelpfully vague response, finally yielded the correct answer with source attribution through progressive prompt refinement. This iterative improvement process exemplifies practical LLMOps development cycles.
The second deployment expanded to external community forum data, presenting significantly greater complexity than internal policy documents. This use case involved publicly available content where sellers ask questions about shop optimization, SEO strategies, and platform usage, with answers provided by both Etsy staff and community members.
The forum data presented multiple LLMOps challenges: much higher heterogeneity in style and scope, more opinionated and subjective content, and potential divergence between community answers and official Etsy policies. The system architecture remained identical to the T&E implementation, but the evaluation revealed the impact of data complexity on system performance.
Performance dropped to 72% accuracy on a 50-question test set, with 28% generating wrong or misleading answers. This performance degradation illustrates how data characteristics significantly impact LLMOps system reliability, even when the underlying technical architecture remains constant. The case study notes that LLMs performed better when query formulations closely matched reference document wording and worse when answers required satisfying multiple conditions.
The forum use case revealed important limitations in current prompt engineering approaches. Adding contextual information didn’t always improve performance, and the system sometimes punted with statements like “Without prior knowledge, it is impossible to determine…” even when information was available. These limitations suggest that while prompt engineering can be effective, it’s not a panacea for all LLMOps challenges.
Etsy’s experience reveals several crucial LLMOps considerations for production deployments. The study emphasizes that asking LLMs to disclose specific sources emerged as the most effective technique for flagging potential hallucinations. This source citation approach provides both transparency and a mechanism for users to verify information, essential features for production question-answering systems.
The research also highlighted that complex reasoning scenarios require carefully structured chain-of-thought prompting. Generic prompting strategies don’t universally apply across different types of queries or data complexity levels. This finding has significant implications for LLMOps teams who must develop domain-specific prompt optimization strategies rather than relying on one-size-fits-all approaches.
The case study demonstrates responsible LLMOps evaluation practices by acknowledging both successes and limitations. Rather than overselling the technology’s capabilities, Etsy provides a balanced assessment that other organizations can use to set realistic expectations for similar implementations. The authors explicitly note that “care should still be taken when assessing answer truthfulness” even with advanced reasoning models.
From an operational perspective, the system leverages embedding-based search as a core component, effectively implementing a form of query expansion to improve information retrieval. The prompt engineering approaches include instruction prompting, role prompting, and few-shot prompting, representing standard techniques in the LLMOps toolkit.
The architectural choice to treat LLMs as black boxes while focusing on prompt optimization reflects a pragmatic LLMOps approach that balances capability with cost-effectiveness. This strategy avoids the substantial computational resources required for fine-tuning while still achieving task-specific performance through careful context management.
The system’s reliance on foundation models like OpenAI’s GPT family and Google’s Gemini demonstrates the typical production LLMOps pattern of building on top of established model providers rather than training custom models. This approach allows organizations to focus on application-specific challenges rather than model development.
Etsy’s evaluation methodology represents LLMOps best practices with manual curation of test sets and human expert assessment of answer quality. The comparison between LLM-generated answers and extracted reference answers provides a robust evaluation framework that goes beyond simple automated metrics.
The case study reveals the importance of continuous monitoring and evaluation in LLMOps deployments. The authors note that everything being equal, performance varies based on query formulation closeness to reference documents, highlighting the need for ongoing system observation and optimization.
The research approach of testing multiple prompt variations for the same question demonstrates the iterative nature of prompt optimization in production systems. This experimental mindset, combined with systematic evaluation, represents mature LLMOps practices that other organizations can emulate.
This case study provides valuable insights into the practical realities of deploying LLM-based systems in production environments. The 86% and 72% accuracy rates, while impressive, illustrate that current LLM technology still requires careful implementation and monitoring to achieve production reliability standards.
The research demonstrates how prompt engineering can serve as a viable alternative to fine-tuning for many use cases, particularly when combined with proper evaluation frameworks and hallucination mitigation strategies. However, it also shows the limits of this approach, especially when dealing with complex, heterogeneous data or requiring multi-step reasoning.
Etsy’s approach represents a thoughtful, measured implementation of LLMOps practices that prioritizes reliability and transparency over aggressive claims about AI capabilities. This balanced perspective provides a valuable reference point for other organizations considering similar implementations, emphasizing both the potential benefits and the continued need for human oversight in AI-assisted systems.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
This case study presents Ragas' comprehensive approach to improving AI applications through systematic evaluation practices, drawn from their experience working with various enterprises and early-stage startups. The problem addressed is the common challenge of AI engineers making improvements to LLM applications without clear measurement frameworks, leading to ineffective iteration cycles and poor user experiences. The solution involves a structured evaluation methodology encompassing dataset curation, human annotation, LLM-as-judge scaling, error analysis, experimentation, and continuous feedback loops. The results demonstrate that teams can move from subjective "vibe checks" to objective, data-driven improvements that systematically enhance AI application performance and user satisfaction.
This case study explores how Prosus builds and deploys AI agents across e-commerce and food delivery businesses serving two billion customers globally. The discussion covers critical lessons learned from deploying conversational agents in production, with a particular focus on context engineering as the most important factor for success—more so than model selection or prompt engineering alone. The team found that successful production deployments require hybrid approaches combining semantic and keyword search, generative UI experiences that mix chat with dynamic visual components, and sophisticated evaluation frameworks. They emphasize that technology has advanced faster than user adoption, leading to failures when pure chatbot interfaces were tested, and success only came through careful UI/UX design, contextual interventions, and extensive testing with both synthetic and real user data.