ZenML

LLMOps Lessons from W&B's Wandbot: Manual Evaluation & Quality Assurance of Production LLM Systems

Weights & Biases 2023
View original source

The case study details Weights & Biases' comprehensive evaluation of their production LLM system Wandbot, achieving a baseline accuracy of 66.67% through manual evaluation. The study offers valuable insights into LLMOps practices, demonstrating the importance of systematic evaluation, clear metrics, and expert annotation in production LLM systems. It highlights key challenges in areas like language handling, retrieval accuracy, and hallucination prevention, while also showcasing practical solutions using tools like Argilla.io for annotation management. The findings emphasize the need for continuous improvement cycles and the critical role of high-quality documentation in LLM system performance, providing a practical template for other organizations deploying LLMs in production.

Industry

Tech

Technologies

Overview

Weights & Biases, a company known for providing machine learning experiment tracking and MLOps tools, developed an internal LLM-powered documentation assistant called Wandbot. This case study focuses on their approach to evaluating this LLM system, specifically highlighting manual evaluation methodologies. The work represents a practical example of how organizations building LLM-powered applications approach the critical challenge of evaluation in production systems.

Context and Background

Weights & Biases operates in the MLOps and AI tooling space, providing infrastructure for machine learning practitioners to track experiments, manage datasets, and deploy models. The development of Wandbot appears to be an internal initiative to leverage LLM technology to improve their documentation experience and provide users with an intelligent assistant capable of answering questions about their platform and tools.

Documentation assistants powered by LLMs have become a common use case in the tech industry, as they can significantly reduce the burden on support teams while providing users with immediate, contextual answers to their questions. These systems typically rely on Retrieval-Augmented Generation (RAG) architectures, where the LLM is grounded in the company’s actual documentation to provide accurate and relevant responses.

The Evaluation Challenge

One of the most significant challenges in deploying LLM-powered systems in production is evaluation. Unlike traditional software where outputs are deterministic and can be tested with standard unit and integration tests, LLM outputs are probabilistic and can vary in subtle ways that are difficult to assess automatically. This makes evaluation a critical component of the LLMOps lifecycle.

The title of the source material suggests this is “Part 2” of a series on LLM evaluation, indicating that Weights & Biases has developed a comprehensive, multi-part approach to assessing their Wandbot system. The focus on “manual evaluation” suggests they recognize that automated metrics alone are insufficient for understanding LLM performance in real-world scenarios.

Manual Evaluation in LLMOps

Manual evaluation serves several critical purposes in the LLMOps workflow:

For a documentation assistant like Wandbot, evaluators would typically assess factors such as:

RAG System Considerations

Documentation assistants like Wandbot typically employ RAG architectures, which introduce additional evaluation dimensions. In a RAG system, the evaluation must consider both the retrieval component (are the right documents being retrieved?) and the generation component (is the LLM synthesizing the retrieved information correctly?).

This dual nature of RAG systems means that evaluation frameworks must be able to:

LLMOps Best Practices Demonstrated

While the source text provides limited technical detail, the existence of this evaluation framework demonstrates several LLMOps best practices that Weights & Biases appears to be following:

Limitations and Considerations

It should be noted that the source material for this case study is extremely limited, consisting only of a page title and URL. The full content of the evaluation methodology, specific metrics used, results obtained, and lessons learned are not available in the provided text. Therefore, this summary represents an inference based on the title and the general knowledge of Weights & Biases’ work in the MLOps space.

Organizations considering similar evaluation approaches should be aware that manual evaluation, while valuable, has its own limitations:

Broader Implications for LLMOps

This case study, despite its limited detail, highlights the importance of evaluation as a core component of LLMOps practices. As organizations increasingly deploy LLM-powered applications in production, the need for robust evaluation frameworks becomes critical. The combination of manual and automated evaluation approaches appears to be emerging as a best practice in the industry.

Weights & Biases’ work on Wandbot evaluation also demonstrates the value of “eating your own dog food” – using their own MLOps tools to build and evaluate AI systems. This provides them with firsthand experience of the challenges their customers face and helps inform the development of their platform.

The focus on documentation assistants as a use case is particularly relevant, as this represents one of the most common enterprise applications of LLM technology. The evaluation challenges and solutions developed for Wandbot are likely applicable to similar systems across many industries and organizations.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

AI-Powered Vehicle Information Platform for Dealership Sales Support

Toyota 2025

Toyota Motor North America (TMNA) and Toyota Connected built a generative AI platform to help dealership sales staff and customers access accurate vehicle information in real-time. The problem was that customers often arrived at dealerships highly informed from internet research, while sales staff lacked quick access to detailed vehicle specifications, trim options, and pricing. The solution evolved from a custom RAG-based system (v1) using Amazon Bedrock, SageMaker, and OpenSearch to retrieve information from official Toyota data sources, to a planned agentic platform (v2) using Amazon Bedrock AgentCore with Strands agents and MCP servers. The v1 system achieved over 7,000 interactions per month across Toyota's dealer network, with citation-backed responses and legal compliance built in, while v2 aims to enable more dynamic actions like checking local vehicle availability.

customer_support chatbot question_answering +47

Scaling AI Product Development with Rigorous Evaluation and Observability

Notion 2025

Notion AI, serving over 100 million users with multiple AI features including meeting notes, enterprise search, and deep research tools, demonstrates how rigorous evaluation and observability practices are essential for scaling AI product development. The company uses Brain Trust as their evaluation platform to manage the complexity of supporting multilingual workspaces, rapid model switching, and maintaining product polish while building at the speed of AI industry innovation. Their approach emphasizes that 90% of AI development time should be spent on evaluation and observability rather than prompting, with specialized data specialists creating targeted datasets and custom LLM-as-a-judge scoring functions to ensure consistent quality across their diverse AI product suite.

document_processing content_moderation question_answering +52