ZenML

Scaling and Optimizing Self-Hosted LLMs for Developer Documentation

Various 2023
View original source

A tech company needed to improve their developer documentation accessibility and understanding. They implemented a self-hosted LLM solution using retrieval augmented generation (RAG), with guard rails for content safety. The team optimized performance using vLLM for faster inference and Ray Serve for horizontal scaling, achieving significant improvements in latency and throughput while maintaining cost efficiency. The solution helped developers better understand and adopt the company's products while keeping proprietary information secure.

Industry

Tech

Technologies

Overview

This case study covers a mini-summit featuring three speakers discussing different aspects of LLM production optimization. The presentations complement each other well: Matt from Fuzzy Labs covers infrastructure scaling and benchmarking, Vaibhav from Boundary presents BAML for structured output optimization, and Tom from SAS discusses combining traditional NLP with LLMs for document analysis. Together, they provide a comprehensive view of LLMOps challenges and solutions across different layers of the stack.

Fuzzy Labs: Self-Hosted RAG System for Developer Documentation

Business Context and Problem

Fuzzy Labs, a UK-based MLOps consultancy, worked with an unnamed tech company (described as a hardware and software platform provider) that faced a significant challenge: their technical documentation was difficult to navigate, and developers struggled to understand their products. The business goal was to improve developer experience and grow their developer community.

The customer had specific requirements that shaped the technical approach:

Architecture Overview

The solution implemented a retrieval-augmented generation (RAG) architecture with several key components:

The guardrails component is noteworthy as it represents a second model in the system that also requires scaling considerations, demonstrating that production LLM systems often involve multiple models with different resource requirements.

Benchmarking Philosophy and Methodology

The speaker emphasized a critical principle often attributed to Donald Knuth: “Premature optimization is the root of all evil.” Before attempting any optimization, the team established a rigorous benchmarking methodology.

They used Locust, a Python-based load testing tool, to simulate different traffic scenarios:

The key metrics tracked were:

Importantly, the team maintained scientific rigor by recording test conditions, software versions, git commits, environment details, and dataset versions for reproducibility.

Initial Performance and Bottlenecks

Initial testing with standard Hugging Face pipelines plus FastAPI revealed severe limitations:

Vertical Scaling: vLLM for Inference Optimization

The team adopted vLLM, which addresses the key bottleneck in transformer inference: GPU memory, specifically the key-value attention lookups. vLLM implements “paged attention,” which the speaker describes as analogous to virtual memory for LLMs, though noting the metaphor is somewhat imprecise.

The claimed benefits of vLLM are significant:

In practice, Fuzzy Labs observed:

Horizontal Scaling: Ray Serve for Distributed Inference

For handling concurrent users beyond what a single server can manage, the team implemented Ray Serve, a framework developed by Anyscale and used by OpenAI, Uber, and LinkedIn.

Key capabilities of Ray Serve in this deployment:

The integration between Ray Serve and vLLM was described as “pretty recent” at the time of the project and not without challenges, though the speaker expected improvements over time.

Key Takeaways from Fuzzy Labs

The speaker emphasized that this represents a single data point in an industry still figuring out best practices for LLM deployment. The main lessons:

Boundary: BAML for Structured Output Optimization

The Problem with Structured Outputs

Vaibhav from Boundary presented BAML (Boundary AI Markup Language), an open-source DSL for improving structured output reliability from LLMs. The core insight is that LLMs struggle with strict output formats like JSON, which requires quotes, proper comma placement, and no comments. Smaller, cheaper models particularly struggle with format compliance, leading to parsing failures even when the underlying data is correct.

BAML’s Technical Approach

BAML takes a fundamentally different approach than traditional prompt engineering or OpenAI’s structured outputs:

Developer Experience Features

BAML provides a VS Code extension with hot-reload capabilities that show:

The speaker emphasized that developers should be able to see the full request without abstractions, similar to how web developers wouldn’t ship CSS changes without seeing them rendered.

Benchmark Results

Boundary ran benchmarks against function calling datasets and found:

Advanced Capabilities

BAML supports chain-of-thought reasoning through prompt templates. The speaker demonstrated a surprisingly simple prompt pattern that enables the model to outline key details before producing structured output, improving reasoning quality while maintaining output reliability.

One customer example cited: parsing 18+ page bank statements without a single penny of error, demonstrating production-grade accuracy.

SAS: Combining Traditional NLP with LLMs

Text Analytics as a Pre-Processing Layer

Tom from SAS presented an approach that combines traditional NLP techniques (text analytics, information extraction) with LLMs to improve accuracy and reduce hallucinations. The key insight is that filtering and structuring data before sending to an LLM helps the model focus on relevant information.

Public Comment Analysis Use Case

Government agencies must respond to all salient points in public comments on proposed regulations or face potential lawsuits. This creates enormous manual processing burdens—the speaker cited a health equipment services regulation that required 4,500 hours of manual processing.

Technical Approach

The pipeline works as follows:

Results and Validation

The approach reduced processing time from 4,500 to approximately 600 hours for one regulation. The visualization layer is crucial for validation—users can drill down from LLM summaries to specific terminology and source documents to verify accuracy.

This represents a “beyond RAG” approach where instead of retrieving a few documents based on similarity, the system retrieves thousands of pre-filtered statements that are specifically relevant to the query.

Broader Applications

The technique has been applied to:

Cross-Cutting Themes

Several themes emerged across all three presentations:

The importance of benchmarking: All speakers emphasized measuring before optimizing and maintaining traceability to understand what’s actually happening in production systems.

Complementary techniques: The presentations showed how infrastructure optimization (vLLM, Ray Serve), output formatting (BAML), and pre-processing (text analytics) can work together to create more robust production systems.

Cost consciousness: GPU costs are significant, and all approaches aimed to maximize efficiency—whether through better GPU utilization, reduced token counts, or filtering data before expensive LLM calls.

The industry is early: The Fuzzy Labs speaker explicitly noted that best practices are still evolving and what works today may change next year, reflecting the rapid evolution of LLMOps practices.

Self-hosting considerations: While managed APIs offer convenience, self-hosting remains important for data privacy, learning, and cost control, but brings significant engineering challenges that require specialized tooling.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Large-Scale Personalization and Product Knowledge Graph Enhancement Through LLM Integration

DoorDash 2025

DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.

customer_support question_answering classification +64

Building a Multi-Agent Research System for Complex Information Tasks

Anthropic 2025

Anthropic developed a production multi-agent system for their Claude Research feature that uses multiple specialized AI agents working in parallel to conduct complex research tasks across web and enterprise sources. The system employs an orchestrator-worker architecture where a lead agent coordinates and delegates to specialized subagents that operate simultaneously, achieving 90.2% performance improvement over single-agent systems on internal evaluations. The implementation required sophisticated prompt engineering, robust evaluation frameworks, and careful production engineering to handle the stateful, non-deterministic nature of multi-agent interactions at scale.

question_answering document_processing data_analysis +48