ZenML

Building an AI Legal Assistant: From Early Testing to Production Deployment

Casetext 2023
View original source

Casetext transformed their legal research platform into an AI-powered legal assistant called Co-Counsel using GPT-4, leading to a $650M acquisition by Thomson Reuters. The company shifted their entire 120-person team to focus on building this AI assistant after early access to GPT-4 showed promising results. Through rigorous testing, prompt engineering, and a test-driven development approach, they created a reliable AI system that could perform complex legal tasks like document review and research that previously took lawyers days to complete. The product achieved rapid market acceptance and true product-market fit within months of launch.

Industry

Legal

Technologies

Overview

Casetext represents one of the most significant early success stories of deploying large language models in a production environment for mission-critical applications. Founded in 2012 by Jake Heller, a lawyer with computer science training, the company spent a decade building legal technology products before gaining early access to GPT-4 and making a dramatic pivot that resulted in a $650 million acquisition by Thomson Reuters. This case study, drawn from a Y Combinator Light Cone podcast interview with Heller, provides invaluable insights into how to successfully productionize LLMs in high-stakes, accuracy-critical domains.

The Pre-LLM Journey and Context

Before the LLM breakthrough, Casetext had spent approximately ten years navigating what Heller describes as the “idea maze” of legal technology. The company’s original vision was to improve legal research tools, which at the time required lawyers to spend days manually searching through documents and case law. Their initial approach attempted to create a user-generated content platform where lawyers would annotate case law, inspired by Stack Overflow and Wikipedia. This failed because lawyers, who bill by the hour, had no incentive to contribute free content.

The company then pivoted to natural language processing and machine learning approaches, building features like recommendation algorithms similar to those powering Spotify’s music suggestions. They analyzed citation networks between legal cases to help lawyers identify relevant precedents they might have missed. While these approaches generated real revenue (between $15-20 million ARR with 70-80% year-over-year growth), Heller describes them as “incremental improvements” that were easy for conservative, well-compensated lawyers to ignore.

The GPT-4 Moment and Company Pivot

Casetext’s close relationship with OpenAI and other research labs positioned them to receive early access to what would become GPT-4. The transformation was immediate and dramatic. Within 48 hours of seeing GPT-4’s capabilities, Heller made the decision to redirect all 120 employees from their existing projects to focus exclusively on building what would become Co-Counsel.

The contrast between model generations was stark. GPT-3.5 scored at the 10th percentile on the bar exam—essentially random performance. GPT-4, on the same test (verified to not be in the training data), scored above the 90th percentile. More importantly, when given legal documents and asked to write research memos, GPT-4 could actually cite the provided sources accurately rather than hallucinating information.

The pivot required significant founder leadership. Heller led by example, personally building the first prototype even with a 120-person engineering team available. The NDA with OpenAI initially extended only to Heller and his co-founder, which he describes as “a blessing” because it forced them to deeply understand the technology themselves before delegating. When the NDA was extended to select customers months before GPT-4’s public launch, these early users experienced what Heller describes as a “Godlike feeling”—watching tasks that previously took entire days completed in ninety seconds.

Technical Architecture and Workflow Design

The core product insight behind Co-Counsel was treating it as “a new member of the firm”—an AI legal assistant that could be given natural language instructions to complete complex tasks. The key tasks included reviewing millions of documents for evidence of fraud, summarizing legal documents, and conducting legal research to produce comprehensive memos with citations.

Heller’s approach to building these capabilities began with a fundamental question: “How would the best attorney in the world approach this problem?” For legal research, this meant breaking down the workflow into discrete steps:

Each of these steps became a separate prompt in what Heller describes as a “chain of actions.” The final result might require a dozen or two dozen individual prompts, each with its own evaluation criteria and test suite.

Test-Driven Development for Prompt Engineering

One of the most significant operational insights from Casetext’s experience is their rigorous application of test-driven development principles to prompt engineering. Heller, who admits he “never really believed in test-driven development before prompting,” found it essential for LLM work because of the unpredictable nature of language models.

For each prompt in their chain, the team defined clear success criteria based on what “good looks like” for that specific task. They wrote gold-standard answers specifying expected outputs for given inputs. The test suites grew from dozens to hundreds to thousands of test cases per prompt.

The methodology followed a pattern: write tests first, craft prompts to pass those tests, and maintain vigilance because adding instructions to fix one problem could break previously passing tests. Heller notes that if a prompt passes even 100 tests, “the odds that it will do on any random distribution of user inputs the next 100,000 100% accurately is like very high.”

This stands in stark contrast to what Heller observes many companies doing—what the interviewer colorfully describes as “raw dogging” prompts with “vibes only” engineering. The rigor was essential because lawyers react negatively to even small mistakes, and a single bad first experience could cause users to abandon the product entirely.

Beyond the “GPT Wrapper” Criticism

Heller directly addresses the criticism that companies building on LLMs are merely creating “GPT wrappers” without real intellectual property. He argues that solving real customer problems requires many layers of additional work:

The analogy offered is that successful SaaS companies like Salesforce are essentially “SQL wrappers”—the underlying technology is commodity, but the business logic, integrations, and user experience built around it create enormous value. Similarly, going from the 70% accuracy available through raw ChatGPT to the 100% accuracy required for production use cases represents significant engineering and domain expertise investment.

Accuracy Requirements in Mission-Critical Applications

The legal domain’s requirements for accuracy are extreme. Heller notes that “every time we made the smallest mistake in anything that we did we heard about it immediately.” This shaped their entire development philosophy. They understood that users could “lose faith in these things really quickly” and that the first encounter had to work flawlessly.

The test-driven approach enabled systematic debugging. When tests failed, they would analyze patterns, add specific instructions to address those patterns, and verify the changes didn’t break other cases. Heller describes it as “root causing” failures—usually the solution involves clearer instructions, better context management (neither too much nor too little information), or improved input formatting.

O1 and the Evolution of Chain-of-Thought Reasoning

The interview also touches on OpenAI’s O1 model and its implications for legal AI. Heller describes O1 as capable of the kind of “precise detail thinking” that earlier models struggled with. A specific test involved giving the model a 40-page legal brief where quotations had been subtly altered (such as changing “and” to “neither nor”), along with the original case text. Previous models would claim nothing was wrong; O1 correctly identified the alterations.

The discussion frames this as moving from “System 1” thinking (fast, intuitive, pattern-based) to “System 2” thinking (slower, deliberate, logical). Heller speculates that O1’s training may have involved contractors documenting their reasoning process rather than just input-output pairs.

An intriguing future direction mentioned is the possibility of “teaching it not just how to answer the question” but “how to think”—injecting the reasoning approaches of the best lawyers into the model’s deliberation process. While results are preliminary, this represents a potential new frontier in prompt engineering where domain expertise shapes not just the answer but the reasoning path.

Organizational and Cultural Factors

The case study also highlights important organizational aspects of successful LLM deployment. The intense focus of having all 120 employees working on a single product enabled rapid iteration. Heller notes that some competitors “are stuck where we were in the first month of seeing GPT-4” because they lack the same intensity of focus.

Convincing skeptical employees required demonstrating customer reactions in real-time. When skeptical team members saw customers on Zoom calls experiencing the product—watching lawyers go through “existential crises” as they realized the implications—it quickly changed minds.

The early customer engagement, even under NDA, served multiple purposes: validating the product, generating testimonials, refining the experience, and building internal conviction. This created a flywheel where customer feedback accelerated development, which improved customer reactions, which further motivated the team.

Market Timing and Results

The timing of ChatGPT’s public release (while Casetext was secretly building on GPT-4) created a “pull” market effect. Lawyers who previously resisted technology changes because they “make $5 million a year” suddenly felt the ground shifting and began proactively seeking AI solutions. For the first time in Casetext’s decade of operation, customers were calling them rather than needing to be convinced.

The product launch met all of Marc Andreessen’s criteria for product-market fit: servers crashed from demand, they couldn’t hire support and sales people fast enough, and they received major press coverage on CNN and MSNBC. Two months after launching, Thomson Reuters acquisition discussions began, closing at $650 million—roughly six times their pre-GPT-4 valuation.

Key Takeaways for LLMOps Practitioners

The Casetext case study offers several actionable lessons for teams deploying LLMs in production:

The story also serves as a reminder that successful LLM deployment often builds on years of domain expertise and existing technology infrastructure. Casetext’s decade of experience in legal technology, proprietary data assets, and deep understanding of lawyer workflows positioned them to capitalize on GPT-4’s capabilities faster than potential competitors starting from scratch.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Enterprise-Scale AI-First Translation Platform with Agentic Workflows

Smartling 2025

Smartling operates an enterprise-scale AI-first agentic translation delivery platform serving major corporations like Disney and IBM. The company addresses challenges around automation, centralization, compliance, brand consistency, and handling diverse content types across global markets. Their solution employs multi-step agentic workflows where different model functions validate each other's outputs, combining neural machine translation with large language models, RAG for accessing validated linguistic assets, sophisticated prompting, and automated post-editing for hyper-localization. The platform demonstrates measurable improvements in throughput (from 2,000 to 6,000-7,000 words per day), cost reduction (4-10x cheaper than human translation), and quality approaching 70% human parity for certain language pairs and content types, while maintaining enterprise requirements for repeatability, compliance, and brand voice consistency.

translation content_moderation multi_modality +44

Building Production AI Agents for Enterprise HR, IT, and Finance Platform

Rippling 2025

Rippling, an enterprise platform providing HR, payroll, IT, and finance solutions, has evolved its AI strategy from simple content summarization to building complex production agents that assist administrators and employees across their entire platform. Led by Anker, their head of AI, the company has developed agents that handle payroll troubleshooting, sales briefing automation, interview transcript summarization, and talent performance calibration. They've transitioned from deterministic workflow-based approaches to more flexible deep agent paradigms, leveraging LangChain and LangSmith for development and tracing. The company maintains a dual focus: embedding AI capabilities within their product for customers running businesses on their platform, and deploying AI internally to increase productivity across all teams. Early results show promise in handling complex, context-dependent queries that traditional rule-based systems couldn't address.

customer_support healthcare document_processing +39