Casetext: Building an AI Legal Assistant: From Early Testing to Production Deployment

Overview

Casetext represents one of the most significant early success stories of deploying large language models in a production environment for mission-critical applications. Founded in 2012 by Jake Heller, a lawyer with computer science training, the company spent a decade building legal technology products before gaining early access to GPT-4 and making a dramatic pivot that resulted in a $650 million acquisition by Thomson Reuters. This case study, drawn from a Y Combinator Light Cone podcast interview with Heller, provides invaluable insights into how to successfully productionize LLMs in high-stakes, accuracy-critical domains.

The Pre-LLM Journey and Context

Before the LLM breakthrough, Casetext had spent approximately ten years navigating what Heller describes as the “idea maze” of legal technology. The company’s original vision was to improve legal research tools, which at the time required lawyers to spend days manually searching through documents and case law. Their initial approach attempted to create a user-generated content platform where lawyers would annotate case law, inspired by Stack Overflow and Wikipedia. This failed because lawyers, who bill by the hour, had no incentive to contribute free content.

The company then pivoted to natural language processing and machine learning approaches, building features like recommendation algorithms similar to those powering Spotify’s music suggestions. They analyzed citation networks between legal cases to help lawyers identify relevant precedents they might have missed. While these approaches generated real revenue (between $15-20 million ARR with 70-80% year-over-year growth), Heller describes them as “incremental improvements” that were easy for conservative, well-compensated lawyers to ignore.

The GPT-4 Moment and Company Pivot

Casetext’s close relationship with OpenAI and other research labs positioned them to receive early access to what would become GPT-4. The transformation was immediate and dramatic. Within 48 hours of seeing GPT-4’s capabilities, Heller made the decision to redirect all 120 employees from their existing projects to focus exclusively on building what would become Co-Counsel.

The contrast between model generations was stark. GPT-3.5 scored at the 10th percentile on the bar exam—essentially random performance. GPT-4, on the same test (verified to not be in the training data), scored above the 90th percentile. More importantly, when given legal documents and asked to write research memos, GPT-4 could actually cite the provided sources accurately rather than hallucinating information.

The pivot required significant founder leadership. Heller led by example, personally building the first prototype even with a 120-person engineering team available. The NDA with OpenAI initially extended only to Heller and his co-founder, which he describes as “a blessing” because it forced them to deeply understand the technology themselves before delegating. When the NDA was extended to select customers months before GPT-4’s public launch, these early users experienced what Heller describes as a “Godlike feeling”—watching tasks that previously took entire days completed in ninety seconds.

Technical Architecture and Workflow Design

The core product insight behind Co-Counsel was treating it as “a new member of the firm”—an AI legal assistant that could be given natural language instructions to complete complex tasks. The key tasks included reviewing millions of documents for evidence of fraud, summarizing legal documents, and conducting legal research to produce comprehensive memos with citations.

Heller’s approach to building these capabilities began with a fundamental question: “How would the best attorney in the world approach this problem?” For legal research, this meant breaking down the workflow into discrete steps:

Receiving a research request and decomposing it into specific search queries (sometimes using specialized search syntax resembling SQL)
Executing multiple search queries against legal databases
Reading and analyzing hundreds or thousands of returned results
Taking notes, summarizing key points, and building an outline
Compiling the final research memo with proper citations

Each of these steps became a separate prompt in what Heller describes as a “chain of actions.” The final result might require a dozen or two dozen individual prompts, each with its own evaluation criteria and test suite.

Test-Driven Development for Prompt Engineering

One of the most significant operational insights from Casetext’s experience is their rigorous application of test-driven development principles to prompt engineering. Heller, who admits he “never really believed in test-driven development before prompting,” found it essential for LLM work because of the unpredictable nature of language models.

For each prompt in their chain, the team defined clear success criteria based on what “good looks like” for that specific task. They wrote gold-standard answers specifying expected outputs for given inputs. The test suites grew from dozens to hundreds to thousands of test cases per prompt.

The methodology followed a pattern: write tests first, craft prompts to pass those tests, and maintain vigilance because adding instructions to fix one problem could break previously passing tests. Heller notes that if a prompt passes even 100 tests, “the odds that it will do on any random distribution of user inputs the next 100,000 100% accurately is like very high.”

This stands in stark contrast to what Heller observes many companies doing—what the interviewer colorfully describes as “raw dogging” prompts with “vibes only” engineering. The rigor was essential because lawyers react negatively to even small mistakes, and a single bad first experience could cause users to abandon the product entirely.

Beyond the “GPT Wrapper” Criticism

Heller directly addresses the criticism that companies building on LLMs are merely creating “GPT wrappers” without real intellectual property. He argues that solving real customer problems requires many layers of additional work:

Proprietary data integration: Casetext maintains their own databases of case law and automated annotations
Customer system integration: Legal-specific document management systems require custom connections
Document preprocessing: Sophisticated OCR handling for handwritten notes, tilted scans, and the legal industry practice of printing four pages on one page (which naive OCR would read incorrectly)
Prompt engineering IP: The specific strategies for breaking down problems, formatting information, and managing the chain of prompts
Domain expertise encoding: Understanding what good legal work looks like and building that into evaluation criteria

The analogy offered is that successful SaaS companies like Salesforce are essentially “SQL wrappers”—the underlying technology is commodity, but the business logic, integrations, and user experience built around it create enormous value. Similarly, going from the 70% accuracy available through raw ChatGPT to the 100% accuracy required for production use cases represents significant engineering and domain expertise investment.

Accuracy Requirements in Mission-Critical Applications

The legal domain’s requirements for accuracy are extreme. Heller notes that “every time we made the smallest mistake in anything that we did we heard about it immediately.” This shaped their entire development philosophy. They understood that users could “lose faith in these things really quickly” and that the first encounter had to work flawlessly.

The test-driven approach enabled systematic debugging. When tests failed, they would analyze patterns, add specific instructions to address those patterns, and verify the changes didn’t break other cases. Heller describes it as “root causing” failures—usually the solution involves clearer instructions, better context management (neither too much nor too little information), or improved input formatting.

O1 and the Evolution of Chain-of-Thought Reasoning

The interview also touches on OpenAI’s O1 model and its implications for legal AI. Heller describes O1 as capable of the kind of “precise detail thinking” that earlier models struggled with. A specific test involved giving the model a 40-page legal brief where quotations had been subtly altered (such as changing “and” to “neither nor”), along with the original case text. Previous models would claim nothing was wrong; O1 correctly identified the alterations.

The discussion frames this as moving from “System 1” thinking (fast, intuitive, pattern-based) to “System 2” thinking (slower, deliberate, logical). Heller speculates that O1’s training may have involved contractors documenting their reasoning process rather than just input-output pairs.

An intriguing future direction mentioned is the possibility of “teaching it not just how to answer the question” but “how to think”—injecting the reasoning approaches of the best lawyers into the model’s deliberation process. While results are preliminary, this represents a potential new frontier in prompt engineering where domain expertise shapes not just the answer but the reasoning path.

Organizational and Cultural Factors

The case study also highlights important organizational aspects of successful LLM deployment. The intense focus of having all 120 employees working on a single product enabled rapid iteration. Heller notes that some competitors “are stuck where we were in the first month of seeing GPT-4” because they lack the same intensity of focus.

Convincing skeptical employees required demonstrating customer reactions in real-time. When skeptical team members saw customers on Zoom calls experiencing the product—watching lawyers go through “existential crises” as they realized the implications—it quickly changed minds.

The early customer engagement, even under NDA, served multiple purposes: validating the product, generating testimonials, refining the experience, and building internal conviction. This created a flywheel where customer feedback accelerated development, which improved customer reactions, which further motivated the team.

Market Timing and Results

The timing of ChatGPT’s public release (while Casetext was secretly building on GPT-4) created a “pull” market effect. Lawyers who previously resisted technology changes because they “make $5 million a year” suddenly felt the ground shifting and began proactively seeking AI solutions. For the first time in Casetext’s decade of operation, customers were calling them rather than needing to be convinced.

The product launch met all of Marc Andreessen’s criteria for product-market fit: servers crashed from demand, they couldn’t hire support and sales people fast enough, and they received major press coverage on CNN and MSNBC. Two months after launching, Thomson Reuters acquisition discussions began, closing at $650 million—roughly six times their pre-GPT-4 valuation.

Key Takeaways for LLMOps Practitioners

The Casetext case study offers several actionable lessons for teams deploying LLMs in production:

Test-driven development becomes more important, not less, with probabilistic LLM outputs
Breaking complex tasks into chains of simpler, individually testable prompts improves reliability
Domain expertise should inform both prompt design and evaluation criteria
The “last mile” from 70% to 100% accuracy represents enormous value and significant engineering effort
Early customer engagement under controlled conditions validates approaches and builds organizational conviction
Production-grade LLM applications require substantial infrastructure beyond the model itself
First impressions matter enormously—users will abandon products that fail early

The story also serves as a reminder that successful LLM deployment often builds on years of domain expertise and existing technology infrastructure. Casetext’s decade of experience in legal technology, proprietary data assets, and deep understanding of lawyer workflows positioned them to capitalize on GPT-4’s capabilities faster than potential competitors starting from scratch.

Building an AI Legal Assistant: From Early Testing to Production Deployment

Industry

Technologies