AngelList transformed their investment document processing from manual classification to an automated system using LLMs. They initially used AWS Comprehend for news article classification but transitioned to OpenAI's models, which proved more accurate and cost-effective. They built Relay, a product that automatically extracts and organizes investment terms and company updates from documents, achieving 99% accuracy in term extraction while significantly reducing operational costs compared to manual processing.
This case study comes from a podcast interview with Tebow, an engineering lead at AngelList, discussing how the investment platform evolved its machine learning infrastructure from traditional ML approaches to LLM-powered solutions. AngelList is a platform that connects startups with investors, handling investment documents, company news tracking, and portfolio management for venture capital funds and individual investors.
The conversation provides an excellent window into the pragmatic decisions involved in transitioning from custom-trained ML models to LLM-based systems, and the subsequent productization of internal AI capabilities into a customer-facing product called Relay.
Tebow joined AngelList approximately two years before the interview as one of the first engineers with machine learning experience. The organization had no data scientists, research scientists, or dedicated ML competency when he arrived. The culture emphasized autonomy, with employees encouraged to act as “founders of their one-person startup.”
The first ML use case was news article classification to route relevant articles to investor dashboards for companies they had invested in. This was implemented using AWS Comprehend with a custom-trained model. Key challenges with this approach included:
The team was able to deprecate the entire Comprehend-based system and rewrite it using OpenAI in a single day. The new approach used simple prompts with GPT-3 (later upgraded to GPT-3.5 and GPT-4) to:
The benefits were immediate: lower costs (pay-per-request vs. always-on server), automatic quality improvements as OpenAI released better models, and the ability to add nuanced extraction through prompt modifications rather than model retraining.
The production system uses LangChain as the orchestration library. The pipeline follows this flow:
A critical aspect of their approach is the inherent verifiability of their use case. Since they’re doing data extraction from source documents (not generation), they can always cross-reference extracted information against the original text. This makes their domain particularly well-suited for LLM applications because:
The team claims 99% accuracy on extraction tasks, validated through extensive back-testing against their historical document corpus.
The team acknowledged that prompt engineering remains somewhat “artisanal” in their current process. They built an internal system that maintains human-in-the-loop capabilities where operators can:
While they don’t yet have fully structured regression testing (where new prompts are automatically validated against known-good outputs), this is on their roadmap. The eventual goal is an automated test suite that can validate prompt changes don’t introduce regressions, even if it’s computationally expensive.
A strategic decision was to put prompt engineering in the hands of domain experts rather than solely engineers. Since prompts are natural language, lawyers and operations staff who understand the business domain can iterate on prompts directly. This provides higher leverage than having engineers attempt to encode domain knowledge.
The biggest operational challenges weren’t model quality but rather scaling API access. Key issues included:
To address reliability and scaling concerns, AngelList implemented a dual-provider approach:
They load balance between both providers, providing fallback capability if either experiences issues. The team noted that Azure was more operationally mature and stable compared to direct OpenAI API access.
The internal document extraction capabilities were productized into Relay, a customer-facing tool launched publicly. Relay allows investors to:
The product offers a free tier (5 documents per month) to allow users to experience the value proposition directly.
The team articulated a deliberate 70/30 strategy:
Their rationale for sticking with OpenAI rather than exploring alternatives like Anthropic or Cohere was focused prioritization. Adding another provider would increase system complexity without clear benefit since they weren’t hitting roadblocks with OpenAI’s capabilities. The time saved is better spent building new features with known-working infrastructure.
However, they recognize future scenarios requiring more control:
The case study highlights several practical insights:
Start with verifiable use cases: Document extraction where outputs can be validated against source material is an ideal LLM application because hallucinations are detectable.
Value over optimization: The team deliberately chose breadth of application over cost optimization initially. As costs for AI inference continue to decrease, getting features to market matters more than premature optimization.
Embrace managed services initially: Despite eventual limitations, starting with OpenAI’s API allowed rapid iteration. The complexity of self-hosted models can be deferred until scale demands it.
Domain experts should own prompts: Moving prompt engineering to lawyers and business experts rather than keeping it solely with engineers increases velocity and quality.
Build fallback infrastructure early: Multi-provider routing between Azure OpenAI and direct OpenAI APIs provides both reliability and scaling headroom.
Human-in-the-loop remains valuable: Even at 99% accuracy, having review processes and the ability for humans to validate edge cases maintains quality and builds trust in the system.
The interview presents a refreshingly pragmatic view of LLMOps: focus on delivering value to users, defer complexity until it’s necessary, and maintain clear connections to ground truth data to ensure reliability.
A comprehensive overview of ML infrastructure evolution and LLMOps practices at major tech companies, focusing on Doordash's approach to integrating LLMs alongside traditional ML systems. The discussion covers how ML infrastructure needs to adapt for LLMs, the importance of maintaining guard rails, and strategies for managing errors and hallucinations in production systems, while balancing the trade-offs between traditional ML models and LLMs in production environments.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Agoda transformed from GenAI experiments to company-wide adoption through a strategic approach that began with a 2023 hackathon, grew into a grassroots culture of exploration, and was supported by robust infrastructure including a centralized GenAI proxy and internal chat platform. Starting with over 200 developers prototyping 40+ ideas, the initiative evolved into 200+ applications serving both internal productivity (73% employee adoption, 45% of tech support tickets automated) and customer-facing features, demonstrating how systematic enablement and community-driven innovation can scale GenAI across an entire organization.