ZenML

Scaling LLM-Powered Financial Insights with Continuous Evaluation

Fintool 2025
View original source

Fintool, an AI equity research assistant, faced the challenge of processing massive amounts of financial data (1.5 billion tokens across 70 million document chunks) while maintaining high accuracy and trust for institutional investors. They implemented a comprehensive LLMOps evaluation workflow using Braintrust, combining automated LLM-based evaluation, golden datasets, format validation, and human-in-the-loop oversight to ensure reliable and accurate financial insights at scale.

Industry

Finance

Technologies

Overview

Fintool is an AI equity research assistant that helps investors make better decisions by processing large volumes of unstructured financial data, including SEC filings and earnings call transcripts. The company serves prominent institutional investors such as Kennedy Capital and First Manhattan, as well as enterprise clients like PricewaterhouseCoopers. Their flagship product, Fintool Feed, provides a Twitter-like interface where key sections of financial documents are summarized based on user-configured prompts and alerts.

It’s worth noting that this case study is presented by Braintrust, the evaluation platform that Fintool uses, so the narrative naturally emphasizes the benefits of their tooling. While the technical approaches described are sound and represent genuine LLMOps best practices, readers should be aware of the promotional context.

The Production Challenge

The core challenge Fintool faced is a classic LLMOps scaling problem: how do you maintain quality and reliability when processing massive amounts of data through LLM pipelines? The specific numbers cited are impressive—over 1.5 billion tokens across 70 million document chunks, with gigabytes of new data processed daily. In the financial services context, the stakes are particularly high since a single overlooked disclosure or inaccurate summary could have serious consequences for institutional investors making decisions based on this information.

The problem is further complicated by the diversity of user prompts. Some users want broad compliance monitoring across entire sectors, while others need very specific alerts about particular disclosures like board membership changes. This variability means the system cannot rely on a one-size-fits-all approach to quality assurance.

Continuous Evaluation Workflow

Fintool’s approach to maintaining quality at scale centers on what they describe as a “continuous evaluation workflow.” This represents a mature LLMOps practice where evaluation is not a one-time gate but an ongoing process integrated into the production system.

Quality Standards and Format Validation

The first component involves defining and enforcing quality standards through custom validation rules. Every insight generated by the system must include a reliable source, specifically an SEC document ID. The system automatically flags anything that’s missing or malformed. This goes beyond simple presence checking—they validate that sources are properly formatted and directly tied to the insights they support.

A particularly interesting implementation detail is the use of “span iframes” to show citations within trace spans. This allows expert reviewers to quickly validate content by seeing the original source material alongside the generated insight. This kind of traceability is essential in financial contexts where regulatory compliance often requires demonstrating the provenance of any claim or recommendation.

Golden Dataset Curation

Fintool maintains curated golden datasets that serve as benchmarks for evaluating LLM output quality. These datasets are tailored to specific industries and document types, such as healthcare compliance or technology KPIs. The approach combines production logs with handpicked examples that reflect real-world scenarios, which helps ensure the benchmarks remain relevant as the system evolves.

The dynamic nature of these golden datasets is noteworthy. Rather than treating evaluation data as static, Fintool continuously refreshes their benchmarks based on production data. This is a mature practice that helps prevent the common problem of evaluation datasets becoming stale and unrepresentative of actual production workloads.

LLM-as-a-Judge Automation

Perhaps the most technically interesting aspect of the workflow is the use of LLM-as-a-judge for automated evaluation. Each generated insight is scored on metrics including accuracy, relevance, and completeness. The case study provides a concrete code example showing a format validation scorer that uses an LLM to check whether output follows a specific structure (business description followed by a markdown list of product lines).

The format validation prompt template demonstrates a simple but effective pattern:

This automated scoring approach is configured to run whenever prompts are adjusted or new data is ingested, providing continuous regression detection. The automation serves a dual purpose: it ensures consistent quality monitoring across the massive volume of generated content, and it frees up human reviewers to focus on the most challenging or ambiguous cases.

Human-in-the-Loop Oversight

Despite heavy automation, Fintool maintains human oversight as an essential component of their quality assurance process. When content receives a low score from automated evaluation or is downvoted by users, a human expert is immediately notified. These experts can approve, reject, or directly edit the content to fix issues like poor formatting or incorrect information.

The integration between Fintool’s database and Braintrust is highlighted as enabling rapid intervention—reviewers can update live content directly from the evaluation UI. This tight integration between monitoring, evaluation, and content management reduces the friction involved in addressing quality issues when they’re detected.

Technical Architecture Considerations

While the case study doesn’t provide deep technical architecture details, several aspects can be inferred:

Results and Claimed Benefits

The case study reports several positive outcomes, though specific metrics are notably absent:

It’s worth noting that while these claims are plausible given the described architecture, the case study doesn’t provide quantitative improvements (e.g., error rate reductions, reviewer time savings). This is common in vendor case studies but limits the ability to objectively assess the magnitude of benefits.

Key LLMOps Takeaways

This case study illustrates several important LLMOps patterns for production LLM systems:

The approach described represents a relatively mature LLMOps practice, particularly the integration of evaluation into the production feedback loop and the combination of automated and human oversight. For organizations processing high volumes of LLM-generated content in high-stakes domains, this case study provides a useful reference architecture.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Migration of Credit AI RAG Application from Multi-Cloud to AWS Bedrock

Octus 2025

Octus, a leading provider of credit market data and analytics, migrated their flagship generative AI product Credit AI from a multi-cloud architecture (OpenAI on Azure and other services on AWS) to a unified AWS architecture using Amazon Bedrock. The migration addressed challenges in scalability, cost, latency, and operational complexity associated with running a production RAG application across multiple clouds. By leveraging Amazon Bedrock's managed services for embeddings, knowledge bases, and LLM inference, along with supporting AWS services like Lambda, S3, OpenSearch, and Textract, Octus achieved a 78% reduction in infrastructure costs, 87% decrease in cost per question, improved document sync times from hours to minutes, and better development velocity while maintaining SOC2 compliance and serving thousands of concurrent users across financial services clients.

document_processing question_answering summarization +45

Multi-Agent Financial Research and Question Answering System

Yahoo! Finance 2025

Yahoo! Finance built a production-scale financial question answering system using multi-agent architecture to address the information asymmetry between retail and institutional investors. The system leverages Amazon Bedrock Agent Core and employs a supervisor-subagent pattern where specialized agents handle structured data (stock prices, financials), unstructured data (SEC filings, news), and various APIs. The solution processes heterogeneous financial data from multiple sources, handles temporal complexities of fiscal years, and maintains context across sessions. Through a hybrid evaluation approach combining human and AI judges, the system achieves strong accuracy and coverage metrics while processing queries in 5-50 seconds at costs of 2-5 cents per query, demonstrating production viability at scale with support for 100+ concurrent users.

question_answering data_analysis chatbot +49