Fintool: Scaling LLM-Powered Financial Insights with Continuous Evaluation

Overview

Fintool is an AI equity research assistant that helps investors make better decisions by processing large volumes of unstructured financial data, including SEC filings and earnings call transcripts. The company serves prominent institutional investors such as Kennedy Capital and First Manhattan, as well as enterprise clients like PricewaterhouseCoopers. Their flagship product, Fintool Feed, provides a Twitter-like interface where key sections of financial documents are summarized based on user-configured prompts and alerts.

It’s worth noting that this case study is presented by Braintrust, the evaluation platform that Fintool uses, so the narrative naturally emphasizes the benefits of their tooling. While the technical approaches described are sound and represent genuine LLMOps best practices, readers should be aware of the promotional context.

The Production Challenge

The core challenge Fintool faced is a classic LLMOps scaling problem: how do you maintain quality and reliability when processing massive amounts of data through LLM pipelines? The specific numbers cited are impressive—over 1.5 billion tokens across 70 million document chunks, with gigabytes of new data processed daily. In the financial services context, the stakes are particularly high since a single overlooked disclosure or inaccurate summary could have serious consequences for institutional investors making decisions based on this information.

The problem is further complicated by the diversity of user prompts. Some users want broad compliance monitoring across entire sectors, while others need very specific alerts about particular disclosures like board membership changes. This variability means the system cannot rely on a one-size-fits-all approach to quality assurance.

Continuous Evaluation Workflow

Fintool’s approach to maintaining quality at scale centers on what they describe as a “continuous evaluation workflow.” This represents a mature LLMOps practice where evaluation is not a one-time gate but an ongoing process integrated into the production system.

Quality Standards and Format Validation

The first component involves defining and enforcing quality standards through custom validation rules. Every insight generated by the system must include a reliable source, specifically an SEC document ID. The system automatically flags anything that’s missing or malformed. This goes beyond simple presence checking—they validate that sources are properly formatted and directly tied to the insights they support.

A particularly interesting implementation detail is the use of “span iframes” to show citations within trace spans. This allows expert reviewers to quickly validate content by seeing the original source material alongside the generated insight. This kind of traceability is essential in financial contexts where regulatory compliance often requires demonstrating the provenance of any claim or recommendation.

Golden Dataset Curation

Fintool maintains curated golden datasets that serve as benchmarks for evaluating LLM output quality. These datasets are tailored to specific industries and document types, such as healthcare compliance or technology KPIs. The approach combines production logs with handpicked examples that reflect real-world scenarios, which helps ensure the benchmarks remain relevant as the system evolves.

The dynamic nature of these golden datasets is noteworthy. Rather than treating evaluation data as static, Fintool continuously refreshes their benchmarks based on production data. This is a mature practice that helps prevent the common problem of evaluation datasets becoming stale and unrepresentative of actual production workloads.

LLM-as-a-Judge Automation

Perhaps the most technically interesting aspect of the workflow is the use of LLM-as-a-judge for automated evaluation. Each generated insight is scored on metrics including accuracy, relevance, and completeness. The case study provides a concrete code example showing a format validation scorer that uses an LLM to check whether output follows a specific structure (business description followed by a markdown list of product lines).

The format validation prompt template demonstrates a simple but effective pattern:

The LLM is instructed to act as a “format validator”
It receives specific criteria for what constitutes correct formatting
It responds with a binary PASS/FAIL classification
These classifications are converted to scores (1 for PASS, 0 for FAIL) through a choice_scores mapping

This automated scoring approach is configured to run whenever prompts are adjusted or new data is ingested, providing continuous regression detection. The automation serves a dual purpose: it ensures consistent quality monitoring across the massive volume of generated content, and it frees up human reviewers to focus on the most challenging or ambiguous cases.

Human-in-the-Loop Oversight

Despite heavy automation, Fintool maintains human oversight as an essential component of their quality assurance process. When content receives a low score from automated evaluation or is downvoted by users, a human expert is immediately notified. These experts can approve, reject, or directly edit the content to fix issues like poor formatting or incorrect information.

The integration between Fintool’s database and Braintrust is highlighted as enabling rapid intervention—reviewers can update live content directly from the evaluation UI. This tight integration between monitoring, evaluation, and content management reduces the friction involved in addressing quality issues when they’re detected.

Technical Architecture Considerations

While the case study doesn’t provide deep technical architecture details, several aspects can be inferred:

Token-level scale: Processing 1.5 billion tokens across 70 million chunks suggests a sophisticated chunking and retrieval system, likely involving embeddings and vector search for efficient document retrieval.
Real-time monitoring: The emphasis on real-time quality monitoring suggests integration with streaming or near-real-time data pipelines rather than batch processing.
Traceability: The mention of trace spans and span iframes indicates integration with observability tooling that captures the full context of LLM calls, including inputs, outputs, and intermediate steps.
Direct database integration: The ability to edit live content from the evaluation UI suggests tight coupling between the evaluation platform and the production content database.

Results and Claimed Benefits

The case study reports several positive outcomes, though specific metrics are notably absent:

Scalability: The system now processes millions of datapoints daily while maintaining quality standards.
Efficiency: Automated real-time evaluation enables faster detection and resolution of quality issues.
Accuracy: Rigorous citation and format validation improved the precision of insights to meet institutional investor requirements.
Streamlined review: Human reviewers can intervene quickly through direct UI integration.

It’s worth noting that while these claims are plausible given the described architecture, the case study doesn’t provide quantitative improvements (e.g., error rate reductions, reviewer time savings). This is common in vendor case studies but limits the ability to objectively assess the magnitude of benefits.

Key LLMOps Takeaways

This case study illustrates several important LLMOps patterns for production LLM systems:

Continuous evaluation over point-in-time testing: Rather than evaluating only during development, quality assurance is an ongoing production concern.
Multi-layer validation: Combining automated scoring, format checking, source validation, and human review creates defense in depth against quality issues.
Domain-specific evaluation criteria: Generic metrics are insufficient for specialized domains like finance; custom rules around citations and SEC document IDs address domain-specific requirements.
Dynamic golden datasets: Keeping evaluation benchmarks fresh by incorporating production data prevents drift between evaluation and production scenarios.
LLM-as-a-judge for scale: Using LLMs to evaluate LLM outputs enables quality monitoring at scales impossible for human review alone.
Human-in-the-loop as escalation: Rather than having humans review everything, automation handles the bulk of cases while humans focus on exceptions and edge cases.

The approach described represents a relatively mature LLMOps practice, particularly the integration of evaluation into the production feedback loop and the combination of automated and human oversight. For organizations processing high volumes of LLM-generated content in high-stakes domains, this case study provides a useful reference architecture.

Scaling LLM-Powered Financial Insights with Continuous Evaluation

Industry

Technologies