ANNA, a UK business banking provider, implemented LLMs to automate transaction categorization for tax and accounting purposes across diverse business types. They achieved this by combining traditional ML with LLMs, particularly focusing on context-aware categorization that understands business-specific nuances. Through strategic optimizations including offline predictions, improved context utilization, and prompt caching, they reduced their LLM costs by 75% while maintaining high accuracy in their AI accountant system.
ANNA is a UK-based business banking application designed for sole traders, small businesses, freelancers, and startups. What distinguishes ANNA from competitors is its chat-based interface and integrated AI accountant that can file corporation tax, VAT, and self-assessment on behalf of customers. The presentation, delivered by Nick Teriffen (Lead Data Scientist at ANNA), focuses primarily on their transaction categorization system and the cost optimization strategies they developed to make LLM-powered categorization economically viable at scale.
The company has multiple LLM applications in production: a customer-facing chatbot that automates approximately 80% of customer requests (payments, invoicing, payroll), an AI accountant for transaction categorization and tax assistance, and LLM-generated account summaries for financial crime prevention. Transaction categorization represented roughly two-thirds of their LLM costs, making it the primary target for optimization efforts.
Transaction categorization for tax purposes requires understanding nuanced business contexts that traditional ML approaches struggle to handle. ANNA illustrates this with a concrete example: a builder and an IT freelancer both purchasing from the same hardware store (B&Q) should have those transactions categorized differently. For the builder, it represents direct costs needed to operate their business, while for the IT freelancer purchasing home office supplies, it’s a home office expense with different tax implications. With customers operating across hundreds of industries, each with their own accounting rules, building comprehensive rule-based systems or traditional ML models for every scenario would be impractical.
ANNA had existing XGBoost and rule-based systems that handled transactions with high confidence, but LLMs were needed for the complex, context-dependent cases. The LLM approach offered several advantages: understanding complex business context, following internal accounting rules, providing clear explanations to customers, handling multimodal inputs (documents attached to transactions), and leveraging real-world knowledge about merchants.
The categorization pipeline works as follows: after a customer’s financial year ends, transactions initially categorized by simpler models (XGBoost) enter the LLM pipeline. Transactions with high-confidence predictions are filtered out, leaving only those requiring LLM processing. Customers typically have nine months to review corrections for tax purposes, which is critical because it means real-time processing is not required—enabling batch optimization strategies.
Each LLM request consists of three components:
The output structure includes the predicted category, explanation/reasoning for internal review, citations to the accounting rulebook, and customer-friendly explanations that appear in the app.
ANNA chose Anthropic’s Claude 3.5 via Vertex AI in mid-2024 based on evaluation against test sets annotated by accountants. While metrics were similar across models (including OpenAI’s offerings), the team found Claude produced more transparent and customer-friendly explanations, which was a decisive factor.
A critical constraint with Claude 3.5 was the 8,000 token output limit against a 200,000 token context window. This meant that despite having ample input capacity, they could only process approximately 15 transactions per request due to output constraints. For a client with 45 transactions, this required three separate API calls, each consuming the full 50,000+ token system prompt—highly inefficient.
When Anthropic released Claude 3.7 in February 2025, maintaining the same pricing but increasing the output limit to 128,000 tokens (16x increase), ANNA could process up to 145 transactions per request. This dramatically reduced API calls and costs without changing prices.
Since transaction categorization didn’t require real-time processing, ANNA leveraged batch prediction APIs. The agreement with LLM providers is straightforward: accept up to 24-hour wait times for predictions in exchange for a 50% discount. All top three providers (OpenAI, Anthropic, Google) offer this same discount rate.
Implementation involves creating a JSON file with requests, uploading to cloud storage (Google Cloud Storage for Vertex AI), submitting a batch job request, and polling for results. Important considerations include TTL (time-to-live)—Anthropic gives 29 days to fetch predictions before they’re deleted.
An interesting note from Google’s documentation suggested batch predictions may improve consistency by processing descriptions simultaneously, maintaining uniform tone and style. While unexpected, this suggests batch processing might affect quality in ways beyond just cost.
By upgrading to Claude 3.7 with its expanded output window, ANNA reduced the number of API calls significantly. Instead of consuming 180,000 input tokens to get 45 predictions (three calls), they could now get all 45 predictions in a single call. This alone provided approximately 22 percentage points of additional savings on top of the 50% from offline predictions.
Prompt caching allows providers to reuse computed key-value caches for static portions of prompts. For ANNA, the 50,000 token accounting rulebook is static across all requests, making it an ideal caching candidate.
Pricing varies by provider:
TTL considerations are important—cached prompts expire, affecting implementation strategy. OpenAI’s approach of automatically caching all sufficiently long prompts simplifies implementation.
Beyond cost savings, prompt caching reduces latency by up to 80%. ANNA noted this is particularly valuable for their upcoming phone-based tax consultation feature, where conversational latency is critical.
ANNA implemented application-level caching by categorizing unique transaction patterns per client once and storing results for reuse, avoiding redundant LLM calls entirely.
Applying these strategies to transaction categorization yielded cumulative savings:
The order of implementation matters and depends on use case specifics. ANNA’s sequence was offline predictions first, then context utilization improvements, then caching.
When processing more than 100 transactions per request, ANNA observed increased hallucinations—specifically, transaction IDs that were either fabricated or contained errors, making predictions impossible to attribute. This increased from 0% to 2-3% of outputs. Their mitigation was to reduce batch size from the theoretical maximum of 145 transactions to approximately 120, finding a balance between efficiency and quality.
Anthropic explicitly states prompt caching works on a “best effort basis” with no guarantees. Combined with offline predictions, cache hits become less predictable since request scheduling is less deterministic. ANNA’s actual benefit was only 3 percentage points, suggesting real-world gains may be modest.
ANNA discovered an error in their input preparation script that performed double JSON serialization, adding unnecessary characters that increased token consumption. Manual inspection of inputs after changes is recommended.
Every optimization potentially affects output quality. Google’s claim about batch predictions improving consistency, combined with ANNA’s experience with hallucinations in longer contexts, reinforces the need to re-evaluate metrics after any optimization—even those seemingly unrelated to quality.
ANNA considered RAG for the accounting rulebook but found multi-hop query requirements problematic. Tax questions often require multiple pieces of information (tax rates, dividend allowances, etc.), and LLM-generated search queries for RAG performed poorly. They opted to include the entire rulebook in context, accepting higher token costs for better quality, with plans to potentially revisit RAG as capabilities improve.
ANNA evaluated self-hosted LLM solutions but determined they weren’t large enough to justify the infrastructure investment and expertise required. They noted this might change with scale, particularly for real-time categorization in new markets (they launched in Australia), where latency constraints make self-hosting more attractive.
The presenter noted the dramatic decrease in LLM costs—approximately 10x per year at equivalent capability levels. However, this doesn’t necessarily translate to lower absolute spending, as improved capabilities unlock new business opportunities that may increase overall consumption.
For error handling, ANNA uses simple retry logic for failed predictions rather than sophisticated fallback mechanisms. Combined with the non-real-time nature of their use case, this proves sufficient.
The hybrid approach—using traditional ML (XGBoost, rules) for high-confidence cases and reserving LLMs for complex, context-dependent scenarios—represents a pragmatic production architecture that balances cost, quality, and latency requirements.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.
Digits, a company providing automated accounting services for startups and small businesses, implemented production-scale LLM agents to handle complex workflows including vendor hydration, client onboarding, and natural language queries about financial books. The company evolved from a simple 200-line agent implementation to a sophisticated production system incorporating LLM proxies, memory services, guardrails, observability tooling (Phoenix from Arize), and API-based tool integration using Kotlin and Golang backends. Their agents achieve a 96% acceptance rate on classification tasks with only 3% requiring human review, handling approximately 90% of requests asynchronously and 10% synchronously through a chat interface.