ZenML

Building a Secure and Scalable LLM Gateway for Financial Services

Wealthsimple 2023
View original source

Wealthsimple, a Canadian FinTech company, developed a comprehensive LLM platform to securely leverage generative AI while protecting sensitive financial data. They built an LLM gateway with built-in security features, PII redaction, and audit trails, eventually expanding to include self-hosted models, RAG capabilities, and multi-modal inputs. The platform achieved widespread adoption with over 50% of employees using it monthly, leading to improved productivity and operational efficiencies in client service workflows.

Industry

Finance

Technologies

Wealthsimple’s LLM Platform Journey: From Gateway to Production Operations

Overview

Wealthsimple is a Canadian FinTech company focused on helping Canadians achieve financial independence through a unified app for investing, saving, and spending. Their generative AI efforts are organized into three streams: employee productivity (the original focus), optimizing operations for clients, and the underlying LLM platform that powers both. This case study documents approximately two years of LLM journey from late 2022 through 2024, covering the technical implementations, organizational learnings, and strategic pivots that characterized their experience bringing LLMs into production.

The company’s approach represents a pragmatic evolution from initial excitement through a “trough of disillusionment” to more deliberate, business-aligned applications. Their key wins include an open-sourced LLM gateway used by over half the company, an in-house PII redaction model, self-hosted open-source LLMs, platform support for fine-tuning with hardware acceleration, and production LLM systems optimizing client operations.

The LLM Gateway: Security-First Foundation

When ChatGPT launched in November 2022, Wealthsimple recognized both the potential and the security risks. Many companies, including Samsung, had to ban ChatGPT due to inadvertent data sharing with third parties. Rather than banning the technology, Wealthsimple built an LLM gateway to address security concerns while enabling exploration.

The first version of the gateway was relatively simple: it maintained an audit trail tracking what data was sent externally, where it was sent, and who sent it. The gateway was deployed behind a VPN, gated by Okta for authentication, and proxied conversations to various LLM providers like OpenAI. Users could select different models from a dropdown, and production systems could interact programmatically through an API endpoint that included retry and fallback mechanisms for reliability. Early features included the ability to export and import conversations across platforms.

A significant challenge was adoption. Many employees viewed the gateway as a “bootleg version of ChatGPT” with little incentive to use it. Wealthsimple applied their philosophy of making “the right way the easy way” through a combination of carrots and soft sticks. The carrots included free usage (company-paid costs), optionality across multiple LLM providers, improved reliability through retry mechanisms, negotiated rate limit increases with OpenAI, and integrated APIs with staging and production environments. The soft sticks included “nudge mechanisms” - gentle Slack reminders when employees visited ChatGPT directly. Interestingly, the nudges were later removed in 2024 because they proved ineffective; people became conditioned to ignore them, and platform improvements proved to be stronger drivers of behavioral change.

PII Redaction: Closing Security Gaps

In June 2023, Wealthsimple shipped their own PII redaction model, leveraging Microsoft’s Presidio framework along with an internally-developed Named Entity Recognition (NER) model. This system detected and redacted potentially sensitive information before sending to external LLM providers.

However, closing the security gap introduced a user experience gap. The PII redaction model wasn’t always accurate, interfering with answer relevance. More fundamentally, many employees needed to work with PII data as part of their jobs. This friction led to the next major investment: self-hosted open-source LLMs.

Self-Hosted LLMs: Data Sovereignty

To address the PII challenge, Wealthsimple built a framework using llama.cpp, a quantized framework for self-hosting open-source LLMs within their own VPCs. The first three models self-hosted were Llama 2, Mistral models, and Whisper (OpenAI’s open-source voice transcription model, included in their LLM platform umbrella for simplicity). By hosting within their cloud environment, they eliminated the need for PII redaction since data never left their infrastructure.

RAG Implementation and Vector Database

Following self-hosted LLMs, Wealthsimple introduced retrieval-augmented generation (RAG) as an API. They deliberately chose Elasticsearch (later AWS’s managed OpenSearch) as their vector database because it was already part of their stack, making it an easy initial choice. They built pipelines and DAGs in Airflow, their orchestration framework, to update and index common knowledge bases. The initial RAG offering was a simple semantic search API.

Despite demand for grounding capabilities and the intuitive value proposition, engagement and adoption were low. People weren’t expanding knowledge bases or extending APIs as expected. The team realized there was still too much friction in experimentation and feedback loops.

Data Applications Platform: Enabling Experimentation

To address the experimentation gap, Wealthsimple built an internal Data Applications Platform running on Python and Streamlit—technologies familiar to their data scientists. Behind Okta and VPN, this platform made it easy to build and iterate on applications with fast feedback loops. Stakeholders could quickly see proof-of-concept applications and provide input.

Within two weeks of launch, seven applications were running on the platform, and two eventually made it to production, optimizing operations and improving client experience. This represented a key insight: reducing friction in the experimentation-to-production pipeline was crucial for adoption.

Platform Architecture

The mature LLM platform architecture included several layers:

Boosterpack: Internal Knowledge Assistant

At the end of 2023, Wealthsimple built “Boosterpack,” a personal assistant grounded against company context. It featured three types of knowledge bases: public (accessible to everyone with source code, help articles, financial newsletters), private (personal documents for each employee), and limited (shared with specific coworkers by role and project). Built on the data applications platform, it included question-answering with source attribution for fact-checking.

Despite initial excitement, Boosterpack didn’t achieve the transformative adoption expected. The team learned that bifurcating tools created friction—even when intuitively valuable, user behavior often surprised them.

2024: Strategic Evolution

2024 marked a significant shift in strategy. After the “peak of inflated expectations” in 2023, the team became more deliberate about business alignment. There was less appetite for speculative bets and more focus on concrete value creation.

Key 2024 developments included:

Removing nudge mechanisms: The gentle Slack reminders were abandoned because they weren’t changing behavior—the same people kept getting nudged and conditioned to ignore them.

Expanding LLM providers: Starting with Gemini (attracted by the 1M+ token context window), followed by other providers. The focus shifted from chasing state-of-the-art models (which changed weekly) to tracking higher-level trends.

Multi-modal inputs: Leveraging Gemini’s capabilities, they added image and PDF upload features. Within weeks, nearly a third of users leveraged multi-modal features at least weekly. A common use case was sharing screenshots of error messages for debugging help—LLMs embraced this antipattern that frustrated human developers.

Adopting Bedrock: This marked a shift in build-versus-buy strategy. Bedrock (AWS’s managed service for foundational LLMs) overlapped significantly with internal capabilities, but 2024’s more deliberate strategy led to reevaluation. Key considerations included baseline security requirements, time-to-market and cost, and opportunity cost of building versus buying. External vendors had improved security practices (zero-day data retention, cloud integration), and the team recognized their leverage was in business-specific applications rather than recreating marketplace tools.

API standardization: The initial API structure didn’t mirror OpenAI’s specs, causing integration headaches with LangChain and other frameworks requiring monkey patching. In September 2024, they shipped v2 with OpenAI-compatible API specs—an important lesson about adopting emerging industry standards.

Production Use Case: Client Experience Triaging

A concrete production success was optimizing client experience triaging. Previously, a dedicated team manually read tickets and phone calls to route them appropriately—an unenjoyable and inefficient workflow. They developed a transformer-based classification model for email tickets.

With the LLM platform, two improvements were made: Whisper transcribed phone calls to text, extending triaging beyond emails, and self-hosted LLM generations enriched the classification. This resulted in significant performance improvements, translating to hours saved for both agents and clients.

Usage Patterns and Learnings

Internal analysis revealed several patterns:

Key behavioral insight: tools are most valuable when injected where work happens, and context-switching between platforms is a major detractor. Even as tools proliferated, most users stuck to a single tool. The Boosterpack experience reinforced that centralizing tools matters more than adding features.

Current State and Future Direction

As of late 2024, Wealthsimple’s LLM platform sees over 2,200 daily messages, with approximately one-third of the company as weekly active users and over half as monthly active users. The foundations built for employee productivity have paved the way for production systems optimizing client operations.

The team positions themselves as ascending the “slope of enlightenment” after the sobering experience of 2024’s trough of disillusionment. The emphasis on security guardrails, platform investments, and business alignment—rather than chasing the latest models—appears to be paying off in sustainable adoption and genuine productivity gains. For 2025, they’re evaluating deeper Bedrock integration and continuing to refine their build-versus-buy strategy as the vendor landscape evolves.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Multi-Agent Financial Research and Question Answering System

Yahoo! Finance 2025

Yahoo! Finance built a production-scale financial question answering system using multi-agent architecture to address the information asymmetry between retail and institutional investors. The system leverages Amazon Bedrock Agent Core and employs a supervisor-subagent pattern where specialized agents handle structured data (stock prices, financials), unstructured data (SEC filings, news), and various APIs. The solution processes heterogeneous financial data from multiple sources, handles temporal complexities of fiscal years, and maintains context across sessions. Through a hybrid evaluation approach combining human and AI judges, the system achieves strong accuracy and coverage metrics while processing queries in 5-50 seconds at costs of 2-5 cents per query, demonstrating production viability at scale with support for 100+ concurrent users.

question_answering data_analysis chatbot +49

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57