RBC developed an internal RAG (Retrieval Augmented Generation) system called Arcane to help financial advisors quickly access and interpret complex investment policies and procedures. The system addresses the challenge of finding relevant information across semi-structured documents, reducing the time specialists spend searching through documentation. The solution combines advanced parsing techniques, vector databases, and LLM-powered generation with a chat interface, while implementing robust evaluation methods to ensure accuracy and prevent hallucinations.
This case study comes from a presentation by Dr. Ahsan Mujahid, Head of AI Solution Acceleration and Innovation at RBC (Royal Bank of Canada), describing their internal RAG system called “Arcane.” The system was designed to help investment advisors quickly locate and understand investment policies, addressing a critical bottleneck in financial services where highly trained specialists often face significant backlogs when trying to answer complex policy questions.
The presentation was delivered at a TMLS (Toronto Machine Learning Summit) event and provides valuable insights into deploying RAG systems in highly regulated financial environments with proprietary data. Dr. Mujahid emphasized that while they completed the pilot phase and tested everything in production, the scaling to millions of users would be handled by Borealis AI, RBC’s enterprise AI branch.
Financial operations, particularly in direct investment advisory services, are extremely complex. The domain encompasses core banking systems, treasury management, mobile banking, asset liability management, foreign exchange, and various investment types. Training specialists in these areas takes years, and these individuals typically arrive with 5-10 years of post-secondary STEM education, often including graduate studies.
The core problem was a service bottleneck: when clients have questions about investment policies, advisors need to quickly find the right information from vast amounts of semi-structured documentation. Every second of delay has multiplicative productivity impacts, often translating to millions of dollars in bottom-line impact. The challenge was compounded by the fact that information about programs like the Home Buyers’ Plan (HBP) and other investment policies was scattered across multiple semi-structured sources that were difficult to search efficiently.
The Arcane system follows a standard RAG architecture but with significant customization for the financial domain. The user interface includes a chat interface with suggested frequent questions for the specific use case, plus chat memory/management functionality on the left side. The chat memory feature proved particularly important for agents specializing in subdomains, though it also introduced challenges that will be discussed later.
The speaker identified parsing of semi-structured data as the single biggest technical challenge they faced. Semi-structured data in their context included XML-like formats, HTML, PDF files, and internal/external websites. These formats are notoriously difficult to work with because:
The solution emphasized investing heavily in robust parsing capabilities before focusing on other components. Without proper parsing and chunking, even the best LLMs struggle to retrieve the right information, especially given that these are highly specialized internal materials that external models would never have seen during training.
The presentation devoted significant attention to RAG evaluation, which was described as essential for mitigating the risks of irrelevant or hallucinated responses. In financial advisory contexts, the stakes are particularly high - providing false information to clients could have serious consequences.
The team utilized RAGAS (Retrieval-Augmented Generation Assessment), an open-source evaluation framework, as their primary evaluation tool. The key metrics include:
The distinction between faithfulness and relevancy is crucial - a response can be accurate with respect to the context but completely irrelevant to what was actually asked.
Beyond RAGAS, the team employed multiple complementary evaluation approaches:
For relevance evaluation specifically, they used:
The TruEra evaluation triad visualization was also referenced as a useful conceptual framework, showing the relationships between query-context (context relevance), query-response (answer relevance), and context-response (groundedness).
The team experimented with different OpenAI models and found notable trade-offs:
For embeddings, smaller models were deemed acceptable, though better models do produce better results. The key insight was that generation quality (where users interact directly) warranted the best available models, while embeddings for indexing and retrieval could use more efficient options.
The pilot concluded “a few months ago” relative to the presentation, so the speaker acknowledged that model performance may have improved since then. The latency wasn’t a showstopper since even slower responses were better than waiting minutes on a call, but it remained a consideration for production scaling.
The speaker emphasized the importance of profiling the entire solution to identify bottlenecks - not just computational bottlenecks but also where things are most likely to go wrong. For their system, parsing was the biggest bottleneck and required the most effort. Vector databases themselves were not a bottleneck, though understanding how to use them properly was important.
A critical lesson was the importance of not applying RAG indiscriminately. Technologically-biased individuals may be tempted to use the latest tools for every problem, while non-technical stakeholders might see RAG as a magic solution after seeing demonstrations. Technical team members should not shy away from telling partners when RAG isn’t the right solution for a particular problem.
One of the most significant production risks identified was the “curse of joint probabilities” in multi-turn conversations. While chat memory and ongoing conversation support provide user experience benefits (not having to re-explain context), they introduce compounding error risks. The speaker noted that if you ask five questions back-to-back, the probability that the fifth answer is completely false becomes very high, even when the correct answer exists in the context.
Careful prompt engineering can help mitigate this but cannot completely solve it. Fine-tuning could help but requires significant data and has cost implications. Perhaps most concerningly, the fifth incorrect answer will often look very convincing, making it harder for users to identify errors.
Given the limitations of systematic and systemic risk mitigation, educating users becomes essential. Users need to understand:
This human guardrail cannot be replaced by technical measures alone in high-stakes financial advisory contexts.
The speaker referenced a separate hour-long talk on security and privacy but highlighted several key concerns:
This privacy attack allows adversaries with programmatic API access to reconstruct training data and potentially the model itself through consistent attacks. This is particularly concerning for models fine-tuned on proprietary financial data.
Models can potentially be manipulated to reveal information they’ve been exposed to, which is especially dangerous when the model has access to proprietary organizational information.
The location of model execution matters significantly in regulated industries:
DLP was mentioned as “very important” with “a lot of things that can go wrong,” though specifics were deferred due to time constraints.
In response to a question, the speaker provided guidance on chunking:
The pilot phase was completed and tested in production, with the system handling questions about investment policies including the Home Buyers’ Plan (HBP), withdrawal fees, eligibility criteria, and family-related policy questions. The next phase involving scaling to millions of users was being handed off to Borealis AI, RBC’s dedicated AI branch, demonstrating a thoughtful approach to organizational responsibility - innovation teams prototype and validate, while enterprise teams handle production scaling.
Yahoo! Finance built a production-scale financial question answering system using multi-agent architecture to address the information asymmetry between retail and institutional investors. The system leverages Amazon Bedrock Agent Core and employs a supervisor-subagent pattern where specialized agents handle structured data (stock prices, financials), unstructured data (SEC filings, news), and various APIs. The solution processes heterogeneous financial data from multiple sources, handles temporal complexities of fiscal years, and maintains context across sessions. Through a hybrid evaluation approach combining human and AI judges, the system achieves strong accuracy and coverage metrics while processing queries in 5-50 seconds at costs of 2-5 cents per query, demonstrating production viability at scale with support for 100+ concurrent users.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Martin Der, a data scientist at Xomnia, presents practical approaches to GenAI governance addressing the challenge that only 5% of GenAI projects deliver immediate ROI. The talk focuses on three key pillars: access and control (enabling self-service prototyping through tools like Open WebUI while avoiding shadow AI), unstructured data quality (detecting contradictions and redundancies in knowledge bases through similarity search and LLM-based validation), and LLM ops monitoring (implementing tracing platforms like LangFuse and creating dynamic golden datasets for continuous testing). The solutions include deploying Chrome extensions for workflow integration, API gateways for centralized policy enforcement, and developing a knowledge agent called "Genie" for internal use cases across telecom, healthcare, logistics, and maritime industries.