Ramp, a financial services company, faced challenges with industry classification of their customers due to reliance on a homegrown taxonomy that resulted in inconsistent, overly broad, and non-auditable categorizations across teams. The company built an in-house Retrieval-Augmented Generation (RAG) model to migrate to the standardized NAICS (North American Industry Classification System) taxonomy. The solution involved a two-stage RAG pipeline with optimized embeddings for recommendation generation and a dual-prompt LLM system for final prediction selection. The deployed system achieved accuracy improvements of 60% in recommendation retrieval and 5-15% in final prediction accuracy, while providing full ownership, auditability, and interpretability of classification decisions. This enabled consistent cross-team collaboration, improved compliance capabilities, and more precise customer understanding across sales, risk, and analytics functions.
Ramp, a financial services platform focused on helping customers save time and money, implemented a production RAG-based system to solve a critical data quality and operational challenge: industry classification of their business customers. The case study presents a comprehensive example of building and deploying an LLM-powered system from scratch, with particular emphasis on systematic optimization, production architecture considerations, and measurable business impact.
The core business problem stemmed from Ramp’s reliance on a homegrown industry taxonomy cobbled together from third-party data, sales-entered information, and customer self-reporting. This system created multiple sources of truth across the organization, with different teams using incompatible taxonomies (homegrown categories, SIC codes, NAICS codes) that required complex many-to-many translation layers. The consequences were significant: the Risk and Sales teams couldn’t establish quick feedback loops on targeting and segmentation, classifications were often incorrect or overly generic (like categorizing diverse businesses all as “Professional Services”), similar businesses received different classifications, and the system lacked auditability. For example, WizeHire, a hiring platform, was broadly classified as “Professional Services” alongside law firms, dating apps, and consulting firms, making it difficult to understand specific customer needs or profile credit risk accurately.
Ramp’s solution was to standardize on NAICS (North American Industry Classification System), a six-digit hierarchical taxonomy used across North America, replacing the older SIC system and their homegrown approach. This standardization would enable internal consistency while facilitating communication with external partners already using NAICS. The hierarchical nature of NAICS codes proved particularly valuable, allowing teams to choose appropriate granularity levels - from broad categories down to specific six-digit codes. Under the new system, WizeHire would be classified as “561311 - Employment Placement Agencies,” with clear rollup categories for different analytical purposes.
The critical challenge was building a classification model capable of predicting six-digit NAICS codes for all Ramp businesses. While third-party solutions existed, Ramp determined that their unique needs with complex, proprietary data justified building an in-house model. They chose a RAG architecture specifically for its ability to constrain LLM outputs to a valid knowledge base domain - essentially converting an open-ended generation problem into a multiple-choice question where all choices are valid NAICS codes.
The RAG system architecture consists of three primary stages operating in a production environment. First, the system calculates text embeddings of both the query (business information) and the knowledge base (NAICS code descriptions). Second, it computes similarity scores to generate recommendations from the knowledge base. Third, an LLM makes final predictions from the filtered recommendations. This multi-stage design required careful consideration of production infrastructure, data flow, and monitoring capabilities.
The production architecture integrates several key components. Internal services handle embeddings for new businesses and LLM prompt evaluations. Knowledge base embeddings are pre-computed and stored in ClickHouse, enabling fast retrieval of recommendations through similarity score calculations. Intermediate results are logged via Kafka, providing diagnostic capabilities for pathological cases and enabling prompt iteration. This architecture reflects thoughtful LLMOps practices around observability, data management, and system modularity.
The case study demonstrates sophisticated thinking about evaluation for multi-stage systems. Rather than using a single end-to-end metric, Ramp decomposed the problem and identified metrics for each stage, ensuring metrics wouldn’t interfere with each other while remaining aligned with overall system goals.
For the recommendation generation stage, they selected accuracy at k (acc@k) as the primary metric, measuring how often the correct NAICS code appears in the top k recommendations. This metric represents a performance ceiling for the full system - if the correct code isn’t among recommendations, the LLM cannot select it. This framing shows clear understanding of how retrieval quality bounds downstream performance in RAG systems.
For the prediction selection stage, they defined a custom fuzzy-accuracy metric that accounts for NAICS’s hierarchical structure. Rather than treating all incorrect predictions equally, predictions correct for part of the hierarchy receive partial credit. For instance, if the correct code is 123456, a prediction of 123499 (matching the first four digits) scores better than 999999 (completely wrong). This custom metric reflects domain expertise and recognition that partial correctness has business value - a somewhat wrong industry classification is better than a completely wrong one for most downstream use cases.
The recommendation generation stage involved optimizing multiple hyperparameters through systematic experimentation. Key parameters included which knowledge base field to embed, which query field to embed, the embedding model selection, and the number of recommendations to generate. Each parameter presented tradeoffs - for example, certain business attributes might be more informative but have higher missing rates, while different embedding models have varying resource requirements that don’t necessarily correlate with performance on Ramp’s specific data.
The team profiled performance across different configurations, creating acc@k curves to identify optimal settings. They explicitly avoided naive optimization that would maximize acc@k without considering downstream LLM performance (which would lead to recommending the entire knowledge base). Through this systematic optimization, they achieved up to 60% improvement in acc@k. Importantly, they identified economical embedding models suitable for production deployment without sacrificing performance compared to larger models - a critical consideration for cost management in production LLM systems.
The prediction selection stage similarly involved optimizing multiple hyperparameters: the number of recommendations to include, which fields to include in prompts (for both business descriptions and knowledge base entries), prompt variations, number of prompts, and structured output schema design. Again, tradeoffs were carefully considered. More recommendations give the LLM better chances of finding the correct code but increase context size and potentially degrade performance if the model cannot focus on relevant options. Likewise, longer or more descriptive information helps the LLM understand businesses and NAICS codes better but significantly increases context size.
Ramp implemented a sophisticated two-prompt system to balance competing considerations. The first prompt includes many recommendations without the most specific descriptions, asking the LLM to return a small list of the most relevant codes. The second prompt then asks the LLM to choose the best single code from this filtered list, providing more detailed context for each remaining option. This staged approach gets “the best of both worlds” - comprehensive initial coverage with focused final decision-making. Through optimization of these parameters, they achieved 5-15% improvement in fuzzy accuracy.
The use of structured output is notable, constraining the LLM to return predictions in a specific format that can be reliably parsed and validated. This is a production-critical consideration that prevents parsing failures and enables systematic validation of outputs.
While RAG systems inherently constrain LLM outputs to the knowledge base domain, Ramp implemented additional guardrails. They validate that output NAICS codes from each LLM prompt are valid codes that actually exist. Interestingly, they discovered cases where the LLM correctly predicted codes not present in the recommendations (a form of “positive hallucination”), so their validation focuses on filtering out just “bad” hallucinations rather than all hallucinations. This nuanced approach shows sophisticated understanding of LLM behavior and willingness to leverage beneficial emergent capabilities while protecting against harmful ones.
The logging of all intermediate steps through Kafka provides crucial observability for a production system. This enables the team to pinpoint where issues arise - whether in retrieval or re-ranking stages - and make targeted improvements. The ability to diagnose pathological cases and iterate on prompts represents operational maturity in LLMOps practices. The team can analyze failures systematically rather than treating the system as a black box.
The case study also highlights requesting LLM justifications to clarify reasoning behind predictions. This interpretability capability serves multiple purposes: building user trust, debugging incorrect predictions, identifying systematic biases or errors, and potentially improving the system through analysis of reasoning patterns.
Since deployment, Ramp has realized several key benefits beyond accuracy improvements. Full ownership and control over the algorithm allows them to make adjustments across dozens of hyperparameters to address concerns as they emerge. They can tune for different priorities: performance degradation, latency requirements, or cost sensitivity. This flexibility would be impossible with a third-party solution where they’d be constrained by the vendor’s roadmap, pricing, and iteration speed.
The auditability and interpretability of the model’s decisions addresses compliance requirements and builds stakeholder confidence. The ability to examine intermediate steps and LLM reasoning provides transparency lacking in their previous system.
The case study provides concrete examples of improvement. Businesses previously classified inconsistently despite similarity (three similar businesses categorized differently in the homegrown system) are now correctly grouped under the same NAICS code. Conversely, businesses previously lumped into overly broad categories are now distinguished with more descriptive, specific NAICS codes. For instance, businesses all categorized as generic “Professional Services” are now properly differentiated into their specific industry segments.
The stakeholder quotes, while promotional in nature, reflect genuine improvement in data quality and operational capabilities: enabling industry exclusion requirements for compliance, supporting business diversification efforts, and providing the nuanced understanding necessary for sophisticated customer segmentation and risk profiling.
While the case study presents a clear success story, certain aspects deserve balanced consideration. The claimed improvements (60% in acc@k, 5-15% in fuzzy accuracy) lack context about baseline performance - we don’t know if they went from 30% to 90% or 85% to 95%, which would have very different business implications. The case study doesn’t discuss failure modes, error rates, or cases where the system still struggles, which would provide more complete understanding.
The cost comparison with third-party solutions is mentioned but not quantified. Building and maintaining an in-house system requires significant engineering investment, ongoing maintenance, and specialized expertise. Whether this represents better total cost of ownership compared to third-party solutions depends on factors not fully explored in the case study.
The two-prompt strategy is elegant but implies two LLM calls per classification, doubling inference costs and latency compared to single-prompt approaches. The case study doesn’t discuss whether they explored or benchmarked against single-prompt alternatives with optimized context management.
The handling of data sparsity and non-uniform distribution (mentioned as challenges) isn’t fully addressed in the solution description. How does the system perform for rare or unusual business types with limited training examples? What strategies handle businesses that don’t fit neatly into NAICS categories?
Overall, this case study demonstrates substantial LLMOps maturity. The systematic approach to evaluation with stage-specific metrics, comprehensive hyperparameter optimization, thoughtful production architecture with appropriate infrastructure choices (ClickHouse for fast retrieval, Kafka for logging), implementation of guardrails and validation, and focus on observability and iteration all represent production-grade LLM deployment practices. The two-stage prompting strategy shows sophisticated prompt engineering going beyond basic approaches.
The migration from a heterogeneous, unauditable system to a standardized, interpretable one addresses real organizational challenges around data quality, cross-functional collaboration, and compliance. The business impact - enabling better customer understanding, risk assessment, and regulatory compliance - justifies the engineering investment in building a custom solution rather than relying on off-the-shelf alternatives.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Swisscom, Switzerland's leading telecommunications provider, developed a Network Assistant using Amazon Bedrock to address the challenge of network engineers spending over 10% of their time manually gathering and analyzing data from multiple sources. The solution implements a multi-agent RAG architecture with specialized agents for documentation management and calculations, combined with an ETL pipeline using AWS services. The system is projected to reduce routine data retrieval and analysis time by 10%, saving approximately 200 hours per engineer annually while maintaining strict data security and sovereignty requirements for the telecommunications sector.
Ramp, a financial services company, replaced their fragmented homegrown industry classification system with a standardized NAICS-based taxonomy powered by an in-house RAG model. The old system relied on stitched-together third-party data and multiple non-auditable sources of truth, leading to inconsistent, overly broad, and sometimes incorrect business categorizations. By building a custom RAG system that combines embeddings-based retrieval with LLM-based re-ranking, Ramp achieved significant improvements in classification accuracy (up to 60% in retrieval metrics and 5-15% in final prediction accuracy), gained full control over the model's behavior and costs, and enabled consistent cross-team usage of industry data for compliance, risk assessment, sales targeting, and product analytics.