A case study detailing lessons learned from processing over 250k LLM calls on 100k corporate documents at Credal. The team discovered that successful LLM implementations require careful data formatting and focused prompt engineering. Key findings included the importance of structuring data to maximize LLM understanding, especially for complex documents with footnotes and tables, and concentrating prompts on the most challenging aspects of tasks rather than trying to solve multiple problems simultaneously.
Credal is an enterprise AI platform that helps companies safely use their data with generative AI. This case study documents their learnings from processing over 250,000 LLM calls on more than 100,000 corporate documents for enterprise customers with thousands of employees. The insights provided offer a practical look at what it takes to move from demo-quality LLM applications to production-grade systems that handle real-world data complexity at scale.
The core thesis of this case study is that LLM attention is a scarce resource that must be carefully managed. When models are asked to perform multiple logical steps or process large amounts of potentially relevant context, their performance degrades significantly. This fundamental constraint shapes most of the technical decisions and workarounds described in the case study.
One of the most significant findings is that the way data is represented to LLMs dramatically impacts answer quality. Credal discovered that out-of-the-box document loaders from libraries like LangChain fail to preserve the semantic relationships within documents that humans take for granted.
When processing documents with footnotes (common in academic papers, legal documents, and research reports), standard parsers place all footnotes at the end of the document. This creates a critical problem for retrieval-based systems: when a user asks “Which author said X?” the citation information needed to answer that question is semantically unrelated to the quote itself in embedding space. The author’s name and the quotation will almost never appear together in search results.
Credal’s solution was to inline footnote content directly where the reference appears in the text. Instead of seeing Said Bin Taimur oversaw a "domestic tyranny"[75], the LLM sees Said Bin Taimur oversaw a "domestic tyranny" [Calvin Allen Jr, W. Lynn Rigsbee II, Oman under Qaboos...]. This simple restructuring transforms an impossible question into a trivial one for the LLM.
Similarly, tables parsed from Google Docs or similar sources come through as confusingly formatted strings with special characters and unclear structure. Credal found that even the most powerful models (GPT-4-32k, Claude 2) failed to correctly reason about data in poorly formatted tables. When asked to count monarchs whose reign started between 1970 and 1980, GPT-4 incorrectly included a monarch who started in 1986, demonstrating a failure in date-based reasoning that was exacerbated by the confusing data format.
The solution involved converting tables to CSV format before sending to the LLM. This representation is both 36% more token-efficient than the raw parsed format and significantly easier for models to reason about correctly. The efficiency gain matters not just for cost but also for performance, as every unnecessary token “dissipates the model’s attention” from the actual question.
For summary questions like “What is the main thesis of the paper?”, traditional semantic search fails because the relevant passage often doesn’t contain the query keywords. The section that actually summarizes a thesis might never use the words “thesis” or “summary.” Keyword-based hybrid search doesn’t help either when the semantic mismatch is this fundamental.
Credal’s solution was to use LLMs at ingestion time to generate metadata tags for each document section. These tags categorize content by type (high-level summary vs. exposition) and by entities mentioned (customers, products, features, etc.). When a user asks a summary question, the system can pre-filter to summary sections before performing semantic search, dramatically improving retrieval quality.
This represents an interesting pattern in LLMOps: using LLMs to preprocess and enrich data at ingestion time to improve downstream LLM performance at query time. The approach requires human experts to define the relevant tag taxonomy for their domain, but the actual tagging work is automated. Credal frames this as “human-computer symbiosis” where humans direct AI attention and computers handle the reading and summarization at scale.
The case study discusses the tradeoffs between RAG (Retrieval Augmented Generation) and full-context approaches. For a single document, using Claude’s 100k context window can produce excellent summaries, but at costs potentially exceeding $1-2 per query. With thousands of users, this becomes prohibitively expensive.
More importantly, context window approaches don’t scale to enterprise use cases involving thousands of documents. When dealing with a corpus of legal contracts, a company’s entire Google Drive, or 4,000 written letters, you cannot fit everything in context. RAG becomes necessary, but it requires careful attention to data formatting and retrieval strategy to work reliably.
The case study also notes that even identifying whether a question requires a summary (full-context) approach versus a detail-lookup (RAG) approach is non-trivial and needs to be handled dynamically.
The second major learning concerns how to structure prompts for reliable production performance. Credal built a system where multiple domain-specific “AI experts” live in Slack channels, and incoming questions must be routed to the correct expert with 95%+ accuracy.
Using GPT-3.5 for cost and latency reasons, Credal initially tried LangChain’s StructuredOutputParser to get JSON responses. The problem was that the extensive formatting instructions (10-20 lines about JSON structure) distracted the model from the actual hard part: correctly matching user questions to expert descriptions. GPT-3.5’s accuracy dropped to only 50% even with a single expert in the channel.
The solution was counter-intuitive: remove the sophisticated LangChain tooling and hand-roll a simpler approach. By making the few-shot examples the bulk of the prompt (using JSON naturally within the examples rather than extensive format instructions), they focused model attention on the matching task itself.
When the simplified prompt occasionally produced malformed JSON, Credal added a second GPT-3.5 call specifically for JSON formatting. This two-call approach (with accuracy checking between calls) was both faster and cheaper than a single GPT-4 call while achieving better reliability. This pattern of sequential, specialized prompts with intermediate validation emerged as more robust than trying to accomplish multiple tasks in a single call.
A recurring theme is that libraries like LangChain, while useful for demos and simple use cases, proved insufficient for production enterprise requirements. Credal still uses some LangChain components but found that solving “the hard parts” required custom implementations.
The specific failure modes included:
The case study notes that when building demos with controlled data and hand-picked questions, naive approaches work fine. Production systems face long, strangely formatted documents, cost and latency constraints, and diverse user phrasings that break simple approaches.
The case study provides some interesting observations about different models:
Throughout the case study, cost consciousness is apparent. Making GPT-4 calls on every message in a 5,000-person company Slack channel would be “painful.” The solutions consistently optimize for using cheaper, faster models (GPT-3.5) where possible, through better prompting and data formatting rather than simply throwing more powerful models at problems.
The insight that more efficient data representation (CSV vs. raw parsed format) saves tokens while also improving accuracy demonstrates how optimization for cost and quality can align in LLMOps.
The case study concludes with several principles that emerged from real-world deployment:
This represents a valuable practitioner’s view of LLMOps challenges, grounded in real deployment experience rather than theoretical concerns. While Credal is naturally promoting their platform, the technical insights about document formatting, prompt engineering, and system architecture are broadly applicable to anyone building production LLM applications.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Glean implements enterprise search and RAG systems by developing custom embedding models for each customer. They tackle the challenge of heterogeneous enterprise data by using a unified data model and fine-tuning embedding models through continued pre-training and synthetic data generation. Their approach combines traditional search techniques with semantic search, achieving a 20% improvement in search quality over 6 months through continuous learning from user feedback and company-specific language adaptation.
Wesco, a B2B supply chain and industrial distribution company, presents a comprehensive case study on deploying enterprise-grade AI applications at scale, moving from POC to production. The company faced challenges in transitioning from traditional predictive analytics to cognitive intelligence using generative AI and agentic systems. Their solution involved building a composable AI platform with proper governance, MLOps/LLMOps pipelines, and multi-agent architectures for use cases ranging from document processing and knowledge retrieval to fraud detection and inventory management. Results include deployment of 50+ use cases, significant improvements in employee productivity through "everyday AI" applications, and quantifiable ROI through transformational AI initiatives in supply chain optimization, with emphasis on proper observability, compliance, and change management to drive adoption.