Qatar Computing Research Institute developed a novel question-answering system for organizational documents combining RAG, finetuning, and a tree-based entity structure. The system, called T-RAG, handles confidential documents on-premise using open source LLMs and achieves 73% accuracy on test questions, outperforming baseline approaches while maintaining robust entity tracking through a custom tree structure.
This case study from Qatar Computing Research Institute (QCRI) presents T-RAG (Tree-RAG), a practical LLM application built for question answering over private governance documents for a large non-profit organization. The work is notable for its focus on real-world deployment constraints and the lessons learned from building production LLM systems. The system was built to handle queries about organizational governance manuals, which contain descriptions of governing principles, duties of governing bodies, and hierarchical information about entities within the organization.
The researchers highlight several key production constraints that shaped their architecture: data security concerns requiring on-premise deployment (ruling out API-based proprietary models), limited computational resources, and the critical need for reliable and accurate responses. This represents a common enterprise scenario where organizations cannot simply use cloud-based LLM APIs due to confidentiality requirements.
T-RAG combines three main components: standard RAG for document retrieval, a finetuned open-source LLM for answer generation, and a novel tree-based entity context system for handling hierarchical organizational information.
The team chose Llama-2 7B as their base model, specifically the chat variant, due to its open-source nature enabling on-premise deployment. This decision was driven by practical considerations: larger models (like Llama-2 70B) would require more computational resources than typically available to small and medium enterprises, and geographic restrictions on GPU access can further limit options. The smaller 7B model proved sufficient for their use case while remaining manageable for finetuning and inference.
For inference, they used greedy decoding (temperature 0) with a repetition penalty of 1.1 to generate responses. This conservative approach to generation helps ensure more consistent and predictable outputs in a production setting.
The RAG pipeline follows a standard architecture but with careful attention to implementation details. Documents are chunked based on section headers rather than arbitrary token counts, which preserves semantic coherence. The Chroma DB vector database stores document embeddings, and they use the Instructor embedding model for vectorization, which can produce embeddings optimized for different domains.
For retrieval, they employ Maximum Marginal Relevance (MMR) rather than simple similarity search. MMR optimizes for both relevance to the query and diversity among retrieved documents, helping to avoid redundant context. The system retrieves chunks that are then merged into context for the LLM.
The finetuning process demonstrates practical approaches to creating training data when resources are limited. The team generated an instruction dataset of 1,614 Q&A pairs through an iterative process using the Llama-2 model itself:
Quality control included manual inspection and duplicate removal. The dataset was split 90/10 for training and validation.
For the actual finetuning, they used QLoRA (Quantized Low-Rank Adaptation), which combines 4-bit quantization of model weights with LoRA for parameter-efficient finetuning. With rank r=64, this reduced trainable parameters to approximately 33.5 million—about 200x fewer than the full model parameters. Training was performed on 4 Quadro RTX 6000 GPUs with 24GB memory each, using the Hugging Face PEFT library.
A critical production consideration was testing for overfitting. The team evaluated their finetuned model against the base model on the MMLU benchmark to ensure general language capabilities weren’t degraded. The finetuned model achieved 43% overall accuracy compared to 45.3% for the base model, indicating minimal capability loss.
The most distinctive aspect of T-RAG is the tree structure for representing organizational hierarchies. Standard RAG struggles with questions about entity relationships within complex hierarchies because the relevant information may be scattered across documents or represented in organizational charts that don’t embed well.
The solution encodes the organizational hierarchy as a tree where each node represents an entity and parent-child relationships indicate categorical membership. For example, in a UNHCR-style structure, “Global Service Center in Budapest” would be a child of “Deputy High Commissioner” which would be under “High Commissioner Executive Office.”
The entity detection pipeline uses spaCy with custom rules for named entity recognition tailored to the organization’s entities. Standard NER would detect “Budapest” as a location, but custom rules identify “Global Service Center in Budapest” as an organizational entity. When a query mentions entities, the system:
This adaptive approach means the tree search is only performed when entity-related questions are detected, avoiding unnecessary context pollution for other query types.
The evaluation approach reflects practical deployment concerns with multiple rounds of real user testing. Three sets of questions were gathered from organizational end users across testing phases, totaling 37 questions.
Human evaluation was chosen over automated LLM-based evaluation, with responses marked as Correct (C) or Correct-Verbose (CV). The CV category captures an important production concern: responses that are factually correct but include irrelevant information. This verbosity metric is valuable because overly long responses waste user time and may bury the actual answer.
Results showed T-RAG achieved 73% total accuracy (27/37) compared to 56.8% for standard RAG (21/37) and 54.1% for finetuning alone (20/37). However, T-RAG produced more verbose responses (6 CV vs 1 for other methods), indicating a tradeoff between accuracy and conciseness.
For entity-related questions, the team created separate test sets: “simple” questions (direct queries about single entities) and “complex” questions (listing entities under categories or compound questions). The tree context dramatically improved performance—the finetuned model without tree context answered 47.1% of simple questions correctly, while with tree context this jumped to 100%. For complex questions, accuracy improved from 36.4% to 77.3%.
This evaluation tested retrieval robustness by placing relevant context at different positions (top, middle, end) surrounded by varying numbers of irrelevant chunks (k=2 or k=10). Results showed models generally performed better with relevant information at context ends rather than the middle. Critically, T-RAG maintained 64.1% accuracy with relevant context buried in the middle among 10 unrelated chunks, while standard RAG dropped to 33.3%. This demonstrates the tree context’s value in providing reliable entity information regardless of position in the context window.
The paper provides valuable practical insights from deployment experience:
Robustness is hard: While building an initial RAG is straightforward, making it robust requires domain expertise and extensive optimization across all pipeline components (chunking strategy, embedding model selection, retrieval algorithms, etc.)
Finetuned models are phrasing-sensitive: Small changes in question wording can significantly affect outputs. For example, asking for “a comprehensive list of all the…” produced hallucinations while “a list of all the…” was answered correctly. This sensitivity to distribution shift from training data is a known challenge requiring careful prompt design or broader training data coverage.
Context window efficiency: Finetuning can reduce context requirements by encoding information in model parameters, leaving more context window space for chat history or other dynamic information. This may be more practical than pursuing ever-larger context windows.
End user involvement: Regular testing with actual end users throughout development provided valuable feedback for system refinement.
Hybrid approaches are promising: Neither RAG nor finetuning alone was optimal. RAG provides grounding that reduces hallucinations but is sensitive to retrieval quality. Finetuning adapts tone and incorporates knowledge but risks overfitting and hallucinations on out-of-distribution queries. Combining both approaches proved most effective.
Update considerations: RAG is better suited for frequently changing documents since the vector database can be updated easily. Finetuning requires regenerating training data and retraining, making it better for more stable documents like governance manuals.
The work represents a thoughtful approach to production LLM deployment that acknowledges constraints and tradeoffs rather than simply pursuing maximum benchmark performance. The combination of practical architectural choices, careful evaluation methodology, and honest assessment of limitations makes this a useful reference for enterprise LLM application development.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.
Slack's Developer Experience team embarked on a multi-year journey to integrate generative AI into their internal development workflows, moving from experimental prototypes to production-grade AI assistants and agentic systems. Starting with Amazon SageMaker for initial experimentation, they transitioned to Amazon Bedrock for simplified infrastructure management, achieving a 98% cost reduction. The team rolled out AI coding assistants using Anthropic's Claude Code and Cursor integrated with Bedrock, resulting in 99% developer adoption and a 25% increase in pull request throughput. They then evolved their internal knowledge bot (Buddybot) into a sophisticated multi-agent system handling over 5,000 escalation requests monthly, using AWS Strands as an orchestration framework with Claude Code sub-agents, Temporal for workflow durability, and MCP servers for standardized tool access. The implementation demonstrates a pragmatic approach to LLMOps, prioritizing incremental deployment, security compliance (FedRAMP), observability through OpenTelemetry, and maintaining model agnosticism while scaling to millions of tokens per minute.