The research analyzes real-world prompt templates from open-source LLM-powered applications to understand their structure, composition, and effectiveness. Through analysis of over 2,000 prompt templates from production applications like those from Uber and Microsoft, the study identifies key components, patterns, and best practices for template design. The findings reveal that well-structured templates with specific patterns can significantly improve LLMs' instruction-following abilities, potentially enabling weaker models to achieve performance comparable to more advanced ones.
This research paper from Technical University of Munich presents a comprehensive empirical study of prompt templates used in production LLM-powered applications (LLMapps). The study is particularly relevant to LLMOps practitioners because it provides data-driven insights into how real-world applications structure their prompts, derived from analyzing open-source repositories from major companies including Uber and Microsoft.
The core motivation stems from a significant operational challenge: while LLMs have democratized AI adoption, crafting effective prompts remains non-trivial. Small variations in prompt structure or wording can lead to substantial differences in model output, creating reliability and maintainability challenges for production systems. Prompt templates serve as a solution by providing predefined structures that combine static text with dynamic placeholders, enabling consistent and efficient LLM interactions at scale.
The researchers constructed their dataset from PromptSet, a collection of prompts extracted from LLMapps in open-source GitHub projects as of January 2024. Their data processing pipeline employed several quality filters that are instructive for LLMOps practitioners building similar analysis systems:
The dataset includes production tools from notable organizations: Uber’s Piranha (a tool for refactoring code related to feature flag APIs, adopted by over 200 developers), Microsoft’s TaskWeaver (a code-first agent framework for data analytics with over 5k GitHub stars), Weaviate’s Verba RAG chatbot (6.5k stars), and LAION-AI’s Open-Assistant (37k stars).
The researchers identified seven common component types in prompt templates, derived from synthesizing insights from Google Cloud documentation, the Elavis Saravia framework, the CRISPE framework, and the LangGPT framework. The distribution of components provides insight into production template design:
A key operational finding is the common sequential order of components: Profile/Role and Directive typically appear first (establishing model identity and task intent), while examples are typically placed at the end. The researchers found that over 90% of directives are written in instruction style rather than question style, suggesting that commands like “Summarize the report” are more effective for production systems than questions like “Could you summarize this?”
Given that JSON is the most commonly used structured output format in LLMapps (critical for downstream processing), the researchers identified three distinct patterns for specifying JSON output:
Sample testing with both llama3-70b-8192 and gpt-4o revealed significant performance differences. Pattern 3 achieved the highest scores across both format-following and content-following metrics. For format-following specifically, the llama3 model scored 3.09 with Pattern 1 versus 4.90 with Pattern 3 (on a 1-5 scale), demonstrating that explicit attribute definitions dramatically improve structural consistency.
The researchers also found that using JSON format definitions alone is insufficient to prevent extraneous explanations in output. Combining positive instructions (output format definitions) with negative instructions (exclusion constraints like “Do not provide any other output text beyond the JSON string”) raised the format-following rate from 40% to 100% for llama3 and from 86.67% to 100% for gpt-4o.
The study identified four primary placeholder categories in prompt templates:
Regarding positional distribution, approximately 60% of user questions appear at the end of templates, while Knowledge Input placeholders are more evenly distributed between beginning and end positions. Testing revealed that positioning task intent instructions after the Knowledge Input (rather than before) significantly improves output quality, particularly for long inputs. The llama3 model showed a +0.91 improvement in task intent adherence with this positioning strategy, compared to +0.34 for gpt-4o.
The researchers also flagged a common anti-pattern: many templates use non-semantic placeholder names like “text” (4.44%) and “input” (2.35%), which hinder maintainability. Similar to variable naming conventions in traditional software, clear and descriptive placeholder names are recommended for production systems.
Analysis of constraint types revealed that exclusion constraints are most common (46.0%), followed by inclusion constraints (35.6%) and word count constraints (10.5%). The researchers further classified exclusion constraints into subcategories:
These exclusion constraints serve as guardrails against hallucination and help narrow the generation space, which are critical concerns for production LLM deployments.
The study offers several actionable insights for production LLM systems:
For Template Maintenance: Prompt templates should adapt dynamically based on user feedback and usage analytics. Analyzing historical input patterns (lengths, content types) helps optimize placeholder positions and component ordering to prevent information decay in long prompts.
For Model Selection and Cost Optimization: Well-designed prompt templates can significantly strengthen weaker models’ instruction-following abilities. In the long-input experiments, the output quality boost achieved with optimized templates for llama3-70b-8192 was nearly double that of gpt-4o. This suggests that developers should first consider redesigning prompt templates before switching to more expensive models.
For In-Context Learning Trade-offs: Fewer than 20% of production applications use few-shot examples in their templates. The researchers suggest that in-context learning is not a one-size-fits-all solution, and clearly defined prompt templates can sometimes outperform few-shot approaches while avoiding increased token costs and potential semantic contamination.
For LLM API Providers: The study recommends that providers offer pre-defined templates for common tasks and automated template evaluation tools that compare outputs across different template patterns and provide explainability for optimization recommendations.
The researchers employed a rigorous validation methodology combining LLM-assisted analysis with human review. Component identification using llama3-70b-8192 achieved 86% precision at the component level and 66% full-match precision at the prompt level (99% partial match). Placeholder classification achieved 81% accuracy after an iterative refinement process. Human evaluators with programming experience independently reviewed LLM-generated classifications, with final scores being averages of their assessments.
While this study provides valuable empirical insights, several limitations should be noted. The dataset is derived from open-source repositories, which may not fully represent proprietary production systems. The quality filtering based on GitHub stars and recent updates, while reasonable, may exclude valid but less popular applications. Additionally, the testing was conducted on specific model versions (llama3-70b-8192 and gpt-4o as of the study period), and results may vary with newer model versions or different LLM families.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.
Smartling operates an enterprise-scale AI-first agentic translation delivery platform serving major corporations like Disney and IBM. The company addresses challenges around automation, centralization, compliance, brand consistency, and handling diverse content types across global markets. Their solution employs multi-step agentic workflows where different model functions validate each other's outputs, combining neural machine translation with large language models, RAG for accessing validated linguistic assets, sophisticated prompting, and automated post-editing for hyper-localization. The platform demonstrates measurable improvements in throughput (from 2,000 to 6,000-7,000 words per day), cost reduction (4-10x cheaper than human translation), and quality approaching 70% human parity for certain language pairs and content types, while maintaining enterprise requirements for repeatability, compliance, and brand voice consistency.