A comprehensive study examining the challenges faced by 26 professional software engineers in building AI-powered product copilots. The research reveals significant pain points across the entire engineering process, including prompt engineering difficulties, orchestration challenges, testing limitations, and safety concerns. The study provides insights into the need for better tooling, standardized practices, and integrated workflows for developing AI-first applications.
This academic research paper from Microsoft and GitHub presents findings from a mixed-methods study involving 26 professional software engineers who are actively building AI-powered “copilot” products. The study was conducted in late 2023 and provides a comprehensive examination of the real-world challenges faced when integrating Large Language Models (LLMs) into production software systems. Unlike marketing materials or vendor documentation, this is an empirical research study with systematic methodology, making it a valuable source for understanding the actual state of LLMOps practices across the industry.
The term “copilot” in this context refers broadly to any software system that translates user actions into prompts for an LLM and transforms the outputs into suitable formats for user interaction. Examples include GitHub Copilot for code generation, Windows Copilot for OS interactions, and Microsoft 365 Copilot for productivity applications.
The researchers recruited participants through two mechanisms: internal Microsoft engineers working on publicly announced Copilot products (14 participants) and external engineers from various companies recruited via UserInterviews.com (12 participants). Importantly, they screened out engineers with extensive data science or ML backgrounds to be representative of general software engineers encountering AI integration for the first time. They also excluded engineers who merely used AI tools rather than integrating them into products.
The study combined semi-structured interviews (45 minutes each) with structured brainstorming sessions to both identify pain points and collaboratively explore potential solutions. This balanced approach helps mitigate the inherent biases in each methodology.
The study found that prompt engineering was fundamentally different from typical software engineering processes, with participants describing it as “more of an art than a science.” Several key challenges emerged:
Trial and Error Nature: Engineers started in ad hoc environments like OpenAI’s playground, bouncing between different tools based on availability. The process was described as “stumbling around” and “playing around with prompts” without structured guidance. As one participant noted, “Experimenting is the most time-consuming if you don’t have the right tools.”
Output Wrangling: Getting consistent, machine-readable output proved extremely difficult. Engineers attempted various tactics like providing JSON schemas for responses, but discovered “a million ways you can effect it.” The models would sometimes generate malformed outputs, hallucinate stop tokens, or produce inconsistent formatting. An interesting finding was that working with the model’s natural output tendencies (like ASCII tree representations for file structures) yielded better results than forcing specific formats.
Context and Token Management: Engineers struggled with providing appropriate context while staying within token limits. Participants described challenges in “distilling a really large dataset” and “selectively truncating” conversation history. Testing the impact of different prompt components on overall performance proved particularly difficult.
Asset Management: Prompts evolved into complex libraries of templates, examples, and fragments that needed to be dynamically assembled. While engineers kept these assets in version control, there was no systematic approach to tracking performance over time or validating the impact of changes.
Production copilots require sophisticated orchestration beyond simple single-turn interactions:
Intent Detection and Routing: Systems needed to first determine user intent from natural language inputs and then route to appropriate “skills” (like adding tests or generating documentation). After receiving model responses, additional processing was needed to interpret and apply the results appropriately.
Commanding Limitations: Engineers noted significant gaps between user expectations and actual copilot capabilities. Users expected copilots to perform any available product action, but considerable engineering effort and safety concerns limited open-ended access.
Agent-Based Approaches: Some teams explored agent-based architectures for more complex workflows and multi-turn interactions. While more powerful, these approaches were described as having behaviors that are “really hard to manage and steer.” Models struggled with recognizing task completion and often got “stuck in loops or went really far off track.”
Perhaps the most significant LLMOps challenge identified was testing non-deterministic systems:
Flaky Tests Everywhere: Traditional unit testing approaches broke down because each model response could differ. One participant described running “each test 10 times” and only considering it passed if 7 of 10 instances succeeded. Engineers maintained manually curated spreadsheets with hundreds of input/output examples, with multiple acceptable outputs per input. Some teams adopted metamorphic testing approaches focusing on structural properties rather than exact content.
Benchmark Creation: No standardized benchmarks existed, forcing each team to create their own. Building manually labeled datasets was described as “mind-numbingly boring and time-consuming,” often requiring outsourcing. One team labeled approximately 10,000 responses externally.
Cost and Resource Constraints: Running benchmarks through LLM endpoints introduced significant costs (“each test would probably cost 1-2 cents to run, but once you end up with a lot of them, that will start adding up”). Some teams were asked to stop automated testing due to costs or interference with production endpoints.
Quality Thresholds: Determining what constitutes “good enough” performance remained elusive. Teams resorted to simple grading schemes (A, B, C, etc.) with averaging to mitigate biases, but lacked established guidelines.
The study highlighted significant concerns around responsible AI deployment:
Safety Guardrails: Engineers described the challenge of preventing off-topic or harmful conversations. One participant noted the stakes: “Windows runs in nuclear power plants.” Content filtering on managed endpoints was sometimes insufficient, requiring additional rule-based classifiers and manual blocklists.
Privacy Constraints: Processing was needed to ensure outputs didn’t contain identifiable information. Some organizations established partnerships with OpenAI for internally hosted models to avoid data ingestion policies that posed compliance risks.
Telemetry Limitations: A catch-22 situation emerged where telemetry was needed to understand user interactions, but privacy constraints prevented logging user prompts. Teams could see what skills were used but not what users actually asked.
Responsible AI Assessments: These reviews were significantly more intensive than traditional security or privacy reviews, requiring multiple weeks of documentation and assessment work. One team needed to generate automated benchmarks covering hundreds of subcategories of potential harm before shipping.
The study documented significant challenges in building expertise:
Lack of Established Practices: Engineers described starting “from scratch” with no established learning paths. They relied heavily on social media communities, examples from others’ prompts, and even using GPT-4 itself to bootstrap understanding.
Knowledge Volatility: Investment in formal learning resources was limited because “the ecosystem is evolving quickly and moving so fast.” There was uncertainty about whether skills like prompt engineering would remain relevant.
Mindshift Required: Some engineers experienced fundamental realizations that required abandoning deterministic thinking. As one participant stated: “You cannot expect deterministic responses, and that’s terrifying to a lot of people. There is no 100% right answer… The idea of testing is not what you thought it was.”
Tool Selection: While libraries like LangChain offered “basic building blocks and most rich ecosystem” with “clear-cut examples,” they were primarily useful for prototypes. Most participants did not adopt LangChain for actual products, citing the learning curve and preference for focusing on customer problems.
Integration Challenges: Getting frameworks running required piecing things together manually with “no consistent easy way to have everything up and running in one shot.” Behavioral discrepancies between different model hosts added complexity.
Missing Unified Workflow: There was “no one opinionated workflow” that integrated prompt engineering, orchestration, testing, benchmarking, and telemetry.
The study identified several areas for tool improvement:
The researchers acknowledge several limitations: reliance on participant recall, potential for responses reflecting ideal practices rather than actual behavior, and findings that may be specific to the professional contexts and model capabilities available at the time. As models evolve, some challenges may dissipate while new ones emerge.
This study provides valuable empirical grounding for understanding LLMOps challenges, moving beyond anecdotal evidence to systematic documentation of pain points across the production lifecycle.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Yahoo! Finance built a production-scale financial question answering system using multi-agent architecture to address the information asymmetry between retail and institutional investors. The system leverages Amazon Bedrock Agent Core and employs a supervisor-subagent pattern where specialized agents handle structured data (stock prices, financials), unstructured data (SEC filings, news), and various APIs. The solution processes heterogeneous financial data from multiple sources, handles temporal complexities of fiscal years, and maintains context across sessions. Through a hybrid evaluation approach combining human and AI judges, the system achieves strong accuracy and coverage metrics while processing queries in 5-50 seconds at costs of 2-5 cents per query, demonstrating production viability at scale with support for 100+ concurrent users.
Slack's Developer Experience team embarked on a multi-year journey to integrate generative AI into their internal development workflows, moving from experimental prototypes to production-grade AI assistants and agentic systems. Starting with Amazon SageMaker for initial experimentation, they transitioned to Amazon Bedrock for simplified infrastructure management, achieving a 98% cost reduction. The team rolled out AI coding assistants using Anthropic's Claude Code and Cursor integrated with Bedrock, resulting in 99% developer adoption and a 25% increase in pull request throughput. They then evolved their internal knowledge bot (Buddybot) into a sophisticated multi-agent system handling over 5,000 escalation requests monthly, using AWS Strands as an orchestration framework with Claude Code sub-agents, Temporal for workflow durability, and MCP servers for standardized tool access. The implementation demonstrates a pragmatic approach to LLMOps, prioritizing incremental deployment, security compliance (FedRAMP), observability through OpenTelemetry, and maintaining model agnosticism while scaling to millions of tokens per minute.