ZenML

Managing Memory and Scaling Issues in Production AI Agent Systems

Gradient Labs 2025
View original source

Gradient Labs experienced a series of interconnected production incidents involving their AI agent deployed on Google Cloud Run, starting with memory usage alerts that initially appeared to be memory leaks. The team discovered the root cause was Temporal workflow cache sizing issues causing container crashes, which they resolved by tuning cache parameters. However, this fix inadvertently caused auto-scaling problems that throttled their system's ability to execute activities, leading to increased latency. The incidents highlight the complex interdependencies in production AI systems and the need for careful optimization across all infrastructure layers.

Industry

Tech

Technologies

This case study from Gradient Labs provides a detailed technical account of production incidents involving their AI agent system, offering valuable insights into the operational challenges of running LLM-based applications at scale. The company operates an AI agent that participates in customer conversations, built using Go and deployed on Google Cloud Run with Temporal workflows managing conversation state and response generation.

System Architecture and Initial Problem

Gradient Labs’ AI agent architecture demonstrates several key LLMOps principles in practice. Their system uses long-running Temporal workflows to manage conversation state, timers, and child workflows for response generation. This design pattern is particularly relevant for conversational AI applications where maintaining context and state across extended interactions is critical. The use of Temporal as a workflow orchestration engine reflects a mature approach to handling the complex, multi-step processes typical in production LLM applications.

The initial incident began with memory usage alerts from Google Cloud Platform, indicating abnormally high memory consumption across agent containers. This type of monitoring is essential in LLMOps environments where resource usage can be highly variable and unpredictable. The team’s immediate priority of ensuring no customers were left without responses demonstrates the customer-first approach necessary when operating AI systems in production environments.

Investigation and Root Cause Analysis

The investigation process reveals several important aspects of LLMOps troubleshooting. Initially suspecting a classic memory leak, the team systematically examined memory-intensive components of their AI agent, particularly those handling variable-sized documents and parallel processing operations. This methodical approach is crucial in LLMOps where multiple components can contribute to resource consumption issues.

The breakthrough came through Google Cloud Profiler flame graphs, which identified that Temporal’s top-level execution functions were experiencing the largest growth in exclusive memory usage over time. This led to the discovery that the Temporal Workflow cache, which stores workflow execution histories to avoid repeated retrievals from Temporal Cloud, was the source of the problem. The cache serves an important optimization function by reducing network calls and improving latency, but when improperly sized relative to available memory, it can cause container crashes.

Technical Resolution and Trade-offs

The resolution involved a classic LLMOps trade-off between infrastructure costs and system performance. By tuning the worker cache size down by 10x, the team reduced memory usage to a sustainable level while accepting the trade-off of potentially increased network calls to Temporal Cloud and slightly higher latency. This decision exemplifies the optimization challenges common in production LLM systems where multiple competing objectives must be balanced.

The validation process they employed - first increasing memory 5x to observe plateau behavior, then reducing cache size to confirm the hypothesis - demonstrates a systematic approach to production troubleshooting that’s essential in LLMOps environments where changes can have complex, non-obvious effects.

Cascading Effects and Auto-scaling Challenges

The case study becomes particularly instructive when describing the cascading effects of the initial fix. After resolving the memory issue, the team observed increased AI agent latency across different model providers and prompt types. This cross-cutting impact initially suggested external LLM provider issues, highlighting how production AI systems are dependent on multiple external services and the importance of comprehensive monitoring across the entire stack.

The actual cause proved to be an auto-scaling problem created by their memory fix. Google Cloud Run’s auto-scaling mechanism, which relies on HTTP requests, event consumption, and CPU utilization metrics, had previously been maintaining instance counts partly due to container crashes from the memory issue. Once the crashes stopped, Cloud Run scaled down the instance count, creating a bottleneck in Temporal activity execution.

This reveals a critical insight for LLMOps practitioners: fixing one issue can inadvertently create others, particularly in cloud-native environments with auto-scaling. The team’s AI agent, implemented as Temporal workflows that poll rather than receive HTTP requests, didn’t trigger the typical auto-scaling signals, leading to under-provisioning once the “artificial” scaling driver (container crashes) was removed.

LLMOps Best Practices Demonstrated

Several LLMOps best practices emerge from this case study. The team maintained comprehensive monitoring across platform and agent metrics with appropriate alerting thresholds. Their incident response process prioritized customer experience above all else, ensuring no conversations were abandoned during the investigation and resolution process.

The systematic approach to troubleshooting, using profiling tools and methodical hypothesis testing, demonstrates the kind of engineering rigor necessary for production LLM systems. The team’s willingness to make intentional changes with clear hypotheses and measurable outcomes reflects mature operational practices.

Multi-layered System Complexity

The case study illustrates the multi-layered nature of production AI systems, where optimization is required across prompts, LLM providers, databases, containers, and orchestration layers. Each layer has its own performance characteristics and failure modes, and changes in one layer can have unexpected effects on others. This complexity is particularly pronounced in LLMOps environments where the AI components add additional variables to traditional infrastructure challenges.

The team’s use of multiple LLM model providers with failover capabilities demonstrates another important LLMOps pattern - building resilience through provider diversity. Their ability to quickly adjust LLM failover systems when providers experienced outages shows the operational agility required for production AI systems.

Monitoring and Observability Insights

The incidents highlight the importance of comprehensive observability in LLMOps environments. The team tracked metrics for memory usage, latency, activity execution rates, and provider performance. The use of flame graphs for memory profiling and the correlation of scaling metrics with performance degradation demonstrates the kind of deep observability required for complex AI systems.

The challenge of variable traffic patterns and the difficulty in isolating changes during high activity periods reflects real-world operational challenges in LLMOps environments where demand can be unpredictable and debugging must often occur while systems are under load.

Resource Management and Cost Optimization

The case study touches on several cost optimization considerations relevant to LLMOps. The trade-off between cache size and infrastructure costs, the balance between memory provisioning and network calls, and the auto-scaling configuration all represent ongoing optimization challenges. These decisions become more complex in AI systems where resource usage patterns may be less predictable than traditional applications.

The team’s approach of buying time by redeploying with more memory during the initial incident shows pragmatic incident management - sometimes the immediate fix isn’t the optimal long-term solution, but maintaining system availability while conducting thorough investigation is the right approach.

While this case study is presented by Gradient Labs themselves and may emphasize their competence in handling these issues, the technical details provided appear credible and the challenges described are consistent with real-world LLMOps operational experiences. The systematic approach to problem-solving and the honest discussion of how one fix created another issue lends credibility to their account.

More Like This

Building a Multi-Agent Research System for Complex Information Tasks

Anthropic 2025

Anthropic developed a production multi-agent system for their Claude Research feature that uses multiple specialized AI agents working in parallel to conduct complex research tasks across web and enterprise sources. The system employs an orchestrator-worker architecture where a lead agent coordinates and delegates to specialized subagents that operate simultaneously, achieving 90.2% performance improvement over single-agent systems on internal evaluations. The implementation required sophisticated prompt engineering, robust evaluation frameworks, and careful production engineering to handle the stateful, non-deterministic nature of multi-agent interactions at scale.

question_answering document_processing data_analysis +48

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building and Scaling Conversational Voice AI Agents for Enterprise Go-to-Market

Thoughtly / Gladia 2025

Thoughtly, a voice AI platform founded in late 2023, provides conversational AI agents for enterprise sales and customer support operations. The company orchestrates speech-to-text, large language models, and text-to-speech systems to handle millions of voice calls with sub-second latency requirements. By optimizing every layer of their stack—from telephony providers to LLM inference—and implementing sophisticated caching, conditional navigation, and evaluation frameworks, Thoughtly delivers 3x conversion rates over traditional methods and 15x ROI for customers. The platform serves enterprises with HIPAA and SOC 2 compliance while handling both inbound customer support and outbound lead activation at massive scale across multiple languages and regions.

customer_support healthcare regulatory_compliance +33