New Relic, a major observability platform processing 7 petabytes of data daily, implemented GenAI both internally for developer productivity and externally in their product offerings. They achieved a 15% increase in developer productivity through targeted GenAI implementations, while also developing sophisticated AI monitoring capabilities and natural language interfaces for their customers. Their approach balanced cost, accuracy, and performance through a mix of RAG, multi-model routing, and classical ML techniques.
This case study comes from a joint presentation at AWS re:Invent featuring Jeffrey Hammond (AWS) and Suraj Krishnan (Group Vice President of Engineering, Telemetry Data Platform at New Relic). The presentation covers both broad patterns AWS observes across ISVs implementing generative AI and a deep dive into New Relic’s specific journey—both for internal productivity and product-embedded AI capabilities.
New Relic is an AI-first observability platform serving approximately 85,000 customers with mission-critical monitoring needs. The platform ingests about seven petabytes of data daily with billions of events, running what they claim to be one of the largest streaming platforms globally, backed by their proprietary New Relic Database (NRDB) and S3-based relational storage architecture.
New Relic began their generative AI journey approximately two years before the presentation (around 2023), notably before “agents” became a mainstream concept. Their approach centered on “proof of value” rather than just “proof of concept”—a critical distinction that guided their prioritization of use cases.
The primary internal use case targeted developer productivity, recognizing developers as the scarcest resource. They identified several key bottlenecks:
Code Reviews: Traditional code review processes created significant delays, with teams waiting days or weeks for principal or senior engineers to review code. This represents a major bottleneck in most engineering organizations.
Engineering Standards Enforcement: Rather than relying on senior developers to manually teach coding standards to junior team members, they automated standards compliance checking.
Code Generation: They found the most value not in direct code generation but specifically in test generation and refactoring of legacy code—a nuanced finding that challenges common assumptions about where generative AI adds the most value.
New Relic conducted rigorous cohort analysis across engineering levels (P1: entry-level, P2: mid-level, P3: senior). Their hypothesis that generative AI would benefit newer engineers more than senior engineers was tested through pre- and post-implementation calibration. The results showed approximately 15% average productivity increase across all levels, with the data supporting their initial hypothesis about junior engineers seeing greater benefits.
New Relic implemented what they call a “Nova agent”—a unified agent interface that orchestrates multiple specialized domain agents internally. This architecture was designed before the current agent hype cycle, reflecting forward-thinking architectural decisions. Key aspects include:
Co-pilot Mode vs. Free-form Mode: Agents can be explicitly invoked or operate automatically within workflows. For example, pull requests automatically trigger code review agents without manual invocation.
Asynchronous Communication: The agent architecture supports asynchronous operations—acknowledging requests, performing work (which may take minutes to hours), and returning results. This includes creating pull requests for other teams and awaiting approval.
External Integrations: The architecture includes integration with Amazon Q, allowing unified querying across AWS resources through a single agent interface rather than multiple separate tools.
New Relic has approximately 100 engineering teams, each with engineers on rotation for “hero work” (on-call support). They classified all tickets and activities, identifying five use cases where generative AI could handle requests automatically. The agent picks up engineer requests asynchronously, performs the work, and can even create pull requests in other teams’ repositories.
For their mission-critical environment, incident management automation became a key investment area:
A particularly impactful use case involved cloud cost monitoring and optimization:
New Relic’s product platform architecture has three layers: Platform of Record (data ingestion and knowledge platform), Platform of Intelligence (probabilistic and deterministic engines), and Platform of Action (anomaly alerts, recommendations, NL queries).
A key product capability addresses customer onboarding friction. While experienced users love NRQL (New Relic Query Language), new customers face a learning curve. The platform now accepts natural language queries and uses prompt engineering to translate them to NRQL, democratizing access to the platform’s capabilities.
New Relic offers monitoring for customers’ own AI implementations, focusing on:
A critical observation is that during incidents, the alert that should exist often doesn’t, while many irrelevant alerts fire. NR AI addresses this through:
The presentation emphasizes moving beyond POCs to measurable value. Every AI implementation should have quantifiable success metrics before significant investment.
A critical insight: not everything requires generative AI. New Relic uses classical machine learning for capacity engineering in their cellular architecture (determining when to add or decommission cells). This approach is “easier, more impactful, and less costly” than generative approaches for forecasting use cases.
Echoing AWS’s framework, New Relic emphasizes balancing:
Using the largest LLM for every use case is cost-prohibitive; sometimes humans are more cost-effective than LLMs.
The presentation advocates for a pragmatic approach to model sophistication:
New Relic maintains 40-50 use cases in experimentation. They estimate approximately 15 will be powerful enough to productize, some will provide learnings but not production value, and some will need more time. Prototypes typically take two to three weeks, not months.
A balanced perspective on AI and employment: “People who use AI will take jobs of people who don’t use AI”—emphasizing AI as an empowerment tool that removes drudgery rather than wholesale job replacement. Both organic training and specialized expertise acquisition are necessary for successful implementation.
The presentation warns against over-broad scoping of AI use cases. Disappointment often results from expectations exceeding what current capabilities can deliver. Narrow, well-defined use cases with clear success criteria perform better.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.
Codeium's journey in building their AI-powered development tools showcases how investing early in enterprise-ready infrastructure, including containerization, security, and comprehensive deployment options, enabled them to scale from individual developers to large enterprise customers. Their "go slow to go fast" approach in building proprietary infrastructure for code completion, retrieval, and agent-based development culminated in Windsurf IDE, demonstrating how thoughtful early architectural decisions can create a more robust foundation for AI tools in production.