ZenML

Observability Platform's Journey to Production GenAI Integration

New Relic 2023
View original source

New Relic, a major observability platform processing 7 petabytes of data daily, implemented GenAI both internally for developer productivity and externally in their product offerings. They achieved a 15% increase in developer productivity through targeted GenAI implementations, while also developing sophisticated AI monitoring capabilities and natural language interfaces for their customers. Their approach balanced cost, accuracy, and performance through a mix of RAG, multi-model routing, and classical ML techniques.

Industry

Tech

Technologies

Overview

This case study comes from a joint presentation at AWS re:Invent featuring Jeffrey Hammond (AWS) and Suraj Krishnan (Group Vice President of Engineering, Telemetry Data Platform at New Relic). The presentation covers both broad patterns AWS observes across ISVs implementing generative AI and a deep dive into New Relic’s specific journey—both for internal productivity and product-embedded AI capabilities.

New Relic is an AI-first observability platform serving approximately 85,000 customers with mission-critical monitoring needs. The platform ingests about seven petabytes of data daily with billions of events, running what they claim to be one of the largest streaming platforms globally, backed by their proprietary New Relic Database (NRDB) and S3-based relational storage architecture.

Internal Generative AI Implementation

New Relic began their generative AI journey approximately two years before the presentation (around 2023), notably before “agents” became a mainstream concept. Their approach centered on “proof of value” rather than just “proof of concept”—a critical distinction that guided their prioritization of use cases.

Developer Productivity Focus

The primary internal use case targeted developer productivity, recognizing developers as the scarcest resource. They identified several key bottlenecks:

Measured Productivity Results

New Relic conducted rigorous cohort analysis across engineering levels (P1: entry-level, P2: mid-level, P3: senior). Their hypothesis that generative AI would benefit newer engineers more than senior engineers was tested through pre- and post-implementation calibration. The results showed approximately 15% average productivity increase across all levels, with the data supporting their initial hypothesis about junior engineers seeing greater benefits.

Multi-Agent Architecture

New Relic implemented what they call a “Nova agent”—a unified agent interface that orchestrates multiple specialized domain agents internally. This architecture was designed before the current agent hype cycle, reflecting forward-thinking architectural decisions. Key aspects include:

Hero Work Automation

New Relic has approximately 100 engineering teams, each with engineers on rotation for “hero work” (on-call support). They classified all tickets and activities, identifying five use cases where generative AI could handle requests automatically. The agent picks up engineer requests asynchronously, performs the work, and can even create pull requests in other teams’ repositories.

Incident Management

For their mission-critical environment, incident management automation became a key investment area:

Cloud Cost Optimization

A particularly impactful use case involved cloud cost monitoring and optimization:

Product-Embedded AI Capabilities

New Relic’s product platform architecture has three layers: Platform of Record (data ingestion and knowledge platform), Platform of Intelligence (probabilistic and deterministic engines), and Platform of Action (anomaly alerts, recommendations, NL queries).

Natural Language Query Translation

A key product capability addresses customer onboarding friction. While experienced users love NRQL (New Relic Query Language), new customers face a learning curve. The platform now accepts natural language queries and uses prompt engineering to translate them to NRQL, democratizing access to the platform’s capabilities.

AI Monitoring for Customers

New Relic offers monitoring for customers’ own AI implementations, focusing on:

Intelligent Alert Management

A critical observation is that during incidents, the alert that should exist often doesn’t, while many irrelevant alerts fire. NR AI addresses this through:

Key Lessons and LLMOps Insights

Proof of Value Over Proof of Concept

The presentation emphasizes moving beyond POCs to measurable value. Every AI implementation should have quantifiable success metrics before significant investment.

Classical ML vs. Generative AI Balance

A critical insight: not everything requires generative AI. New Relic uses classical machine learning for capacity engineering in their cellular architecture (determining when to add or decommission cells). This approach is “easier, more impactful, and less costly” than generative approaches for forecasting use cases.

The Cost-Accuracy-Speed Triangle

Echoing AWS’s framework, New Relic emphasizes balancing:

Using the largest LLM for every use case is cost-prohibitive; sometimes humans are more cost-effective than LLMs.

Context and Model Selection

The presentation advocates for a pragmatic approach to model sophistication:

Iteration and Experimentation Culture

New Relic maintains 40-50 use cases in experimentation. They estimate approximately 15 will be powerful enough to productize, some will provide learnings but not production value, and some will need more time. Prototypes typically take two to three weeks, not months.

Workforce Transformation

A balanced perspective on AI and employment: “People who use AI will take jobs of people who don’t use AI”—emphasizing AI as an empowerment tool that removes drudgery rather than wholesale job replacement. Both organic training and specialized expertise acquisition are necessary for successful implementation.

Scope Management

The presentation warns against over-broad scoping of AI use cases. Disappointment often results from expectations exceeding what current capabilities can deliver. Narrow, well-defined use cases with clear success criteria perform better.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Enterprise AI Platform Integration for Secure Production Deployment

Rubrik 2025

Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.

customer_support content_moderation chatbot +53

Building Enterprise-Ready AI Development Infrastructure from Day One

Windsurf 2024

Codeium's journey in building their AI-powered development tools showcases how investing early in enterprise-ready infrastructure, including containerization, security, and comprehensive deployment options, enabled them to scale from individual developers to large enterprise customers. Their "go slow to go fast" approach in building proprietary infrastructure for code completion, retrieval, and agent-based development culminated in Windsurf IDE, demonstrating how thoughtful early architectural decisions can create a more robust foundation for AI tools in production.

code_generation code_interpretation high_stakes_application +42