ZenML

Evolution of AI Agents: From Manual Workflows to End-to-End Training

OpenAI 2024
View original source

OpenAI's journey in developing agentic products showcases the evolution from manually designed workflows with LLMs to end-to-end trained agents. The company has developed three main agentic products - Deep Research, Operator, and Codeex CLI - each addressing different use cases from web research to code generation. These agents demonstrate how end-to-end training with reinforcement learning enables better error recovery and more natural interaction compared to traditional manually designed workflows.

Industry

Tech

Technologies

Overview

This case study derives from a podcast conversation between Sam Cherington and Josh Tobin, a member of technical staff at OpenAI who leads the agents research team. The discussion provides deep insights into how OpenAI approaches building production-grade agentic systems, the technical challenges involved, and the architectural philosophy that differentiates their approach from traditional LLM workflow orchestration.

Josh Tobin brings a unique perspective, having previously co-founded Gantry (an ML infrastructure startup) and worked on Full Stack Deep Learning before returning to OpenAI in September 2024. His experience spans both the pre-ChatGPT era of ML infrastructure and the current foundation model paradigm, giving him valuable context on how the industry has evolved.

The Shift from Custom Models to Foundation Model-Based Agents

A key theme throughout the conversation is the fundamental shift in how businesses should think about AI deployment. Before ChatGPT and GPT-3, the prevailing assumption in ML infrastructure was that every company would need to train their own models. This drove the creation of an entire category of ML infrastructure tools built on that premise.

However, the emergence of capable general-purpose models has largely invalidated this assumption. Tobin now advises companies not to consider training their own models until they have exhausted what foundation model providers can offer. This represents a significant operational paradigm shift—rather than building internal ML infrastructure and expertise, organizations can focus on building applications on top of commercially available APIs.

The Core Technical Challenge: Compounding Errors in Agentic Workflows

The central technical problem addressed in this case study is what Tobin describes as the compounding error problem in agentic systems. When building agents on top of traditional LLMs using workflow orchestration approaches (which was the dominant paradigm in 2023-2024), developers would:

The fundamental issue is that even if a model is 90% accurate on any single step, running a 10-step process causes accuracy to fall off dramatically as errors compound. Additionally, human-designed workflows often oversimplify the actual processes that real experts would follow—the real world is messy, and rigid workflows cannot account for all edge cases.

Tobin provides a concrete example with web research: if an agent searches for the wrong term and gets irrelevant results, a naive agent might get confused and go down rabbit holes or conclude the user’s query doesn’t make sense. This is because traditional LLMs haven’t been trained to handle these failure modes.

The Solution: End-to-End Reinforcement Learning Training

OpenAI’s approach to solving this problem involves training agents end-to-end using reinforcement learning to succeed at complete tasks rather than designing multi-agent workflows. The key insight, which Tobin attributes to an idea from Andrej Karpathy, is that good models are simply much better than humans at designing these types of systems.

When models learn processes by being rewarded for succeeding at them, they can figure out solutions that are better than what humans could easily design. The RL training process has several critical benefits:

In the web research example, an RL-trained agent that searched for the wrong term would recognize the irrelevant results and think “maybe I had the wrong search term—let me try again with a different one.”

Production Agentic Products: Deep Research, Operator, and Codex CLI

Deep Research

Deep Research is designed to produce thorough, detailed reports on complex topics. Key operational characteristics include:

Surprising use cases have emerged beyond the anticipated market research and literature review applications:

The follow-up question functionality was added after observing user interactions, recognizing that more detailed initial specifications produce more compelling results. This represents an interesting pattern in production LLM deployment—observing user behavior and adapting the product accordingly.

Currently, Deep Research is only available through the ChatGPT interface, not via API, as the primary use case is enhancing the standard ChatGPT experience for queries requiring more thorough responses.

Operator

Operator is OpenAI’s browser automation agent that works like ChatGPT but executes tasks in a virtual browser. Users can watch the agent navigate websites, click through interfaces, and complete tasks like booking restaurant reservations.

Tobin candidly positions Operator as an “early launch” technology preview, similar to early GPT-3 API access. It’s not intended for universal adoption on day one, but power users who invest time learning the tool derive significant value. This honest framing about production readiness is notable—not every agentic product is meant for mass adoption immediately.

A key operational tip for improving Operator’s performance involves adding site-specific instructions that help the model understand how to navigate particular websites. This doesn’t require granular UI element specifications but rather clarifying intent, such as “if I want to book a flight, here’s how you do that on this site.”

Codex CLI

Codex CLI is OpenAI’s open-source local code execution agent, representing a fundamentally different deployment model than their cloud-based products. Key characteristics:

Tobin describes the mental model as “a superhuman intern who has never seen your codebase before”—intern-sized chunks of work, but with superhuman speed at reading, writing, and understanding code.

A fascinating technical observation is that Codex CLI is essentially “contextless”—rather than using bespoke context management tools, it simply explores codebases using decades-old command-line utilities. The RL-trained models are surprisingly efficient at building understanding of codebases this way, potentially more efficient than humans at determining how much code they need to read before they can build something.

The tool excels at:

Future development directions include adding APIs or MCPs to the model, implementing memory between sessions, and richer customization options.

The Role of Reasoning and Adaptive Compute

Reasoning models are highlighted as critical for agentic use cases. Agentic tasks have varying difficulty levels across steps, and letting models choose how much reasoning effort to apply at each step is essential for reliable outcomes.

The evolution from O1 to O3 represents improvement in this area—O3 can provide quick responses or take extended time to think through problems, adapting compute to task complexity. This adaptive reasoning capability is core to making multi-step agentic workflows reliable.

Larger models generally perform better for agentic work due to superior generalization. When building custom applications, use cases may not match what model developers anticipated, so larger models handle novel situations more robustly. However, Tobin notes that the limits of small model capability haven’t been exhausted yet.

Trust, Safety, and Tool Use in Production Agents

The conversation addresses a critical unsolved challenge in production agent deployment: specifying and enforcing trust levels between humans, agents, tools, and tasks. Using the example of an agent booking vacation with a credit card, Tobin outlines several trust enforcement approaches:

The discussion references MCP (Model Context Protocol) as an emerging standard for exposing tools to models. Tobin frames tool exposure as one of three key components for useful agents: smart general-purpose reasoning models, appropriate tool access, and small amounts of high-quality task-specific training.

A noted gap in current protocols is the lack of standardized trust level specification—there’s no established way to tell an agent “you can use this tool freely in some contexts but need approval in others.” This is an area where the industry still needs significant development.

Cost Considerations and Trajectory

Cost is acknowledged as a concern with agentic tools, but Tobin offers several perspectives:

There’s no current mechanism for cost transparency (showing users anticipated cost before execution), though Tobin acknowledges this as an interesting idea worth exploring.

The Future of Software Development

Tobin predicts a dramatic shift in software development where the vast majority of code will be written by AI systems “much sooner than people think.” The role of developers will evolve toward:

This accelerates an existing trend of roles converging—design engineers, technical product managers, and full-stack engineers represent the direction of increasing individual scope. However, Tobin maintains that learning fundamental programming remains valuable even as the day-to-day work changes, drawing an analogy to ML practitioners who benefit from understanding backpropagation at a deep level even if they rarely implement it manually.

The vision for ChatGPT’s evolution is toward a unified experience that feels like talking to “your friend and co-worker and personal assistant and coach all in one”—an entity that knows when to give quick answers versus when to conduct deep research, that can provide initial impressions while continuing to investigate in the background.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building Production-Ready AI Agent Systems: Multi-Agent Orchestration and LLMOps at Scale

Galileo / Crew AI 2025

This podcast discussion between Galileo and Crew AI leadership explores the challenges and solutions for deploying AI agents in production environments at enterprise scale. The conversation covers the technical complexities of multi-agent systems, the need for robust evaluation and observability frameworks, and the emergence of new LLMOps practices specifically designed for non-deterministic agent workflows. Key topics include authentication protocols, custom evaluation metrics, governance frameworks for regulated industries, and the democratization of agent development through no-code platforms.

customer_support code_generation document_processing +41

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57