OpenAI: Evolution of AI Agents: From Manual Workflows to End-to-End Training

Overview

This case study derives from a podcast conversation between Sam Cherington and Josh Tobin, a member of technical staff at OpenAI who leads the agents research team. The discussion provides deep insights into how OpenAI approaches building production-grade agentic systems, the technical challenges involved, and the architectural philosophy that differentiates their approach from traditional LLM workflow orchestration.

Josh Tobin brings a unique perspective, having previously co-founded Gantry (an ML infrastructure startup) and worked on Full Stack Deep Learning before returning to OpenAI in September 2024. His experience spans both the pre-ChatGPT era of ML infrastructure and the current foundation model paradigm, giving him valuable context on how the industry has evolved.

The Shift from Custom Models to Foundation Model-Based Agents

A key theme throughout the conversation is the fundamental shift in how businesses should think about AI deployment. Before ChatGPT and GPT-3, the prevailing assumption in ML infrastructure was that every company would need to train their own models. This drove the creation of an entire category of ML infrastructure tools built on that premise.

However, the emergence of capable general-purpose models has largely invalidated this assumption. Tobin now advises companies not to consider training their own models until they have exhausted what foundation model providers can offer. This represents a significant operational paradigm shift—rather than building internal ML infrastructure and expertise, organizations can focus on building applications on top of commercially available APIs.

The Core Technical Challenge: Compounding Errors in Agentic Workflows

The central technical problem addressed in this case study is what Tobin describes as the compounding error problem in agentic systems. When building agents on top of traditional LLMs using workflow orchestration approaches (which was the dominant paradigm in 2023-2024), developers would:

Design a system that breaks problems into multiple steps
Assign each step to an LLM
Build rules around the workflow

The fundamental issue is that even if a model is 90% accurate on any single step, running a 10-step process causes accuracy to fall off dramatically as errors compound. Additionally, human-designed workflows often oversimplify the actual processes that real experts would follow—the real world is messy, and rigid workflows cannot account for all edge cases.

Tobin provides a concrete example with web research: if an agent searches for the wrong term and gets irrelevant results, a naive agent might get confused and go down rabbit holes or conclude the user’s query doesn’t make sense. This is because traditional LLMs haven’t been trained to handle these failure modes.

The Solution: End-to-End Reinforcement Learning Training

OpenAI’s approach to solving this problem involves training agents end-to-end using reinforcement learning to succeed at complete tasks rather than designing multi-agent workflows. The key insight, which Tobin attributes to an idea from Andrej Karpathy, is that good models are simply much better than humans at designing these types of systems.

When models learn processes by being rewarded for succeeding at them, they can figure out solutions that are better than what humans could easily design. The RL training process has several critical benefits:

Models see failures during training and learn to recover from them
Even if each step is only 90-95% accurate, the model has learned what failure looks like and can reroute
Models develop the ability to recognize when something doesn’t look right and try alternative approaches

In the web research example, an RL-trained agent that searched for the wrong term would recognize the irrelevant results and think “maybe I had the wrong search term—let me try again with a different one.”

Production Agentic Products: Deep Research, Operator, and Codex CLI

Deep Research

Deep Research is designed to produce thorough, detailed reports on complex topics. Key operational characteristics include:

Integrates directly into ChatGPT as an elevated experience for queries requiring depth
Features an initial phase where it asks follow-up questions to clarify user intent
The quality of results is sensitive to how well the initial question is framed
Users often employ smaller models (O1, O3, O4-mini) to help craft questions before submitting to Deep Research

Surprising use cases have emerged beyond the anticipated market research and literature review applications:

Coding: The model can search GitHub, understand codebases, and return implementation plans
Finding rare facts: Excellent at locating information buried in obscure corners of the internet

The follow-up question functionality was added after observing user interactions, recognizing that more detailed initial specifications produce more compelling results. This represents an interesting pattern in production LLM deployment—observing user behavior and adapting the product accordingly.

Currently, Deep Research is only available through the ChatGPT interface, not via API, as the primary use case is enhancing the standard ChatGPT experience for queries requiring more thorough responses.

Operator

Operator is OpenAI’s browser automation agent that works like ChatGPT but executes tasks in a virtual browser. Users can watch the agent navigate websites, click through interfaces, and complete tasks like booking restaurant reservations.

Tobin candidly positions Operator as an “early launch” technology preview, similar to early GPT-3 API access. It’s not intended for universal adoption on day one, but power users who invest time learning the tool derive significant value. This honest framing about production readiness is notable—not every agentic product is meant for mass adoption immediately.

A key operational tip for improving Operator’s performance involves adding site-specific instructions that help the model understand how to navigate particular websites. This doesn’t require granular UI element specifications but rather clarifying intent, such as “if I want to book a flight, here’s how you do that on this site.”

Codex CLI

Codex CLI is OpenAI’s open-source local code execution agent, representing a fundamentally different deployment model than their cloud-based products. Key characteristics:

Fully open source with over 20,000 GitHub stars and approximately 100 contributors
Runs locally on developer machines
Can operate in a network-sandboxed mode for safety
Uses standard command-line tools (grep, sed, etc.) to navigate codebases

Tobin describes the mental model as “a superhuman intern who has never seen your codebase before”—intern-sized chunks of work, but with superhuman speed at reading, writing, and understanding code.

A fascinating technical observation is that Codex CLI is essentially “contextless”—rather than using bespoke context management tools, it simply explores codebases using decades-old command-line utilities. The RL-trained models are surprisingly efficient at building understanding of codebases this way, potentially more efficient than humans at determining how much code they need to read before they can build something.

The tool excels at:

De novo work on new projects or features
Work in unfamiliar parts of codebases
Tasks that engineers don’t enjoy (frontend work for backend engineers, data plumbing for product engineers)
Work that would otherwise fall below the priority cutline

Future development directions include adding APIs or MCPs to the model, implementing memory between sessions, and richer customization options.

The Role of Reasoning and Adaptive Compute

Reasoning models are highlighted as critical for agentic use cases. Agentic tasks have varying difficulty levels across steps, and letting models choose how much reasoning effort to apply at each step is essential for reliable outcomes.

The evolution from O1 to O3 represents improvement in this area—O3 can provide quick responses or take extended time to think through problems, adapting compute to task complexity. This adaptive reasoning capability is core to making multi-step agentic workflows reliable.

Larger models generally perform better for agentic work due to superior generalization. When building custom applications, use cases may not match what model developers anticipated, so larger models handle novel situations more robustly. However, Tobin notes that the limits of small model capability haven’t been exhausted yet.

Trust, Safety, and Tool Use in Production Agents

The conversation addresses a critical unsolved challenge in production agent deployment: specifying and enforcing trust levels between humans, agents, tools, and tasks. Using the example of an agent booking vacation with a credit card, Tobin outlines several trust enforcement approaches:

Requiring explicit approval for high-risk tool calls
Building guidelines that models follow about when to request permission
Acknowledging that even with safeguards, deception risks remain

The discussion references MCP (Model Context Protocol) as an emerging standard for exposing tools to models. Tobin frames tool exposure as one of three key components for useful agents: smart general-purpose reasoning models, appropriate tool access, and small amounts of high-quality task-specific training.

A noted gap in current protocols is the lack of standardized trust level specification—there’s no established way to tell an agent “you can use this tool freely in some contexts but need approval in others.” This is an area where the industry still needs significant development.

Cost Considerations and Trajectory

Cost is acknowledged as a concern with agentic tools, but Tobin offers several perspectives:

Historical trend shows dramatic cost reduction over time
Even at current costs, saving hours of developer time for a few dollars is often worthwhile
As models improve and produce more consistent results, cost concerns diminish relative to value delivered

There’s no current mechanism for cost transparency (showing users anticipated cost before execution), though Tobin acknowledges this as an interesting idea worth exploring.

The Future of Software Development

Tobin predicts a dramatic shift in software development where the vast majority of code will be written by AI systems “much sooner than people think.” The role of developers will evolve toward:

Thinking about desired functionality and trade-offs
Considering edge cases
Guiding AI systems and providing feedback
Finding ways to validate AI-generated work

This accelerates an existing trend of roles converging—design engineers, technical product managers, and full-stack engineers represent the direction of increasing individual scope. However, Tobin maintains that learning fundamental programming remains valuable even as the day-to-day work changes, drawing an analogy to ML practitioners who benefit from understanding backpropagation at a deep level even if they rarely implement it manually.

The vision for ChatGPT’s evolution is toward a unified experience that feels like talking to “your friend and co-worker and personal assistant and coach all in one”—an entity that knows when to give quick answers versus when to conduct deep research, that can provide initial impressions while continuing to investigate in the background.

Evolution of AI Agents: From Manual Workflows to End-to-End Training

Industry

Technologies