Weights & Biases developed an advanced AI programming agent using OpenAI's o1 model that achieved state-of-the-art performance on the SWE-Bench-Verified benchmark, successfully resolving 64.6% of software engineering issues. The solution combines o1 with custom-built tools, including a Python code editor toolset, memory components, and parallel rollouts with crosscheck mechanisms, all developed and evaluated using W&B's Weave toolkit and newly created Eval Studio platform.
This case study documents how Weights & Biases (W&B), a company specializing in MLOps and AI development tools, built an autonomous AI programming agent that achieved state-of-the-art performance on the SWE-Bench-Verified benchmark. The work was led by Shawn Lewis, co-founder and CTO of W&B, who spent two months iterating on the solution using the company’s own tools. The agent resolved 64.6% of issues on the benchmark, topping the leaderboard and significantly outperforming OpenAI’s own published o1 result which used a more basic agent framework. This project served a dual purpose: demonstrating the capabilities of W&B’s tooling ecosystem while pushing the frontier of AI programming agents.
SWE-Bench-Verified is considered one of the most rigorous benchmarks for evaluating software engineering agents. It consists of 500 GitHub issues paired with Docker images and held-out unit tests. Agents must operate autonomously within Docker containers, mimicking how a human programmer would work—iteratively reading code, writing modifications, running tests, and refining solutions until the issue is resolved. This represents a significant real-world challenge that tests an agent’s ability to understand complex codebases, diagnose problems, and implement correct fixes.
The W&B agent employs a multi-component architecture that makes strategic use of different models for different purposes:
OpenAI o1 with high reasoning mode serves as the core driver for all agent step logic and code editing operations. The choice of o1 over other models proved critical given its superior ability to analyze large code contexts and identify bugs accurately.
GPT-4o memory component handles compression of the agent’s step history, allowing the agent to maintain context over long interaction sequences without overwhelming the primary model’s context window.
Custom Python code editor toolset was built specifically to use model context efficiently. Rather than relying on generic code manipulation tools, the team designed tools optimized for the way o1 processes and reasons about code.
Auto-command system allows the agent to register commands that run automatically after every file modification, reducing the need for the model to reason about temporal ordering of events.
Parallel rollouts with cross-check selection runs 5 parallel attempts at each problem instance, then uses an o1-based tie-breaker in a “cross-check” step to select the best solution. The author notes this mechanism “works pretty well and may be somewhat novel.”
The case study provides several practical insights about using o1 as an agent backbone that are valuable for LLMOps practitioners:
Improved Instruction Following: Unlike previous models where adding more instructions to prompts might degrade adherence to other parts of the prompt, o1 demonstrates remarkable consistency. The author shares a 7-line section from the 58-line task instructions portion of the prompt, each line “hard-earned from grinding out evals and reviewing lots of agent trajectories.” O1 respects these detailed instructions “almost all of the time.”
Outcome-Oriented Prompting: The most effective approach with o1 is to specify desired outcomes rather than prescriptive step-by-step processes. The stopping condition prompt shared in the article lists five criteria that must all be true before the agent considers the task complete, allowing o1 the flexibility to determine how to achieve that outcome.
Temporal Reasoning Challenges: A significant finding is that o1 does not always reason correctly about the time ordering of events. The author observed instances where, after a sequence of edit-test-edit actions, o1 would incorrectly conclude about test results without having run the test after the most recent edit. The solution was architectural—implementing auto-commands to reduce the need for temporal reasoning rather than trying to prompt around the limitation.
The development process exemplifies rigorous LLMOps practices. The team ran 977 evaluations over the course of developing this solution, tracking everything through W&B’s Weave toolkit. This level of systematic experimentation underscores the importance of robust evaluation infrastructure when building production AI systems.
The author credits several tools for enabling this iteration velocity:
Weave served as the primary development toolkit, tracking all experiments and providing the evaluation framework. The platform apparently improved significantly during the project, with particular mention of a new playground feature supporting “first-class support for testing multiple trials of the same prompt.”
Eval Studio was built during the project as a new tool backed by Weave data. It provides charts for monitoring live runs and statistical analysis of results, plus a table view with rollout drawer for detailed investigation of instances where performance changed between model versions. The author notes these concepts will be integrated into Weave over coming months.
Phaseshift is a new TypeScript framework for composing AI agents, built around Weave’s core concepts. The choice of TypeScript was deliberate, with the author citing its “powerful type system” for reasoning about interfaces and composition. Key features include versioning of both data and code together (enabling understanding of changes during iteration) and evaluations as a first-class concept for any function or pipeline.
It’s worth noting some important caveats about this case study. First, this is written by W&B’s CTO and serves partly as a demonstration of W&B’s own tooling—there’s inherent promotional value in achieving state-of-the-art results. The claims about outperforming OpenAI’s basic o1 agent should be understood in context: different agent frameworks and evaluation setups can significantly impact results, and direct comparisons require careful interpretation.
The “cross-check” mechanism for selecting among parallel rollouts is mentioned as potentially novel but not detailed, making it difficult to assess its actual contribution versus simply running more attempts. Running 5 parallel rollouts and selecting the best is a form of test-time compute scaling that may not be practical for all production scenarios due to cost and latency considerations.
The reliance on 977 evaluations to achieve this result highlights both the value of systematic experimentation and the significant effort required. This level of iteration may not be feasible for many organizations, though it does validate W&B’s thesis that better tooling enables better outcomes.
Several aspects of this work have implications for production LLMOps:
The observation about o1’s temporal reasoning limitations is particularly valuable for anyone building agent systems. The solution of reducing the need for temporal reasoning through architectural choices (auto-commands) rather than prompt engineering represents a mature approach to working around model limitations.
The multi-model architecture (o1 for reasoning, GPT-4o for memory compression) demonstrates cost/capability tradeoffs that are common in production systems. Using cheaper models for auxiliary tasks while reserving more expensive models for core reasoning is a standard production pattern.
The emphasis on version control for both data and code together, as highlighted with Phaseshift, addresses a common pain point in LLMOps where experiments are difficult to reproduce because prompts, data, and code evolve independently.
W&B indicates plans to continue pushing the frontier of AI programming and to deliver the tools developed during this project to customers. Phaseshift is mentioned as a future release candidate, and Eval Studio concepts will be integrated into Weave. The company explicitly connects these developments to broader AI safety implications, suggesting the evaluation infrastructure could support safety-focused applications as well as capability development.
Uber's Developer Platform team explored three major initiatives using LLMs in production: a custom IDE coding assistant (which was later abandoned in favor of GitHub Copilot), an AI-powered test generation system called Auto Cover, and an automated Java-to-Kotlin code migration system. The team combined deterministic approaches with LLMs to achieve significant developer productivity gains while maintaining code quality and safety. They found that while pure LLM approaches could be risky, hybrid approaches combining traditional software engineering practices with AI showed promising results.
This podcast discussion between Galileo and Crew AI leadership explores the challenges and solutions for deploying AI agents in production environments at enterprise scale. The conversation covers the technical complexities of multi-agent systems, the need for robust evaluation and observability frameworks, and the emergence of new LLMOps practices specifically designed for non-deterministic agent workflows. Key topics include authentication protocols, custom evaluation metrics, governance frameworks for regulated industries, and the democratization of agent development through no-code platforms.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.