ZenML

Evolving Agent Architecture Through Model Capability Improvements

Aomni 2023
View original source

David from Aomni discusses how their company evolved from building complex agent architectures with multiple guardrails to simpler, more model-centric approaches as LLM capabilities improved. The company provides AI agents for revenue teams, helping automate research and sales workflows while keeping humans in the loop for customer relationships. Their journey demonstrates how LLMOps practices need to continuously adapt as model capabilities expand, leading to removal of scaffolding and simplified architectures.

Industry

Tech

Technologies

Overview

This case study is derived from a podcast interview with David from Aomni, a company building autonomous AI agents for revenue teams. The discussion provides valuable insights into the evolution of agent architectures, reliability engineering for production LLM systems, and the philosophical approach of building AI products that improve as model capabilities advance rather than becoming obsolete.

Aomni’s core product enables sales representatives to orchestrate revenue playbooks through natural language prompts instead of manually navigating the typical 5-20 pieces of software that enterprise sales teams use today. The company positions itself as an “AI support function” rather than an “AI SDR”—empowering human salespeople with better data and research rather than attempting to replace customer-facing interactions entirely.

Technical Architecture and Evolution

Early Agent Architecture (2023)

David’s journey with production agents began in mid-2023, shortly after Baby AGI and Auto-GPT emerged. His initial insight was recognizing that AI agents are fundamentally workflow orchestration systems facing the same reliability challenges as long-running microservice workflows. Key architectural decisions from this period included:

The original research agent architecture working with GPT-3.5 and GPT-4 required extensive scaffolding:

Current Architecture Philosophy

The company’s core philosophy is “never bet against the model” and the observation that model capabilities roughly double at regular intervals. This leads to a key operational principle: completely rewrite the product every time model capability doubles. David describes this as building “scaffolding” rather than “wrappers”—temporary support structures that should be progressively removed as the underlying AI becomes more capable.

The evolution is quantifiable in their research agent:

The current deep research agent architecture is remarkably simple:

This simplification was enabled by improvements in model reasoning capabilities, particularly the advent of test-time compute and reasoning models where effort can be tuned from low to high.

Production Reliability Considerations

For enterprise deployment, Aomni focuses heavily on reliability since enterprise customers expect 99% reliability, not hackathon-quality demos. Key approaches include:

Context and Memory Management

A significant operational challenge discussed is providing appropriate context to models. Aomni addresses this through:

The interview touches on Model Context Protocol (MCP) as a potential solution for tool integration and memory, though David notes limited community adoption and competitive dynamics where companies resist becoming “just tool makers on top of an AI platform.”

Evaluation and Testing

David provides candid insights into the challenges of evaluating agentic systems:

However, he acknowledges significant limitations:

A concrete example: an eval expected a specific sequence of tool calls (web search → web browse → contact enrichment), but a newer model achieved the same goal by calling tools in a completely different order and skipping some entirely. This represents a philosophical challenge where better models may “prove your eval dataset wrong.”

The recommendation is to redo evaluation datasets every time model performance doubles, treating eval maintenance as an ongoing operational responsibility rather than a one-time setup.

Tool Calling vs. Code Generation

The interview explores an interesting architectural tension between tool calling (the current mainstream approach) and code generation for task execution. David notes:

David experimented with service discovery patterns (a tool that loads other tools based on needs) but found models don’t reliably call the discovery tool before giving up—they lack the “instinct” for this pattern, suggesting it needs to be tuned into models by frontier labs.

Strategic Positioning

The company’s approach differs from the “AI SDR” trend of replacing customer-facing salespeople. David argues that for enterprise B2B sales with 5-7 figure deal sizes, human relationships remain essential—“nobody’s going to feel good talking to a robot.” Enterprise deals typically represent 60-80% of revenue for enterprise companies, making this the economically important segment.

The long-term vision is that as models continue improving, Aomni’s value proposition shifts from scaffolding and guardrails to primarily providing tools and data that feed into increasingly capable models. This positions the product to improve with each model generation rather than competing against it.

Key Takeaways for LLMOps Practitioners

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building Production-Ready AI Agents and Monitoring Systems

Portkey, Airbyte, Comet 2024

The panel discussion and demo sessions showcase how companies like Portkey, Airbyte, and Comet are tackling the challenges of deploying LLMs and AI agents in production. They address key issues including monitoring, observability, error handling, data movement, and human-in-the-loop processes. The solutions presented range from AI gateways for enterprise deployments to experiment tracking platforms and tools for building reliable AI agents, demonstrating both the challenges and emerging best practices in LLMOps.

customer_support code_generation structured_output +24

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57