Minimal developed a sophisticated multi-agent customer support system for e-commerce businesses using LangGraph and LangSmith, achieving 80%+ efficiency gains in ticket resolution. Their system combines three specialized agents (Planner, Research, and Tool-Calling) to handle complex support queries, automate responses, and execute order management tasks while maintaining compliance with business protocols. The system successfully automates up to 90% of support tickets, requiring human intervention for only 10% of cases.
Minimal is a Dutch startup founded by Titus Ex (a machine learning engineer) and Niek Hogenboom (an aerospace engineering graduate) that focuses on automating customer support workflows for e-commerce businesses. The company has developed an AI-powered system that integrates with popular helpdesk platforms like Zendesk, Front, and Gorgias, aiming to handle customer queries autonomously while maintaining the ability to execute real actions such as order cancellations, refunds, and address updates through direct integrations with e-commerce services like Shopify.
The case study, published via LangChain’s blog in January 2025, presents Minimal’s approach to building a production-ready multi-agent system for customer support. It’s worth noting that this is a promotional piece from LangChain’s ecosystem, so the claimed metrics (80%+ efficiency gains, 90% autonomous ticket resolution) should be considered with appropriate caution, as independent verification is not provided.
E-commerce customer support presents a tiered complexity challenge. While basic support tickets (referred to as T1) are relatively straightforward to handle, more complex issues (T2 and T3) require deeper integration with business systems and nuanced understanding of company policies. Traditional approaches using monolithic LLM prompts were found to conflate multiple tasks, leading to errors and inefficient token usage. The team discovered that attempting to handle all aspects of customer support within a single prompt led to reliability issues and made it difficult to maintain and extend the system.
Minimal’s core technical innovation lies in their multi-agent architecture, which decomposes the customer support workflow into specialized components. This approach represents a significant LLMOps pattern for managing complexity in production LLM systems.
The architecture consists of three main agent types working in coordination:
Planner Agent: This agent serves as the orchestration layer, receiving incoming customer queries and breaking them down into discrete sub-problems. For example, a complex query might be decomposed into separate concerns like “Return Policy” and “Troubleshooting Front-End Issues.” The Planner Agent coordinates with specialized research agents and determines the overall flow of information through the system.
Research Agents: These specialized agents handle individual sub-problems identified by the Planner Agent. They perform retrieval and re-ranking operations against the company’s knowledge base, which includes documentation like returns guidelines, shipping rules, and other customer protocols stored in what Minimal calls their “training center.” This represents a RAG (Retrieval-Augmented Generation) pattern where agents pull relevant context before generating responses.
Tool-Calling Agent: This agent receives the consolidated “tool plan” from the Planner Agent and executes actual business operations. This includes decisive actions like processing refunds via Shopify or updating shipping addresses. Importantly, this agent consolidates logs for post-processing and chain-of-thought validation, which is crucial for maintaining auditability in a production environment.
The final step in the pipeline produces a reasoned draft reply to the customer that references correct protocols, checks relevant data, and ensures compliance with business rules around refunds and returns.
The team’s decision to adopt a multi-agent architecture over a monolithic approach was driven by several production concerns. They found that combining all tasks in a single prompt led to conflation of responsibilities and increased error rates. By splitting tasks across specialized agents, they achieved several benefits:
This architectural pattern is increasingly common in production LLM systems where complexity needs to be managed systematically.
A significant portion of Minimal’s LLMOps practice centers on their use of LangSmith for testing and benchmarking. Their development workflow leverages LangSmith’s capabilities for:
Performance Tracking: The team tracks model responses and performance metrics over time, enabling them to detect regressions and measure improvements as they iterate on the system.
Prompt Comparison: LangSmith enables side-by-side comparisons of different prompting strategies, including few-shot versus zero-shot approaches and chain-of-thought variants. This systematic experimentation is essential for optimizing production LLM systems.
Sub-Agent Logging: Each sub-agent’s output is logged, allowing the team to catch unexpected reasoning loops or erroneous tool calls. This visibility into the internal workings of the multi-agent system is critical for debugging and quality assurance.
Test-Driven Iteration: When errors are discovered—such as policy misunderstandings or missing steps—the team creates new tests based on LangSmith’s trace logs. They then add more few-shot examples or further decompose sub-problems to address the issues. This iterative, test-driven approach helps maintain velocity while improving system reliability.
Minimal built their system using the LangChain ecosystem, with LangGraph serving as the orchestration framework for their multi-agent architecture. The choice of LangGraph was motivated by several factors:
Modularity: LangGraph’s modular design allows flexible management of sub-agents without the constraints of a monolithic framework. This enables customization for specific workflow needs.
Integration Capabilities: The code-like design of the framework facilitated the development of proprietary connectors for e-commerce services including Shopify, Monta Warehouse Management Services, and Firmhouse (for recurring e-commerce).
Future-Proofing: The architecture supports easy addition of new agents and transition to next-generation LLMs. New subgraphs for new tasks can be added and connected back to the Planner Agent without major refactoring.
The system integrates with helpdesk platforms (Zendesk, Front, Gorgias) to provide a unified interface for handling customer queries, operating in either draft mode (co-pilot, where responses are suggested for human review) or fully automated mode.
The case study reveals several important production considerations. The system maintains chain-of-thought validation through consolidated logging, which is essential for compliance and auditability in e-commerce contexts where refund and return policies must be followed precisely. The dual-mode operation (draft vs. automated) provides flexibility for businesses with different risk tolerances.
The integration with multiple e-commerce and helpdesk platforms demonstrates the importance of robust connectors and API integrations for production LLM systems. Real-world utility depends not just on the quality of LLM outputs but on the ability to execute actual business operations.
Minimal claims 80%+ efficiency gains across a variety of e-commerce stores and projects that 90% of customer support tickets will be handled autonomously with only 10% escalated to human agents. They report having earned revenue from Dutch e-commerce clients and plan to expand across Europe.
However, these claims should be evaluated carefully. The source is a promotional case study published by LangChain, which has a commercial interest in showcasing successful applications of their ecosystem. No independent verification of the efficiency metrics is provided, and the specific methodology for measuring “efficiency gains” is not detailed. The 90% autonomous resolution target is stated as an expectation rather than a demonstrated achievement.
This case study illustrates several important patterns for production LLM deployments:
The case study represents an interesting example of a production multi-agent system, though practitioners should seek additional evidence before adopting similar approaches, particularly regarding the claimed efficiency metrics.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Snorkel developed a comprehensive benchmark dataset and evaluation framework for AI agents in commercial insurance underwriting, working with Chartered Property and Casualty Underwriters (CPCUs) to create realistic scenarios for small business insurance applications. The system leverages LangGraph and Model Context Protocol to build ReAct agents capable of multi-tool reasoning, database querying, and user interaction. Evaluation across multiple frontier models revealed significant challenges in tool use accuracy (36% error rate), hallucination issues where models introduced domain knowledge not present in guidelines, and substantial variance in performance across different underwriting tasks, with accuracy ranging from single digits to 80% depending on the model and task complexity.
This panel discussion features three AI-native companies—Delphi (personal AI profiles), Seam AI (sales/marketing automation agents), and APIsec (API security testing)—discussing their journeys building production LLM systems over three years. The companies address infrastructure evolution from single-shot prompting to fully agentic systems, the shift toward serverless and scalable architectures, managing costs at scale (including burning through a trillion OpenAI tokens), balancing deterministic workflows with model autonomy, and measuring ROI through outcome-based metrics rather than traditional productivity gains. Key technical themes include moving away from opinionated architectures to let models reason autonomously, implementing state machines for high-confidence decisions, using tools like Pydantic AI and Logfire for instrumentation, and leveraging Pinecone for vector search at scale.