Entelligence addresses the challenges of managing large engineering teams by providing AI agents that handle code reviews, documentation maintenance, and team performance analytics. The platform combines LLM-based code analysis with learning from team feedback to provide contextually appropriate reviews, while maintaining up-to-date documentation and offering insights into engineering productivity beyond traditional metrics like lines of code.
Entelligence is building AI engineering agents designed to reduce operational overhead for large engineering teams. The company was founded by Ashia, who previously worked at Uber Freight (a 500-engineer organization) and experienced firsthand the challenges of coordinating large engineering teams—including performance reviews, status updates, code reviews, knowledge sharing, and team syncs. The core insight driving Entelligence is that while AI tools like Cursor have accelerated code generation, they haven’t addressed the team coordination challenges that compound as organizations grow.
The interview, conducted by Jordi from “Agents at Work,” provides valuable insights into how Entelligence operationalizes LLMs in production for software development workflows. This case study is particularly relevant given the rise of “vibe coding”—where engineers use AI to rapidly generate code without necessarily understanding every line—which creates new challenges around code quality, documentation, and knowledge management.
As engineering teams scale, several operational challenges emerge:
Entelligence takes model selection seriously and has developed an open-source PR review evaluation benchmark to compare different LLMs for code review tasks. When new models are released, they run comparisons—for example, finding that Deepseek outperformed Claude 3.5 for code reviews by a significant margin. This led to incorporating Deepseek into their platform.
The company is launching a human-in-the-loop PR evaluation leaderboard, similar to Berkeley’s LLM leaderboard but specifically for code review. Users can compare models like GPT-4o, Claude, or Deepseek on real PRs without knowing which model generated which review, then vote on which comments are more relevant. The goal is to gather thousands of reviews to converge on optimal model selection.
The foundation of Entelligence’s system is what they describe as “elaborate agentic hybrid RAG search.” They spent their first few months building universal search across code, documentation, and issues. This context retrieval powers several features:
One of the most sophisticated aspects of Entelligence’s LLMOps implementation is their approach to learning from team-specific feedback. They discovered that the same product given to two different companies might see 70% comment acceptance at one and 20% at another—not because of product quality, but because engineering cultures differ dramatically.
Their solution involves tracking which comments engineers accept versus reject over time, then using this historical data to calibrate future reviews. This isn’t simple prompt engineering but rather a comprehensive approach to giving the model context about team preferences. After roughly two weeks of use, the system learns a team’s style and priorities.
A key challenge they addressed is that LLMs are inherently “pedantic”—if asked to review code, they’ll find 30-50 issues with almost any PR. However, humans don’t need or want perfection on every line. Entelligence built orchestration to force the model to eventually say “looks good to me” rather than endlessly finding issues. This required multiple layers of reflection after initial generation.
Entelligence deploys across multiple touchpoints:
The documentation agent represents a significant production AI use case. It:
For certain validation tasks, Entelligence runs mini sandboxed environments to verify that lint passes and that tests execute properly. However, they acknowledge limitations—full latency testing would require dockerized containers with access to the complete production environment and external services, which is beyond current scope.
The system has customized support for the top 20 programming languages (Python, Go, Java, C, etc.). For less common languages, it falls back to baseline LLM capabilities without specialized optimization.
A recurring theme is managing the tension between comprehensive AI review and practical usability. Models naturally want to flag every possible issue, but this creates noise that engineers ignore. Entelligence’s solution involves multiple layers of filtering and learning from team behavior.
An unexpected production requirement emerged around emotional intelligence. One engineering leader specifically requested that the system include positive feedback, not just criticism. Even automated systems benefit from making users feel good about their work before pointing out issues. The platform includes “sprint assessments” with badges and rewards highlighting each engineer’s main contributions.
For enterprise customers with existing documentation systems, Entelligence had to build export capabilities to multiple platforms. Rather than forcing customers to use their editor, they “hydrate” documentation wherever customers prefer to host it while maintaining the source of truth in their system.
Interestingly, adoption is not strongest among “elite” engineering organizations. Engineers who consider themselves cream-of-the-crop tend to be more resistant to AI feedback. The strongest adoption comes from engineering teams focused on delivering business value who want operational overhead removed. Enterprise sales often proceed team-by-team rather than organization-wide.
The platform serves multiple personas:
Entelligence partners with open source projects, offering their suite of tools for code reviews, contributor onboarding, codebase change tracking, and contribution analytics. This serves as both community support and a pathway to adoption.
The company is moving beyond traditional engineering metrics toward impact-based measurement. Their PR review analysis generates metrics on features built, impact of features, and complexity of changes—attempting to measure meaningful contribution rather than code volume.
The interview candidly discusses current limitations:
Entelligence uses their own product internally, which has helped identify issues like overly permissive initial reviews and the need for the feedback learning loop. The founder notes they’ve iterated since day one when the system “would not catch anything.”
Lance Martin from LangChain discusses the emerging discipline of "context engineering" through his experience building Open Deep Research, a deep research agent that evolved over a year to become the best-performing open-source solution on Deep Research Bench. The conversation explores how managing context in production agent systems—particularly across dozens to hundreds of tool calls—presents challenges distinct from simple prompt engineering, requiring techniques like context offloading, summarization, pruning, and multi-agent isolation. Martin's iterative development journey illustrates the "bitter lesson" for AI engineering: structured workflows that work well with current models can become bottlenecks as models improve, requiring engineers to continuously remove structure and embrace more general approaches to capture exponential model improvements.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Manus AI demonstrates their production-ready AI agent platform through a technical workshop showcasing their API and application framework. The session covers building complex AI applications including a Slack bot, web applications, browser automation, and invoice processing systems. The platform addresses key production challenges such as infrastructure scaling, sandboxed execution environments, file handling, webhook management, and multi-turn conversations. Through live demonstrations and code walkthroughs, the workshop illustrates how their platform enables developers to build and deploy AI agents that handle millions of daily conversations while providing consistent pricing and functionality across web, mobile, Slack, and API interfaces.