Cursor, founded by MIT graduates, developed an AI-powered code editor that goes beyond simple code completion to reimagine how developers interact with AI while coding. By focusing on innovative features like instructed edits and codebase indexing, along with developing custom models for specific tasks, they achieved rapid growth to $100M in revenue. Their success demonstrates how combining frontier LLMs with custom-trained models and careful UX design can transform developer productivity.
Cursor is an AI-powered code editor founded by four MIT graduates (including Amon, the speaker) who all graduated in 2022. The company represents a compelling case study in LLMOps, demonstrating how to build a production AI system that combines frontier models from major labs with custom-trained models, all while maintaining a lean team of 30 people and achieving $100 million in revenue. The product launched in 2023, and this interview captures insights from the company’s rapid growth and technical evolution.
The founders were early users of GitHub Copilot when it launched in summer 2021, recognizing it as “the first language model based product that actually worked.” However, they observed a critical gap: while the underlying models were improving rapidly in research, the product experience remained largely static—simply swapping in newer OpenAI models without reimagining the interface. Their thesis was that if language models were going to produce 90-95% of software, the interface needed fundamental reinvention, not just better autocomplete.
A key insight came from understanding that Copilot’s success was partially due to Microsoft’s ownership of both VS Code and GitHub, which allowed them to introduce the “ghost text” UX pattern for autocomplete. This replaced the older “intellisense bubble” approach and dramatically improved usefulness. The founders realized that to push the boundaries of AI-assisted coding, they needed to own the IDE itself.
Cursor employs a pragmatic hybrid approach to model deployment that reflects mature LLMOps thinking:
Frontier Model Usage: The company primarily uses Claude Sonnet as the main frontier model, with options for users to select other models like GPT-4, O1, O3, and DeepSeek’s R1 reasoning model. These are accessed via API calls and handle the complex reasoning tasks that require the most capable models.
Custom Model Development: Cursor has trained approximately 50 model variants, with around 10 actively served in production. These custom models handle specific use cases where frontier models are suboptimal:
Low-latency requirements: For features that need to feel instantaneous (like autocomplete and next-edit prediction), the latency and cost of frontier model APIs is prohibitive. Custom models provide the speed necessary for a responsive user experience.
Unique data distribution: The company identified that pre-training data (primarily code from GitHub) captures code states at commit time, but not the edit patterns between commits. Cursor’s proprietary data on how users actually make edits represents an out-of-distribution advantage that improves model performance for next-edit prediction.
Embedding models: The company trains custom embedding models for codebase indexing, noting that the best researchers at major labs focus on advancing reasoning capabilities rather than embeddings, leaving room for specialized improvement.
The speaker emphasizes being “pragmatic” about model choices, warning against the 2023-2024 trend of companies raising large sums and spending heavily on GPUs before achieving product-market fit. Instead, Cursor advocates using “the best tool for the job,” which often means calling an API rather than training custom models prematurely.
Cursor’s development process is characterized by extensive experimentation, with the speaker noting that for every shipped feature, there are approximately 10 failed experiments. Key production features include:
Command K (Instructed Edit): A core feature that allows users to invoke instructed edits with a keyboard shortcut, representing a new UX paradigm for AI-assisted code modification.
Codebase Indexing: Enables users to ask questions about their entire codebase, likely leveraging custom embedding models and retrieval systems.
Cursor Tab (formerly Copilot++): Perhaps the most technically interesting feature from an LLMOps perspective. Unlike traditional autocomplete that predicts what comes next after the cursor, Cursor Tab predicts the user’s next edit to existing code. The development history is instructive:
This pattern demonstrates a key LLMOps principle: some features require sufficient user data before they become viable, and companies should be prepared to revisit failed experiments as data accumulates.
The speaker discusses competitive positioning with notable candor, acknowledging that Cursor doesn’t yet have strong moats but is building toward them:
Current Advantages:
Future Stickiness Strategy: The company plans to build enterprise stickiness by specializing models on individual codebases. When thousands of developers at a company use Cursor daily, the accumulated edit data can be used to improve models specifically for that company (without cross-contamination with other customers’ data). This creates a flywheel where increased usage leads to better performance, making migration less attractive.
The speaker notes that growth currently takes priority over stickiness, reflecting a strategic choice about when to invest in defensive capabilities.
Cursor takes an aggressive approach to releasing features, shipping “half-finished things” that competitors refuse to release. The rationale is that real-world feedback enables rapid improvement. The first version of Cursor Tab “sucked,” but releasing it allowed the team to observe user reactions and iterate quickly.
This philosophy requires careful consideration of user experience and trust, but has enabled faster learning cycles. The company releases features “as soon as something shows signs of usefulness to the team,” leveraging their position as power users of their own product.
The company maintained an extremely lean structure during its growth phase—just four technical co-founders for the first year, reaching 30 employees by the time of this interview. Early hiring came from MIT connections and, notably, from power users of the product who demonstrated high code quality through analytics. This represents an unusual recruiting channel enabled by the nature of developer tools.
The team culture emphasizes curiosity about the future of programming, with conversations at dinner focusing on speculation about model capabilities and UI/UX possibilities. The speaker estimates the product is “2-3% done,” suggesting significant room for continued development.
The path to $100 million in revenue came without a traditional sales team (sales hiring only began 1-2 months before the interview, with 3 salespeople). Revenue comes from a mix of individual developers and teams, with individual developers representing a significant portion.
Two factors enabled this unusual self-serve growth for developer tools (a category historically known for poor monetization): the normalization of paying $20/month for AI tools (established by ChatGPT and Copilot), and word-of-mouth driven by product quality.
The speaker offers a nuanced view on the future of AI coding tools. Rather than believing in a pure “prompt to code” future, they envision a spectrum:
A notable observation is that AI colleagues may not be the optimal interface pattern. The overhead of communicating with human colleagues (implicit knowledge, back-and-forth iterations, ideas that are hard to articulate) suggests that truly instant, responsive AI tools that iterate in real-time may be more effective than autonomous agents that work independently.
The company aims to make it possible for anyone to produce sophisticated software like Figma, but acknowledges this is a long-term vision requiring continued advancement in both models and interfaces.
This lecture transcript from Yangqing Jia, VP at NVIDIA and founder of Lepton AI (acquired by NVIDIA), explores the evolution of AI system design from an engineer's perspective. The talk covers the progression from research frameworks (Caffe, TensorFlow, PyTorch) to production AI infrastructure, examining how LLM applications are built and deployed at scale. Jia discusses the emergence of "neocloud" infrastructure designed specifically for AI workloads, the challenges of GPU cluster management, and practical considerations for building consumer and enterprise LLM applications. Key insights include the trade-offs between open-source and closed-source models, the importance of RAG and agentic AI patterns, infrastructure design differences between conventional cloud and AI-specific platforms, and the practical challenges of operating LLMs in production, including supply chain management for GPUs and cost optimization strategies.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Codeium addressed the limitations of traditional embedding-based retrieval in code generation by developing a novel approach called M-query, which leverages vertical integration and custom infrastructure to run thousands of parallel LLM calls for context analysis. Instead of relying solely on vector embeddings, they implemented a system that can process entire codebases efficiently, resulting in more accurate and contextually aware code generation. Their approach has led to improved user satisfaction and code generation acceptance rates while maintaining rapid response times.