ZenML

Building and Deploying a Code Generation LLM at Scale

Replit 2024
View original source

Replit, a software development platform, aimed to democratize coding by developing their own code completion LLM. Using Databricks' Mosaic AI Training infrastructure, they successfully built and deployed a multi-billion parameter model in just three weeks, enabling them to launch their code completion feature on time with a small team. The solution allowed them to abstract away infrastructure complexity and focus on model development, resulting in a production-ready code generation system that serves their 25 million users.

Industry

Tech

Technologies

Overview

Replit is a software development platform that aims to democratize coding by providing an intuitive integrated development environment (IDE) accessible to beginners and experienced developers alike. With approximately 25 million users, Replit recognized the transformative potential of generative AI in accelerating software development. Their goal was to build AI-powered features that could complete, explain, and transform code, thereby reducing the time from idea to deployed product while making software development more accessible to a broader audience.

The case study focuses on how Replit built a custom code completion model from scratch using Databricks’ Mosaic AI Training infrastructure, achieving a remarkably fast development cycle despite having a small engineering team. This represents a compelling example of LLMOps in practice, where the challenges of training large language models at scale were addressed through careful platform selection and operational efficiency.

The Challenge

Replit’s AI engineering team, led by VP of AI Michele Catasta (who previously worked on LLM training at Google), understood that building an effective code completion feature required training a specialized model rather than relying solely on off-the-shelf solutions. The key challenges they faced were multifaceted:

The team recognized that assembling an end-to-end training infrastructure for large language models is extremely difficult. This includes managing GPU clusters, handling distributed training, ensuring fault tolerance, and optimizing for cost-efficiency. For a small team, these infrastructure concerns could easily consume engineering resources that would be better spent on model architecture, data pipelines, and domain-specific optimizations.

They needed to focus on differentiating aspects of their model, particularly fine-tuning on Replit-specific code patterns and developing custom vocabulary suited to their platform’s multi-language coding environment. The operational overhead of managing training infrastructure at scale was a significant barrier to achieving this focus.

Additionally, Replit operated under an accelerated product launch timeline, meaning they needed a solution that could support rapid experimentation and scaling without extended setup or stabilization periods.

The Solution: Mosaic AI Training Infrastructure

Replit chose to partner with Databricks and leverage the Mosaic AI Training infrastructure, which is designed to simplify the process of large-scale model training. The platform abstracts away hardware complexity, allowing organizations to pretrain and fine-tune AI models securely and cost-effectively.

The selection of Mosaic AI Training was driven by several factors. According to Michele Catasta, alternative solutions they explored were either underdeveloped or exposed a much higher level of complexity than their small team could handle. The Mosaic AI platform provided what he described as the “best white glove experience” available on the market, combining ease of use with enterprise-grade capabilities.

The training approach followed an iterative methodology. Replit started by experimenting with smaller models using the Mosaic AI Training tools. This allowed them to validate their data pipelines, custom vocabulary, and training configurations before committing to larger-scale runs. The ability to experiment efficiently at smaller scales before scaling up is a critical LLMOps practice that reduces both risk and cost.

As they approached their product launch deadline, Replit scaled up to a larger allocation of 256 GPUs just one week before their developer day event. The rapid scaling capability of the Mosaic AI platform was crucial here, as provisioning and configuring such compute resources typically requires significant lead time and operational expertise.

Technical Implementation Details

While the case study does not provide exhaustive technical specifications, several important aspects of the implementation can be inferred:

The model developed was described as a “multi-billion parameter code generation model,” indicating a substantial model size that would require distributed training across many GPUs. Training models of this scale involves complexities such as data parallelism, model parallelism, gradient synchronization, and checkpoint management, all of which the Mosaic AI platform helped abstract.

The training process incorporated Replit-specific data, allowing them to fine-tune the model on code patterns and idioms common to their platform. This domain-specific training data was a key differentiator, as it would enable the model to provide more relevant code completions for Replit users.

The development of custom vocabulary was mentioned as a focus area, suggesting that Replit may have used specialized tokenization approaches suited to their multi-language coding environment. Proper tokenization is particularly important for code models, as standard natural language tokenizers often perform poorly on programming constructs.

The team conducted what was colorfully described as a “YOLO run” of their LLM in the final week before launch. While the terminology suggests a high-risk approach, this was enabled by the confidence the team had built through their smaller-scale experiments and the reliability of the Mosaic AI platform.

Operational Outcomes and Results

The collaboration produced several notable outcomes from an LLMOps perspective:

The entire journey from initial development to production-ready code completion model took approximately three weeks. For a multi-billion parameter model developed by a small team, this represents a remarkable acceleration compared to traditional approaches that might take months to achieve similar results.

The small team was able to achieve significant milestones without needing to scale up their engineering headcount proportionally. This productivity multiplier is a key benefit of using managed AI training infrastructure, as it allows organizations to focus engineering effort on value-adding activities rather than infrastructure management.

The development cycle demonstrated what was described as a “continuous cycle of learning, adapting and optimizing.” This iterative approach is characteristic of mature LLMOps practices, where models are continuously refined based on feedback and performance metrics.

The streamlined development process facilitated a “smoother, more reliable development cycle,” which is essential for managing the inherent complexities and volatility associated with training LLMs. Large model training can be prone to various failure modes, and having robust tooling and support helps navigate these challenges.

Critical Assessment

It’s worth noting that this case study is published by Databricks and features their Mosaic AI product, so it should be read with appropriate skepticism regarding the marketing claims. The specific metrics around “3 weeks to launch” and productivity gains are presented without detailed baseline comparisons, making it difficult to independently verify the magnitude of improvement.

That said, the fundamental value proposition is credible: managed AI training infrastructure can significantly reduce the operational burden of training large models, particularly for smaller teams without dedicated ML infrastructure engineers. The challenges of GPU cluster management, distributed training, and fault-tolerant checkpointing are well-documented, and platforms that address these concerns can genuinely accelerate development cycles.

The quote from Michele Catasta describing this as his “best career accomplishment to date” suggests genuine satisfaction with the outcome, though this should be balanced against the promotional context of the case study.

Lessons for LLMOps Practitioners

Several key takeaways emerge for organizations considering similar initiatives:

The importance of abstracting infrastructure complexity cannot be overstated for small teams. While building custom training infrastructure provides maximum flexibility, the engineering overhead can be prohibitive for teams that need to focus on model quality and product development.

Iterative scaling from smaller experiments to full-scale training runs is a prudent approach. It allows teams to validate their approaches and build confidence before committing significant compute resources, reducing both risk and cost.

Having access to on-demand GPU scaling is critical for meeting aggressive timelines. The ability to rapidly provision 256 GPUs when needed enabled Replit to meet their launch deadline despite starting larger-scale training just one week before.

Domain-specific training data and custom vocabulary remain important differentiators even when using managed infrastructure. The platform choice addresses operational concerns but does not substitute for thoughtful model design and training data curation.

Finally, the partnership model between Replit and Databricks, including what appears to be close technical collaboration and support, highlights the value of vendor relationships that go beyond simple platform access to include expertise and guidance throughout the development process.

More Like This

Building an AI-Powered IDE at Scale: Architectural Deep Dive

Cursor 2025

Cursor, an AI-powered IDE built by Anysphere, faced the challenge of scaling from zero to serving billions of code completions daily while handling 1M+ queries per second and 100x growth in load within 12 months. The solution involved building a sophisticated architecture using TypeScript and Rust, implementing a low-latency sync engine for autocomplete suggestions, utilizing Merkle trees and embeddings for semantic code search without storing source code on servers, and developing Anyrun, a Rust-based orchestrator service. The results include reaching $500M+ in annual revenue, serving more than half of the Fortune 500's largest tech companies, and processing hundreds of millions of lines of enterprise code written daily, all while maintaining privacy through encryption and secure indexing practices.

code_generation code_interpretation chatbot +34

Building and Deploying Enterprise-Grade LLMs: Lessons from Mistral

Mistral 2023

Mistral, a European AI company, evolved from developing academic LLMs to building and deploying enterprise-grade language models. They started with the successful launch of Mistral-7B in September 2023, which became one of the top 10 most downloaded models on Hugging Face. The company focuses not just on model development but on providing comprehensive solutions for enterprise deployment, including custom fine-tuning, on-premise deployment infrastructure, and efficient inference optimization. Their approach demonstrates the challenges and solutions in bringing LLMs from research to production at scale.

code_generation code_interpretation translation +25

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90