Numbers Station addresses the challenges of integrating foundation models into the modern data stack for data processing and analysis. They tackle key challenges including SQL query generation from natural language, data cleaning, and data linkage across different sources. The company develops solutions for common LLMOps issues such as scale limitations, prompt brittleness, and domain knowledge integration through techniques like model distillation, prompt ensembling, and domain-specific pre-training.
Numbers Station is a company focused on bringing foundation model technology into the modern data stack to accelerate time-to-insights for organizations. This case study, presented as a lightning talk in collaboration with Stanford AI Lab, explores how large language models and foundation models can be applied to automate various data engineering tasks that traditionally require significant manual effort. The presentation offers a balanced view of both the capabilities and the operational challenges of deploying these models in production data environments.
The modern data stack encompasses a set of tools used to process, store, and analyze data—from data origination in apps like Salesforce or HubSpot, through extraction and loading into data warehouses like Snowflake, transformation with tools like dbt, and visualization with Tableau or Power BI. Despite the maturity of these tools, there remains substantial manual work throughout the data pipeline, which Numbers Station aims to address with foundation model automation.
The talk begins with a recap of foundation models—very large neural networks trained on massive amounts of unlabeled data using self-supervised learning techniques. The key innovation highlighted is the emergence of in-context learning capabilities at scale, where models can generalize to downstream tasks through carefully crafted prompts without requiring task-specific fine-tuning. This represents a paradigm shift from traditional AI, enabling rapid prototyping of AI applications by users who may not be AI experts themselves.
The auto-regressive language model architecture, trained to predict the next word in a sequence, serves as the foundation. By casting any task as a generation task through prompt crafting, the same underlying model can be reused across many different applications. This flexibility is central to Numbers Station’s approach to data engineering automation.
One of the primary applications discussed is generating SQL queries from natural language requests. In traditional enterprise settings, business users must submit requests to data engineering teams for ad-hoc queries, which involves multiple iterations and significant delays. Foundation models can reduce this back-and-forth by directly translating natural language questions into SQL.
However, the presentation is careful to note that while this works well for simple queries, there are significant caveats for complex queries that require domain-specific knowledge. For instance, when a table has multiple date columns, the model may not inherently know which one to use for a particular business question without additional context. This points to an important production consideration: foundation models alone may not be sufficient for enterprise-grade SQL generation without supplementary domain knowledge integration.
Data cleaning—fixing typos, correcting missing values, and standardizing formats—is traditionally handled through extensive SQL rule development. This process is time-consuming and fragile, as rules often break when encountering edge cases not anticipated during development.
Foundation models offer an alternative approach using in-context learning. By creating a prompt with a few examples of correct transformations, the model can generalize these patterns across entire datasets. The model derives patterns automatically from the provided examples, potentially eliminating much of the manual rule-crafting process.
The presentation acknowledges scalability issues with this approach when applied to large datasets, which will be addressed in the technical challenges section below.
Data linkage involves finding connections between different data sources that lack common identifiers—for example, linking customer records between Salesforce and HubSpot when there’s no shared ID for joins. Traditional approaches require engineers to develop complex matching rules, which can be brittle in production.
With foundation models, the approach involves feeding both records to the model and asking in natural language whether they represent the same entity. The presentation notes that the best production solution often combines rules with foundation model inference: use rules for the 80% of cases that are straightforward, then call the foundation model for complex edge cases that would otherwise require extensive rule engineering.
Foundation models are extremely large and can be expensive and slow to run at scale. The presentation distinguishes between two usage patterns with different scalability requirements:
The primary solution discussed is model distillation, where a large foundation model is used for prototyping, then its knowledge is transferred to a smaller model through fine-tuning. This distilled model can achieve comparable performance with significantly reduced computational requirements. The presentation claims this approach can effectively “bridge the gap” between large and small models with good prototyping and fine-tuning practices.
Another strategy is to use foundation models selectively—only invoking the model when truly necessary. For tasks simple enough to be solved with rules, the model can be used to automatically derive those rules from data rather than making predictions directly. This approach is described as “always better than handcrafting rules” while avoiding the computational overhead of model inference at scale.
A significant operational challenge is the sensitivity of foundation models to prompt formatting. The same logical prompt expressed differently can yield different predictions, which is problematic for data applications where users expect deterministic outputs. The presentation provides an example showing that manual demonstration selection versus random demonstration selection produces a “huge performance gap.”
To address this, Numbers Station developed techniques published in academic venues (referenced as an “AMA paper”). The core idea is to apply multiple prompts to the same input and aggregate the predictions to produce a final result. This ensemble approach reduces variance and improves reliability compared to single-prompt methods.
Additional techniques mentioned include:
Foundation models are trained on public data and lack organizational knowledge critical for enterprise tasks. The example given is generating a query for “active customers” when no explicit is_active column exists—the model needs to understand the organization’s definition of customer activity.
Two solution categories are presented:
Training-time solutions: Continual pre-training of open-source models on organizational documents, logs, and metadata. This approach makes models “aware of domain knowledge” by incorporating internal knowledge during the training process.
Inference-time solutions: Augmenting the foundation model with external memory accessed through:
This is essentially a retrieval-augmented generation (RAG) approach, where relevant context is retrieved and provided to the model at inference time to supplement its base knowledge.
This case study provides several important lessons for LLMOps practitioners:
The hybrid approach of combining rules with model inference is particularly noteworthy. Rather than treating foundation models as a complete replacement for traditional systems, the optimal production architecture often involves using models selectively—either to handle edge cases that rules cannot address or to generate rules automatically. This reduces both computational costs and the risk of model errors propagating through the data pipeline.
The emphasis on prompt engineering techniques like demonstration selection and multi-prompt aggregation highlights that production LLM systems require careful attention to input design, not just model selection. The brittleness of prompts means that seemingly minor formatting changes can significantly impact output quality.
The distillation approach offers a practical path from prototype to production. Large models can be used for initial development and to generate training data, while smaller distilled models handle production inference workloads. This addresses both cost and latency concerns that would otherwise make foundation model deployment impractical for high-volume data applications.
The domain knowledge integration strategies—whether through continual pre-training or RAG—are essential for enterprise deployments where generic models lack necessary business context. The choice between training-time and inference-time solutions likely depends on how dynamic the organizational knowledge is and the resources available for model customization.
It’s worth noting that while the presentation highlights significant potential, the specific quantitative results and production deployments are not detailed. The techniques discussed appear to be primarily research contributions from the collaboration with Stanford AI Lab, though Numbers Station is described as building products that incorporate this technology. Organizations considering similar approaches should validate performance claims in their specific contexts.
The work presented is done in collaboration with the Stanford AI Lab, suggesting a research-oriented approach to these production challenges. Multiple papers are referenced (though not cited by name in the transcript), indicating that the techniques have undergone academic peer review. This collaboration between industry application and academic research is a valuable model for developing robust LLMOps practices.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Wesco, a B2B supply chain and industrial distribution company, presents a comprehensive case study on deploying enterprise-grade AI applications at scale, moving from POC to production. The company faced challenges in transitioning from traditional predictive analytics to cognitive intelligence using generative AI and agentic systems. Their solution involved building a composable AI platform with proper governance, MLOps/LLMOps pipelines, and multi-agent architectures for use cases ranging from document processing and knowledge retrieval to fraud detection and inventory management. Results include deployment of 50+ use cases, significant improvements in employee productivity through "everyday AI" applications, and quantifiable ROI through transformational AI initiatives in supply chain optimization, with emphasis on proper observability, compliance, and change management to drive adoption.
DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.