ZenML

RoBERTa for Large-Scale Merchant Classification

Square 2025
View original source

Square developed and deployed a RoBERTa-based merchant classification system to accurately categorize millions of merchants across their platform. The system replaced unreliable self-selection methods with an ML approach that combines business names, self-selected information, and transaction data to achieve a 30% improvement in accuracy. The solution runs daily predictions at scale using distributed GPU infrastructure and has become central to Square's business metrics and strategic decision-making.

Industry

Finance

Technologies

Overview

Square, a major payment processing and business services company, developed a RoBERTa-based machine learning model to accurately categorize the types of businesses (merchants) that use their platform. This case study provides an excellent example of deploying a large language model for production classification tasks at scale, dealing with tens of millions of merchants requiring daily inference.

The core business problem was that Square’s previous approach to merchant categorization—primarily relying on merchant self-selection during onboarding—was highly inaccurate. Merchants often rushed through onboarding, faced confusion with granular category options, or lacked clear definitions of what each category meant. This led to significant business challenges including misinformed product strategy, ineffective marketing targeting, incorrect product eligibility determinations, and potentially overpaying on interchange fees to card issuers.

Technical Approach and Architecture

Square’s solution leveraged the RoBERTa (Robustly Optimized BERT Pretraining Approach) architecture, specifically the roberta-large model from Hugging Face. While the blog post refers to this as utilizing “LLMs,” it’s worth noting that RoBERTa is technically a masked language model designed for understanding tasks rather than generation, making it well-suited for classification problems. The model was fine-tuned for multi-class sequence classification using the AutoModelForSequenceClassification class from Hugging Face Transformers.

Data Preparation and Quality

A crucial differentiator in this approach was the emphasis on high-quality training data. The team assembled a dataset of over 20,000 merchants with manually reviewed ground truth labels, addressing a key weakness of previous ML approaches that had relied on inaccurate self-selected labels as training targets. This manual review process, while labor-intensive, established a reliable foundation for the model.

The data preprocessing pipeline included several thoughtful steps:

Model Training Infrastructure

Training was conducted on Databricks using GPU-based clusters with NVIDIA A10G GPUs. The team employed several memory optimization techniques essential for working with large models:

The training configuration used the AdamW optimizer with a learning rate of 1e-05, linear learning rate scheduler with 5% warmup, batch size of 16, and 4 epochs. The team used the Hugging Face Trainer class for managing the training loop, a common pattern for production ML workflows using transformer models.

Production Inference at Scale

One of the most operationally significant aspects of this case study is the daily inference pipeline that processes predictions for tens of millions of merchants. The team employed multiple techniques to make this feasible:

The inference code used Hugging Face’s pipeline abstraction for text classification, making it straightforward to apply the trained model to new data: classifier = pipeline('text-classification', model=model_path, truncation=True, padding=True, device=device, batch_size=batch_size).

Output and Monitoring

The system produces two output tables serving different operational needs:

Results and Business Impact

The model achieved approximately 30% absolute improvement in accuracy compared to existing methods. Performance varied by category, with some showing particularly strong gains: retail (38% improvement), home and repair (35%), and beauty and personal care (32%). Even categories with smaller improvements like food and drink (13%) still represented meaningful accuracy gains.

The model outputs now power Square’s business metrics that require merchant category segmentation, influencing product strategy, go-to-market targeting, email campaigns, and other applications. The team indicates future work will explore using the model to reduce interchange fees associated with payment processing, an area with direct financial impact.

Critical Assessment

While this case study demonstrates a well-executed production ML system, a few considerations are worth noting:

Overall, this case study provides valuable insights into deploying transformer-based classification models at scale, with practical solutions for data quality, memory optimization, distributed inference, and incremental processing that are broadly applicable to production LLM/ML systems.

More Like This

Multi-Industry AI Deployment Strategies with Diverse Hardware and Sovereign AI Considerations

AMD / Somite AI / Upstage / Rambler AI 2025

This panel discussion at AWS re:Invent features three companies deploying AI models in production across different industries: Somite AI using machine learning for computational biology and cellular control, Upstage developing sovereign AI with proprietary LLMs and OCR for document extraction in enterprises, and Rambler AI building vision language models for industrial task verification. All three leverage AMD GPU infrastructure (MI300 series) for training and inference, emphasizing the importance of hardware choice, open ecosystems, seamless deployment, and cost-effective scaling. The discussion highlights how smaller, domain-specific models can achieve enterprise ROI where massive frontier models failed, and explores emerging areas like physical AI, world models, and data collection for robotics.

healthcare document_processing classification +40

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

AI-Powered Marketing Content Generation and Compliance Platform at Scale

Volkswagen 2025

Volkswagen Group Services partnered with AWS to build a production-scale generative AI platform for automotive marketing content generation and compliance evaluation. The problem was a slow, manual content supply chain that took weeks to months, created confidentiality risks with pre-production vehicles, and faced massive compliance bottlenecks across 10 brands and 200+ countries. The solution involved fine-tuning diffusion models on proprietary vehicle imagery (including digital twins from CAD), automated prompt enhancement using LLMs, and multi-stage image evaluation using vision-language models for both component-level accuracy and brand guideline compliance. Results included massive time savings (weeks to minutes), automated compliance checks across legal and brand requirements, and a reusable shared platform supporting multiple use cases across the organization.

content_moderation classification multi_modality +45