ZenML

Domain Adaptation of LLMs for Enterprise Use Through Multi-Task Fine-Tuning

Wix 2024
View original source

Wix developed a customized LLM for their enterprise needs by applying multi-task supervised fine-tuning (SFT) and domain adaptation using full weights fine-tuning (DAPT). Despite having limited data and tokens, their smaller customized model outperformed GPT-3.5 on various Wix-specific tasks. The project focused on three key components: comprehensive evaluation benchmarks, extensive data collection methods, and advanced modeling processes to achieve full domain adaptation capabilities.

Industry

Tech

Technologies

Overview

Wix, the cloud-based web development platform serving over 200 million users worldwide, embarked on a comprehensive journey to customize Large Language Models for their enterprise-specific use cases. The primary motivation was to achieve what they term “full domain adaptation” — the ability for a single model to tackle multiple domain-specific tasks simultaneously, rather than being limited to single-task optimization. Their target use cases included customer intent classification, sentiment detection, customer segmentation, domain-specific summarization, and question answering about the Wix platform.

The project emerged from fundamental limitations they experienced with more common LLM customization approaches. While prompt engineering and Retrieval Augmented Generation (RAG) are widely adopted and relatively easy to implement, Wix identified several inherent problems: lack of multitasking capability (only one domain task handled at a time), training data that isn’t domain-aware, excessive model sizes that reduce accuracy while increasing cost and latency, prompt complexity leading to higher token counts and potential overfitting, and the observation that vendor-provided prompt fine-tuning services often simply overfit to specific prompts rather than achieving genuine cross-domain capabilities.

Technical Approach

Evaluation-First Methodology

Wix adopted a principle that every data science project should start with evaluation. They emphasized that understanding model goals drives better decisions in model building and dataset preparation. While open LLM benchmarks estimate general-purpose capabilities, custom models require custom benchmarks to assess domain knowledge and task performance.

For knowledge estimation, they built a custom Question and Answer (Q&A) dataset using existing customer service live chats and FAQs. Since answers are free text rather than categorical labels, they implemented an LLM-as-a-judge approach. This involves a prompt that compares LLM-suggested answers against ground-truth answers. After evaluating several open-source LLMs as judges, they built their own judging prompt which outperformed open solutions due to their team’s superior domain expertise. They stress that having a solid, reliable metric is essential — otherwise, optimization efforts are essentially “shooting in the air.”

For task capability estimation, they used domain-specific text-based learning tasks including customer intent classification, customer segmentation, custom domain summarization, and sentiment analysis. They also incorporated a technique from Microsoft that transforms Q&A from free-text to multiple choice format: using the correct answer to generate three alternatives (one similar but slightly wrong, two completely wrong). This allows simultaneous assessment of domain knowledge and instruction-following ability.

A critical methodological note: for task evaluation, they fixed one simple prompt per task that hadn’t been optimized for specific LLM model families, preventing evaluation bias toward particular architectures.

Training Data Strategy

The training data component proved central to their approach. Wix acknowledges the challenge that LLMs can learn everything — including typos, inappropriate language, and confidential information — making data quality paramount alongside quantity.

Industry best practice suggests billions of tokens for full LLM fine-tuning, but they recognized that pre-trained LLMs already contain relevant domain knowledge. For instance, all common LLMs already have some awareness of website building and Wix products. The goal of domain-specific fine-tuning is to increase domain knowledge ratio from essentially negligible (0.00000001% in pre-trained models) to approximately 2% of training data.

Their data strategy included completion-based training data (raw free text such as articles, technical documentation, real-world dialogues) and instructive training data using labeled datasets from other Wix NLP projects like sentiment analysis and customer intent classification. Using internal unprocessed data posed challenges due to potential mistakes and confidential information.

To address limited manually-created data, they implemented synthetic data generation using organizational sources: knowledge base articles, customer support chats, technical documentation, and internal reports. They generated Q&As and reading comprehension tasks synthetically, though specific methodologies for this generation process aren’t detailed in the source material.

A crucial hyperparameter mentioned is sampling between data sources. Additionally, to maintain performance on common knowledge while specializing on domain knowledge, models must continue training on public data alongside domain-specific content.

Modeling Decisions

Wix explicitly avoided reinventing the wheel, leveraging existing optimized training recipes from established LLM families. Their selection criteria for base models included:

They selected models that already demonstrated familiarity with Wix and website building concepts, eliminating the need for training from scratch.

Key hyperparameters they identified include:

Infrastructure

Training was conducted on AWS P5 instances with high-power GPUs, enabling full-scale fine-tuning and high-rank LoRA experimentation on LLaMA2 7B. The decision to constrain to one powerful GPU suggests a practical approach to resource management while maintaining capability for substantial experiments.

Results and Outcomes

The customized Wix LLM demonstrated better performance than GPT-3.5 on various Wix-specific tasks. This is a notable achievement given GPT-3.5’s general capabilities, suggesting that domain adaptation provides genuine value for specialized enterprise use cases. The smaller model also addresses the cost and latency concerns that motivated the project.

The team identified that knowledge-based tasks, particularly Q&A with or without context, truly require customization. This insight emerged because their most common benchmark was Q&A, and off-the-shelf LLMs and RAG solutions weren’t meeting performance requirements.

Critical Assessment

While Wix presents a compelling case for full domain adaptation, several aspects warrant balanced consideration:

Broader LLMOps Implications

The Wix case study illustrates several mature LLMOps practices:

The project demonstrates that achieving production-ready domain-adapted LLMs requires coordinated effort across evaluation design, data engineering, model training, and infrastructure — a genuinely cross-functional LLMOps undertaking requiring collaboration between AI researchers, engineers, and domain experts (AI curators in their terminology).

More Like This

Enterprise AI Platform Integration for Secure Production Deployment

Rubrik 2025

Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.

customer_support content_moderation chatbot +53

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Large-Scale LLM Infrastructure for E-commerce Applications

Coupang 2024

Coupang, a major e-commerce platform operating primarily in South Korea and Taiwan, faced challenges in scaling their ML infrastructure to support LLM applications across search, ads, catalog management, and recommendations. The company addressed GPU supply shortages and infrastructure limitations by building a hybrid multi-region architecture combining cloud and on-premises clusters, implementing model parallel training with DeepSpeed, and establishing GPU-based serving using Nvidia Triton and vLLM. This infrastructure enabled production applications including multilingual product understanding, weak label generation at scale, and unified product categorization, with teams using patterns ranging from in-context learning to supervised fine-tuning and continued pre-training depending on resource constraints and quality requirements.

customer_support content_moderation translation +32