ZenML

Fine-tuning LLMs for Toxic Speech Classification in Gaming

Large Gaming Company 2023
View original source

AWS Professional Services helped a major gaming company build an automated toxic speech detection system by fine-tuning Large Language Models. Starting with only 100 labeled samples, they experimented with different BERT-based models and data augmentation techniques, ultimately moving from a two-stage to a single-stage classification approach. The final solution achieved 88% precision and 83% recall while reducing operational complexity and costs compared to the initial proof of concept.

Industry

Media & Entertainment

Technologies

Overview

This case study documents AWS Professional Services’ engagement with a large gaming company to build an automated toxic speech detection system for online player communications. The video gaming industry serves over 3 billion users worldwide, and maintaining a socially responsible gaming environment requires effective moderation of player interactions. The customer’s goal was to replace manual moderation processes with an automated system that could classify voice and text excerpts into custom-defined toxic language categories, improving both speed and quality of detection.

The project was executed as a joint effort between two AWS teams: the Generative AI Innovation Center (GAIIC) for proof of concept development, and the ProServe ML Delivery Team (MLDT) for productionization. This two-team handoff model represents an interesting organizational approach to LLMOps, where specialized research teams develop and validate solutions before handing them to production-focused teams.

The Data Challenge and Transfer Learning Approach

One of the central LLMOps challenges in this case study was the severe scarcity of labeled training data. The customer initially provided only approximately 100 labeled samples—far below the commonly recommended minimum of 1,000 samples for fine-tuning LLMs. Training a custom language model from scratch was not viable due to cost and time constraints.

The solution leveraged transfer learning, specifically fine-tuning pre-trained foundation models. The key insight was to find models pre-trained on data with similar characteristics to gaming chat: short-form, casual text from diverse user populations. Twitter data proved to be an excellent proxy, as tweets share similar length and vocabulary diversity characteristics with gaming chat messages.

The team selected BERTweet-based models from the Hugging Face Model Hub. BERTweet uses the RoBERTa pre-training procedure, which improves upon standard BERT training through several modifications: larger batch sizes, removal of the next sentence prediction objective, training on longer sequences, and dynamic masking patterns. The base BERTweet model was pre-trained on 850 million English tweets, making it the first large-scale language model specifically pre-trained for English tweets.

Model Selection and Experimentation

Three models were evaluated during the proof of concept phase:

The progressive pre-training approach—starting with a general language model, then domain-specific pre-training, then task-specific fine-tuning—represents a best practice in transfer learning for NLP tasks. Models pre-trained on toxic language detection tasks provided a better starting point than general-purpose language models.

Two-Stage vs. Single-Stage Architecture

The proof of concept employed a two-stage prediction architecture: a binary classifier first determined whether text was toxic or non-toxic, and only toxic content proceeded to a second fine-grained classifier that categorized the type of toxicity according to customer-defined categories. This cascading approach achieved strong results: 92% precision, 90% recall, and 91% F1 for the binary classifier, with 81% precision, 80% recall, and 81% F1 for the fine-grained classifier.

However, the production team identified significant operational challenges with the two-stage approach:

These are critical LLMOps considerations that often don’t surface during proof of concept phases but become significant in production environments. The team opted to consolidate into a single-stage multi-class classifier that includes non-toxic as one of the classification labels.

Data Augmentation Strategy

To address the limited labeled data, the team employed data augmentation by incorporating third-party labeled data from the Jigsaw Toxicity Kaggle competition. They mapped the Jigsaw labels to customer-defined toxicity categories and combined this with the original customer data. This approach of leveraging publicly available labeled datasets to supplement limited proprietary data is a practical strategy when training data is scarce.

For the production model, the customer provided an additional 5,000 labeled samples (3,000 non-toxic, 2,000 toxic), bringing the total training dataset to approximately 10,000 samples with the PoC data included. This demonstrates the iterative nature of ML projects—the initial data constraint was addressed through a combination of transfer learning, data augmentation, and customer data collection efforts.

Fine-Tuning Implementation

The implementation used the Hugging Face Transformers library, which provides a unified API for working with different transformer architectures. The code demonstrates the standard fine-tuning workflow:

The pre-trained model is loaded with a modified classification head using AutoModelForSequenceClassification.from_pretrained(), where the num_labels parameter specifies the number of output classes. This automatically replaces the pre-trained model’s classification head with a new randomly initialized head sized appropriately for the task.

Key training hyperparameters that were tuned include:

The training used an 80/20 train/test split for validation, with model checkpointing based on evaluation metrics at each epoch. The load_best_model_at_end=True parameter ensures the best-performing checkpoint is retained rather than simply the final epoch.

Production Model Performance

The single-stage bertweet-base-offensive model achieved 88% precision, 83% recall, 86% F1, and 89% AUC. While these metrics represent a slight decrease from the two-stage approach (91% precision, 90% recall, 90% F1, 92% AUC), the customer accepted this trade-off in favor of the operational benefits.

Interestingly, the results also show the impact of pre-training domain relevance:

The offensive speech pre-trained model outperformed both the general model and the hate speech model, suggesting that offensive language detection aligns more closely with the gaming toxicity classification task than hate speech detection.

Infrastructure and Tooling

The development utilized Amazon SageMaker notebooks for experimentation and model training. The Hugging Face Transformers library provided the model loading, tokenization, and training infrastructure. The productionization (detailed in a separate Part 2 not included in this text) was built on SageMaker for scalable deployment.

This case study highlights several important LLMOps lessons: the value of transfer learning for low-data scenarios, the operational trade-offs between model accuracy and system complexity, the importance of selecting pre-trained models with domain-relevant training data, and the need for separate consideration of proof of concept and production requirements. The handoff model between research and production teams, while adding coordination overhead, allowed each team to focus on their respective strengths.

More Like This

Advanced Fine-Tuning Techniques for Multi-Agent Orchestration at Scale

Amazon 2026

Amazon teams faced challenges in deploying high-stakes LLM applications across healthcare, engineering, and e-commerce domains where basic prompt engineering and RAG approaches proved insufficient. Through systematic application of advanced fine-tuning techniques including Supervised Fine-Tuning (SFT), Proximal Policy Optimization (PPO), Direct Preference Optimization (DPO), and cutting-edge reasoning optimizations like Group-based Reinforcement Learning from Policy Optimization (GRPO) and Direct Advantage Policy Optimization (DAPO), three Amazon business units achieved production-grade results: Amazon Pharmacy reduced dangerous medication errors by 33%, Amazon Global Engineering Services achieved 80% human effort reduction in inspection reviews, and Amazon A+ Content improved quality assessment accuracy from 77% to 96%. These outcomes demonstrate that approximately one in four high-stakes enterprise applications require advanced fine-tuning beyond standard techniques to achieve necessary performance levels in production environments.

healthcare customer_support content_moderation +43

Enterprise AI Platform Integration for Secure Production Deployment

Rubrik 2025

Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.

customer_support content_moderation chatbot +53

Training a 70B Japanese Large Language Model with Amazon SageMaker HyperPod

Institute of Science Tokyo 2025

The Institute of Science Tokyo successfully developed Llama 3.3 Swallow, a 70-billion-parameter large language model with enhanced Japanese capabilities, using Amazon SageMaker HyperPod infrastructure. The project involved continual pre-training from Meta's Llama 3.3 70B model using 314 billion tokens of primarily Japanese training data over 16 days across 256 H100 GPUs. The resulting model demonstrates superior performance compared to GPT-4o-mini and other leading models on Japanese language benchmarks, showcasing effective distributed training techniques including 4D parallelism, asynchronous checkpointing, and comprehensive monitoring systems that enabled efficient large-scale model training in production.

translation question_answering chatbot +37