Large Gaming Company: Fine-tuning LLMs for Toxic Speech Classification in Gaming

Overview

This case study documents AWS Professional Services’ engagement with a large gaming company to build an automated toxic speech detection system for online player communications. The video gaming industry serves over 3 billion users worldwide, and maintaining a socially responsible gaming environment requires effective moderation of player interactions. The customer’s goal was to replace manual moderation processes with an automated system that could classify voice and text excerpts into custom-defined toxic language categories, improving both speed and quality of detection.

The project was executed as a joint effort between two AWS teams: the Generative AI Innovation Center (GAIIC) for proof of concept development, and the ProServe ML Delivery Team (MLDT) for productionization. This two-team handoff model represents an interesting organizational approach to LLMOps, where specialized research teams develop and validate solutions before handing them to production-focused teams.

The Data Challenge and Transfer Learning Approach

One of the central LLMOps challenges in this case study was the severe scarcity of labeled training data. The customer initially provided only approximately 100 labeled samples—far below the commonly recommended minimum of 1,000 samples for fine-tuning LLMs. Training a custom language model from scratch was not viable due to cost and time constraints.

The solution leveraged transfer learning, specifically fine-tuning pre-trained foundation models. The key insight was to find models pre-trained on data with similar characteristics to gaming chat: short-form, casual text from diverse user populations. Twitter data proved to be an excellent proxy, as tweets share similar length and vocabulary diversity characteristics with gaming chat messages.

The team selected BERTweet-based models from the Hugging Face Model Hub. BERTweet uses the RoBERTa pre-training procedure, which improves upon standard BERT training through several modifications: larger batch sizes, removal of the next sentence prediction objective, training on longer sequences, and dynamic masking patterns. The base BERTweet model was pre-trained on 850 million English tweets, making it the first large-scale language model specifically pre-trained for English tweets.

Model Selection and Experimentation

Three models were evaluated during the proof of concept phase:

vinai/bertweet-base: The baseline BERTweet model pre-trained on general tweets
cardiffnlp/bertweet-base-offensive: BERTweet further fine-tuned on 14,100 annotated offensive tweets from SemEval-2019 Task 6
cardiffnlp/bertweet-base-hate: BERTweet further fine-tuned on 19,600 hate speech tweets from SemEval-2019 Task 5

The progressive pre-training approach—starting with a general language model, then domain-specific pre-training, then task-specific fine-tuning—represents a best practice in transfer learning for NLP tasks. Models pre-trained on toxic language detection tasks provided a better starting point than general-purpose language models.

Two-Stage vs. Single-Stage Architecture

The proof of concept employed a two-stage prediction architecture: a binary classifier first determined whether text was toxic or non-toxic, and only toxic content proceeded to a second fine-grained classifier that categorized the type of toxicity according to customer-defined categories. This cascading approach achieved strong results: 92% precision, 90% recall, and 91% F1 for the binary classifier, with 81% precision, 80% recall, and 81% F1 for the fine-grained classifier.

However, the production team identified significant operational challenges with the two-stage approach:

Monitoring complexity: Two models require double the monitoring infrastructure and may drift at different rates, creating inconsistent retraining schedules
Cost: Running two models in production doubles inference costs
Latency: Sequential inference through two models increases response time

These are critical LLMOps considerations that often don’t surface during proof of concept phases but become significant in production environments. The team opted to consolidate into a single-stage multi-class classifier that includes non-toxic as one of the classification labels.

Data Augmentation Strategy

To address the limited labeled data, the team employed data augmentation by incorporating third-party labeled data from the Jigsaw Toxicity Kaggle competition. They mapped the Jigsaw labels to customer-defined toxicity categories and combined this with the original customer data. This approach of leveraging publicly available labeled datasets to supplement limited proprietary data is a practical strategy when training data is scarce.

For the production model, the customer provided an additional 5,000 labeled samples (3,000 non-toxic, 2,000 toxic), bringing the total training dataset to approximately 10,000 samples with the PoC data included. This demonstrates the iterative nature of ML projects—the initial data constraint was addressed through a combination of transfer learning, data augmentation, and customer data collection efforts.

Fine-Tuning Implementation

The implementation used the Hugging Face Transformers library, which provides a unified API for working with different transformer architectures. The code demonstrates the standard fine-tuning workflow:

The pre-trained model is loaded with a modified classification head using AutoModelForSequenceClassification.from_pretrained(), where the num_labels parameter specifies the number of output classes. This automatically replaces the pre-trained model’s classification head with a new randomly initialized head sized appropriately for the task.

Key training hyperparameters that were tuned include:

Number of training epochs
Per-device batch sizes for training and evaluation
Learning rate
Optimizer selection
Metric for selecting the best model checkpoint

The training used an 80/20 train/test split for validation, with model checkpointing based on evaluation metrics at each epoch. The load_best_model_at_end=True parameter ensures the best-performing checkpoint is retained rather than simply the final epoch.

Production Model Performance

The single-stage bertweet-base-offensive model achieved 88% precision, 83% recall, 86% F1, and 89% AUC. While these metrics represent a slight decrease from the two-stage approach (91% precision, 90% recall, 90% F1, 92% AUC), the customer accepted this trade-off in favor of the operational benefits.

Interestingly, the results also show the impact of pre-training domain relevance:

bertweet-base (general): 76% precision, 72% recall, 74% F1
bertweet-base-hate: 85% precision, 82% recall, 84% F1
bertweet-base-offensive: 88% precision, 83% recall, 86% F1

The offensive speech pre-trained model outperformed both the general model and the hate speech model, suggesting that offensive language detection aligns more closely with the gaming toxicity classification task than hate speech detection.

Infrastructure and Tooling

The development utilized Amazon SageMaker notebooks for experimentation and model training. The Hugging Face Transformers library provided the model loading, tokenization, and training infrastructure. The productionization (detailed in a separate Part 2 not included in this text) was built on SageMaker for scalable deployment.

This case study highlights several important LLMOps lessons: the value of transfer learning for low-data scenarios, the operational trade-offs between model accuracy and system complexity, the importance of selecting pre-trained models with domain-relevant training data, and the need for separate consideration of proof of concept and production requirements. The handoff model between research and production teams, while adding coordination overhead, allowed each team to focus on their respective strengths.

Fine-tuning LLMs for Toxic Speech Classification in Gaming

Industry

Technologies

Overview

The Data Challenge and Transfer Learning Approach

Model Selection and Experimentation

Two-Stage vs. Single-Stage Architecture

Data Augmentation Strategy

Fine-Tuning Implementation

Production Model Performance

Infrastructure and Tooling

More Like This

Advanced Fine-Tuning Techniques for Multi-Agent Orchestration at Scale

Enterprise AI Platform Integration for Secure Production Deployment

Training a 70B Japanese Large Language Model with Amazon SageMaker HyperPod