ZenML

LLMOps Tag: caption_generation

35 tools with this tag

← Back to LLMOps Database

Common industries

View all industries →

Accelerating Game Asset Creation with Fine-Tuned Diffusion Models

Rovio

Rovio, the Finnish gaming company behind Angry Birds, faced challenges in meeting the high demand for game art assets across multiple games and seasonal events, with artists spending significant time on repetitive tasks. The company developed "Beacon Picasso," a suite of generative AI tools powered by fine-tuned diffusion models running on AWS infrastructure (SageMaker, Bedrock, EC2 with GPUs). By training custom models on proprietary Angry Birds art data and building multiple user interfaces tailored to different user needs—from a simple Slackbot to advanced cloud-based workflows—Rovio achieved an 80% reduction in production time for specific use cases like season pass backgrounds, while maintaining brand quality standards and keeping artists in creative control. The solution enabled artists to focus on high-value creative work while AI handled repetitive variations, ultimately doubling content production capacity.

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

AI-Driven Media Analysis and Content Assembly Platform for Large-Scale Video Archives

Bloomberg Media

Bloomberg Media, facing challenges in analyzing and leveraging 13 petabytes of video content growing at 3,000 hours per day, developed a comprehensive AI-driven platform to analyze, search, and automatically create content from their massive media archive. The solution combines multiple analysis approaches including task-specific models, vision language models (VLMs), and multimodal embeddings, unified through a federated search architecture and knowledge graphs. The platform enables automated content assembly using AI agents to create platform-specific cuts from long-form interviews and documentaries, dramatically reducing time to market while maintaining editorial trust and accuracy. This "disposable AI strategy" emphasizes modularity, versioning, and the ability to swap models and embeddings without re-engineering entire workflows, allowing Bloomberg to adapt quickly to evolving AI capabilities while expanding reach across multiple distribution platforms.

AI-Powered Audio Enhancement for TV and Movie Dialogue Clarity

Amazon

Amazon developed Dialogue Boost, an AI-powered audio processing technology that enhances dialogue clarity in TV shows, movies, and podcasts by suppressing background music and sound effects. The system uses deep neural networks for sound source separation and runs directly on-device (Echo smart speakers and Fire TV devices) thanks to breakthroughs in model compression and knowledge distillation. Originally launched on Prime Video in 2022 using cloud-based processing, the technology was compressed to less than 1% of its original size while maintaining nearly identical performance, enabling real-time processing across multiple streaming platforms including Netflix, YouTube, and Disney+. Research shows over 86% of participants preferred Dialogue-Boost-enhanced audio, with 100% approval among users with hearing loss, significantly reducing listening effort and improving accessibility for millions of viewers globally.

AI-Powered Fan Engagement and Content Personalization for Global Football Audiences

DFL / Bundesliga

DFL / Bundesliga, the organization behind Germany's premier football league, partnered with AWS to enhance fan engagement for their 1 billion global fans through AI and generative AI solutions. The primary challenges included personalizing content at scale across diverse geographies and languages, automating manual content creation processes, and making decades of archival footage searchable and accessible. The solutions implemented included an AI-powered live ticker providing real-time commentary in multiple languages and styles within 7 seconds of events, an intelligent metadata generation (IGM) system to analyze 9+ petabytes of historical footage using multimodal AI, automated content localization for speech-to-speech and speech-to-text translation, AI-generated "Stories" format content from existing articles, and personalized app experiences. Results demonstrated significant impact: 20% increase in overall app usage, 67% increase in articles read through personalization, 75% reduction in processing time for localized content with 5x content output, 2x increase in app dwell time from AI-generated stories, and 67% story retention rate indicating strong user engagement.

AI-Powered Marketing Content Generation and Compliance Platform at Scale

Volkswagen

Volkswagen Group Services partnered with AWS to build a production-scale generative AI platform for automotive marketing content generation and compliance evaluation. The problem was a slow, manual content supply chain that took weeks to months, created confidentiality risks with pre-production vehicles, and faced massive compliance bottlenecks across 10 brands and 200+ countries. The solution involved fine-tuning diffusion models on proprietary vehicle imagery (including digital twins from CAD), automated prompt enhancement using LLMs, and multi-stage image evaluation using vision-language models for both component-level accuracy and brand guideline compliance. Results included massive time savings (weeks to minutes), automated compliance checks across legal and brand requirements, and a reusable shared platform supporting multiple use cases across the organization.

AI-Powered Personalized Year-in-Review Campaign at Scale

Canva

Canva launched DesignDNA, a year-in-review campaign in December 2024 to celebrate their community's design achievements. The campaign needed to create personalized, shareable experiences for millions of users while respecting privacy constraints. Canva leveraged generative AI to match users to design trends using keyword analysis, generate design personalities, and create over a million unique personalized poems across 9 locales. The solution combined template metadata analysis, prompt engineering, content generation at scale, and automated review processes to produce 95 million unique DesignDNA stories. Each story included personalized statistics, AI-generated poems, design personality profiles, and predicted emerging design trends, all dynamically assembled using URL parameters and tagged template elements.

AI-Powered Video Workflow Orchestration Platform for Broadcasting

Cires21

Cires21, a Spanish live streaming services company, developed MediaCoPilot to address the fragmented ecosystem of applications used by broadcasters, which resulted in slow content delivery, high costs, and duplicated work. The solution is a unified serverless platform on AWS that integrates custom AI models for video and audio processing (ASR, diarization, scene detection) with Amazon Bedrock for generating complex metadata like subtitles, highlights, and summaries. The platform uses AWS Step Functions for orchestration, exposes capabilities via API for integration into client workflows, and recently added AI agents powered by AWS Agent Core that can handle complex multi-step tasks like finding viral moments, creating social media clips, and auto-generating captions. The architecture delivers faster time-to-market, improved scalability, and automated content workflows for broadcast clients.

Building a Multi-Model AI Platform and Agent Marketplace

Quora

Quora built Poe as a unified platform providing consumer access to multiple large language models and AI agents through a single interface and subscription. Starting with experiments using GPT-3 for answer generation on Quora, the company recognized the paradigm shift toward chat-based AI interactions and developed Poe to serve as a "web browser for AI" - enabling users to access diverse models, create custom agents through prompting or server integrations, and monetize AI applications. The platform has achieved significant scale with creators earning millions annually while supporting various modalities including text, image, and voice models.

Building an AI API Gateway for Streamlined GenAI Service Development

DeliveryHero

DeliveryHero's Woowa Brothers division developed an AI API Gateway to address the challenges of managing multiple GenAI providers and streamlining development processes. The gateway serves as a central infrastructure component to handle credential management, prompt management, and system stability while supporting various GenAI services like AWS Bedrock, Azure OpenAI, and GCP Imagen. The initiative was driven by extensive user interviews and aims to democratize AI usage across the organization while maintaining security and efficiency.

Building an AI Co-Pilot Application: Patterns and Best Practices

Thoughtworks

Thoughtworks built Boba, an experimental AI co-pilot for product strategy and ideation, to learn about building generative AI experiences beyond chat interfaces. The team implemented several key patterns including templated prompts, structured responses, real-time progress streaming, context management, and external knowledge integration. The case study provides detailed insights into practical LLMOps patterns for building production LLM applications with enhanced user experiences.

Building an AI Co-pilot for Product Strategy with LLM Integration Patterns

Thoughtworks

Thoughtworks built Boba, an experimental AI co-pilot for product strategy and ideation, to explore effective patterns for LLM-powered applications beyond simple chat interfaces. The team developed and documented key patterns including templated prompts, structured responses, real-time progress streaming, context management, and external knowledge integration. The case study provides detailed implementation insights for building sophisticated LLM applications with better user experiences.

Building Production-Grade Heterogeneous RAG Systems

AWS GenAIIC

AWS GenAIIC shares practical insights from implementing RAG systems with heterogeneous data formats in production. The case study explores using routers for managing diverse data sources, leveraging LLMs' code generation capabilities for structured data analysis, and implementing multimodal RAG solutions that combine text and image data. The solutions include modular components for intent detection, data processing, and retrieval across different data types with examples from multiple industries.

Building Robust Evaluation Systems for Auto-Generated Video Titles

Loom

Loom developed a systematic approach to evaluating and improving their AI-powered video title generation feature. They created a comprehensive evaluation framework combining code-based scorers and LLM-based judges, focusing on specific quality criteria like relevance, conciseness, and engagement. This methodical approach to LLMOps enabled them to ship AI features faster and more confidently while ensuring consistent quality in production.

Company-Wide GenAI Transformation Through Hackathon-Driven Culture and Centralized Infrastructure

Agoda

Agoda transformed from GenAI experiments to company-wide adoption through a strategic approach that began with a 2023 hackathon, grew into a grassroots culture of exploration, and was supported by robust infrastructure including a centralized GenAI proxy and internal chat platform. Starting with over 200 developers prototyping 40+ ideas, the initiative evolved into 200+ applications serving both internal productivity (73% employee adoption, 45% of tech support tickets automated) and customer-facing features, demonstrating how systematic enablement and community-driven innovation can scale GenAI across an entire organization.

Democratizing Prompt Engineering Through Platform Architecture and Employee Empowerment

Pinterest

Pinterest developed a comprehensive LLMOps platform strategy to enable their 570 million user visual discovery platform to rapidly adopt generative AI capabilities. The company built a multi-layered architecture with vendor-agnostic model access, centralized proxy services, and employee-facing tools, combined with innovative training approaches like "Prompt Doctors" and company-wide hackathons. Their solution included automated batch labeling systems, a centralized "Prompt Hub" for prompt development and evaluation, and an "AutoPrompter" system that uses LLMs to automatically generate and optimize prompts through iterative critique and refinement. This approach enabled non-technical employees to become effective prompt engineers, resulted in the fastest-adopted platform at Pinterest, and demonstrated that democratizing AI capabilities across all employees can lead to breakthrough innovations.

Edge AI Architecture for Wearable Smart Glasses with Real-Time Multimodal Processing

Meta / Ray Ban

Meta Reality Labs developed a production AI system for Ray-Ban Meta smart glasses that brings AI capabilities directly to wearable devices through a four-part architecture combining on-device processing, smartphone connectivity, and cloud-based AI services. The system addresses unique challenges of wearable AI including power constraints, thermal management, connectivity limitations, and real-time performance requirements while enabling features like visual question answering, photo capture, and voice commands with sub-second response times for on-device operations and under 3-second response times for cloud-based AI interactions.

Enterprise-Grade RAG System for Internal Knowledge Management

PDI

PDI Technologies, a global leader in convenience retail and petroleum wholesale, built PDIQ (PDI Intelligence Query), an AI-powered internal knowledge assistant to address the challenge of fragmented information across websites, Confluence, SharePoint, and other enterprise systems. The solution implements a custom Retrieval Augmented Generation (RAG) system on AWS using serverless technologies including Lambda, ECS, DynamoDB, S3, Aurora PostgreSQL, and Amazon Bedrock models (Nova Pro, Nova Micro, Nova Lite, and Titan Embeddings V2). The system features sophisticated document processing with image captioning, dynamic token management for chunking (70% content, 10% overlap, 20% summary), and role-based access control. PDIQ improved customer satisfaction scores, reduced resolution times, increased accuracy approval rates from 60% to 79%, and enabled cost-effective scaling through serverless architecture while supporting multiple business units with configurable data sources.

Enterprise-Scale AI-First Translation Platform with Agentic Workflows

Smartling

Smartling operates an enterprise-scale AI-first agentic translation delivery platform serving major corporations like Disney and IBM. The company addresses challenges around automation, centralization, compliance, brand consistency, and handling diverse content types across global markets. Their solution employs multi-step agentic workflows where different model functions validate each other's outputs, combining neural machine translation with large language models, RAG for accessing validated linguistic assets, sophisticated prompting, and automated post-editing for hyper-localization. The platform demonstrates measurable improvements in throughput (from 2,000 to 6,000-7,000 words per day), cost reduction (4-10x cheaper than human translation), and quality approaching 70% human parity for certain language pairs and content types, while maintaining enterprise requirements for repeatability, compliance, and brand voice consistency.

Global News Organization's AI-Powered Content Production and Verification System

Reuters

Reuters has implemented a comprehensive AI strategy to enhance its global news operations, focusing on reducing manual work, augmenting content production, and transforming news delivery. The organization developed three key tools: a press release fact extraction system, an AI-integrated CMS called Leon, and a content packaging tool called LAMP. They've also launched the Reuters AI Suite for clients, offering transcription and translation capabilities while maintaining strict ethical guidelines around AI-generated imagery and maintaining journalistic integrity.

Google Photos Magic Editor: Transitioning from On-Device ML to Cloud-Based Generative AI for Image Editing

Google

Google Photos evolved from using on-device machine learning models for basic image editing features like background blur and object removal to implementing cloud-based generative AI for their Magic Editor feature. The team transitioned from small, specialized models (10MB) running locally on devices to large-scale generative models hosted in the cloud to enable more sophisticated image editing capabilities like scene reimagination, object relocation, and advanced inpainting. This shift required significant changes in infrastructure, capacity planning, evaluation methodologies, and user experience design while maintaining focus on grounded, memory-preserving edits rather than fantastical image generation.

Integrating LLMs and Diffusion Models for Website Design Automation

Wix

Wix is leveraging AI technologies, including LLMs and diffusion models, to automate and enhance the website building experience. Their AI group has developed the AI Text Creator suite using LLMs for content generation, integrated DALL-E for image creation, and introduced the Diffusion Layout Transformer (DLT) for automated layout generation. This comprehensive approach combines content generation with layout design, addressing the challenge of creating professional websites without requiring extensive design expertise.

Large Language Models for Search Relevance at Scale

Pinterest

Pinterest's search relevance team integrated large language models into their search pipeline to improve semantic relevance prediction for over 6 billion monthly searches across 45 languages and 100+ countries. They developed a cross-encoder teacher model using fine-tuned open-source LLMs that achieved 12-20% performance improvements over existing models, then used knowledge distillation to create a production-ready bi-encoder student model that could scale efficiently. The solution incorporated visual language model captions, user engagement signals, and multilingual capabilities, ultimately improving search relevance metrics internationally while producing reusable semantic embeddings for other Pinterest surfaces.

Large Language Models for Search Relevance via Knowledge Distillation

Pinterest

Pinterest tackled the challenge of improving search relevance by implementing a large language model-based system. They developed a cross-encoder LLM teacher model trained on human-annotated data, which was then distilled into a lightweight student model for production deployment. The system processes rich Pin metadata including titles, descriptions, and synthetic image captions to predict relevance scores. The implementation resulted in a 2.18% improvement in search feed relevance (nDCG@20) and over 1.5% increase in search fulfillment rates globally, while successfully generalizing across multiple languages despite being trained primarily on US data.

Large-Scale Deployment of On-Device and Server Foundation Models for Consumer AI Features

Apple

Apple developed and deployed a comprehensive foundation model infrastructure consisting of a 3-billion parameter on-device model and a mixture-of-experts server model to power Apple Intelligence features across iOS, iPadOS, and macOS. The implementation addresses the challenge of delivering generative AI capabilities at consumer scale while maintaining privacy, efficiency, and quality across 15 languages. The solution involved novel architectural innovations including shared KV caches, parallel track mixture-of-experts design, and extensive optimization techniques including quantization and compression, resulting in production deployment across millions of devices with measurable performance improvements in text and vision tasks.

Multi-Company Showcase: AI-Powered Development Tools and Creative Applications

Tempo Labs / Zencoder / Diffusion / Bito / Gamma / Create

This case study presents six startups showcasing production deployments of Claude-powered applications across diverse domains at Anthropic's Code with Claude conference. Tempo Labs built a visual IDE enabling designers and PMs to collaborate on code generation, Zencoder extended AI coding assistance across the full software development lifecycle with custom agents, Gamma created an AI presentation builder leveraging Claude's web search capabilities, Bito developed an AI code review platform analyzing codebases for critical issues, Diffusion deployed Claude for song lyric generation in their music creation platform, and Create built a no-code platform for generating full-stack mobile and web applications. These companies demonstrated how Claude 3.5 and 3.7 Sonnet, along with features like tool use, web search, and prompt caching, enabled them to achieve rapid growth with hundreds of thousands to millions of users within 12 months.

Multi-Industry LLM Deployment: Building Production AI Systems Across Diverse Verticals

Caylent

Caylent, a development consultancy, shares their extensive experience building production LLM systems across multiple industries including environmental management, sports media, healthcare, and logistics. The presentation outlines their comprehensive approach to LLMOps, emphasizing the importance of proper evaluation frameworks, prompt engineering over fine-tuning, understanding user context, and managing inference economics. Through various client projects ranging from multimodal video search to intelligent document processing, they demonstrate key lessons learned about deploying reliable AI systems at scale, highlighting that generative AI is not a "magical pill" but requires careful engineering around inputs, outputs, evaluation, and user experience.

Multimodal AI Vector Search for Advanced Video Understanding

Twelve Labs

Twelve Labs developed an integration with Databricks Mosaic AI to enable advanced video understanding capabilities through multimodal embeddings. The solution addresses challenges in processing large-scale video datasets and providing accurate multimodal content representation. By combining Twelve Labs' Embed API for generating contextual vector representations with Databricks Mosaic AI Vector Search's scalable infrastructure, developers can implement sophisticated video search, recommendation, and analysis systems with reduced development time and resource needs.

Native Image Generation with Multimodal Context in Gemini 2.5 Flash

Google DeepMind

Google DeepMind released an updated native image generation capability in Gemini 2.5 Flash that represents a significant quality leap over previous versions. The model addresses key production challenges including consistent character rendering across multiple angles, pixel-perfect editing that preserves scene context, and improved text rendering within images. Through interleaved generation, the model can maintain conversation context across multiple editing turns, enabling iterative creative workflows. The team tackled evaluation challenges by combining human preference data with specific technical metrics like text rendering quality, while incorporating real user feedback from social media to create comprehensive benchmarks that drive model improvements.

Production-Scale Generative AI Infrastructure for Game Art Creation

Playtika

Playtika, a gaming company, built an internal generative AI platform to accelerate art production for their game studios with the goal of reducing art production time by 50%. The solution involved creating a comprehensive infrastructure for fine-tuning and deploying diffusion models (Stable Diffusion 1.5, then SDXL) at scale, supporting text-to-image, image-to-image, and inpainting capabilities. The platform evolved from using DreamBooth fine-tuning with separate model deployments to LoRA adapters with SDXL, enabling efficient model switching and GPU utilization. Through optimization techniques including OneFlow acceleration framework (achieving 40% latency reduction), FP16 quantization, NVIDIA MIG partitioning, and careful infrastructure design, they built a cost-efficient system serving multiple game studios while maintaining quality and minimizing inference latency.

Rapid Prototyping and Scaling AI Applications Using Open Source Models

Hassan El Mghari

Hassan El Mghari, a developer relations leader at Together AI, demonstrates how to build and scale AI applications to millions of users using open source models and a simplified architecture. Through building approximately 40 AI apps over four years (averaging one per month), he developed a streamlined approach that emphasizes simplicity, rapid iteration, and leveraging the latest open source models. His applications, including commit message generators, text-to-app builders, and real-time image generators, have collectively served millions of users and generated tens of millions of outputs, proving that simple architectures with single API calls can achieve significant scale when combined with good UI design and viral sharing mechanics.

Scaling Agentic AI for Digital Accessibility and Content Intelligence

Siteimprove

Siteimprove, a SaaS platform provider for digital accessibility, analytics, SEO, and content strategy, embarked on a journey from generative AI to production-scale agentic AI systems. The company faced the challenge of processing up to 100 million pages per month for accessibility compliance while maintaining trust, speed, and adoption. By leveraging AWS Bedrock, Amazon Nova models, and developing a custom AI accelerator architecture, Siteimprove built a multi-agent system supporting batch processing, conversational remediation, and contextual image analysis. The solution achieved 75% cost reduction on certain workloads, enabled autonomous multi-agent orchestration across accessibility, analytics, SEO, and content domains, and was recognized as a leader in Forrester's digital accessibility platforms assessment. The implementation demonstrated how systematic progression through human-in-the-loop, human-on-the-loop, and autonomous stages can bridge the prototype-to-production chasm while delivering measurable business value.

Scaling Audio Content Generation with LLMs and TTS for Language Learning

Duolingo

Duolingo tackled the challenge of scaling their DuoRadio feature, a podcast-like audio learning experience, by implementing an AI-driven content generation pipeline. They transformed a labor-intensive manual process into an automated system using LLMs for script generation and evaluation, coupled with Text-to-Speech technology. This allowed them to expand from 300 to 15,000+ episodes across 25+ language courses in under six months, while reducing costs by 99% and growing daily active users from 100K to 5.5M.

Scaling Content Production and Fan Engagement with Gen AI

Bundesliga

Bundesliga (DFL), Germany's premier soccer league, deployed multiple Gen AI solutions to address two key challenges: scaling content production for over 1 billion global fans across 200 countries, and enhancing personalized fan engagement to reduce "second screen chaos" during live matches. The organization implemented three main production-scale solutions: automated match report generation that saves editors 90% of their time, AI-powered story creation from existing articles that reduces production time by 80%, and on-demand video localization that cuts processing time by 75% while reducing costs by 3.5x. Additionally, they developed MatchMade, an AI-powered fan companion featuring dynamic text-to-SQL workflows and proactive content nudging. By leveraging Amazon Nova for cost-performance optimization alongside other models like Anthropic's Claude, Bundesliga achieved a 70% cost reduction in image assignment tasks, 35% cost reduction through dynamic routing, and scaled personalized content delivery by 5x per user while serving over 100,000 fans in production.

Using Evaluation Systems and Inference-Time Scaling for Beautiful, Scannable QR Code Generation

Modal

Modal's engineering team tackled the challenge of generating aesthetically pleasing QR codes that consistently scan by implementing comprehensive evaluation systems and inference-time compute scaling. The team developed automated evaluation pipelines that measured both scan rate and aesthetic quality, using human judgment alignment to validate their metrics. They applied inference-time compute scaling by generating multiple QR codes in parallel and selecting the best candidates, achieving a 95% scan rate service-level objective while maintaining aesthetic quality and returning results in under 20 seconds.