ZenML

Evaluating Product Image Integrity in AI-Generated Advertising Content

Microsoft 2024
View original source

Microsoft worked with an advertising customer to enable 1:1 ad personalization while ensuring product image integrity in AI-generated content. They developed a comprehensive evaluation system combining template matching, Mean Squared Error (MSE), Peak Signal to Noise Ratio (PSNR), and Cosine Similarity to verify that AI-generated backgrounds didn't alter the original product images. The solution successfully enabled automatic verification of product image fidelity in AI-generated advertising materials.

Industry

Media & Entertainment

Technologies

Overview

This case study from Microsoft’s ISE (Industry Solutions Engineering) team documents an engagement with an advertising customer pursuing 1:1 ad personalization at scale. The fundamental goal was to create AI-generated advertisements where real product images are embedded within generated contextual backgrounds, while ensuring the integrity of the original product image is maintained. This represents an interesting intersection of generative AI capabilities and quality assurance requirements—a critical concern when deploying AI-generated content in production advertising contexts.

The project addresses a key hypothesis: given a product image and a text description of desired surroundings, can we generate high-fidelity images with appropriate environmental elements while keeping the product image completely unmodified? This is essential for advertising use cases where product representation accuracy has legal and brand implications.

The Technical Challenge

The team identified inpainting as the most promising technology for this use case. Inpainting is a feature of multi-modal text-to-image models where inputs include the product image, a mask defining the area to be generated (the background), and a textual prompt describing the desired background. The output is an image where the masked area has been filled with generated content according to the prompt.

However, the team noted a critical limitation: inpainting cannot guarantee the product remains unmodified. The generative process may subtly alter product pixels, especially when masked areas are close to product boundaries. This creates a significant quality assurance challenge for production deployment—how do you programmatically verify that thousands or millions of generated ad images have preserved product integrity?

Evaluation Framework Design

Rather than relying on a single metric, the team developed a multi-technique evaluation system that combines four distinct approaches, each with complementary strengths and weaknesses. This represents a sophisticated approach to LLMOps evaluation where no single metric captures all quality dimensions.

Mean Squared Error (MSE)

MSE calculates the average squared difference between pixel values in two images. The team implemented this using NumPy for efficient array operations. MSE performs well for detecting exact pixel color differences and disproportionate scaling, but has significant limitations: it requires images to be aligned (same product position) and cannot distinguish between acceptable transformations (like translation or proportionate scaling) and unacceptable modifications.

Peak Signal-to-Noise Ratio (PSNR)

PSNR builds on MSE but uses a logarithmic scale to express the ratio between maximum signal power and noise (corruption). The team noted that while the logarithmic nature makes results initially less intuitive, it provides advantages for understanding orders of magnitude difference between images. PSNR shares MSE’s limitations regarding translation and rotation sensitivity, but provides a standardized way to express image fidelity that is commonly used in image processing literature.

Cosine Similarity with CNN Feature Vectors

This approach represents a more sophisticated computer vision technique. Instead of comparing raw pixels, the team used VGG16 (a pre-trained convolutional neural network) to extract feature vectors from images, then compared these using cosine similarity. They specifically used only the feature extraction layers of VGG16, not the classification head.

The key advantage is that cosine similarity on feature vectors captures structural and semantic image properties—edges, curves, and shapes that define the product—rather than exact pixel values. This makes it robust to translation and proportionate scaling, as the extracted features represent the product’s visual characteristics regardless of position. However, it won’t detect subtle color changes as effectively as pixel-based methods and may incorrectly approve disproportionate scaling that preserves feature ratios.

The implementation uses TensorFlow/Keras for the VGG16 model and SciPy for cosine distance calculation, representing a practical production-ready approach using established deep learning frameworks.

Template Matching with OpenCV

Template matching serves a different purpose—locating where the product exists within the generated image. Using OpenCV’s matchTemplate function, the system slides the original product image across the generated image to find the best match location. This enables extraction of the product region from the generated image for subsequent comparison.

The team noted important practical considerations: template matching requires the template and target image to have the same resolution, and OpenCV doesn’t track resolution metadata. This means production systems may need to maintain multiple resolution versions of product templates to match various GenAI model output resolutions.

Combined Evaluation Strategy

The case study’s key LLMOps insight is that combining these techniques creates a more robust evaluation system than any single approach. The team designed the following complementary usage pattern:

This layered approach means that different types of product modifications will be caught by at least one technique. A color shift would be detected by MSE/PSNR even if cosine similarity shows high structural similarity. A structural distortion would be caught by cosine similarity even if pixel colors are similar.

Limitations and Future Work

The team transparently documented several limitations, which is valuable for understanding production deployment considerations:

Production Considerations

While the case study focuses on the evaluation methodology rather than full production deployment details, several LLMOps considerations emerge:

The evaluation pipeline would need to run at scale for 1:1 personalization use cases generating potentially millions of unique ads. The choice of established, efficient libraries (NumPy, OpenCV, TensorFlow) supports this requirement.

Threshold calibration is implicitly required but not detailed—at what MSE, PSNR, or cosine similarity values should a generated image be rejected? This would require validation against human judgment and business requirements.

The mention that all product images in the post were AI-generated by OpenAI via ChatGPT provides useful context about the generative model being evaluated, though the evaluation framework itself is model-agnostic.

Technical Implementation Details

The code snippets provided show production-ready patterns:

The VGG16 feature extraction uses the ‘fc2’ layer (the second fully-connected layer), which provides a 4096-dimensional feature vector that captures high-level visual features learned from ImageNet training.

Conclusion

This case study demonstrates a thoughtful approach to quality assurance for AI-generated content in production advertising contexts. Rather than trusting generative AI outputs blindly, the team developed a multi-technique validation framework that can programmatically verify product integrity at scale. The combination of traditional image processing (MSE, PSNR, template matching) with deep learning feature extraction (CNN embeddings, cosine similarity) represents a practical pattern for LLMOps evaluation where different quality dimensions require different measurement approaches. The transparent discussion of limitations provides valuable guidance for teams implementing similar systems.

More Like This

Enterprise-Scale AI-First Translation Platform with Agentic Workflows

Smartling 2025

Smartling operates an enterprise-scale AI-first agentic translation delivery platform serving major corporations like Disney and IBM. The company addresses challenges around automation, centralization, compliance, brand consistency, and handling diverse content types across global markets. Their solution employs multi-step agentic workflows where different model functions validate each other's outputs, combining neural machine translation with large language models, RAG for accessing validated linguistic assets, sophisticated prompting, and automated post-editing for hyper-localization. The platform demonstrates measurable improvements in throughput (from 2,000 to 6,000-7,000 words per day), cost reduction (4-10x cheaper than human translation), and quality approaching 70% human parity for certain language pairs and content types, while maintaining enterprise requirements for repeatability, compliance, and brand voice consistency.

translation content_moderation multi_modality +44

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Hardening AI Agents for E-commerce at Scale: Multi-Company Perspectives on RL Alignment and Reliability

Prosus / Microsoft / Inworld AI / IUD 2025

This panel discussion features experts from Microsoft, Google Cloud, InWorld AI, and Brazilian e-commerce company IUD (Prosus partner) discussing the challenges of deploying reliable AI agents for e-commerce at scale. The panelists share production experiences ranging from Google Cloud's support ticket routing agent that improved policy adherence from 45% to 90% using DPO adapters, to Microsoft's shift away from prompt engineering toward post-training methods for all Copilot models, to InWorld AI's voice agent architecture optimization through cascading models, and IUD's struggles with personalization balance in their multi-channel shopping agent. Key challenges identified include model localization for UI elements, cost efficiency, real-time voice adaptation, and finding the right balance between automation and user control in commerce experiences.

customer_support chatbot realtime_application +33