Microsoft: Evaluating Product Image Integrity in AI-Generated Advertising Content

Overview

This case study from Microsoft’s ISE (Industry Solutions Engineering) team documents an engagement with an advertising customer pursuing 1:1 ad personalization at scale. The fundamental goal was to create AI-generated advertisements where real product images are embedded within generated contextual backgrounds, while ensuring the integrity of the original product image is maintained. This represents an interesting intersection of generative AI capabilities and quality assurance requirements—a critical concern when deploying AI-generated content in production advertising contexts.

The project addresses a key hypothesis: given a product image and a text description of desired surroundings, can we generate high-fidelity images with appropriate environmental elements while keeping the product image completely unmodified? This is essential for advertising use cases where product representation accuracy has legal and brand implications.

The Technical Challenge

The team identified inpainting as the most promising technology for this use case. Inpainting is a feature of multi-modal text-to-image models where inputs include the product image, a mask defining the area to be generated (the background), and a textual prompt describing the desired background. The output is an image where the masked area has been filled with generated content according to the prompt.

However, the team noted a critical limitation: inpainting cannot guarantee the product remains unmodified. The generative process may subtly alter product pixels, especially when masked areas are close to product boundaries. This creates a significant quality assurance challenge for production deployment—how do you programmatically verify that thousands or millions of generated ad images have preserved product integrity?

Evaluation Framework Design

Rather than relying on a single metric, the team developed a multi-technique evaluation system that combines four distinct approaches, each with complementary strengths and weaknesses. This represents a sophisticated approach to LLMOps evaluation where no single metric captures all quality dimensions.

Mean Squared Error (MSE)

MSE calculates the average squared difference between pixel values in two images. The team implemented this using NumPy for efficient array operations. MSE performs well for detecting exact pixel color differences and disproportionate scaling, but has significant limitations: it requires images to be aligned (same product position) and cannot distinguish between acceptable transformations (like translation or proportionate scaling) and unacceptable modifications.

Peak Signal-to-Noise Ratio (PSNR)

PSNR builds on MSE but uses a logarithmic scale to express the ratio between maximum signal power and noise (corruption). The team noted that while the logarithmic nature makes results initially less intuitive, it provides advantages for understanding orders of magnitude difference between images. PSNR shares MSE’s limitations regarding translation and rotation sensitivity, but provides a standardized way to express image fidelity that is commonly used in image processing literature.

Cosine Similarity with CNN Feature Vectors

This approach represents a more sophisticated computer vision technique. Instead of comparing raw pixels, the team used VGG16 (a pre-trained convolutional neural network) to extract feature vectors from images, then compared these using cosine similarity. They specifically used only the feature extraction layers of VGG16, not the classification head.

The key advantage is that cosine similarity on feature vectors captures structural and semantic image properties—edges, curves, and shapes that define the product—rather than exact pixel values. This makes it robust to translation and proportionate scaling, as the extracted features represent the product’s visual characteristics regardless of position. However, it won’t detect subtle color changes as effectively as pixel-based methods and may incorrectly approve disproportionate scaling that preserves feature ratios.

The implementation uses TensorFlow/Keras for the VGG16 model and SciPy for cosine distance calculation, representing a practical production-ready approach using established deep learning frameworks.

Template Matching with OpenCV

Template matching serves a different purpose—locating where the product exists within the generated image. Using OpenCV’s matchTemplate function, the system slides the original product image across the generated image to find the best match location. This enables extraction of the product region from the generated image for subsequent comparison.

The team noted important practical considerations: template matching requires the template and target image to have the same resolution, and OpenCV doesn’t track resolution metadata. This means production systems may need to maintain multiple resolution versions of product templates to match various GenAI model output resolutions.

Combined Evaluation Strategy

The case study’s key LLMOps insight is that combining these techniques creates a more robust evaluation system than any single approach. The team designed the following complementary usage pattern:

Template matching first locates the product within the generated image, enabling scoped comparison of only relevant pixels rather than the entire image including generated background
MSE and PSNR then detect color alterations and disproportionate scaling issues within the located product region
Cosine similarity independently verifies that the product’s structural features (edges, curves) are preserved, even if the product was translated or proportionately scaled

This layered approach means that different types of product modifications will be caught by at least one technique. A color shift would be detected by MSE/PSNR even if cosine similarity shows high structural similarity. A structural distortion would be caught by cosine similarity even if pixel colors are similar.

Limitations and Future Work

The team transparently documented several limitations, which is valuable for understanding production deployment considerations:

Rotation handling: None of the techniques reliably handle rotated products. Template matching could theoretically test multiple rotation angles, but this increases computation time significantly
Content additions outside bounding box: The template matching approach only compares pixels within the detected product region. The team noted that aesthetically pleasing but inaccurate additions (like an extra element on top of a teapot) would not be detected if they fall outside the template bounding box
Resolution matching: Template matching requires consistent resolution between product templates and generated images, adding operational complexity for multi-model deployments

Production Considerations

While the case study focuses on the evaluation methodology rather than full production deployment details, several LLMOps considerations emerge:

The evaluation pipeline would need to run at scale for 1:1 personalization use cases generating potentially millions of unique ads. The choice of established, efficient libraries (NumPy, OpenCV, TensorFlow) supports this requirement.

Threshold calibration is implicitly required but not detailed—at what MSE, PSNR, or cosine similarity values should a generated image be rejected? This would require validation against human judgment and business requirements.

The mention that all product images in the post were AI-generated by OpenAI via ChatGPT provides useful context about the generative model being evaluated, though the evaluation framework itself is model-agnostic.

Technical Implementation Details

The code snippets provided show production-ready patterns:

Efficient NumPy array operations for MSE calculation
Proper image preprocessing for VGG16 (expanding dimensions, using preprocess_input)
Flattening feature vectors for cosine similarity computation
OpenCV’s standard template matching workflow with bounding box visualization

The VGG16 feature extraction uses the ‘fc2’ layer (the second fully-connected layer), which provides a 4096-dimensional feature vector that captures high-level visual features learned from ImageNet training.

Conclusion

This case study demonstrates a thoughtful approach to quality assurance for AI-generated content in production advertising contexts. Rather than trusting generative AI outputs blindly, the team developed a multi-technique validation framework that can programmatically verify product integrity at scale. The combination of traditional image processing (MSE, PSNR, template matching) with deep learning feature extraction (CNN embeddings, cosine similarity) represents a practical pattern for LLMOps evaluation where different quality dimensions require different measurement approaches. The transparent discussion of limitations provides valuable guidance for teams implementing similar systems.

Evaluating Product Image Integrity in AI-Generated Advertising Content

Industry

Technologies