GetYourGuide faced challenges with their lengthy 16-step activity creation process, where suppliers spent up to an hour manually entering content that often had quality issues, leading to traveler confusion and lower conversion rates. They implemented a generative AI solution that allows activity providers to paste existing content and automatically generates descriptions and fills structured fields across 8 key onboarding steps. After an initial failed experiment due to UX confusion and measurement challenges, they iterated with improved UI/UX design and developed a novel permutation testing framework. The second rollout successfully increased activity completion rates, improved content quality, and reduced onboarding time to as little as 14 minutes, ultimately achieving positive impacts on both supplier efficiency and traveler engagement metrics.
GetYourGuide, a leading travel marketplace platform, implemented a generative AI solution to transform their activity supplier onboarding process. This case study provides valuable insights into the real-world challenges of deploying LLM-based features in production environments, particularly in two-sided marketplaces where AI impacts both supply-side (activity providers) and demand-side (travelers) participants. The initiative spanned multiple quarters and involved four cross-functional teams: Supply Data Products, Catalog Tech, Content Management, and Analytics.
The original activity creation process required suppliers to navigate a 16-step product creation wizard where they manually entered descriptions, photos, availability, pricing, and location information. This process was identified as a significant pain point through supplier feedback and research, with several critical issues:
The hypothesis driving this initiative was that generative AI could simultaneously address both efficiency and quality concerns while ensuring consistency across the experience catalog. This dual benefit would create value for suppliers (faster onboarding) and travelers (better content quality leading to higher trust and conversion).
The production solution enables activity providers to paste existing content (such as from their own websites) into a designated input box. The LLM-powered system then processes this input to:
The approach represents a practical application of prompt engineering where the LLM needs to understand unstructured supplier input and transform it into both engaging narrative content and structured metadata that fits GetYourGuide’s platform requirements. The system must balance creativity in generating traveler-friendly descriptions with accuracy in extracting and categorizing factual information about activities.
One of the most significant LLMOps challenges encountered was measuring the success of the AI feature in a two-sided marketplace context. GetYourGuide’s existing experimentation platform was primarily designed for traveler-focused A/B tests and couldn’t be directly applied to this supplier-side feature. The core measurement challenge stemmed from the fact that while activity providers could be assigned to treatment or control groups, travelers could not be separately assigned to variants—an activity created through AI couldn’t simultaneously have a non-AI version.
This constraint led to the development of a novel permutation testing framework specifically designed to account for:
The case study emphasizes a critical LLMOps principle: the black-box nature of LLM systems makes evaluation particularly challenging. The correctness and suitability of AI-generated outputs depend on multiple factors including input data quality and algorithm design, and outputs may be technically correct but not meet platform constraints or business requirements.
The initial full-scale experiment used a 75/25 split between treatment and control groups, with the larger treatment group accounting for an expected 60% opt-in rate (resulting in approximately 50% of activities created via AI). This experiment revealed critical issues:
User confusion and trust deficit: The primary success metric (percentage of activities submitted out of those that started the wizard) was significantly lower in the treatment group. Root cause analysis revealed that activity providers didn’t understand how the AI tool fit into the onboarding process. The UI design of the AI input page was insufficiently clear, causing suppliers to think they were in the wrong place and restarting the activity creation process multiple times.
Expectation mismatch: Activity providers in the treatment group spent longer on pages not filled out by AI, indicating frustration about having to complete certain sections manually. The feature hadn’t adequately set expectations about which fields would be automated versus which would require manual input.
Measurement complications: The planned standard A/B analysis approach failed because experiment groups showed significant pre-experiment differences in both traveler and supplier-side metrics. Certain activity providers could significantly skew results based on their group assignment, violating fundamental assumptions of the statistical approach.
The decision to close the experiment without launching demonstrates appropriate LLMOps rigor—recognizing when a deployment isn’t ready despite organizational pressure to ship AI features.
Following the failed first experiment, the team made several improvements informed by data-driven analysis:
UX refinements: The AI input page was redesigned to clearly show it as a step within the normal product creation wizard, with a visible left-side menu/progress bar providing context. Visual design and microcopy were improved to set explicit expectations about what the tool would and wouldn’t automate.
Model improvements: The LLM was refined to improve content quality and automatically fill out additional sections, reducing the manual work required from suppliers.
Measurement framework: The custom permutation testing framework was finalized to properly account for marketplace dynamics and pre-experiment group differences.
The second experiment achieved measurable success across multiple dimensions:
Following the successful second experiment, the feature was rolled out to 100% of the supplier base. While specific monitoring infrastructure isn’t detailed, the case study emphasizes the importance of anticipating potential issues before they arise and setting up monitoring systems to track them. This forward-thinking approach to observability is a key LLMOps practice for production AI systems.
The deployment represents a full-scale production LLM application that directly impacts business-critical workflows. The system processes supplier-provided content in real-time during the onboarding flow, generating both creative and structured outputs that immediately become part of the product catalog visible to travelers.
The case study provides extensive insights into the organizational aspects of deploying LLM features in production:
Cross-functional collaboration: The project involved four teams over multiple quarters, requiring structured coordination through bi-weekly syncs, active Slack channels, and ad-hoc meetings. The complexity of LLM projects often requires this level of coordination across ML/AI teams, product teams, engineering infrastructure teams, and analytics teams.
Documentation and knowledge management: A centralized master document with all important links (referencing 30+ other documents) proved essential for alignment. For LLMOps projects dealing with complex metrics and multiple teams, maintaining a “master table” documenting all assumptions and logic prevents confusion and ensures consistent decision-making.
Early analytics involvement: Including analysts from the project’s inception, even before immediate analytical work was needed, ensured better context and more meaningful insights. This is particularly important for LLMOps projects where defining success metrics and measurement approaches for AI-generated outputs requires domain expertise.
Iteration and perseverance: The willingness to learn from failure and iterate rather than abandon the project after the first failed experiment represents mature LLMOps practice. The case study explicitly notes that “failure is often a part of the learning process” and that understanding why experiments fail enables turning things around.
Scope management: The team identified scope creep as a common pitfall—underestimating AI limitations and over-promising what LLMs can realistically achieve. Balancing ambitious goals with practical constraints while maintaining adaptability to rapid AI advancements proved crucial.
Several LLMOps best practices emerge from this case study:
Attention to outliers: Statistical anomalies and edge cases often highlight important user patterns and pain points. Investigating outlier behavior in the first experiment proved instrumental in refining the product for the second test.
Transparency about limitations: Clear communication about both benefits and limitations of AI tools significantly improved user satisfaction and adoption. The UI explicitly set expectations about what the tool could and couldn’t do, addressing the trust deficit observed in the first experiment.
Data-driven iteration: Close monitoring of metrics at each step of the supplier journey, segmented by key dimensions, enabled identifying who was engaging successfully versus struggling. This granular analysis informed specific improvements rather than broad changes.
Measurement framework adaptation: Recognizing when standard A/B testing approaches don’t apply and developing custom statistical frameworks represents sophisticated LLMOps practice. The permutation testing toolkit they developed for marketplace dynamics could be valuable for other two-sided platform contexts.
While the case study presents a success story, several caveats merit consideration:
Limited technical detail: The case provides minimal information about the actual LLM architecture, model selection, prompt engineering techniques, or infrastructure. It’s unclear whether they use proprietary models, commercial APIs, or open-source alternatives, and what specific technical approaches enable the dual output of creative content and structured fields.
Selective metrics disclosure: While the case mentions increases in “all success metrics” and specific improvements like the 5 percentage point drop-off reduction, many quantitative results are presented qualitatively (“solid increase,” “higher quality content”) without precise numbers. This is common in company blog posts but limits the ability to assess magnitude of impact.
Quality control mechanisms unclear: The case doesn’t detail how content quality is evaluated or what guardrails exist to prevent AI-generated content from containing errors, inappropriate tone, or hallucinated information. For a travel marketplace where accuracy is critical, these quality control mechanisms would be important LLMOps components.
Cost considerations absent: No discussion of computational costs, API expenses, or cost-benefit analysis compared to the previous manual process. LLMOps in production requires managing these economic tradeoffs.
Opt-in dynamics: With a 60% adoption rate, 40% of suppliers still chose not to use the AI feature even after improvements. Understanding why these suppliers opted out and whether their activities perform differently would provide useful context.
Despite these limitations, the case study provides valuable real-world insights into deploying LLM features in production environments, particularly the challenges of measurement, iteration based on user behavior, and organizational coordination required for successful LLMOps at scale in marketplace contexts. The transparency about failure and the detailed discussion of what went wrong in the first experiment makes this particularly valuable for practitioners facing similar challenges.
This panel discussion features three AI-native companies—Delphi (personal AI profiles), Seam AI (sales/marketing automation agents), and APIsec (API security testing)—discussing their journeys building production LLM systems over three years. The companies address infrastructure evolution from single-shot prompting to fully agentic systems, the shift toward serverless and scalable architectures, managing costs at scale (including burning through a trillion OpenAI tokens), balancing deterministic workflows with model autonomy, and measuring ROI through outcome-based metrics rather than traditional productivity gains. Key technical themes include moving away from opinionated architectures to let models reason autonomously, implementing state machines for high-confidence decisions, using tools like Pydantic AI and Logfire for instrumentation, and leveraging Pinecone for vector search at scale.
This case study explores how Airia developed an orchestration platform to help organizations deploy AI agents in production environments. The problem addressed is the significant complexity and security challenges that prevent businesses from moving beyond prototype AI agents to production-ready systems. The solution involves a comprehensive platform that provides agent building capabilities, security guardrails, evaluation frameworks, red teaming, and authentication controls. Results include successful deployments across multiple industries including hospitality (customer profiling across hotel chains), HR, legal (contract analysis), marketing (personalized content generation), and operations (real-time incident response through automated data aggregation), with customers reporting significant efficiency gains while maintaining enterprise security standards.
DoorDash evolved from traditional numerical embeddings to LLM-generated natural language profiles for representing consumers, merchants, and food items to improve personalization and explainability. The company built an automated system that generates detailed, human-readable profiles by feeding structured data (order history, reviews, menu metadata) through carefully engineered prompts to LLMs, enabling transparent recommendations, editable user preferences, and richer input for downstream ML models. While the approach offers scalability and interpretability advantages over traditional embeddings, the implementation requires careful evaluation frameworks, robust serving infrastructure, and continuous iteration cycles to maintain profile quality in production.