ZenML

Implementing Effective Safety Filters in a Game-Based LLM Application

JOBifAI 2025
View original source

JOBifAI, a game leveraging LLMs for interactive gameplay, encountered significant challenges with LLM safety filters in production. The developers implemented a retry-based solution to handle both technical failures and safety filter triggers, achieving a 99% success rate after three retries. However, the experience highlighted fundamental issues with current safety filter implementations, including lack of transparency, inconsistent behavior, and potential cost implications, ultimately limiting the game's development from proof-of-concept to full production.

Industry

Media & Entertainment

Technologies

Overview

JOBifAI is an LLM-powered game that demonstrates novel gameplay mechanics enabled by large language models. The game places players in an interview scenario where they’ve submitted an AI-generated portfolio to a company. This case study, published in January 2025, offers a candid and somewhat critical perspective on the operational challenges of deploying LLMs in interactive, real-time applications where user input is unpredictable and the system must handle both legitimate queries and potential misuse gracefully.

The developers’ experience provides valuable insights into the practical realities of LLMOps, particularly around handling safety filters, managing unreliable outputs, and building robust retry mechanisms. While the tone of the original article is somewhat critical of LLM providers’ implementation choices, the technical lessons learned are broadly applicable to any production LLM deployment that involves user-generated input.

The Core Challenge: Unreliable LLM Outputs

The fundamental problem the JOBifAI team encountered is one familiar to many LLMOps practitioners: LLM outputs cannot be trusted without verification. When applications accept free-form user input, the reliability of LLM responses drops significantly. The developers identified three distinct failure modes that they had to contend with in production:

The first failure mode involves invalid JSON responses. When the LLM is instructed to return structured data in JSON format, it sometimes produces malformed output that cannot be parsed, resulting in a 400 error from the API. This is a purely technical failure unrelated to content safety.

The second failure mode involves schema non-conformance. Even when the LLM returns valid JSON, the response may not match the expected schema. The developers noted they attempted type casting (such as converting a string “2” to an integer 2) to handle minor deviations, but sometimes the response structure simply did not conform to what was requested.

The third failure mode involves safety filter rejections. When the safety filter determines a query is unsafe, the API returns a 400 error. Critically, this error code is indistinguishable from the technical failures mentioned above, making it difficult for the application to respond appropriately.

Prompt Engineering Approach

The developers structured their prompts in a specific pattern to handle the interactive nature of the game. The prompt architecture follows this general form:

The expected output format was {"choice": c, "sentence": s}, where c represents which predefined action category best matches the player’s input, and s provides a narrative description of the result. This structured output approach is a common pattern in LLMOps for ensuring that LLM responses can be programmatically processed by downstream systems.

The Retry Mechanism Solution

Given the frequency of failures across all three categories, the developers implemented a retry mechanism as their primary mitigation strategy. Rather than failing immediately on any error, the system attempts the query up to three times before giving up. According to their playtesting observations (they note these are not hard metrics), the success rates were approximately:

This approach effectively papers over both technical failures and transient safety filter false positives. The developers acknowledge this is a “kludgy workaround” rather than an elegant solution, but it was necessary to maintain acceptable user experience. The 25% initial failure rate is remarkably high and would be unacceptable in most production systems without some form of retry logic.

In-Game Safety Handling

One clever aspect of the JOBifAI implementation is how they designed the game mechanics to handle genuinely inappropriate player inputs. Rather than relying solely on the LLM safety filters, they built safety into the game’s narrative logic. The example given is illustrative: if a player attempts to ask about creating explosives, the in-game secretary character would realistically call security, resulting in an instant game over.

This approach has several advantages from an LLMOps perspective. It provides narrative consistency since the response matches what would happen in a real-world scenario. It also removes the need for the LLM itself to handle truly unsafe content, as the game logic intercepts and handles it. Additionally, it creates natural consequences for adversarial users without exposing the underlying technical limitations of the system.

Criticism of Current API Design

A significant portion of the case study is devoted to critiquing how LLM providers implement safety filters and error handling. The developers argue for more granular error codes that would allow applications to distinguish between different types of issues. Their proposed error taxonomy includes:

The current situation, where safety rejections return the same 400 error as JSON parsing failures, prevents applications from implementing appropriate responses. A safety rejection might warrant showing the user a message about acceptable content, while a JSON parsing failure should be silently retried.

Cost and Business Implications

The developers raise important points about the business implications of unreliable LLM outputs. The retry mechanism effectively multiplies the cost of LLM usage, since queries that fail must be resubmitted. They note that if users were billed per query, technical failures “should theoretically not be counted as individual requests,” but in practice they often are.

More concerning is the unpredictability this introduces into cost planning. Adversarial users could potentially spam the system with queries designed to trigger safety filters, causing excessive retries and inflated costs. The developers mention the possibility of using simpler heuristics like word vectors to pre-filter obviously problematic queries, but acknowledge this adds complexity and is imperfect.

The observation that “safety triggers are generally even more unreliable than typical answers” is particularly noteworthy. This suggests that safety filtering systems may have higher variance in their decisions than the core LLM reasoning, making them a significant source of operational unpredictability.

Limitations and Caveats

It’s important to note some limitations of this case study. The success rate percentages cited are explicitly described as coming from playtesting observations rather than rigorous measurement, so they should be treated as approximate. The article also has a somewhat polemic tone regarding safety filters, which may color the presentation of the challenges encountered.

The developers explicitly describe JOBifAI as a “Proof of Concept” released for free, and note that its “unreliable foundations would deter us from developing it into a full-fledged program.” This is an honest assessment that the current state of LLM reliability was sufficient for a demo but would not meet the bar for a commercial product.

Broader LLMOps Lessons

Despite the critical tone, this case study surfaces several valuable lessons for LLMOps practitioners:

Retry mechanisms are essential when deploying LLMs in production. A single-attempt approach will result in unacceptably high failure rates for interactive applications. Planning for at least 2-3 retries should be standard practice.

Structured output formats like JSON are valuable but imperfect. Even with clear instructions, LLMs will sometimes produce malformed output, and applications need graceful degradation paths.

Safety filters introduce significant operational complexity. Their current implementation as opaque, binary decisions with undifferentiated error codes makes building robust applications more difficult than necessary.

Application-level safety design can complement or even replace reliance on model-level safety filters. Building safety into the application logic, as JOBifAI did with their game-over mechanic, provides more predictable and narratively consistent handling.

Cost estimation for LLM-powered applications must account for retries and failure rates. The effective cost per successful query may be 1.3-1.5x the nominal per-query cost when retries are factored in.

The case study also highlights the importance of transparent benchmarking and documentation. The developers reference the “Uncensored General Intelligence Leaderboard” as an example of efforts to quantify model censorship levels, suggesting that this information is valuable for developers choosing which models to deploy.

Conclusion

The JOBifAI case study provides a ground-level view of the challenges facing developers building interactive LLM-powered applications in 2025. While the developers successfully shipped a working proof of concept, their experience highlights significant operational hurdles that remain in the LLMOps space, particularly around error handling, safety filter implementation, and cost predictability. The technical solutions they implemented—retry mechanisms and application-level safety design—offer practical patterns for other developers facing similar challenges, even as the underlying platform limitations remain unresolved.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

LLM-Powered Voice Assistant for Restaurant Operations and Personalized Alcohol Recommendations

Doordash 2025

DoorDash implemented two major LLM-powered features during their 2025 summer intern program: a voice AI assistant for verifying restaurant hours and personalized alcohol recommendations with carousel generation. The voice assistant replaced rigid touch-tone phone systems with natural language conversations, allowing merchants to specify detailed hours information in advance while maintaining backward compatibility with legacy infrastructure through factory patterns and feature flags. The alcohol recommendation system leveraged LLMs to generate personalized product suggestions and engaging carousel titles using chain-of-thought prompting and a two-stage generation pipeline. Both systems were integrated into production using DoorDash's existing frameworks, with the voice assistant achieving structured data extraction through prompt engineering and webhook processing, while the recommendations carousel utilized the company's Carousel Serving Framework and Discovery SDK for rapid deployment.

fraud_detection customer_support content_moderation +42

Scaling Generative AI in Gaming: From Safety to Creation Tools

Roblox 2023

Roblox has implemented a comprehensive suite of generative AI features across their gaming platform, addressing challenges in content moderation, code assistance, and creative tools. Starting with safety features using transformer models for text and voice moderation, they expanded to developer tools including AI code assistance, material generation, and specialized texture creation. The company releases new AI features weekly, emphasizing rapid iteration and public testing, while maintaining a balance between automation and creator control. Their approach combines proprietary solutions with open-source contributions, demonstrating successful large-scale deployment of AI in a production gaming environment serving 70 million daily active users.

content_moderation code_generation speech_recognition +35