Yuewen Group, a global online literature platform, transitioned from traditional NLP models to Claude 3.5 Sonnet on Amazon Bedrock for intelligent text processing. Initially facing challenges with unoptimized prompts performing worse than traditional models, they implemented Amazon Bedrock's Prompt Optimization feature to automatically enhance their prompts. This led to significant improvements in accuracy for tasks like character dialogue attribution, achieving 90% accuracy compared to the previous 70% with unoptimized prompts and 80% with traditional NLP models.
Yuewen Group is a major global player in online literature and intellectual property (IP) operations, operating the overseas platform WebNovel which serves approximately 260 million users across more than 200 countries and regions. The company is involved in promoting Chinese web literature globally and adapting web novels into films and animations for international markets. This case study, published by AWS in April 2025, describes how Yuewen Group transitioned from traditional NLP models to LLM-based text processing using Amazon Bedrock, and specifically how they addressed performance challenges through automated prompt optimization.
It is worth noting that this case study is presented through an AWS blog post, which means it serves promotional purposes for Amazon Bedrock’s Prompt Optimization feature. While the technical details and reported results are informative, readers should be aware that the narrative is constructed to highlight the success of AWS’s offering. The specific accuracy figures (70%, 80%, 90%) should be viewed with some caution as they lack detailed methodology about how they were measured.
Yuewen Group initially relied on proprietary NLP models for intelligent analysis of their extensive web novel texts. These traditional models faced challenges including prolonged development cycles and slow updates. To improve both performance and efficiency, the company decided to transition to Anthropic’s Claude 3.5 Sonnet through Amazon Bedrock.
The LLM approach offered several theoretical advantages: enhanced natural language understanding and generation capabilities, the ability to handle multiple tasks concurrently, improved context comprehension, and better generalization. Using Amazon Bedrock as the managed infrastructure layer significantly reduced technical overhead and streamlined the development process compared to self-hosting models.
However, the transition revealed a critical challenge: Yuewen Group’s limited experience in prompt engineering meant they could not initially harness the full potential of the LLM. In their “character dialogue attribution” task—a core text analysis function for understanding which character speaks which line in a novel—traditional NLP models achieved approximately 80% accuracy while the LLM with unoptimized prompts only reached around 70%. This 10-percentage-point gap demonstrated that simply adopting an LLM was not sufficient; strategic prompt optimization was essential.
The case study articulates several key challenges in manual prompt optimization that are relevant to anyone operating LLMs in production:
Difficulty in Evaluation: Assessing prompt quality is inherently complex because effectiveness depends not just on the prompt itself but on its interaction with the specific language model’s architecture and training data. For open-ended tasks, evaluating LLM response quality often involves subjective and qualitative judgments, making it challenging to establish objective optimization criteria. This is a well-recognized problem in LLMOps—the lack of standardized evaluation metrics for many real-world tasks.
Context Dependency: Prompts that work well in one scenario may underperform in another, requiring extensive customization for different applications. This poses significant challenges for organizations looking to scale their LLM applications across diverse use cases, as each new task potentially requires its own prompt engineering effort.
Scalability: As LLM applications grow, the number of required prompts and their complexity increase correspondingly. Manual optimization becomes increasingly time-consuming and labor-intensive. The search space for optimal prompts grows exponentially with prompt complexity, making exhaustive manual exploration infeasible.
These challenges reflect real operational concerns for teams deploying LLMs at scale. The need for specialized prompt engineering expertise creates a bottleneck that can slow deployment timelines and limit the breadth of LLM adoption within an organization.
Yuewen Group adopted Amazon Bedrock’s Prompt Optimization feature, which is described as an AI-driven capability that automatically optimizes “under-developed prompts” for specific use cases. The feature is integrated into Amazon Bedrock Playground and Prompt Management, allowing users to create, evaluate, store, and use optimized prompts through both API calls and console interfaces.
The underlying system combines two components:
Prompt Analyzer: A fine-tuned LLM that decomposes the prompt structure by extracting key constituent elements such as task instructions, input context, and few-shot demonstrations. This component essentially performs prompt parsing and understanding.
Prompt Rewriter: Employs a “general LLM-based meta-prompting strategy” to improve prompt signatures and restructure prompt layout. The rewriter produces a refined version of the initial prompt tailored to the target LLM.
The workflow is relatively straightforward from a user perspective: users input their original prompt (which can include template variables represented by placeholders like {{document}}), select a target LLM from the supported list, and initiate optimization with a single click. The optimized prompt is generated within seconds and displayed alongside the original for comparison.
According to AWS, the optimized prompts typically include more explicit instructions on processing input variables and generating desired output formats. The system has been evaluated on open-source datasets across various task types including classification, summarization, open-book QA/RAG, and agent/function-calling scenarios.
Using Bedrock Prompt Optimization, Yuewen Group reports achieving significant improvements across various intelligent text analysis tasks:
Beyond accuracy improvements, the case study emphasizes development efficiency gains—prompt engineering processes were completed in “a fraction of the time” compared to manual optimization approaches.
It should be noted that these results are self-reported through an AWS promotional blog post, and details about evaluation methodology, dataset sizes, and statistical significance are not provided. The 20-percentage-point improvement from unoptimized LLM prompts (70%) to optimized prompts (90%) is substantial and should be viewed in context—it suggests the initial prompts may have been significantly underspecified.
The case study includes several operational best practices that have broader applicability for LLMOps:
Use clear and precise input prompts: Even automated optimization benefits from clear intent and well-structured starting points. Separating prompt sections with new lines is recommended.
Language considerations: English is recommended as the input language, as prompts containing significant portions of other languages may not yield optimal results. This is a notable limitation for organizations operating in multilingual contexts.
Avoid overly long prompts and examples: Excessively long prompts and few-shot examples increase semantic understanding difficulty and challenge output length limits. The recommendation is to structure placeholders clearly rather than embedding them within sentences.
Timing of optimization: Prompt Optimization is described as most effective during early stages of prompt engineering, where it can quickly optimize less-structured “lazy prompts.” The improvement is likely to be more significant for such prompts compared to those already curated by experienced prompt engineers.
This case study highlights several important LLMOps considerations:
Model Migration Challenges: Transitioning from traditional ML/NLP models to LLMs is not simply a drop-in replacement. Organizations may experience performance regressions if they lack prompt engineering expertise, even when using more capable foundation models.
Infrastructure Abstraction: Using managed services like Amazon Bedrock reduces operational overhead compared to self-hosted model deployments, allowing teams to focus on application development rather than infrastructure management.
Prompt Management as a Core Capability: The integration of prompt optimization with prompt management tools reflects the growing recognition that prompts are first-class artifacts in LLM systems that require version control, testing, and optimization tooling.
Automation of Prompt Engineering: The emergence of automated prompt optimization tools suggests a trend toward reducing the specialized expertise required to deploy effective LLM applications, potentially democratizing LLM adoption across organizations with varying levels of AI expertise.
Trade-offs and Limitations: While the case study emphasizes successes, the best practices section reveals limitations—particularly around language support and prompt complexity. Organizations should evaluate these constraints against their specific requirements.
While this case study presents compelling results, several factors warrant consideration:
Despite these caveats, the case study provides useful insights into the practical challenges of deploying LLMs for production text processing tasks and illustrates one approach to addressing prompt engineering at scale.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Brex developed an AI-powered financial assistant to automate expense management workflows, addressing the pain points of manual data entry, policy compliance, and approval bottlenecks that plague traditional finance operations. Using Amazon Bedrock with Claude models, they built a comprehensive system that automatically processes expenses, generates compliant documentation, and provides real-time policy guidance. The solution achieved 75% automation of expense workflows, saving hundreds of thousands of hours monthly across customers while improving compliance rates from 70% to the mid-90s, demonstrating how LLMs can transform enterprise financial operations when properly integrated with existing business processes.
Georgia-Pacific, a forest products manufacturing company with 30,000+ employees and 140+ facilities, deployed generative AI to address critical knowledge transfer challenges as experienced workers retire and new employees struggle with complex equipment. The company developed an "Operator Assistant" chatbot using AWS Bedrock, RAG architecture, and vector databases to provide real-time troubleshooting guidance to factory operators. Starting with a 6-8 week MVP deployment in December 2023, they scaled to 45 use cases across multiple facilities within 7-8 months, serving 500+ users daily with improved operational efficiency and reduced waste.