ZenML

AI-Powered Medical Content Review and Revision at Scale

Flo Health 2026
View original source

Flo Health, a leading women's health app, partnered with AWS Generative AI Innovation Center to develop MACROS (Medical Automated Content Review and Revision Optimization Solution), an AI-powered system for verifying and maintaining the accuracy of thousands of medical articles. The solution uses Amazon Bedrock foundation models to automatically review medical content against established guidelines, identify outdated or inaccurate information, and propose evidence-based revisions while maintaining Flo's editorial style. The proof of concept achieved 80% accuracy and over 90% recall in identifying content requiring updates, significantly reduced processing time from hours to minutes per guideline, and demonstrated more consistent application of medical guidelines compared to manual reviews while reducing the workload on medical experts.

Industry

Healthcare

Technologies

Overview

Flo Health operates one of the world’s leading women’s health apps, creating thousands of medical articles annually to provide medically credible information to millions of users worldwide. The company faced a significant operational challenge: maintaining the accuracy and currency of their vast content library as medical knowledge continuously evolves. Manual review of each article is time-consuming, expensive, and prone to inconsistency. In partnership with AWS Generative AI Innovation Center, Flo Health developed MACROS (Medical Automated Content Review and Revision Optimization Solution), a production-grade AI system that automates the review and revision of medical content at scale while maintaining rigorous accuracy standards.

The proof of concept phase, documented in this case study from January 2026, established clear success metrics including 90% content piece recall, 10x faster processing than manual review, significant cost reduction in expert review workload, and maintaining Flo’s editorial standards. The actual results exceeded several targets, achieving over 90% recall and 80% accuracy while demonstrating more consistent guideline application than manual processes. Critically, the project embraced a human-AI collaboration model rather than full automation, with medical experts remaining essential for final validation—a pragmatic approach that balances efficiency gains with medical safety requirements.

Architecture and Infrastructure

MACROS employs a comprehensive AWS-based architecture built for production deployment. The system consists of frontend and backend components orchestrated through AWS Step Functions, with Amazon Elastic Container Service (ECS) hosting the Streamlit-based user interface. Authentication flows through Amazon API Gateway, while Amazon S3 serves as the central data repository for storing input documents, processed results, and refined guidelines.

The architecture demonstrates thoughtful separation of concerns with distinct AWS Lambda functions handling different processing stages: pre-processing with Amazon Textract for PDF text extraction, rule optimization, content review, revision generation, and statistics computation. Each Lambda function has carefully scoped IAM permissions to access Amazon Bedrock for generative AI capabilities and S3 for data persistence. Amazon CloudWatch provides system monitoring and log management, though the team acknowledges that production deployments dealing with critical medical content could benefit from enhanced monitoring with custom metrics and alarms for more granular performance insights.

The team explicitly considered alternative architectures, noting their exploration of Amazon Bedrock Flows as a potential future enhancement. This forward-looking perspective suggests they’re continuously evaluating emerging AWS capabilities that could simplify workflow orchestration and enhance integration with Bedrock services as they scale.

Core Content Review and Revision Pipeline

The heart of MACROS lies in its five-stage content processing pipeline, which demonstrates sophisticated LLMOps design principles. The pipeline begins with an optional filtering stage that determines whether specific rule sets are relevant to a given article, potentially saving significant processing costs by avoiding unnecessary reviews. This filtering can be implemented through direct LLM calls via Amazon Bedrock or through non-LLM approaches such as embedding-based similarity calculations or keyword-level overlap metrics using BLEU or ROUGE scores.

The chunking stage strategically splits source text into paragraphs or semantic sections, a critical design decision that facilitates high-quality assessment and prevents unintended revisions to unrelated content. The team supports both heuristic approaches using punctuation or regular expressions and LLM-based semantic chunking, demonstrating flexibility in balancing accuracy with cost.

The review stage performs the actual adherence assessment, evaluating each text section against relevant rules and guidelines. The system outputs structured XML responses indicating adherence status (binary 1/0), specific non-adherent rules, and explanatory reasons. This structured output format proves essential for reliable parsing and downstream processing. The review can operate in standard mode with a single Bedrock call assessing all rules, or in “multi-call” mode where each rule gets an independent assessment—a configuration option that trades processing cost for potentially higher detection accuracy.

Only sections flagged as non-adherent proceed to the revision stage, where the system generates suggested updates that align with the latest guidelines while maintaining Flo’s distinctive style and tone. This selective revision approach maintains the integrity of adherent content while focusing computational resources where they’re needed. Finally, the post-processing stage seamlessly integrates revised paragraphs back into the original document structure.

Model Selection Strategy

The MACROS solution demonstrates sophisticated understanding of model capability-cost tradeoffs by employing a tiered approach to model selection across different pipeline stages. Simpler tasks like chunking utilize smaller, more cost-efficient models from the Claude Haiku family, while complex reasoning tasks requiring nuanced understanding—such as content review and revision—leverage larger models from the Claude Sonnet or Opus families. This differentiated model strategy optimizes both performance quality and operational costs, reflecting mature LLMOps thinking.

The team explicitly values Amazon Bedrock’s model diversity, noting how the platform enables them to choose optimal models for specific tasks, achieve cost efficiency without sacrificing accuracy, and upgrade to newer models smoothly while maintaining their existing architecture. This abstraction from specific model versions provides operational flexibility crucial for production systems that need to evolve with rapidly advancing foundation model capabilities.

Rule Optimization and Knowledge Extraction

Beyond content review, MACROS includes a sophisticated Rule Optimizer feature that extracts and refines actionable guidelines from unstructured source documents—a critical capability given that medical guidelines often arrive in complex PDF formats rather than machine-readable structures. The Rule Optimizer processes raw PDFs through Amazon Textract for text extraction, chunks content based on document headers, and processes segments through Amazon Bedrock using specialized system prompts.

The system supports two distinct operational modes: “Style/tonality” mode focuses on extracting editorial guidelines about writing style, formatting, and permissible language, while “Medical” mode processes scientific documents to extract three classes of rules: medical condition guidelines, treatment-specific guidelines, and changes to medical advice or health trends. Each extracted rule receives a priority classification (high, medium, low) to guide subsequent review ordering and focus attention appropriately.

The Rule Optimizer defines quality criteria for extracted rules, requiring them to be clear, unambiguous, actionable, relevant, consistent, concise (maximum two sentences), written in active voice, and avoiding unnecessary jargon. This explicit quality framework demonstrates thoughtful prompt engineering for structured knowledge extraction. Importantly, the system includes a manual editing interface where users can refine rule text, adjust classifications, and manage priorities, with changes persisted to S3 for future use—a pragmatic acknowledgment that AI extraction requires human oversight and refinement.

User Interface and Operating Modes

MACROS supports two primary UI modes addressing different operational scales. “Detailed Document Processing” mode provides granular content assessment for individual documents, accepting inputs in PDF, TXT, JSON formats or direct text paste. Users select from predefined rule sets (examples include Vitamin D, Breast Health, and PMS/PMDD guidelines) or input custom guidelines with adherent and non-adherent examples. This mode facilitates thorough interactive review with on-the-fly adjustments.

“Multi Document Processing” mode handles batch operations across numerous JSON files simultaneously, designed to mirror how Flo would integrate MACROS into their content management system for periodic automated assessment. The architecture also supports direct API calls alongside UI access, enabling both interactive expert review and programmatic pipeline integration—a dual interface approach that serves different stakeholder needs from medical experts to system integrators.

Implementation Challenges and Practical Considerations

The case study candidly discusses several implementation challenges that provide valuable insights for practitioners. Data preparation emerged as a fundamental challenge, requiring standardization of input formats for both medical content and guidelines while maintaining consistent document structures. Creating diverse test sets across different medical topics proved essential for comprehensive validation—a reminder that healthcare AI requires domain-representative evaluation data.

Cost management quickly became a priority, necessitating token usage tracking and optimization of both prompt design and batch processing strategies. The team implemented monitoring to balance performance with efficiency, reflecting real production concerns where per-token pricing makes optimization economically important.

Regulatory and ethical compliance considerations loom large given the sensitive nature of medical content. The team established robust documentation practices for AI decisions, implemented strict version control for medical guidelines, and maintained continuous human medical expert oversight for AI-generated suggestions. Regional healthcare regulations were carefully considered throughout implementation, though specific compliance frameworks aren’t detailed in this proof of concept phase.

Integration and scaling challenges led the team to start with a standalone testing environment while planning for future CMS integration through well-designed API endpoints. Building with modularity in mind proved valuable for accommodating future enhancements. Throughout the process, they faced common challenges including maintaining context in long medical articles, balancing processing speed with accuracy, and ensuring consistent tone across AI-suggested revisions.

Performance Monitoring and Validation

While the architecture includes Amazon CloudWatch for basic monitoring and log management, the team acknowledges that production deployments handling critical medical content warrant more sophisticated observability. They suggest future enhancements with custom metrics and alarms to provide granular insights into system performance and content processing patterns—an honest assessment that their current monitoring setup, while functional, represents an area for production hardening.

The validation approach involved close collaboration between Flo Health’s medical experts and AWS technical specialists through regular review sessions. This human-in-the-loop validation process provided critical feedback and medical expertise to continuously enhance AI model performance and accuracy. The emphasis on expert validation of parsing rules and maintaining clinical precision reflects appropriate caution when deploying AI in high-stakes medical contexts.

Results and Critical Assessment

The proof of concept delivered promising results across key success metrics. The solution exceeded target processing speed improvements while maintaining 80% accuracy and achieving over 90% recall in identifying content requiring updates. Most notably, the AI-powered system applied medical guidelines more consistently than manual reviews and significantly reduced time burden on medical experts. Processing time dropped from hours to minutes per guideline, approaching the 10x improvement target.

However, the case study warrants careful interpretation as promotional AWS content. The 80% accuracy figure falls short of the stated 90% goal, though the over 90% recall meets targets. The text doesn’t detail the evaluation methodology, sample size, or how accuracy and recall were precisely defined and measured in this context. The comparison to “manual review” baseline isn’t quantified with specific time or accuracy metrics for the human process, making it difficult to assess the true magnitude of improvements.

The emphasis on “human-AI collaboration” rather than automation, while ethically appropriate for medical content, also suggests the system hasn’t achieved the level of reliability required for autonomous operation. Medical experts remain “essential for final validation,” meaning the actual operational improvement may be more modest than headline numbers suggest—the system accelerates but doesn’t eliminate expert review work.

Key Learnings and Best Practices

The team distilled several insights valuable for practitioners. Content chunking emerged as essential for accurate assessment across long documents, with expert validation of parsing rules helping maintain clinical precision. The most important conclusion: human-AI collaboration, not full automation, represents the appropriate implementation model. Regular expert feedback, clear performance metrics, and incremental improvements guided system refinements.

The project confirmed that while the system significantly streamlines review processes, it works best as an augmentation tool with medical experts remaining essential for final validation. This creates a more efficient hybrid approach to medical content management—a pragmatic conclusion that may disappoint those seeking radical automation but reflects responsible deployment in high-stakes domains.

Production Considerations and Future Directions

This case study represents Part 1 of a two-part series focusing on proof of concept results. The team indicates Part 2 will cover the production journey, diving into scaling challenges and strategies—suggesting the PoC-to-production transition presented significant additional complexity not addressed here.

The technical architecture appears production-ready in structure with appropriate service separation, authentication, monitoring, and data persistence. However, several elements suggest early-stage deployment: the acknowledgment that monitoring could be enhanced for production medical content, the exploration of alternative orchestration approaches like Bedrock Flows, and the focus on standalone testing environments with planned future CMS integration all indicate this solution is approaching but hasn’t fully reached mature production operation.

The cost optimization discussion and tiered model selection strategy demonstrate thoughtful LLMOps maturity, as does the dual API/UI interface design supporting both interactive and automated workflows. The rule optimization capability for extracting guidelines from unstructured sources addresses a real operational challenge beyond simple content review, adding significant practical value.

Overall, this case study documents a substantial and thoughtfully designed LLMOps implementation addressing genuine operational challenges in medical content management. While the promotional nature of AWS content requires critical reading and some claims lack detailed substantiation, the technical approach demonstrates solid LLMOps principles including appropriate model selection, structured output parsing, human-in-the-loop validation, cost optimization, and modular architecture supporting both expert interaction and system integration. The candid discussion of challenges and the emphasis on augmentation rather than replacement of human experts lend credibility to the narrative.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building and Sunsetting Ada: An Internal LLM-Powered Chatbot Assistant

Leboncoin 2025

Leboncoin, a French e-commerce platform, built Ada—an internal LLM-powered chatbot assistant—to provide employees with secure access to GenAI capabilities while protecting sensitive data from public LLM services. Starting in late 2023, the project evolved from a general-purpose Claude-based chatbot to a suite of specialized RAG-powered assistants integrated with internal knowledge sources like Confluence, Backstage, and organizational data. Despite achieving strong technical results and valuable learning outcomes around evaluation frameworks, retrieval optimization, and enterprise LLM deployment, the project was phased out in early 2025 in favor of ChatGPT Enterprise with EU data residency, allowing the team to redirect their expertise toward more user-facing use cases while reducing operational overhead.

chatbot question_answering summarization +38

AI-Powered Vehicle Information Platform for Dealership Sales Support

Toyota 2025

Toyota Motor North America (TMNA) and Toyota Connected built a generative AI platform to help dealership sales staff and customers access accurate vehicle information in real-time. The problem was that customers often arrived at dealerships highly informed from internet research, while sales staff lacked quick access to detailed vehicle specifications, trim options, and pricing. The solution evolved from a custom RAG-based system (v1) using Amazon Bedrock, SageMaker, and OpenSearch to retrieve information from official Toyota data sources, to a planned agentic platform (v2) using Amazon Bedrock AgentCore with Strands agents and MCP servers. The v1 system achieved over 7,000 interactions per month across Toyota's dealer network, with citation-backed responses and legal compliance built in, while v2 aims to enable more dynamic actions like checking local vehicle availability.

customer_support chatbot question_answering +47