Flo Health: AI-Powered Medical Content Review and Revision at Scale

Overview

Flo Health operates one of the world’s leading women’s health apps, creating thousands of medical articles annually to provide medically credible information to millions of users worldwide. The company faced a significant operational challenge: maintaining the accuracy and currency of their vast content library as medical knowledge continuously evolves. Manual review of each article is time-consuming, expensive, and prone to inconsistency. In partnership with AWS Generative AI Innovation Center, Flo Health developed MACROS (Medical Automated Content Review and Revision Optimization Solution), a production-grade AI system that automates the review and revision of medical content at scale while maintaining rigorous accuracy standards.

The proof of concept phase, documented in this case study from January 2026, established clear success metrics including 90% content piece recall, 10x faster processing than manual review, significant cost reduction in expert review workload, and maintaining Flo’s editorial standards. The actual results exceeded several targets, achieving over 90% recall and 80% accuracy while demonstrating more consistent guideline application than manual processes. Critically, the project embraced a human-AI collaboration model rather than full automation, with medical experts remaining essential for final validation—a pragmatic approach that balances efficiency gains with medical safety requirements.

Architecture and Infrastructure

MACROS employs a comprehensive AWS-based architecture built for production deployment. The system consists of frontend and backend components orchestrated through AWS Step Functions, with Amazon Elastic Container Service (ECS) hosting the Streamlit-based user interface. Authentication flows through Amazon API Gateway, while Amazon S3 serves as the central data repository for storing input documents, processed results, and refined guidelines.

The architecture demonstrates thoughtful separation of concerns with distinct AWS Lambda functions handling different processing stages: pre-processing with Amazon Textract for PDF text extraction, rule optimization, content review, revision generation, and statistics computation. Each Lambda function has carefully scoped IAM permissions to access Amazon Bedrock for generative AI capabilities and S3 for data persistence. Amazon CloudWatch provides system monitoring and log management, though the team acknowledges that production deployments dealing with critical medical content could benefit from enhanced monitoring with custom metrics and alarms for more granular performance insights.

The team explicitly considered alternative architectures, noting their exploration of Amazon Bedrock Flows as a potential future enhancement. This forward-looking perspective suggests they’re continuously evaluating emerging AWS capabilities that could simplify workflow orchestration and enhance integration with Bedrock services as they scale.

Core Content Review and Revision Pipeline

The heart of MACROS lies in its five-stage content processing pipeline, which demonstrates sophisticated LLMOps design principles. The pipeline begins with an optional filtering stage that determines whether specific rule sets are relevant to a given article, potentially saving significant processing costs by avoiding unnecessary reviews. This filtering can be implemented through direct LLM calls via Amazon Bedrock or through non-LLM approaches such as embedding-based similarity calculations or keyword-level overlap metrics using BLEU or ROUGE scores.

The chunking stage strategically splits source text into paragraphs or semantic sections, a critical design decision that facilitates high-quality assessment and prevents unintended revisions to unrelated content. The team supports both heuristic approaches using punctuation or regular expressions and LLM-based semantic chunking, demonstrating flexibility in balancing accuracy with cost.

The review stage performs the actual adherence assessment, evaluating each text section against relevant rules and guidelines. The system outputs structured XML responses indicating adherence status (binary 1/0), specific non-adherent rules, and explanatory reasons. This structured output format proves essential for reliable parsing and downstream processing. The review can operate in standard mode with a single Bedrock call assessing all rules, or in “multi-call” mode where each rule gets an independent assessment—a configuration option that trades processing cost for potentially higher detection accuracy.

Only sections flagged as non-adherent proceed to the revision stage, where the system generates suggested updates that align with the latest guidelines while maintaining Flo’s distinctive style and tone. This selective revision approach maintains the integrity of adherent content while focusing computational resources where they’re needed. Finally, the post-processing stage seamlessly integrates revised paragraphs back into the original document structure.

Model Selection Strategy

The MACROS solution demonstrates sophisticated understanding of model capability-cost tradeoffs by employing a tiered approach to model selection across different pipeline stages. Simpler tasks like chunking utilize smaller, more cost-efficient models from the Claude Haiku family, while complex reasoning tasks requiring nuanced understanding—such as content review and revision—leverage larger models from the Claude Sonnet or Opus families. This differentiated model strategy optimizes both performance quality and operational costs, reflecting mature LLMOps thinking.

The team explicitly values Amazon Bedrock’s model diversity, noting how the platform enables them to choose optimal models for specific tasks, achieve cost efficiency without sacrificing accuracy, and upgrade to newer models smoothly while maintaining their existing architecture. This abstraction from specific model versions provides operational flexibility crucial for production systems that need to evolve with rapidly advancing foundation model capabilities.

Rule Optimization and Knowledge Extraction

Beyond content review, MACROS includes a sophisticated Rule Optimizer feature that extracts and refines actionable guidelines from unstructured source documents—a critical capability given that medical guidelines often arrive in complex PDF formats rather than machine-readable structures. The Rule Optimizer processes raw PDFs through Amazon Textract for text extraction, chunks content based on document headers, and processes segments through Amazon Bedrock using specialized system prompts.

The system supports two distinct operational modes: “Style/tonality” mode focuses on extracting editorial guidelines about writing style, formatting, and permissible language, while “Medical” mode processes scientific documents to extract three classes of rules: medical condition guidelines, treatment-specific guidelines, and changes to medical advice or health trends. Each extracted rule receives a priority classification (high, medium, low) to guide subsequent review ordering and focus attention appropriately.

The Rule Optimizer defines quality criteria for extracted rules, requiring them to be clear, unambiguous, actionable, relevant, consistent, concise (maximum two sentences), written in active voice, and avoiding unnecessary jargon. This explicit quality framework demonstrates thoughtful prompt engineering for structured knowledge extraction. Importantly, the system includes a manual editing interface where users can refine rule text, adjust classifications, and manage priorities, with changes persisted to S3 for future use—a pragmatic acknowledgment that AI extraction requires human oversight and refinement.

User Interface and Operating Modes

MACROS supports two primary UI modes addressing different operational scales. “Detailed Document Processing” mode provides granular content assessment for individual documents, accepting inputs in PDF, TXT, JSON formats or direct text paste. Users select from predefined rule sets (examples include Vitamin D, Breast Health, and PMS/PMDD guidelines) or input custom guidelines with adherent and non-adherent examples. This mode facilitates thorough interactive review with on-the-fly adjustments.

“Multi Document Processing” mode handles batch operations across numerous JSON files simultaneously, designed to mirror how Flo would integrate MACROS into their content management system for periodic automated assessment. The architecture also supports direct API calls alongside UI access, enabling both interactive expert review and programmatic pipeline integration—a dual interface approach that serves different stakeholder needs from medical experts to system integrators.

Implementation Challenges and Practical Considerations

The case study candidly discusses several implementation challenges that provide valuable insights for practitioners. Data preparation emerged as a fundamental challenge, requiring standardization of input formats for both medical content and guidelines while maintaining consistent document structures. Creating diverse test sets across different medical topics proved essential for comprehensive validation—a reminder that healthcare AI requires domain-representative evaluation data.

Cost management quickly became a priority, necessitating token usage tracking and optimization of both prompt design and batch processing strategies. The team implemented monitoring to balance performance with efficiency, reflecting real production concerns where per-token pricing makes optimization economically important.

Regulatory and ethical compliance considerations loom large given the sensitive nature of medical content. The team established robust documentation practices for AI decisions, implemented strict version control for medical guidelines, and maintained continuous human medical expert oversight for AI-generated suggestions. Regional healthcare regulations were carefully considered throughout implementation, though specific compliance frameworks aren’t detailed in this proof of concept phase.

Integration and scaling challenges led the team to start with a standalone testing environment while planning for future CMS integration through well-designed API endpoints. Building with modularity in mind proved valuable for accommodating future enhancements. Throughout the process, they faced common challenges including maintaining context in long medical articles, balancing processing speed with accuracy, and ensuring consistent tone across AI-suggested revisions.

Performance Monitoring and Validation

While the architecture includes Amazon CloudWatch for basic monitoring and log management, the team acknowledges that production deployments handling critical medical content warrant more sophisticated observability. They suggest future enhancements with custom metrics and alarms to provide granular insights into system performance and content processing patterns—an honest assessment that their current monitoring setup, while functional, represents an area for production hardening.

The validation approach involved close collaboration between Flo Health’s medical experts and AWS technical specialists through regular review sessions. This human-in-the-loop validation process provided critical feedback and medical expertise to continuously enhance AI model performance and accuracy. The emphasis on expert validation of parsing rules and maintaining clinical precision reflects appropriate caution when deploying AI in high-stakes medical contexts.

Results and Critical Assessment

The proof of concept delivered promising results across key success metrics. The solution exceeded target processing speed improvements while maintaining 80% accuracy and achieving over 90% recall in identifying content requiring updates. Most notably, the AI-powered system applied medical guidelines more consistently than manual reviews and significantly reduced time burden on medical experts. Processing time dropped from hours to minutes per guideline, approaching the 10x improvement target.

However, the case study warrants careful interpretation as promotional AWS content. The 80% accuracy figure falls short of the stated 90% goal, though the over 90% recall meets targets. The text doesn’t detail the evaluation methodology, sample size, or how accuracy and recall were precisely defined and measured in this context. The comparison to “manual review” baseline isn’t quantified with specific time or accuracy metrics for the human process, making it difficult to assess the true magnitude of improvements.

The emphasis on “human-AI collaboration” rather than automation, while ethically appropriate for medical content, also suggests the system hasn’t achieved the level of reliability required for autonomous operation. Medical experts remain “essential for final validation,” meaning the actual operational improvement may be more modest than headline numbers suggest—the system accelerates but doesn’t eliminate expert review work.

Key Learnings and Best Practices

The team distilled several insights valuable for practitioners. Content chunking emerged as essential for accurate assessment across long documents, with expert validation of parsing rules helping maintain clinical precision. The most important conclusion: human-AI collaboration, not full automation, represents the appropriate implementation model. Regular expert feedback, clear performance metrics, and incremental improvements guided system refinements.

The project confirmed that while the system significantly streamlines review processes, it works best as an augmentation tool with medical experts remaining essential for final validation. This creates a more efficient hybrid approach to medical content management—a pragmatic conclusion that may disappoint those seeking radical automation but reflects responsible deployment in high-stakes domains.

Production Considerations and Future Directions

This case study represents Part 1 of a two-part series focusing on proof of concept results. The team indicates Part 2 will cover the production journey, diving into scaling challenges and strategies—suggesting the PoC-to-production transition presented significant additional complexity not addressed here.

The technical architecture appears production-ready in structure with appropriate service separation, authentication, monitoring, and data persistence. However, several elements suggest early-stage deployment: the acknowledgment that monitoring could be enhanced for production medical content, the exploration of alternative orchestration approaches like Bedrock Flows, and the focus on standalone testing environments with planned future CMS integration all indicate this solution is approaching but hasn’t fully reached mature production operation.

The cost optimization discussion and tiered model selection strategy demonstrate thoughtful LLMOps maturity, as does the dual API/UI interface design supporting both interactive and automated workflows. The rule optimization capability for extracting guidelines from unstructured sources addresses a real operational challenge beyond simple content review, adding significant practical value.

Overall, this case study documents a substantial and thoughtfully designed LLMOps implementation addressing genuine operational challenges in medical content management. While the promotional nature of AWS content requires critical reading and some claims lack detailed substantiation, the technical approach demonstrates solid LLMOps principles including appropriate model selection, structured output parsing, human-in-the-loop validation, cost optimization, and modular architecture supporting both expert interaction and system integration. The candid discussion of challenges and the emphasis on augmentation rather than replacement of human experts lend credibility to the narrative.

AI-Powered Medical Content Review and Revision at Scale

Industry

Technologies