## Overview
FemmFlo represents a compelling case study in rapid LLM deployment for healthcare applications, demonstrating how a small startup with limited resources can leverage modern LLMOps practices and cloud infrastructure to build production-ready AI systems in highly regulated domains. Founded by Tal Cook, FemmFlo addresses a critical gap in women's healthcare: the seven-year average wait time for accurate hormonal diagnoses that affects 70% of women. The company partnered with Millio AI, an AI consultancy that has been building production AI solutions since 2019, to develop an LLM-powered platform that integrates hormonal tracking, lab diagnostics, mental health support, and personalized care into a single system.
The platform's centerpiece is Gabby, an AI agent built on AWS Bedrock that serves as a conversational interface for women to understand their hormonal health. What makes this case study particularly noteworthy is the aggressive eight-week timeline from conception to deployment with real users interacting with real data. The presentation, delivered at what appears to be an AWS event, provides detailed insights into both the business rationale and technical architecture decisions that enabled this rapid deployment.
## Problem Context and Platform Vision
The problem FemmFlo addresses is substantial and well-documented. Beyond the diagnostic delays, 30% of women report symptoms like fatigue, pain, and mood changes without effective tracking or care, and 82% indicate willingness to pay for personalized hormonal diagnostics. The economic burden is significant, with the average woman spending $18,000 over her lifetime on out-of-pocket care while seeking answers. The platform targets a $12 billion market, starting with an $800 million underserved segment where cultural fit is strong.
FemmFlo's approach is holistic, built around a six-pillar framework that treats health comprehensively rather than in isolation. These pillars include lab-driven hormone mapping, integrated mental health with mood tracking and virtual care, lifestyle recommendations tailored to hormone cycles, social determinants like work stress and relationships, explicit focus on access and equity for underserved populations, and continuous monitoring with AI feedback loops that adapt care plans in real-time as new data arrives. This comprehensive vision distinguishes FemmFlo from competitors that typically focus on single aspects like cycle tracking or mental health without connecting the dots.
## AI-First Architecture Decision
A critical strategic decision was to adopt an "AI-first" approach, placing Gabby at the core of the platform rather than treating AI as an add-on feature. This architectural choice has significant LLMOps implications. By structuring data once and reusing it everywhere, the team achieved remarkable efficiency. As the presenters explain, AWS provides a single managed backbone where ingestion, storage, retrieval, model inference, testing UIs, and mobile apps all share the same security and operating model.
The payoff extends beyond fast initial deployment to rapid iteration capability. New data sources can be attached to the retrieval layer, validated by the science officer, and promoted to the production app within days. This represents the fundamental difference between building a one-off proof of concept and something that can actually scale as the company grows. The modular, API-driven design enables both rapid rollout and flexible integration with larger healthcare systems.
## Technical Architecture
The technical architecture is comprehensive yet deliberately simplified to enable the eight-week timeline. At the highest level, the system consists of a user interface layer, an orchestration layer with the Bedrock agent at its core, and a data/knowledge layer. The user interface was built using Open Web UI, an open-source tool that provides a ChatGPT-like experience with built-in features for conversation history, file uploads, and user separation. This was deployed on AWS ECS Fargate with Elastic File Storage for uploaded files and RDS for conversation history management.
The interface communicates with the agent through AWS API Gateway, which provides several critical production features including API key authentication for security, rate limiting to prevent abuse, and throttling capabilities. The gateway connects to what the team calls an "interface lambda" - a central compute unit that orchestrates much of the system's logic. This lambda function has several key responsibilities: it retrieves user onboarding data from DynamoDB to provide context about who the agent is speaking with and their medical background, it interfaces with Langfuse for monitoring and observability, capturing detailed traces of how the agent reasons about questions and what tools it uses, and it invokes the AWS Bedrock agent with appropriate context.
The Bedrock agent configuration includes the agent's main objective, available tools, prompt configurations for tone and style, and integration with Bedrock Guardrails. The guardrails feature is particularly important for healthcare applications, ensuring the agent outright refuses to answer certain topics or inappropriate questions. This provides a critical safety layer for production deployment in a regulated industry.
## Knowledge Base and Retrieval Strategy
The knowledge base architecture reveals sophisticated thinking about retrieval strategies. Notably, the team deliberately chose not to use a vector store, which might seem counterintuitive for a RAG-based system. The rationale is pragmatic: most of the data they needed to retrieve was small, already structured, and required in its entirety rather than specific subsections. When the entire context is needed, vector similarity search adds unnecessary complexity without benefit.
The knowledge base consists of two primary components with different retrieval patterns. Onboarding data stored in DynamoDB is retrieved early in every conversation flow through the interface lambda, ensuring the agent always has context about who it's speaking to. This data includes basic information for geographic and cultural localization, critical health factors like contraceptive use, mental health history, and neurodivergent conditions. These factors are rarely considered together in traditional healthcare systems but significantly shape how symptoms appear and how care plans should respond.
Structured lab results stored in S3 are accessed through a tool that the agent can optionally invoke. This design is efficient because not all questions require lab data. When a user uploads lab results as a PDF, the system runs them through a structured extraction process before storing them in S3, making them readily available for retrieval when needed. For questions not covered by the user-specific knowledge base, the system relies on the baked-in knowledge of the LLM, with careful prompt engineering to ensure appropriate tone and detail level.
## The Eight-Week Development Process
The presentation provides detailed insight into Millio AI's systematic approach to building AI MVPs in eight weeks, a framework they've refined across multiple projects in high-stakes industries like enterprise finance and healthcare. The process begins with a two-week discovery phase focused on setting clear success definitions, identifying main integration points where the system connects with larger systems or users, cleaning and preparing data, and confirming assumptions with subject matter experts to avoid missing critical requirements.
The next four weeks focus on building an evaluation framework. This includes creating an evaluation dataset that covers edge cases, working with subject matter experts to develop questions and expected answers, defining clear success metrics to ensure progress in the right direction, and starting with the simplest architecture that could work before overengineering. This principle of starting simple is crucial for rapid development and reflects mature LLMOps practice.
The subsequent four weeks involve iterative optimization, working with subject matter experts to improve performance based on key metrics, setting up comprehensive monitoring and observability to see how components interact, establishing alarms and notifications for failures or bugs, and conducting cost estimates to ensure sustainability. The final two weeks focus on production readiness: tidying up integration points, creating an operating handbook for maintainers, conducting red team testing with adversarial questions, and training users on the system. The team noted they typically deliver this for around $10,000 with a four-person team, though AWS funding credits helped supplement costs for FemmFlo.
## Evaluation and Model Selection
The evaluation strategy demonstrates sophisticated LLMOps maturity. The team created a comprehensive evaluation dataset spanning multiple categories including general mental and feminine health questions, hormones and deficiencies answerable using lab results, questions specific to individual user profiles requiring onboarding data, and inappropriate or adversarial questions for red team testing. Evaluation combined both human assessment and automated LLM-based evaluation, where another LLM inspected the question, answer, and retrieved context to generate metrics. They also tracked operational metrics like response latency and conversation cost.
Model selection involved navigating competing objectives. The team needed sufficient baked-in knowledge in the LLM to handle nuanced questions not covered by the knowledge base, which favored larger models. However, they also required low response latency and low cost, which favored smaller models. They systematically ran their evaluation framework across multiple models and selected the smallest one meeting all criteria, which at development time was Claude 3.5 Sonnet. The presenters noted it would be interesting to rerun the evaluation given the constant emergence of new models, reflecting the ongoing nature of LLMOps optimization.
## Monitoring and Observability
The integration of Langfuse for monitoring and observability represents mature LLMOps practice. The system captures detailed traces showing the question asked to the agent, how the agent reasoned about the question, what tools it decided to use and why, and the step-by-step process of arriving at the final answer. This visibility proved invaluable during evaluation, allowing the team to dig into specific interactions and understand not just what answer was generated but how the agent arrived at that answer. This kind of detailed observability is essential for debugging, optimization, and building trust in AI systems, particularly in healthcare where understanding the reasoning process is critical.
## Testing and Quality Control
A particularly important aspect of FemmFlo's LLMOps approach is the controlled testing environment built before deploying to production. The team addressed legitimate concerns about AI hallucinations and incorrect medical advice with a multi-layered quality control system. They built a testing workspace using Open Web UI where the science officer can validate Gabby's responses before anything reaches users. This environment allows rapid testing, prompt and guardrail tweaking, and verification that answers align with both user input data and clinical guidelines.
The testing interface provides capabilities for uploading lab results, which Gabby reads and translates into easily stored database format for AI consumption and downstream systems. Gabby can explain complicated lab reports in simple language, avoiding overwhelming PDFs full of ranges and numbers in favor of clear summaries of what's high, what's low, why it matters, and how to manage it. Practical next steps are provided based on biomarkers, such as sleep and stress routines for cortisol issues, nutritional ideas for borderline ferritin, or exercise patterns supporting hormonal balance. Users can ask follow-up questions and receive real-time adapted responses.
This testing environment enables safe iteration with A/B testing of prompts, comparison of outputs, and documentation of changes before promoting the exact same logic to the mobile app. Because everything runs on AWS services, the path from testing to deployment is clean, auditable, and appropriate for this heavily regulated industry. The workflow is simple on the surface but underneath represents a controlled process where lab data arrives, guidelines are retrieved, user context is applied, answers are checked by the science officer, and only then shipped to the app.
## Localization and Cultural Relevance
A distinctive feature of FemmFlo's approach is the emphasis on localization and cultural relevance, which has significant implications for how the LLM operates in production. Generic, Western-centric datasets can offer advice that's impractical for underrepresented women. FemmFlo uses both localized data and user-specific data to ensure recommendations fit local realities. The frequently cited example is particularly illustrative: a user in Nairobi won't receive salmon recommendations for increasing omega-3s because salmon can cost a day's wage in that market. Instead, they receive equally effective options that are affordable and available locally.
This localization is enabled by AWS's flexible storage options including S3, DynamoDB, and OpenSearch, allowing the system to handle diverse health data and letting Bedrock personalize outputs for each user's context. The onboarding process captures geographic information upfront specifically to enable this differentiation, making the AI more useful and accessible to the underserved populations that FemmFlo targets.
## User Journey and Integration Points
The complete user journey demonstrates how the various LLMOps components come together in production. Onboarding captures the whole person, not just menstrual cycles, including basic demographic and geographic information, critical health factors like contraceptive use and mental health history, and even small personalization touches like selecting a Gabby avatar to make the experience feel personal and approachable.
Daily check-ins are designed to be simple and optional, taking under a minute, but build a powerful picture of health over time. These check-ins capture pain, mental health, lifestyle factors, and cycle changes broadly enough to capture lived experience but light enough to maintain engagement. Each check-in is saved and added to the user's profile, building history that doesn't just explain what's happening now but enables tracking changes and anticipating what's coming next. This is how the system picks up early signals and adapts care plans in real-time.
At any point, users can chat with Gabby in natural language rather than digging through graphs, reports, or medical jargon. Gabby responds instantly by combining locked data with clinical guidelines approved by the science officer, providing simple actionable insights. The conversational interface feels natural but is powered by structured health data behind the scenes.
The integration with clinical partnerships represents a critical validation layer. By working with labs and scientific partners, FemmFlo ensures data isn't just personal but clinically robust, bridging the gap between everyday health tracking and medical-grade insights. Lab results uploaded as epicenter hormonal panels are processed and added to user profiles, read and explained by Gabby in plain language without cross-talk between users due to the secure architecture.
## Cost and Infrastructure Strategy
The infrastructure strategy reflects sophisticated understanding of LLMOps economics. FemmFlo uses a pay-as-you-grow model made possible by AWS's serverless stack including Lambda, ECS Fargate, and API Gateway, all designed to keep infrastructure costs low at small scale while supporting growth. Traditional licensing with expensive upfront fees would be prohibitive for a startup targeting underserved markets with limited resources. The serverless approach aligns costs with usage and revenue.
The presenters note that AWS also provided substantial startup funding credits, both in platform credits and cash, allowing the team to focus on building differentiating technology rather than worrying about infrastructure costs during the critical early stages. This highlights the importance of startup programs in the LLMOps ecosystem, particularly for applications with social impact in underserved markets.
## Production Deployment and Real-World Use
By the time of the presentation, FemmFlo had deployed the AI MVP for real users with real data, not just a demo or simulation. This represents genuine production LLMOps rather than a proof of concept. The system was processing actual user check-ins, real lab results, and providing personalized health guidance that users were acting on. The platform had achieved beta users, lab integrations, and partnerships despite the aggressive timeline.
The modular, API-driven architecture positions the platform for expansion beyond individual consumers to B2B and public sector deployments with insurers, employers, and public health programs. The same model that serves individual users can be scaled to larger organizational deployments, with the infrastructure designed from the start to support this growth. This forward-looking architecture, despite the rapid development timeline, demonstrates mature LLMOps thinking about the full lifecycle from MVP to scale.
## Critical Assessment and Tradeoffs
While the presentation is clearly promotional in nature, emphasizing AWS services and the rapid development timeline, several aspects deserve critical examination. The eight-week timeline, while impressive, raises questions about certain tradeoffs. The team explicitly started with the simplest architecture and relied heavily on managed services, which is appropriate for rapid MVP development but may require significant re-architecture at scale. The decision not to use vector stores, while justified for the current data characteristics, may become limiting as the knowledge base expands with more diverse content requiring semantic search.
The reliance on a single LLM (Claude 3.5 Sonnet at the time) without clear discussion of fallback strategies, multi-model approaches, or version management represents a potential risk in production systems. The healthcare domain's regulatory requirements receive limited discussion beyond the testing environment and science officer validation. Deeper exploration of compliance frameworks, audit trails, data retention policies, and clinical validation processes would strengthen confidence in the system's production readiness.
The presentation makes strong claims about culturally relevant localization but provides limited technical detail on how this is implemented beyond storing geographic information and having access to local data. The actual mechanisms for ensuring recommendations are culturally appropriate and locally available aren't fully specified. The cost estimate of $10,000 for development seems remarkably low and likely doesn't include the full cost of subject matter expert time, ongoing model costs, or the value of AWS credits, potentially creating unrealistic expectations.
The monitoring and evaluation framework, while more sophisticated than many early-stage deployments, would benefit from more detail on how ongoing model performance is tracked in production, how concept drift is detected, and how the system handles degraded performance. The red team testing is mentioned but not described in detail, which is particularly important for healthcare applications where malicious or inappropriate use could have serious consequences.
## Implications for LLMOps Practice
Despite these caveats, FemmFlo's case study offers valuable lessons for LLMOps practitioners. The systematic eight-week framework developed by Millio AI provides a replicable model for rapid AI deployment that balances speed with rigor. The emphasis on establishing clear success metrics and evaluation frameworks before building is essential LLMOps practice that many organizations skip in their rush to deploy.
The choice to use managed services extensively rather than building custom infrastructure enabled the aggressive timeline and allowed a small team to focus on differentiating features rather than undifferentiated infrastructure. This validates the value proposition of platforms like AWS Bedrock for organizations without deep MLOps expertise. The integration of monitoring from the start rather than as an afterthought reflects mature understanding that observability is not optional in production AI systems.
The controlled testing environment with subject matter expert validation represents a practical approach to the challenge of deploying AI in regulated domains. Rather than attempting to eliminate all possibility of error through purely technical means, the system acknowledges the need for human oversight while using technology to make that oversight efficient and scalable. This hybrid approach may be more realistic for many healthcare AI applications than fully automated solutions.
The emphasis on starting simple and iterating based on measured performance rather than overengineering from the start is valuable LLMOps wisdom that counters the tendency toward premature optimization. The team's willingness to make pragmatic architectural choices like avoiding vector stores when they weren't needed demonstrates engineering maturity.
## Conclusion
FemmFlo's eight-week journey from concept to production demonstrates that rapid deployment of LLMs in healthcare is feasible with the right partner, platform, and process. The case study illustrates how modern LLMOps practices including systematic evaluation, comprehensive monitoring, controlled testing environments, and strategic use of managed services can enable small teams to build production-ready AI systems in highly regulated domains. While the presentation naturally emphasizes successes and may understate challenges, the technical details provided offer genuine insights into practical LLMOps at the intersection of healthcare, AI, and social impact.
The emphasis on cultural relevance and serving underserved populations adds important context often missing from technical AI discussions, demonstrating how LLMOps decisions around data architecture, personalization, and cost structure have direct implications for equity and access. As the platform moves from MVP to scale, it will face typical challenges around model versioning, performance optimization, regulatory compliance, and cost management at higher usage levels. However, the foundation established during the intensive eight-week development period, with its focus on modularity, observability, and systematic evaluation, positions FemmFlo to address these challenges as they arise while continuing to iterate and improve the system based on real user data.