Bell developed a sophisticated hybrid RAG (Retrieval Augmented Generation) system combining batch and incremental processing to handle both static and dynamic knowledge bases. The solution addresses challenges in managing constantly changing documentation while maintaining system performance. They created a modular architecture using Apache Beam, Cloud Composer (Airflow), and GCP services, allowing for both scheduled batch updates and real-time document processing. The system has been successfully deployed for multiple use cases including HR policy queries and dynamic Confluence documentation management.
This case study comes from Bell Canada, one of the largest telecommunications companies in Canada, and was presented at the Toronto Machine Learning Summit (TMLS). The presenters are Lyndon (Senior Manager of AI) and Adam (Senior Machine Learning Engineer) from Bell’s ML Engineering and MLOps team. Their team is responsible for taking data science work—including generative AI applications—and engineering it for production deployment, along with handling operations and DevOps.
The focus of this presentation is on a specific challenge within RAG (Retrieval Augmented Generation) systems: knowledge base management and document embedding pipelines. While RAG architectures have become popular for grounding LLM responses in domain-specific information, the speakers highlight that the operational complexity of maintaining dynamic knowledge bases at scale is often underestimated. Their solution treats document embedding pipelines and knowledge base management as a product, built with modularity, reusability, and maintainability in mind.
Bell uses RAG systems for various internal applications, including an HR chatbot that allows employees to query company policies (such as vacation day entitlements based on tenure, role transitions, and union considerations). The HR policy example illustrates how enterprise knowledge bases are inherently complex—vacation policies at Bell are not simple lookups but depend on numerous factors documented across extensive policy documents.
The core challenges identified include:
The team drew inspiration from two key areas to design their solution:
Data lineage and data provenance practices from traditional ML informed their approach. Just as tracking data from source to model helps explain model performance and detect drift, tracking documents from raw form through chunking and embedding helps explain chatbot responses and enables comparison of different pre-processing configurations. This is particularly important because there are multiple ways to chunk and process documents, and maintaining lineage allows for systematic experimentation and debugging.
The team emphasized modularity as the primary design principle. Each component of the system can be independently tested, changed, debugged, and scaled. Separation of concerns ensures each module has a distinct function. Test-driven development practices were applied despite the novelty of the solution space, with unit tests for every component and integration tests for the system as a whole. CI/CD pipelines are integral to making the system easily deployable and maintainable.
The team carefully defined the constraints and variables of their problem:
Constraints:
Problem Deltas (Variables):
The team developed a solution matrix mapping scenarios to appropriate processing approaches:
The batch pipeline is the primary pipeline—every use case requires one. It serves for initialization, scheduled runs, and handling large document or configuration changes.
The architecture includes:
Apache Beam is used for all processing steps, chosen for its unified programming model for batch and streaming data processing and its excellent support for parallel processing. Documents are processed as small bundles of data (individual documents in pre-processing, individual chunk embeddings in post-processing).
The incremental pipeline is supplementary and addresses high-velocity document scenarios. The key addition is a Pub/Sub topic that listens to changes in the knowledge base bucket (document additions, updates, deletions).
A single Dataflow job encompasses pre-processing, embedding, and post-processing, consuming messages from the Pub/Sub topic and processing changed documents atomically.
The team specifically chose Pub/Sub over Cloud Functions to avoid race conditions. The presenters give an illustrative example: if a librarian uploads a bad document and immediately tries to delete it, Cloud Functions (being isolated atomic events with no coordination) could result in the bad document being processed and synced after the deletion completes, leaving the system out of sync. With Pub/Sub and Apache Beam, events can be windowed and grouped (e.g., sliding windows of 60 seconds), and only the most recent action within that window is processed.
The production deployment combines both pipelines:
The solution is highly configurable through YAML files. Each component (pre-processing, embedding, post-processing) has its own configuration section specifying:
Components are treated as services following DevOps best practices. One-time resources (Dataflow Flex templates, knowledge bucket initialization, Pub/Sub and bucket notifications, vector index initialization) are managed as infrastructure as code.
Rather than defining custom pipelines for each use case, the team created a standardized pipeline process. A DAG generator automatically creates the associated Airflow DAG from a YAML configuration file when it’s uploaded to Cloud Composer. This enables essentially low-code/no-code deployment of new RAG pipelines within minutes.
The knowledge base structure draws heavy inspiration from TensorFlow Extended (TFX) and its concept of pipeline routes and experiment runs. Each use case has its own root folder containing:
For batch pipeline runs, timestamped subfolders are created for each run, providing data lineage and provenance. The most recent timestamp folder is synced to a “current” subfolder that the chatbot API reads from. For incremental processing, files in the current folder are modified directly since there’s no concept of timestamped runs in real-time processing.
The team categorizes LangChain loaders into document loaders (processing specific URIs/documents atomically) and source loaders (processing undefined numbers of documents from a directory or source). They focus on document loaders because source loaders would require reprocessing the entire knowledge base with no notion of which specific documents changed.
Two methods exist for getting documents into the raw folder:
Librarian Approach: A person or group explicitly performs operations on documents—similar to using a file explorer. This is the simplest solution for low-velocity use cases like HR policies.
Automated Pipeline: For high-velocity sources like frequently updated Confluence pages, an automated pipeline determines which documents have changed at the source and reflects only those changes. The process involves:
The presenters were candid about trade-offs. When asked about handling partial document updates (only embedding changed sections), they acknowledged they treat documents as atomic units and re-process entirely rather than attempting complex chunk preservation logic—a pragmatic engineering decision given the marginal time savings.
The solution is not open source, but the team indicated willingness to engage with interested parties about implementation details.
While the presentation focuses heavily on the infrastructure and operational aspects, it notably lacks detailed discussion of evaluation, testing of RAG quality, or metrics beyond the operational level. The feedback loop mentioned (thumbs up/down on chatbot responses) suggests iteration on chunking parameters but the systematic approach to measuring and improving RAG performance is not elaborated.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Octus, a leading provider of credit market data and analytics, migrated their flagship generative AI product Credit AI from a multi-cloud architecture (OpenAI on Azure and other services on AWS) to a unified AWS architecture using Amazon Bedrock. The migration addressed challenges in scalability, cost, latency, and operational complexity associated with running a production RAG application across multiple clouds. By leveraging Amazon Bedrock's managed services for embeddings, knowledge bases, and LLM inference, along with supporting AWS services like Lambda, S3, OpenSearch, and Textract, Octus achieved a 78% reduction in infrastructure costs, 87% decrease in cost per question, improved document sync times from hours to minutes, and better development velocity while maintaining SOC2 compliance and serving thousands of concurrent users across financial services clients.
Climate tech startups are leveraging Amazon SageMaker HyperPod to build specialized foundation models that address critical environmental challenges including weather prediction, sustainable material discovery, ecosystem monitoring, and geological modeling. Companies like Orbital Materials and Hum.AI are training custom models from scratch on massive environmental datasets, achieving significant breakthroughs such as tenfold performance improvements in carbon capture materials and the ability to see underwater from satellite imagery. These startups are moving beyond traditional LLM fine-tuning to create domain-specific models with billions of parameters that process multimodal environmental data including satellite imagery, sensor networks, and atmospheric measurements at scale.