ZenML

Production-Scale RAG System for Real-Time News Processing and Analysis

Emergent Methods 2023
View original source

Emergent Methods built a production-scale RAG system processing over 1 million news articles daily, using a microservices architecture to deliver real-time news analysis and context engineering. The system combines multiple open-source tools including Quadrant for vector search, VLM for GPU optimization, and their own Flow.app for orchestration, addressing challenges in news freshness, multilingual processing, and hallucination prevention while maintaining low latency and high availability.

Industry

Media & Entertainment

Technologies

Overview

Emergent Methods, founded by Robert (a scientist turned entrepreneur), has built a production-grade system for processing and contextualizing news at scale. The company’s flagship product, Ask News (asknews.app), adaptively models over 1 million news articles per day, providing users with timely, diverse, and accurately sourced news information. This case study, presented in a discussion format between Robert and the host Demetrios, offers deep insights into the technical architecture and operational considerations of running LLMs in production for real-time news processing.

The core problem Emergent Methods addresses is the inadequacy of general-purpose LLMs for delivering accurate, timely news. As demonstrated in the presentation, ChatGPT Plus with Bing search takes considerable time to find articles and often returns outdated information—in one example, returning a 25-day-old article when asked for current news on Gaza. Robert characterizes this as “borderline dangerous,” particularly for sensitive topics requiring accuracy and recency.

The Case for Context Engineering

Emergent Methods coined the term “Context Engineering” to describe their approach to news processing. The traditional NLP pipeline before the advent of capable LLMs like LLaMA 2 involved chunking text into 512-token segments, running them through translation models, using DistilBART for summarization, performing sentence extraction, and maybe adding text classification for sentiment. While functional, this approach was rigid and couldn’t adapt to evolving product requirements.

The new paradigm enables reading full articles, extracting rich context, flexible output generation, translation, summarization, and custom extraction—all configurable through prompt modifications rather than pipeline rewrites. This flexibility is crucial for a production system where requirements evolve based on user feedback and new use cases.

Robert emphasizes that enforcing journalistic standards requires dedicated resources. When covering global events, users deserve perspectives from multiple regions and languages—for Gaza coverage, understanding what Egypt, France, and Algeria are saying matters. This requires parsing a massive volume of articles to avoid outdated, stale reporting and minimize hallucination, which has particularly high costs in news contexts.

Data Sourcing and Scale

Emergent Methods sources their news from NewsGrabber (newscatcher-ai.com), which provides access to 50,000 different sources. They process news in 5-minute buckets, continuously ingesting and enriching the latest articles. This scale—over 1 million articles daily—demands careful attention to throughput, latency, and resource management.

Technical Architecture

Microservices Philosophy

A central theme throughout the presentation is the commitment to microservices architecture with single responsibility principle. Rather than using all-in-one solutions that sacrifice performance for convenience, Emergent Methods orchestrates purpose-built components that can be independently scaled, updated, or replaced. This modularity positions them to adapt as the rapidly evolving LLM ecosystem produces better alternatives.

Vector Database: Qdrant

Qdrant serves as the vector database for semantic search and forms a cornerstone of their architecture. Robert highlights several Qdrant features critical to their use case:

LLM Inference: vLLM

For on-premise LLM inference, Emergent Methods uses vLLM, praising its focus on continuous batching and PagedAttention. While Robert admits the GPU memory management internals are “above his pay grade,” the practical benefit is dramatically increased throughput—essential when processing millions of articles.

Embeddings: Text Embedding Inference (TEI)

Rather than letting the vector database handle embeddings (which some solutions offer for convenience), Emergent Methods uses Hugging Face’s Text Embedding Inference as a dedicated microservice. This follows their single responsibility principle: the database should store and search vectors, not generate them. TEI also provides dynamic batching, valuable when dealing with heterogeneous text lengths. This isolation allows independent resource allocation and scaling.

Orchestration: FlowDapt

FlowDapt is Emergent Methods’ own open-source orchestration framework, developed over two years and recently publicly released. It runs the Ask News production system and addresses several challenges specific to ML workloads:

Retrieval Optimization: HyDE

The presentation discusses the challenge that user queries are not semantically equivalent to embedded documents. One approach they explore is HyDE (Hypothetical Document Embedding), where the LLM generates a fake article based on the user’s question. This synthetic document is then embedded and used for search, bringing the query representation closer to the document space.

However, Robert notes limitations: computational cost and the fact that the generated content is based on the LLM’s training data, not current information. For handling ambiguous follow-up questions (like “why did they change the rules?”), they use prompt engineering to generate explicit, unambiguous queries based on conversation history.

The overarching goal is staying in a single parameter space—keeping search embeddings as close as possible to document embeddings for optimal retrieval.

Hybrid Model Strategy

The architecture employs both proprietary remote models (like OpenAI’s offerings) and on-premise LLMs through vLLM. This hybrid approach balances cost, latency, and capability:

Robert acknowledges advantages and disadvantages to each approach without prescribing a universal solution.

Production Considerations

Latency Optimization

In production environments handling high traffic volumes with simultaneous requests hitting multiple services, latency becomes critical. Beyond the network-level optimization of co-locating services in Kubernetes, the application layer uses asynchronous programming. While some services are written in Go for performance, ML-focused endpoints use FastAPI with Pydantic v2.0 (which runs on Rust), providing the benefits of highly parallelized environments with strong guarantees around immutability and atomicity.

Data Freshness and Updates

A key discussion point addresses keeping vector databases current. For news, this involves timestamp-based filtering during retrieval and careful metadata management during ingestion. For other use cases like HR policy documents that update quarterly, Robert suggests maintaining clean databases by removing outdated information rather than accumulating document versions that could confuse retrieval.

Future Capabilities: Recommendations

Emergent Methods plans to leverage Qdrant’s recommendation system to let users personalize their news experience. By allowing users to indicate what they like and don’t like, the system can outsource recommendation logic to Qdrant, building user profiles that suggest relevant stories without custom recommendation engineering.

Startup Advantage Perspective

Robert closes with observations on why startups are well-positioned in the current LLM landscape. Best practices remain unestablished because the technological paradigm is underexplored—no one fully knows the limits. Meanwhile, incumbents face resistance to change, legacy product maintenance, and market expectations that constrain experimentation. Startups can build everything around reasoning engines, embracing modularity and adaptability. While perhaps opinionated, this perspective reflects the approach Emergent Methods has taken: building flexible, composable systems that can evolve with rapidly advancing technology.

Product Demonstration

The discussion includes a live walkthrough of asknews.app, demonstrating features like filtering by sentiment (positive news), region (Europe, Americas), and category (sports, finance). The interface shows source citations, coverage trends, and where stories are in their coverage cycle. User accounts enable starring stories for tracking ongoing narratives, with personalized recommendations as a planned enhancement.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

AI-Powered Conversational Assistant for Streamlined Home Buying Experience

Rocket 2025

Rocket Companies, a Detroit-based FinTech company, developed Rocket AI Agent to address the overwhelming complexity of the home buying process by providing 24/7 personalized guidance and support. Built on Amazon Bedrock Agents, the AI assistant combines domain knowledge, personalized guidance, and actionable capabilities to transform client engagement across Rocket's digital properties. The implementation resulted in a threefold increase in conversion rates from web traffic to closed loans, 85% reduction in transfers to customer care, and 68% customer satisfaction scores, while enabling seamless transitions between AI assistance and human support when needed.

customer_support chatbot question_answering +40

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57