## Overview
Factiva is a business intelligence platform owned by Dow Jones (part of News Corporation) that has been aggregating and licensing content for nearly three decades. The platform provides curated business intelligence from thousands of content sources across 200 countries in over 30 languages, serving corporate clients, financial services firms, and professional researchers. In November 2024, Factiva launched "Smart Summaries," an AI-powered feature that allows users to query the service using natural language and receive summarized responses with relevant sources—all built on content that has been explicitly licensed for generative AI use.
Tracy Mabery, General Manager of Factiva, discussed the implementation in a podcast interview, providing insight into how a legacy business intelligence platform has navigated the transition to generative AI while maintaining its core principles around content licensing, publisher relationships, and data security.
## The Pre-AI Foundation
Before generative AI, Factiva was already built on machine learning and AI foundations. The platform was originally created as a joint venture to serve both qualitative and quantitative research needs in financial services. Core capabilities included advanced Boolean query operators, semantic filtering, rules management, and contextual search (distinguishing, for example, between "Apple" as a company versus "apple" as a fruit). The platform also employed human experts who partnered with corporate clients to build sophisticated search queries.
This technical foundation proved crucial for the generative AI transition. The existing infrastructure for semantic search, content attribution, and royalty tracking provided the scaffolding upon which the new AI capabilities were built. Factiva's corpus includes nearly three billion articles dating back to 1944, representing a massive archive that needed to be indexed and made searchable for the new AI features.
## Technical Architecture and Implementation
Factiva selected Google Gemini on Google Cloud as their foundation model and infrastructure partner. According to Mabery, the selection was driven primarily by security considerations, but also by Google's ability to handle the scale of Factiva's content corpus. The decision-making process included Google providing detailed network architecture diagrams showing exactly how the infrastructure would be built and secured.
The implementation uses a closed ecosystem approach where only content licensed for generative AI use feeds into the summarization engine. This is essentially a Retrieval-Augmented Generation (RAG) architecture where the retrieval is limited to the licensed corpus. The semantic search layer powers both the generative AI summarization features and the non-generative search capabilities, providing unified infrastructure.
When Factiva approached Google about indexing nearly three billion articles for their private cloud, it represented a significant technical undertaking. Mabery noted that even Google found the "three billion with a B" figure notable, suggesting the scale required careful infrastructure planning.
The company maintains an agnostic stance toward both models and cloud providers. While Google was selected for Smart Summaries, Dow Jones broadly aims to pick "best of breed" solutions for each specific use case. There was no explicit preference stated for proprietary versus open-source models—security, capability, and fit for purpose appear to be the driving factors.
## Content Licensing Strategy
Perhaps the most distinctive aspect of Factiva's approach is their publisher-by-publisher content licensing strategy. Rather than following the approach of many frontier AI companies that trained on publicly available internet data, Factiva went to each of their thousands of publishers to secure explicit generative AI licensing rights.
This process involved significant education, as many publishers were unfamiliar with AI terminology and concepts in the early days. Terms like "RAG model," "retrieval augmented generation," "agents," and "hallucinations" were new to the publishing community. The conversations addressed three main concerns:
- **Security**: Publishers wanted assurance that their content would remain secure and not be sent elsewhere or combined with unlicensed content
- **Content integrity**: Ensuring that only licensed content would be used in summarizations, with no external content "wrapping around" or making additional assumptions
- **Compensation**: Understanding how royalty structures would work in the new AI context
Factiva leveraged its existing royalty and attribution infrastructure, which has been tracking content usage and compensating publishers for nearly three decades. The concept of "royalty moments" was extended to generative AI, creating new compensation opportunities for publishers who opt into AI licensing.
The results have been impressive from a partnership perspective. At launch in November 2024, nearly 4,000 publishers had signed generative AI licensing agreements, up from just 2,000 sources six months prior. By the time of the interview (early 2025), the number had grown to nearly 5,000 sources.
## Hallucination Mitigation and Content Integrity
The closed ecosystem approach serves as a primary defense against hallucinations. Because the generative AI can only draw from licensed content within Factiva's corpus, it cannot hallucinate information from the broader internet or its training data. This is described as more of a "language task" than a "knowledge task"—the AI summarizes and synthesizes existing content rather than generating novel claims from parametric knowledge.
The system emphasizes three core search principles: relevancy, recency, and context. These guide how content is surfaced and summarized. Factiva provides explicit disclaimers noting that summaries are generative AI-initiated, along with detailed technical explanations of how the search algorithm surfaces information.
Mabery acknowledged that hallucinations remain an industry-wide challenge and that user tolerance for AI errors may be evolving. She noted the example of Apple Intelligence having to pull a feature due to errors, suggesting that egregious mistakes still require addressing even as users become more accustomed to AI limitations.
## Governance and Ethical Principles
Factiva's approach to generative AI is guided by three core principles:
- **Publisher-first mentality**: As a publisher themselves through Dow Jones properties like the Wall Street Journal, intellectual property protection and mutual respect for IP is foundational
- **Arbiter responsibility**: Factiva sees itself as a convening power for the publishing industry, responsible for educating and protecting publishers who have been part of their ecosystem for decades
- **Secure innovation**: Advancing AI capabilities while maintaining governance, ethics, and security standards
The company explicitly states that they only create licensing deals they would sign themselves, suggesting alignment between their own publishing interests and what they ask of partners.
## Production Considerations and Operational Notes
Several LLMOps-relevant observations emerge from the case study:
- **Legacy system integration**: Factiva demonstrates how a 25-year-old platform can integrate generative AI by building on existing semantic search, attribution, and royalty infrastructure
- **Vendor selection process**: The decision to use Google Gemini was based on security demonstrations including network architecture diagrams, scale capability, and partner transparency
- **Incremental rollout**: Starting with Smart Summaries as the initial generative feature, with personalization and potentially agents on the roadmap
- **Separate content streams**: Managing generative AI-licensed content separately from the broader corpus, with distinct royalty flows for each
- **Rapid scaling**: Growing from 2,000 to nearly 5,000 licensed sources in approximately six months demonstrates the scalability of the publisher engagement model
## Future Roadmap
While specific details were not confirmed, the interview hinted at several future directions:
- Enhanced personalization based on tone and user preferences
- Conversational AI capabilities (referencing "Joanna bot" created by WSJ reporter Joanna Stern as an example of conversational AI experimentation at Dow Jones)
- Agent-based capabilities, though Mabery declined to confirm specifics
The trajectory suggests Factiva views generative AI as moving from "wish list" to "table stakes" for business intelligence platforms, with continued acceleration of AI capabilities expected.
## Critical Assessment
While the case study presents a compelling model for responsible AI deployment in content aggregation, several considerations merit attention:
- The licensing-first approach, while ethically sound, may limit content coverage compared to competitors using broader training approaches
- The "closed ecosystem" claim would benefit from more technical detail about whether any model pre-training on external data could influence outputs
- Long-term sustainability of publisher-by-publisher licensing as the number of AI applications grows remains to be seen
- Hallucination rates and evaluation metrics were not quantified
Nevertheless, Factiva's approach represents a significant case study in building production AI systems with explicit attention to content rights, publisher relationships, and transparent compensation—areas where many AI deployments have faced criticism.