Perplexity AI: Building an AI-Powered Research Assistant

Overview

This case study, presented by a founder of Perplexity AI at a technical talk, chronicles the journey of building a consumer AI search product that combines large language models with real-time web search. The company, incorporated in August 2022, pivoted from enterprise text-to-SQL work to building what they describe as “the world’s best research assistant.” The talk provides candid insights into the product evolution, technical architecture decisions, model selection strategies, and the operational challenges of running LLMs in production at scale.

Perplexity AI’s core insight was that traditional search engines like Google provide links and basic factual answers, but struggle with complex queries that require synthesis and reasoning. By orchestrating LLMs with live search indices, they aimed to provide direct, cited answers to nuanced questions—essentially creating a new product category rather than competing directly for existing market share.

Product Evolution and Rapid Iteration

The company’s journey exemplifies the lean startup approach applied to LLMOps. They initially built a Slack bot for internal use, addressing their own questions about company building, coding, and enterprise software (HubSpot, Salesforce query languages). This internal dogfooding proved invaluable—they used the product to answer questions like “how to start Uber server in Ruby” or “what is SOQL” while building their SQL-based enterprise search prototype.

A critical turning point came when their CTO “casually added Bing search and summarization,” enabling the system to answer real-time questions about current events. This RAG (Retrieval-Augmented Generation) architecture became foundational. They tested with a Discord bot before launching a web product, deliberately choosing platforms with existing user bases for rapid feedback.

The timeline is instructive: incorporated August 3rd, internal Slack bot by September 27th, and public launch shortly after ChatGPT’s November 2022 debut. This approximately 3-month cycle from company formation to production product demonstrates aggressive shipping velocity.

Technical Architecture: Orchestration as Core Competency

The speaker emphasizes that “orchestration” of different components—search index, LLM, conversational rendering, and multimodal capabilities—is non-trivial and represents genuine technical differentiation. They reference Steve Jobs’ quote about “playing the orchestra” to argue that being a “wrapper” company still requires deep technical expertise.

They provide concrete examples of competitor failures in orchestration: Google Bard’s extensions failing to properly query Gmail when asked about flight history, and UChat returning weather widgets when asked for whale songs. The speaker’s thesis is that connecting to APIs and data sources is easy to announce but difficult to execute reliably at inference time.

The core architecture involves:

Custom search indices with proprietary page ranking algorithms
LLM-based relevance ranking for selecting which sources to cite
Conversational context management for multi-turn interactions
Dynamic follow-up question generation

Fine-Tuning Strategy for Cost and Latency Optimization

One of the most detailed technical sections covers their approach to model optimization through fine-tuning. They launched a feature called “Copilot” that uses “generative UI”—dynamically generating clarifying questions with appropriate UI elements (multiple choice, single choice) based on the query type before providing answers.

Initially this ran on GPT-4, but they achieved significant improvements by fine-tuning GPT-3.5:

Performance Metrics:

80% of users cannot distinguish between fine-tuned GPT-3.5 and GPT-4
8% prefer the fine-tuned model
11% still prefer GPT-4 (room for improvement)

Economic Impact:

Prompt tokens: 3x lower cost with GPT-3.5 vs GPT-4
Output tokens: 6x lower cost
Significant latency improvements in throughput and time-to-first-token

The speaker emphasizes that at their query volume, these cost differences are substantial. Moreover, the latency improvements materially affect user experience, especially on mobile devices with poor connectivity.

This fine-tuning work was enabled by OpenAI’s GPT-3.5 fine-tuning API, and the speaker notes they shipped this faster than competitors. The approach represents a sophisticated LLMOps strategy: use the best model (GPT-4) to establish quality baselines and generate training data, then distill into smaller, faster, cheaper models for production.

Custom Inference Infrastructure for Open Source Models

Perplexity also invested in serving open-source models, specifically Llama, through their “Perplexity Labs” offering. They claim to have the fastest Llama inference among competitors including Hugging Face and Replicate, publishing metrics on tokens per second and time-to-first-token.

Key technical approaches mentioned:

Flash attention implementation
Faster Transformer techniques
PagedAttention
Custom CUDA-level optimizations
Building their own inference stack rather than relying on Hugging Face or vLLM

The rationale for custom infrastructure is control: they need to optimize for their specific use case and cannot wait for generic frameworks to implement necessary optimizations.

They also launched a supervised fine-tuned Llama model (llama-2-13b-sft) designed to be more useful and less overly restrictive than the base Llama 2 chat model. The base model would refuse innocuous requests like “how to kill a Linux process” due to excessive safety tuning; their fine-tuned version aims to be more practical.

Multi-Model Strategy and Provider Diversification

The product architecture supports multiple model providers:

OpenAI GPT-4 and GPT-3.5 (including fine-tuned variants)
Anthropic Claude 2 (specifically for file upload and long context use cases)
Their own fine-tuned and served models

This multi-model approach serves several purposes:

Different models excel at different tasks (Claude 2 for document analysis)
Reduces dependency on any single provider
Provides pricing leverage and business continuity
Allows A/B testing and quality comparison

The speaker explicitly acknowledges the business risk of not controlling model pricing: “if you don’t control the pricing of something, it’s always in a tricky position.” This drives their investment in open-source model serving and custom fine-tuning.

Evaluation and Quality Metrics

Stanford researchers evaluated generative search engines on two axes:

Perceived utility (user experience quality)
Citation F1 score (precision and recall of source attribution)

Perplexity achieved top performance on both metrics, notably while using only GPT-3.5 when competitor Bing was using GPT-4. This validates their orchestration approach—model capability is one factor, but retrieval quality, citation accuracy, and overall system design matter significantly.

The speaker suggests re-running evaluations with current models would show even better results, indicating continuous improvement in their pipeline.

Platform Evolution and User Retention

Beyond point-in-time search, Perplexity launched “Collections” to become a persistent platform:

Saved searches and research sessions
Collaboration features
Privacy controls (public vs private)
Use case organization (travel planning, coding research, company research)

This represents an LLMOps pattern of moving from stateless tool to stateful platform, increasing user stickiness and enabling richer personalization over time.

Key LLMOps Lessons

Several operational lessons emerge from this case study:

Rapid Experimentation: The company’s identity is “shipping fast.” They launched multiple experiments quickly, measured usage, and doubled down on what worked. Their SQL search product was killed when web search showed higher engagement (and Twitter API pricing made it impractical).

Dogfooding: Building for themselves first provided high-quality feedback loops before external launch. The initial Slack and Discord bots served as production prototypes.

Incremental Complexity: They started with simple search + summarization, then added conversation, then follow-up suggestions, then generative UI for clarifying questions, then file uploads, then collections. Each feature built on validated infrastructure.

Cost-Aware Architecture: Fine-tuning GPT-3.5 to match GPT-4 performance was driven by real cost constraints at scale. LLMOps at production volume requires aggressive cost optimization.

Latency Obsession: Initial response times of 7-8 seconds improved to “almost instantaneous.” This required custom CUDA implementations and infrastructure investment.

Citation and Trust: Unlike pure chatbots, their product emphasizes verifiable sources. Early internal use revealed that answers couldn’t be trusted without backing from real data—driving the search integration architecture.

The case study represents a mature LLMOps deployment combining retrieval-augmented generation, multi-model orchestration, fine-tuning for cost optimization, custom inference infrastructure, and continuous evaluation against quality metrics.

Scaling an AI-Powered Search and Research Assistant from Prototype to Production

Industry

Technologies