Perplexity: Building a Complex AI Answer Engine with Multi-Step Reasoning

LLMOps Database

Tech

Perplexity

Company

Perplexity

Title

Building a Complex AI Answer Engine with Multi-Step Reasoning

Industry

Tech

Link

https://www.langchain.com/breakoutagents/perplexity

Year

2024

Summary (short)

Perplexity developed Pro Search, an advanced AI answer engine that handles complex, multi-step queries by breaking them down into manageable steps. The system combines careful prompt engineering, step-by-step planning and execution, and an interactive UI to deliver precise answers. The solution resulted in a 50% increase in query search volume, demonstrating its effectiveness in handling complex research questions efficiently.

## Overview Perplexity is an AI company that positions itself as an "answer engine" rather than a traditional search engine, with the motto "Where knowledge begins." This case study focuses on their Pro Search feature, which represents a significant advancement in how complex, multi-step queries are handled using LLM-based agent architectures. The system is designed to serve students, researchers, and enterprises who need precise, comprehensive answers to nuanced questions that would traditionally require sifting through multiple search results and manually connecting information across sources. The core problem that Perplexity Pro Search addresses is the limitation of traditional search engines when dealing with queries that require connecting multiple pieces of information. A simple example given is searching for "What's the educational background of the founders of LangChain?" — a query that requires first identifying the founders and then researching each individual's background. This multi-step reasoning challenge is precisely what agentic AI systems are designed to solve. ## Cognitive Architecture: Planning and Execution Separation One of the most notable LLMOps patterns demonstrated in this case study is the separation of planning from execution in the agent architecture. This is a well-established pattern in the AI agent community that yields better results for complex, multi-step tasks. The system works as follows: when a user submits a query, the AI first creates a plan — essentially a step-by-step guide to answering the question. For each step in the plan, a list of search queries are generated and executed. The execution is sequential rather than parallel, and results from previous steps are passed forward to inform subsequent steps. This allows the system to build context progressively and make more informed decisions as the research progresses. After search queries return documents, these results are grouped and filtered down to the most relevant ones. The highly-ranked documents are then passed to an LLM to generate a final, synthesized answer. This pipeline approach — plan generation, query execution, document filtering, and answer synthesis — represents a mature production architecture for agentic search systems. The system also integrates specialized tools beyond basic web search. Code interpreters allow users to run calculations or analyze files dynamically, while mathematics evaluation tools like Wolfram Alpha provide precise computational capabilities. This tool integration pattern is common in production LLM systems and demonstrates how agents can leverage external capabilities to extend their functionality beyond pure language understanding. ## Prompt Engineering at Scale The prompt engineering approach at Perplexity reveals several important LLMOps considerations for production systems. First, the team supports multiple language models, giving users flexibility to choose the model that best fits their specific problem. This multi-model strategy is increasingly common in production systems and requires careful prompt management. Since different language models process and interpret prompts differently, Perplexity customizes prompts on the backend for each individual model. This model-specific prompt engineering is a crucial but often overlooked aspect of LLMOps — prompts that work well for one model may not transfer effectively to another. The team employs several standard prompting techniques including few-shot examples and chain-of-thought prompting. Few-shot examples help steer the search agent's behavior by providing concrete demonstrations of desired outputs. Chain-of-thought prompting encourages the model to reason through problems step-by-step. A key insight from the Perplexity team is the importance of prompt length balance. As William Zhang, the engineer who led this effort, noted: "It's harder for models to follow the instructions of really complex prompts." This observation aligns with broader research showing that overly complex system prompts can degrade model performance. The team's solution was to keep rules in the system prompt simple and precise, reducing the cognitive load on models. The iteration process described is also instructive for LLMOps practitioners. Rather than just evaluating final outputs, the team checked "that not only the output made sense, but that the intermediate steps were sensible as well." This emphasis on inspecting intermediate reasoning is crucial for debugging agentic systems where errors can compound across steps. ## Evaluation Strategy The evaluation approach used by Perplexity demonstrates a multi-layered strategy that combines automated and human assessment methods. The team relied on both answer quality metrics and internal dogfooding before shipping upgrades. Manual evaluations involved testing Pro Search on a wide range of queries and comparing answers side-by-side with other AI products. This competitive benchmarking approach helps ensure that the product meets or exceeds market standards. The ability to inspect intermediate steps was particularly valuable for identifying common errors before shipping — a reminder that transparency in agent reasoning is not just a user feature but also a debugging necessity. To scale evaluations beyond what manual testing could achieve, Perplexity implemented LLM-as-a-Judge evaluation. They gathered large batches of questions and used an LLM to rank answers. This approach has become increasingly popular in production LLM systems as a way to approximate human judgment at scale, though it comes with known limitations around bias and inconsistency that practitioners should be aware of. A/B testing was employed to gauge user reactions to different product configurations, particularly exploring tradeoffs between latency and costs across different models. This production experimentation approach is essential for understanding how technical decisions impact actual user experience and business metrics. ## User Experience Considerations An often-underappreciated aspect of LLMOps is how system design impacts user experience, and this case study provides valuable insights in this area. One of the biggest challenges the team faced was designing the user interface for a system that inherently requires processing time. The key finding was that users were more willing to wait for results if the product displayed intermediate progress. This led to an interactive UI that shows the plan being executed step-by-step. Users can expand sections to see more details on individual searches, and can hover over citations to see source snippets. This transparency serves multiple purposes: it sets expectations, maintains engagement, and builds trust by showing the system's reasoning. Zhang's guiding philosophy encapsulates a broader principle for LLMOps UX design: "You don't want to overload the user with too much information until they are actually curious. Then, you feed their curiosity." This progressive disclosure pattern allows the system to serve users with varying levels of technical sophistication — from AI experts who want to see all the details to newcomers who just want answers. ## Results and Impact While the case study is presented through a vendor lens (LangChain's customer stories), some concrete metrics are provided. Query search volume for Perplexity Pro Search reportedly increased by over 50% in recent months as users discovered its capabilities. This growth suggests genuine product-market fit, though the specific timeframe and baseline are not disclosed. The case study presents several key lessons for building AI agents in production: - Explicit planning steps improve results for complicated research tasks - Speed alongside answer quality is crucial for good user experience - Dynamic UI feedback keeps users engaged during processing time - Multiple evaluation methods (manual, LLM-as-Judge, A/B testing) provide comprehensive quality assurance - Simple, precise prompts outperform complex instructions - Model-specific prompt customization is necessary when supporting multiple LLMs ## Critical Assessment It's worth noting that this case study is published by LangChain, a vendor in the LLM tooling space, and is designed to showcase successful applications of AI agents. While the technical details appear credible and align with known best practices in the field, readers should be aware of the promotional context. The 50% growth figure, while impressive, lacks context around absolute numbers, timeframes, and whether this growth can be directly attributed to the Pro Search improvements versus other factors. The separation of planning and execution, multi-model support, and comprehensive evaluation strategies described are all well-established patterns in production LLM systems. The insights around prompt engineering and UX design are particularly valuable as they address practical challenges that many teams face when deploying LLM-powered agents to real users.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source