Perplexity has built a conversational search engine that combines LLMs with various tools and knowledge sources. They tackled key challenges in LLM orchestration including latency optimization, hallucination prevention, and reliable tool integration. Through careful engineering and prompt management, they reduced query latency from 6-7 seconds to near-instant responses while maintaining high quality results. The system uses multiple specialized LLMs working together with search indices, tools like Wolfram Alpha, and custom embeddings to deliver personalized, accurate responses at scale.
Perplexity AI is a conversational search engine founded by Arvin Srinivas (CEO), Dennis Yarats (formerly Meta, Cora, Bing, NYU), Johnny Ho (former world #1 programmer), and Andy Konwinski (DataBricks co-founder, UC Berkeley). The company’s mission is to be “the world’s most knowledge-centered company,” providing users with instant, accurate answers to any question through a combination of LLMs, search indexes, and various tools. This case study, presented by the CEO at a conference, offers valuable insights into how Perplexity approaches LLMOps challenges in building a production-grade AI search product.
The core product is described as the “most functional and well-executed version of retrieval-augmented generation (RAG),” combining LLMs with live search indexes to provide up-to-date, conversational answers with citations. This addresses the fundamental limitation of LLMs having knowledge cutoffs while retaining their reasoning and natural language capabilities.
One of the most significant LLMOps insights from this case study is the concept of “orchestration” - managing multiple LLMs and tools together to achieve results impossible with any single component. The CEO uses the analogy of “playing an orchestra” to describe this approach, referencing the Steve Jobs metaphor about orchestrating components rather than writing every line of code.
The Perplexity system employs multiple specialized LLMs working in concert:
This multi-LLM architecture is a sophisticated approach to production AI systems, where different models are optimized for different tasks rather than relying on a single general-purpose model for everything.
Perplexity heavily emphasizes “tool use” as a core architectural principle. The CEO cites Oriol Vinyals (VP of Research at DeepMind) saying that “the end game is not to predict the next word at a time” but rather to connect LLMs with tools like search engines, Python interpreters, and specialized knowledge systems.
Key tool integrations mentioned include:
The CEO notes that Google’s Bard improvements (30% better on word and math problems) came from adding an “implicit code interpreter on the back end,” validating the importance of tool integration for production LLM systems.
Perplexity’s technology stack includes:
A critical architectural decision was to avoid high-level tooling libraries like LangChain or Dust. The CEO strongly recommends building end-to-end engineering in-house, stating: “There’s just so many latency optimizations you’ll miss out on if you rely on these tooling libraries.” This philosophy of controlling all parts of the stack themselves has been key to their performance achievements.
Latency was a major challenge that the team solved through rigorous engineering. Initially, queries took 5-7 seconds - so slow that an investor jokingly suggested they call it “submit a job” rather than “submit a query.” Through end-to-end optimization, they reduced this dramatically, with users now frequently commenting on how fast the system is.
The specific optimizations aren’t detailed, but the emphasis on avoiding abstraction layers and controlling the full stack suggests they achieved gains through:
The CEO acknowledges that when “chaining things together” with multiple tools, there are many ways things can fail - “if just one part has to go wrong and everything breaks.” Their approach to quality includes:
The CEO candidly admits their approach includes “tested in production and see how it goes,” acknowledging the practical reality of production AI systems while maintaining quality through the evaluation frameworks mentioned.
For choosing foundation models, the CEO offers practical guidance based on their evaluations:
The recommendation is pragmatic: stick with OpenAI and Anthropic models for now rather than “trying to be adventurous” with less proven alternatives.
Perplexity introduced an innovative approach to personalization they call “AI Profiles,” which Amjad Masad (founder of Replit) calls “presentation layer prompting.” Instead of Web 2.0 style topic selection, users write natural language descriptions of themselves, their interests, and preferences. This information is “pre-loaded as a prompt” for all interactions.
Examples from their Discord community show practical uses: users setting language preferences (e.g., ensuring Dutch queries return Dutch responses) or specifying technical backgrounds to get appropriately detailed answers. The CEO frames this as users becoming “programmers in natural language,” writing code that controls how the AI serves them.
This approach to personalization represents a shift in how AI systems can be customized without requiring per-query prompt engineering from users.
The CEO addresses the common criticism that products like Perplexity are “LLM wrappers.” While acknowledging heavy reliance on OpenAI models, he argues that building such a product involves substantial engineering beyond model access:
The orchestration work - making all these components work together seamlessly and reliably - is where much of the engineering value lies.
Perplexity operates with a notably small team, which the CEO presents as intentional: “You can do a lot with less. In fact, you can probably only do a lot with less.” He expresses concern that scaling to 50 people would make them slower.
This philosophy aligns with their technical approach of building custom solutions rather than using abstraction layers - a small team can maintain deep understanding of the entire system when they control all parts of the stack.
Several production-focused insights emerge from the discussion:
The emphasis on reliability at scale, combined with the evaluation frameworks and dog-fooding culture, suggests a mature approach to production AI operations despite the company’s relatively young age.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.
Ericsson's System Comprehension Lab is exploring the integration of symbolic reasoning capabilities into telecom-oriented large language models to address critical limitations in current LLM architectures for telecommunications infrastructure management. The problem centers on LLMs' inability to provide deterministic, explainable reasoning required for telecom network optimization, security, and anomaly detection—domains where hallucinations, lack of logical consistency, and black-box behavior are unacceptable. The proposed solution involves hybrid neural-symbolic AI architectures that combine the pattern recognition strengths of transformer-based LLMs with rule-based reasoning engines, connected through techniques like symbolic chain-of-thought prompting, program-aided reasoning, and external solver integration. This approach aims to enable AI-native wireless systems for 6G infrastructure that can perform cross-layer optimization, real-time decision-making, and intent-driven network management while maintaining the explainability and logical rigor demanded by production telecom environments.