## Overview
This case study is drawn from a podcast conversation featuring a senior product manager at LinkedIn discussing practical approaches to building AI and generative AI products in production environments. The conversation provides valuable insights into how product managers should think about incorporating LLMs into existing products, the operational challenges of running these systems at scale, and the evolution of ML product development in the age of large language models.
The speaker brings a unique perspective, having transitioned from a machine learning engineer role to product management while maintaining deep technical involvement. His work at LinkedIn focuses on serving creators on the platform through personalization, content distribution, and content understanding—all areas where AI plays a crucial role.
## Core Philosophy: LLMs as Tools, Not Solutions
One of the central themes emphasized throughout the discussion is the importance of viewing generative AI as a tool in service of user needs, rather than as a solution looking for problems. The speaker explicitly warns against the tendency to "bolt on" GenAI capabilities simply because investors expect it or because it's trendy. Instead, the recommendation is to start with the user problem, understand if it's worth solving, and then determine if and how AI can help address it.
This philosophy has direct implications for LLMOps practices. Rather than building AI-first products, the approach is to identify specific problems that were previously difficult or impossible to solve and then evaluate whether recent advances in LLMs have made those problems tractable. The speaker notes that he's been involved in generative AI since 2017 when he was working with GANs to generate images from text—a process that required significant GPU resources for minimal output quality. The contrast with today's capabilities, where a weekend project can produce functional MVPs, illustrates the operational transformation the field has undergone.
## Go-to-Market Strategy: Speed First, Optimization Later
The speaker articulates a clear two-phase approach to building LLM-powered features that has significant LLMOps implications:
**Phase 1: Rapid Validation with Off-the-Shelf Solutions**
The recommendation is to use commercial LLM APIs (like OpenAI) for initial validation and go-to-market speed. The reasoning is pragmatic: testing a hypothesis quickly is more valuable than building the perfect solution from scratch. The speaker describes doing back-of-envelope calculations for running an experience with 50,000 users over 30 days using OpenAI's public pricing and finding it surprisingly affordable compared to building custom ML infrastructure.
This approach acknowledges that building custom ML solutions involves GPU costs, engineering time, and significant infrastructure development—all of which represent wasted resources if the underlying hypothesis about user value proves incorrect.
**Phase 2: Transition to Specialized Models**
Once an AI feature proves valuable, the recommendation is to transition from general-purpose LLMs to narrower, specialized models. The speaker argues that these narrower models will often outperform general LLMs for specific use cases because they're pre-trained on relevant data rather than needing to handle law questions, medicine questions, and everything else simultaneously.
This two-phase approach has clear cost implications for production systems. While OpenAI APIs enable rapid validation, they won't scale cost-effectively to 100 million users. The transition to specialized models is framed not just as a cost optimization but as a potential quality improvement.
## Push vs. Pull Interaction Patterns
A particularly insightful observation concerns the design of AI-powered user experiences. The speaker distinguishes between "pull" mechanisms (where users must explicitly interact with an assistant) and "push" mechanisms (where AI value is embedded into existing workflows).
The hot take offered is that many companies should avoid building assistant-style products, despite their apparent ease of implementation. The reasoning is that assistants require significant user effort—the user must formulate their request, engage in conversation, and then port results back to their actual task. This represents a "pull" mechanism that creates friction.
The alternative is building experiences that "meet users where they are." Examples cited include Gmail's "Help me write" feature and Notion's inline AI capabilities. These experiences abstract away complexity through meta-prompts that incorporate context (contact lists, thread history, document content) without requiring the user to explicitly provide it.
From an LLMOps perspective, this design philosophy means building more sophisticated prompt pipelines that gather and incorporate context automatically. The "heavy lifting" happens in the product infrastructure rather than in user interactions, which has implications for how prompts are constructed, how context is gathered and managed, and how the overall system is monitored.
## Prompt Engineering and Iteration
The speaker provides candid observations about the challenges of prompt engineering in production:
**Initial Success Can Be Misleading**
A key warning is that the first successful prompt output can create false confidence. When testing a single example, the results may appear brilliant. However, running the same experience across 10,000 different users reveals anomalies and edge cases that weren't apparent in initial testing. This observation highlights the importance of robust evaluation at scale before production deployment.
**Prompt Engineering as First Line of Defense**
The speaker advocates for investing heavily in prompt engineering before resorting to more complex solutions like chained LLM calls (where a second call verifies the first). While verification calls are valid techniques, they double the cost of the experience. The recommendation is to "see how much of the mountain you can climb with just one prompt" before adding complexity.
Specific techniques mentioned include:
- Adding as much context as possible
- Constraining the output format
- Adjusting temperature for more consistent outputs (with awareness of trade-offs)
- Using meta-prompts that incorporate user and environmental context
The speaker also mentions the emergence of tooling for prompt engineering and management, noting that when he started working with LLMs, such tools didn't exist and teams had to build their own infrastructure. Now options like Microsoft Guidance and various open-source tools are available, though the landscape changes rapidly.
## Evaluation Challenges
The discussion touches on the significant challenges of evaluating LLM outputs in production, which is one of the most difficult aspects of LLMOps:
**Defining Evaluation Criteria**
A major gotcha identified is the need to define evaluation criteria very early in the development cycle. Without clear constraints on what constitutes success, scoring outputs becomes subjective and inconsistent. When scaling evaluation, different evaluators may interpret quality differently—what one person rates 5/10, another might rate 10/10.
The recommendation is to be very strict about constraints and success criteria, then reverse-engineer prompts to solve for those criteria. This suggests a test-driven approach to prompt development.
**User Feedback Limitations**
The conversation acknowledges that simple thumbs up/thumbs down feedback has limited value. Many users who receive poor outputs don't bother providing negative feedback—they simply abandon the experience. This makes bounce rates a potentially valuable signal, though false positives exist (users who copy-paste and leave satisfied).
The speaker notes that optimal feedback signals will vary from product to product, and that the industry is still developing best practices for LLM evaluation in production.
## Trust as a Critical Factor
A recurring theme is that generative AI introduces new considerations around trust that don't exist with traditional ML systems:
**Confidence vs. Correctness**
LLMs are noted for being highly confident regardless of accuracy. This creates reputational risk when incorrect outputs are presented to users. The speaker emphasizes that with GenAI, you're effectively "creating a voice for your brand"—if the AI says something incorrect or inappropriate, it reflects directly on the product and company.
**Progressive Complexity**
To manage trust risks, the recommendation is to start with experiences that are less sensitive to hallucination. Text creation experiences where factual accuracy is less critical represent lower-risk starting points. As trust infrastructure matures, products can add more complex features involving factual information, potentially using retrieval-augmented generation (RAG) to ground outputs in verified sources.
**Feedback Pipelines**
Building visible feedback mechanisms serves dual purposes: it provides operational data for improvement and signals to users that the company is committed to learning and improving. This transparency can help maintain trust even when outputs are occasionally incorrect.
## Cost Considerations
The speaker provides practical perspective on LLM costs:
For enterprise-scale testing (50,000 users, 30 days), the speaker found API-based LLM costs surprisingly affordable compared to building custom infrastructure. However, this calculation changes dramatically at production scale—the explicit acknowledgment is that commercial APIs won't scale cost-effectively to 100 million users.
Open-source models are mentioned as an alternative that addresses both cost and data privacy concerns. These models are noted as being smaller than commercial alternatives, which makes them cheaper to run. For use cases with strict data requirements (investment banking, legal applications), open-source models also provide greater control and transparency over data handling.
## Content Understanding at LinkedIn
While specific product details weren't disclosed due to PR restrictions, the speaker mentions working on content understanding at LinkedIn—using AI to interpret what posts, images, and videos are about, then using that understanding for better content matching and distribution. This represents a production LLM use case focused on improving personalization and creator discoverability.
The work involves helping smaller creators grow and reach their audience, starting what the speaker describes as a "virtuous flywheel of engagement and connection." This suggests LLMs are being used not just for content generation but for content analysis and recommendation systems.
## Key Takeaways for LLMOps Practitioners
The discussion provides several actionable insights for teams building LLM-powered features:
- Start simple and add complexity only when simpler approaches hit their limits
- Use commercial APIs for validation, but plan for transition to specialized models at scale
- Invest heavily in prompt engineering before adding architectural complexity
- Define clear evaluation criteria early and test at scale before production
- Design experiences that abstract complexity from users rather than requiring explicit AI interaction
- Build feedback loops and trust mechanisms into the product from the start
- Consider trust and brand implications as first-class operational concerns
- Match the sensitivity of the use case to the maturity of your reliability infrastructure
The overall message is that successful AI product development requires the same user-centered thinking as traditional product management, with additional operational considerations around evaluation, trust, and the unique failure modes of generative systems.