ZenML

Democratizing Prompt Engineering Through Platform Architecture and Employee Empowerment

Pinterest 2025
View original source

Pinterest developed a comprehensive LLMOps platform strategy to enable their 570 million user visual discovery platform to rapidly adopt generative AI capabilities. The company built a multi-layered architecture with vendor-agnostic model access, centralized proxy services, and employee-facing tools, combined with innovative training approaches like "Prompt Doctors" and company-wide hackathons. Their solution included automated batch labeling systems, a centralized "Prompt Hub" for prompt development and evaluation, and an "AutoPrompter" system that uses LLMs to automatically generate and optimize prompts through iterative critique and refinement. This approach enabled non-technical employees to become effective prompt engineers, resulted in the fastest-adopted platform at Pinterest, and demonstrated that democratizing AI capabilities across all employees can lead to breakthrough innovations.

Industry

Tech

Technologies

Pinterest's LLMOps journey represents a comprehensive case study in how a large technology company with 570 million users can systematically democratize generative AI capabilities across their entire organization. The company embarked on this transformation in 2023 with the strategic goal of moving "from prompt to productivity as quickly as possible," recognizing that their existing 300 machine learning engineers, while experienced with transformer models and large-scale inference systems, needed new approaches to leverage the emerging capabilities of large language models.

The foundation of Pinterest's LLMOps strategy rests on a sophisticated multi-layered platform architecture designed for scalability, flexibility, and governance. At the base layer, they implemented a multimodal, multi-vendor model strategy that allows rapid onboarding of different models as they become available. This is supported by a centralized proxy layer that handles critical operational concerns including rate limiting, vendor integration, comprehensive logging, and access control. The proxy layer enables Pinterest to differentiate between unreleased models available to specific teams versus general models accessible to all employees, providing fine-grained administrative control over model access.

Above this infrastructure layer, Pinterest built employee-facing tools and environments including development environments, prompt engineering tools, internal APIs, and various bots and assistants. The top layer implements centralized guardrails encompassing empathetic AI checks, safety validation, and content quality assurance. This layered approach, implemented starting in 2023, proved crucial for enabling rapid iteration and addition of new capabilities while maintaining operational standards.

The human enablement strategy at Pinterest demonstrates remarkable creativity in organizational change management. The team adopted personas as "Prompt Doctors," complete with Halloween costumes and medical-themed puns, to make AI expertise approachable and memorable. They conducted large-scale educational sessions, including a notable class held at a llama farm at 10,000 feet elevation that attracted one-third of the company through organic word-of-mouth promotion. These sessions covered both the capabilities and limitations of generative AI, including hallucination risks and best practices for prompt engineering.

Pinterest's hackathon strategy proved particularly effective in driving adoption. Following educational sessions, they provided all employees with three days to build whatever they wanted using newly acquired prompt engineering skills. Critically, they introduced no-code tools during these hackathons, enabling non-technical employees to create applications and prove concepts without traditional development skills. The hackathons were strategically timed with planning cycles, creating a pathway from learning to building to potential product integration within approximately two weeks.

The development of their batch labeling system illustrates Pinterest's approach to scaling successful proof-of-concepts. Starting with a simple Jupyter notebook that allowed users to write a prompt, select a Hive dataset, and automatically label data, they conducted approximately 40 one-hour meetings with internal teams. These meetings served dual purposes: solving immediate problems for teams while gathering requirements for more robust tooling. The success of these lightweight implementations provided justification for funding a full production-scale batch labeling system, which became the fastest-adopted platform in Pinterest's history.

User research revealed significant friction in the existing workflow for generative AI projects. The typical process involved downloading data from QueryBook, sampling it, uploading to prompt engineering tools, running prompts on small datasets, downloading results, uploading to spreadsheet software for evaluation, copying prompts to version control systems, and iterating through this entire cycle. For production deployment, users still required engineering support to configure and run the batch labeling system, creating bottlenecks and delays.

In response to these findings, Pinterest developed "Prompt Hub," a centralized platform that consolidates the entire prompt development lifecycle. The platform provides access to hundreds of thousands of internal data tables, integrated prompt engineering capabilities with multi-model support, real-time evaluation metrics, and cost estimation per million tokens. The system creates centralized leaderboards for prompt performance, enabling teams to compare different approaches including fine-tuned models, distilled models, and various prompting techniques. A critical feature is the single-button deployment to production scale, eliminating the need for engineering intervention in the deployment process.

The leaderboard functionality enabled Pinterest to experiment with democratized problem-solving through internal competitions. They created prompt engineering challenges where any employee could attempt to outperform professional prompt engineers on real business problems. In one notable example, participants beat a professionally developed prompt that had taken two months to create, achieving better accuracy at lower cost within 24 hours. Significantly, top-performing entries came from non-technical teams, including finance, demonstrating the potential for domain expertise to drive AI innovation when technical barriers are removed.

Pinterest's AutoPrompter system represents a sophisticated approach to automated prompt optimization. Drawing inspiration from neural network training paradigms, the system implements a "predict, critique, and refine" cycle using two LLM agents: a student that generates prompts and a teacher that provides detailed critiques and suggestions for improvement. The student agent incorporates feedback iteratively, leading to progressively improved prompts through what they term "text gradients" - error signals passed back from the teacher to guide prompt refinement.

The AutoPrompter demonstrated impressive results in practice, improving accuracy from 39% to 81% on challenging problems while providing detailed cost tracking and performance analytics. The system can be integrated with existing evaluation frameworks, enabling automated optimization whenever new models become available. This creates a self-improving system where prompt performance can be continuously enhanced without human intervention.

Pinterest's approach to cost management balances thorough evaluation with practical constraints. They typically run evaluations on datasets of 1,000 to 5,000 examples, which provides statistically meaningful results while keeping costs manageable. The platform provides real-time cost estimates and tracks multiple evaluation dimensions including accuracy, toxicity scores, and other safety metrics. This multi-dimensional evaluation approach ensures that improvements in one area don't come at the expense of safety or other critical considerations.

The organizational philosophy underlying Pinterest's LLMOps strategy emphasizes several key principles. They prioritize simplification for accelerated adoption, providing GUI interfaces over APIs wherever possible. They've learned that deterministic workflows generally achieve higher success rates than open-ended agent systems, leading them to convert successful experimental approaches into structured workflows. The company allocates dedicated time for bottom-up innovation, immediately following training with hands-on experimentation opportunities.

Pinterest has deliberately avoided creating a specialized generative AI team, instead expecting all engineers to develop GenAI capabilities as part of their core responsibilities. This approach distributes AI expertise throughout the organization while preventing the bottlenecks that can arise from centralized AI teams. They maintain the assumption that current tools and approaches will become outdated within six months, encouraging rapid experimentation and iteration without excessive attachment to particular solutions.

The impact on non-technical employees has been particularly noteworthy. Pinterest cites examples of sales employees developing RAG-based Slack bots that became widely used company tools, and notes that some of their most effective prompt engineers come from backgrounds in philosophy and linguistics rather than computer science. This suggests that domain expertise and communication skills may be more important than technical programming knowledge for effective prompt engineering.

Pinterest maintains a support system designed to be genuinely helpful and engaging rather than bureaucratic. Their "Prompt Doctor" hotline, complete with medical-themed humor, has handled over 200 sessions of one to two hours each, helping teams accelerate use cases by approximately six months. This human-centered support approach complements their technological solutions and helps maintain adoption momentum.

The documentation and knowledge management challenges inherent in rapid AI development are addressed through quarterly "docathons" focused on deleting outdated documentation and updating current information. They've also implemented automated systems that flag conflicting information sources and alert document owners when inconsistencies are detected.

While Pinterest's presentation focuses heavily on successes, some challenges and limitations can be inferred from their approach. The need for quarterly documentation cleanup suggests ongoing struggles with information currency and consistency. The emphasis on deterministic workflows over open-ended agents indicates limitations in current agent reliability for complex tasks. The cost optimization focus suggests that token costs remain a significant operational consideration even with their efficient approaches.

Pinterest's LLMOps strategy demonstrates that successful enterprise AI adoption requires more than just technical infrastructure - it demands thoughtful organizational change management, creative training approaches, and systems designed to empower rather than gate-keep AI capabilities. Their approach of treating every employee as a potential AI contributor, combined with robust technical infrastructure and support systems, provides a compelling model for democratizing AI capabilities within large organizations. The measurable success of their platform adoption and the innovative contributions from non-technical employees validate their thesis that the next breakthrough in AI applications may come from domain experts armed with prompt engineering skills rather than traditional AI specialists.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building AI-Native Platforms: Agentic Systems, Infrastructure Evolution, and Production LLM Deployment

Delphi / Seam AI / APIsec 2025

This panel discussion features three AI-native companies—Delphi (personal AI profiles), Seam AI (sales/marketing automation agents), and APIsec (API security testing)—discussing their journeys building production LLM systems over three years. The companies address infrastructure evolution from single-shot prompting to fully agentic systems, the shift toward serverless and scalable architectures, managing costs at scale (including burning through a trillion OpenAI tokens), balancing deterministic workflows with model autonomy, and measuring ROI through outcome-based metrics rather than traditional productivity gains. Key technical themes include moving away from opinionated architectures to let models reason autonomously, implementing state machines for high-confidence decisions, using tools like Pydantic AI and Logfire for instrumentation, and leveraging Pinecone for vector search at scale.

chatbot content_moderation customer_support +40

Forward Deployed Engineering: Bringing Enterprise LLM Applications to Production

OpenAI 2025

OpenAI's Forward Deployed Engineering (FDE) team, led by Colin Jarvis, embeds with enterprise customers to solve high-value problems using LLMs and deliver production-grade AI applications. The team focuses on problems worth tens of millions to billions in value, working with companies across industries including finance (Morgan Stanley), manufacturing (semiconductors, automotive), telecommunications (T-Mobile, Klarna), and others. By deeply understanding customer domains, building evaluation frameworks, implementing guardrails, and iterating with users over months, the FDE team achieves 20-50% efficiency improvements and high adoption rates (98% at Morgan Stanley). The approach emphasizes solving hard, novel problems from zero-to-one, extracting learnings into reusable products and frameworks (like Swarm and Agent Kit), then scaling solutions across the market while maintaining strategic focus on product development over services revenue.

customer_support healthcare code_generation +42