LinkedIn extended their generative AI application tech stack to support building complex AI agents that can reason, plan, and act autonomously while maintaining human oversight. The evolution from their original GenAI stack to support multi-agent orchestration involved leveraging existing infrastructure like gRPC for agent definitions, messaging systems for multi-agent coordination, and comprehensive observability through OpenTelemetry and LangSmith. The platform enables agents to work both synchronously and asynchronously, supports background processing, and includes features like experiential memory, human-in-the-loop controls, and cross-device state synchronization, ultimately powering products like LinkedIn's Hiring Assistant which became globally available.
LinkedIn’s evolution from their initial generative AI application tech stack to supporting complex AI agents represents a significant advancement in production-scale LLMOps implementation. This case study details how LinkedIn extended their existing GenAI platform to handle autonomous and semi-autonomous AI agents that can perform complex, long-running tasks while maintaining human oversight and control.
LinkedIn’s approach to building AI agents centers around modularity and reuse of existing infrastructure. Rather than building a completely new system, they strategically extended their current GenAI tech stack to support agentic workflows. The core philosophy involves treating agents not as single monolithic applications but as facades over multiple specialized agentic applications, providing benefits in modularity, scalability, resilience, and flexibility.
The platform architecture leverages LinkedIn’s existing service-to-service communication patterns using gRPC for agent definitions. Developers annotate standard gRPC service schema definitions with platform-specific proto3 options that describe agent metadata, then register these through a build plugin into a central skill registry. This registry tracks available agents, their metadata, and invocation methods, creating a discoverable ecosystem of agent capabilities.
One of the most significant technical decisions was using LinkedIn’s existing messaging system as the foundation for multi-agent orchestration. This choice addresses the classic distributed systems challenges of consistency, availability, and partitioning while handling the additional complexity of highly non-deterministic GenAI workloads. The messaging system provides guaranteed first-in-first-out (FIFO) delivery, seamless message history lookup, horizontal scaling across multiple regions, and built-in resilience constructs for persistent retries and eventual delivery.
The messaging-based approach enables various execution modalities. Agents can respond in single chunks, incrementally through synchronous streaming, or split responses across multiple asynchronous messages. This flexibility allows modeling a wide range of execution patterns from quick interactive responses to complex background processing tasks.
To abstract the messaging complexity from developers, LinkedIn built adapter libraries that handle messaging-to-RPC translations through a central agent lifecycle service. This service creates, updates, and retrieves messages while invoking the appropriate agent RPC endpoints, maintaining clean separation between the messaging infrastructure and agent business logic.
Supporting seamless user experiences across web and mobile applications required sophisticated client integration capabilities. Since agent interactions can be asynchronous and span multiple user sessions, LinkedIn developed libraries that handle server-to-client push notifications for long-running task completion, cross-device state synchronization for consistent application state, incremental streaming for optimizing large LLM response delivery, and robust error handling with fallbacks.
The human-in-the-loop (HITL) design ensures that agents seek clarification, feedback, or approvals at key decision points, balancing autonomy with user control. This approach addresses the cognitive limitations of current LLMs while maintaining user trust through transparency and control mechanisms.
LinkedIn implemented a sophisticated observability strategy tailored to the distinct phases of agent development. In pre-production environments, they focus on rich introspection and iteration using LangSmith for tracing and evaluation. Since many agent components are built on LangGraph and LangChain, LangSmith provides seamless developer experience with detailed execution traces including LLM calls, tool usage, and control flow across chains and agents.
Production observability relies on OpenTelemetry (OTel) as the foundation, instrumenting key agent lifecycle events such as LLM calls, tool invocation, and memory usage into structured, privacy-safe OTel spans. This enables correlation of agent behavior with upstream requests, downstream calls, and platform performance at scale. While production traces are leaner than pre-production ones, they’re optimized for debugging, reliability monitoring, and compliance requirements.
The observability stack tightly integrates with LinkedIn’s holistic evaluation platform. Execution traces are persisted and aggregated into datasets that power offline evaluations, model regression tests, and prompt tuning experiments, creating a continuous improvement feedback loop.
The platform includes sophisticated memory management through experiential memory systems that allow agents to remember facts, preferences, and learned information across interactions. This enables personalized, adaptive experiences that feel responsive to individual user needs and contexts. LinkedIn leverages data-intensive big data offline jobs to curate and refine long-term agent memories, ensuring the quality and relevance of retained information.
Context engineering has become a critical practice, involving the strategic feeding of LLMs with appropriate data and memory aligned with specific goals. This approach unlocks new levels of responsiveness and intelligence by making the right information available at the right time within agent workflows.
LinkedIn built a comprehensive Playground environment that serves as a testing ground for developers to enable rapid prototyping and experimentation. The Playground includes agent experimentation capabilities for two-way communication testing, skill exploration tools for searching registered skills and inspecting metadata, memory inspection features for examining contents and historical revisions, identity management tools for testing varied authorization scenarios, and integrated observability providing traces for quick failure insight during development.
This experimentation platform allows developers to validate concepts without extensive integration efforts, supporting a fail-fast, learn-quickly approach to agent development.
LinkedIn has embraced LangGraph as their primary agentic framework, adapting it to work with LinkedIn’s messaging and memory infrastructure through custom-built providers. This allows developers to use popular, well-supported frameworks while leveraging LinkedIn’s specialized platform capabilities.
The platform is incrementally adopting open protocols like Model Context Protocol (MCP) and Agent-to-Agent (A2A) communication standards. MCP enables agents to explore and interact through standardized tool-based interfaces, while A2A facilitates seamless collaboration among agents. This move toward open protocols supports interoperability and helps avoid fragmentation as agent ecosystems grow.
The platform supports both synchronous and asynchronous agent invocation modes. Synchronous delivery bypasses async queues and directly invokes agents with sideways message creation, significantly speeding up delivery for user-facing interactive experiences. Asynchronous delivery provides strong consistency through queued processing, allowing developers to choose the appropriate trade-offs between consistency and performance.
Background agents represent another significant capability, enabling longer autonomous tasks that can be performed behind the scenes with finished work presented for review. This approach optimizes GPU compute usage during idle or off-peak times while handling complex workflows that don’t require immediate user interaction.
The platform implements strict data boundaries to support privacy, security, and user control. Components like Experiential Memory, Conversation Memory, and other data stores are siloed by design with privacy-preserving methods governing information flow. All sharing between domains happens through explicit, policy-driven interfaces with strong authentication and authorization checks for every cross-component call. This compartmentalized approach ensures that only permitted agents can access specific data, with all access logged and auditable.
LinkedIn emphasizes that there is no single correct path for building successfully with agents, but several key lessons emerge from their experience. Reusing existing infrastructure and providing strong developer abstractions are critical for scaling complex AI systems efficiently. Designing for human-in-the-loop control ensures trust and safety while enabling appropriate autonomy. Observability and context engineering have become essential for debugging, continuous improvement, and delivering adaptive experiences. Finally, adopting open protocols is crucial for enabling interoperability and avoiding fragmentation.
The platform’s evolution represents a mature approach to production LLMOps for agentic systems, demonstrating how established technology companies can extend existing infrastructure to support next-generation AI capabilities while maintaining reliability, security, and scalability requirements. LinkedIn’s experience provides valuable insights for organizations looking to move beyond simple GenAI applications toward more sophisticated agent-based systems that can handle complex, multi-step workflows in production environments.
Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Toyota Motor North America (TMNA) and Toyota Connected built a generative AI platform to help dealership sales staff and customers access accurate vehicle information in real-time. The problem was that customers often arrived at dealerships highly informed from internet research, while sales staff lacked quick access to detailed vehicle specifications, trim options, and pricing. The solution evolved from a custom RAG-based system (v1) using Amazon Bedrock, SageMaker, and OpenSearch to retrieve information from official Toyota data sources, to a planned agentic platform (v2) using Amazon Bedrock AgentCore with Strands agents and MCP servers. The v1 system achieved over 7,000 interactions per month across Toyota's dealer network, with citation-backed responses and legal compliance built in, while v2 aims to enable more dynamic actions like checking local vehicle availability.