Rippling: Building Production AI Agents for Enterprise HR, IT, and Finance Platform

Company

Rippling

Title

Building Production AI Agents for Enterprise HR, IT, and Finance Platform

Industry

Link

https://www.youtube.com/watch?v=-gLH_okCcBA

Year

2025

Summary (short)

Rippling, an enterprise platform providing HR, payroll, IT, and finance solutions, has evolved its AI strategy from simple content summarization to building complex production agents that assist administrators and employees across their entire platform. Led by Anker, their head of AI, the company has developed agents that handle payroll troubleshooting, sales briefing automation, interview transcript summarization, and talent performance calibration. They've transitioned from deterministic workflow-based approaches to more flexible deep agent paradigms, leveraging LangChain and LangSmith for development and tracing. The company maintains a dual focus: embedding AI capabilities within their product for customers running businesses on their platform, and deploying AI internally to increase productivity across all teams. Early results show promise in handling complex, context-dependent queries that traditional rule-based systems couldn't address.

## Overview Rippling represents a comprehensive case study in deploying production AI agents at scale within an enterprise software platform. The company operates an integrated suite covering HR, payroll, benefits, IT identity and device management, and finance (corporate card, travel, expense). With customers ranging from early-stage startups to enterprises with 5,000-6,000 employees (like Anthropic), Rippling faces the challenge of making complex administrative workflows more accessible through AI assistance. Anker, the head of AI who joined approximately seven months before this discussion, oversees both product-facing AI features and internal AI adoption across the organization. ## Evolution of AI Strategy Rippling's AI journey followed a three-phase evolution that mirrors broader industry patterns. Initially, they focused on content summarization and basic document understanding use cases, leveraging unstructured information across their platform. The LangChain ecosystem was integrated early to enable rapid prototyping of these simpler features. The second phase involved creating standalone AI-powered products that could generate new revenue streams. A notable example is "Talent Signal," which analyzes individual outputs across software development, support, and sales teams to provide performance calibration insights across the entire organization. This product integrates with their existing performance management offerings. They also built forecasting capabilities for shift workers in their time-tracking product, applying more traditional ML techniques. The third and most recent phase focuses on building production agents that act as system analysts for administrators and end users. This represents the most ambitious technical challenge, as these agents must navigate the complexity of Rippling's vast product footprint while maintaining accuracy and reliability that enterprise customers demand. The company explicitly positions these agents as assistants that help users operate Rippling day-to-day, rather than fully autonomous systems. ## Development Process and Innovation Culture Rippling maintains a highly decentralized innovation culture with approximately 150 "founders" running different product teams independently. This founder-first culture necessitates a platform approach to AI infrastructure. The AI team runs biannual "hack weeks" where employees across the company can experiment with AI features. In the most recent hack week, 150 AI-related projects were initiated, with roughly 50 reaching completion. This bottom-up innovation is balanced with top-down strategic priorities aligned with business objectives like expanding standalone IT offerings, growing their finance product line, and international expansion. The path from prototype to production follows a structured approach. The AI team has established foundational primitives that product teams can leverage without reinventing core infrastructure. This foundation includes a data layer built on Databricks, an agent layer developed in partnership with LangChain, and comprehensive evaluation systems. Teams have access to enterprise contracts with OpenAI, Anthropic, and Google, enabling experimentation with cutting-edge models. The philosophy is to provide a "paved path" from prototype to production while allowing exploratory freedom during early-stage development. When prototypes show promise and align with quarterly product priorities, the AI team actively co-builds with product teams. This collaboration serves a dual purpose: shipping the specific use case to production while identifying missing primitives that can be added to the shared foundation stack. Each production deployment creates a flywheel effect where subsequent teams benefit from newly established patterns and infrastructure. The company maintains a portfolio approach with use cases simultaneously in exploratory, prototyping, and production phases each quarter. ## Challenges in Production Agent Deployment Anker identifies three primary challenges in moving agents to production, each requiring different mindset shifts and technical approaches. First, there's a cultural challenge around experimentation and failure. Engineers from traditional software backgrounds expect deterministic outcomes where bugs can be fixed with certainty. Machine learning, including LLM-based agents, requires comfort with probabilistic outcomes where some experiments simply don't work and teams must move on. Helping software engineers adopt this mindset represents a significant organizational challenge. Second, production data access is critical but often underestimated. Demo instances provide misleading signals about agent performance. Rippling has curated production snapshots and maintains safe production instances specifically for AI development. This allows teams to test against real-world data complexity, including edge cases, malformed data, and unexpected user behavior patterns that never appear in synthetic test environments. The diversity of their customer base—from three-person startups to multi-thousand employee enterprises—creates substantial data variability that agents must handle. Third, establishing rapid user feedback loops proves essential for iterative improvement. Rippling benefits from a "dogfooding" culture where they use their own platform internally, making employees the first alpha testers. For Talent Signal, they rolled it out to internal engineering managers who could provide feedback on performance explanations and calibration accuracy. For administrative agents, super-admins including CEO Parker Conrad actively test features and provide real-time feedback via Slack on what works and doesn't work. This creates an immediate feedback cycle that complements more formal user research and design partnerships. Beyond user-level feedback, technical tracing becomes invaluable. Anker emphasizes that teams spend substantial time in LangSmith tracing production behavior to understand why certain calls succeed or fail, how LLMs perform on specific queries, and where routing logic or graph execution diverges from expectations. When running controlled tests with internal users, the team accumulates thousands of traces that can be compared and contrasted to identify patterns. They also collect explicit human feedback on desired outputs, which surfaces subtle requirements like date formatting based on user locale and timezone—details that significantly impact user experience but might not be obvious during initial design. ## Agent Architecture: From Deterministic to Agentic Rippling's architectural journey reveals important lessons about building reliable production agents. Their initial approach favored determinism, creating domain-specific sub-agents for areas like IT identity/devices and payroll. A simple router would deterministically direct queries to the appropriate domain agent based on keywords or intent classification. This approach seemed logical given the structured nature of their data and workflows. However, they discovered that human language doesn't conform to deterministic boundaries. Users asking "how many people were onboarded last week" versus "how many people were hired last week" might trigger different domain routing despite seeking the same information. Forcing LLMs into overly rigid structures actually made the system less reliable. Over the past month or so before this discussion, Rippling began exploring "deep agent" paradigms with surprisingly positive results. By giving LLMs appropriate tools and context while allowing them to reason about the best approach, agents became more capable of handling the inevitable edge cases and ambiguous queries that arise in real-world usage. This doesn't mean abandoning structure entirely. Anker clarifies that workflows—predefined paths with state management—still have value, particularly for accuracy and repeatability. When agents need to take actions (as opposed to just reading and synthesizing information), transactionality and sequencing matter. Their approach evolved to encapsulate deterministic action sequences as tools that agents can invoke. For example, if an agent determines that a specific payroll adjustment is needed, it might call a tool that executes that adjustment through a well-tested, deterministic workflow. This hybrid approach leverages LLM reasoning for understanding complex, ambiguous queries while maintaining reliability for critical operations. The concept of "context engineering" emerges as central to their work. Rather than trying to force LLMs into rigid patterns, they focus on providing rich context about the problem space, available tools, and constraints. With proper planning steps, LLMs can handle edge cases that would break traditional rule-based systems. Anker suggests this represents an emerging paradigm shift in the industry, where leaning into LLM capabilities (reasoning, judgment, contextual understanding) produces better outcomes than trying to constrain them too tightly. ## Specific Use Cases and Implementation Details Several concrete examples illustrate Rippling's agent implementations. In recruiting, they built a feature that records and transcribes interviews, then generates summaries for interviewers. The initial implementation was straightforward—transcribe and summarize—but iterative refinement revealed that interviewers needed role-specific insights. When hiring a staff engineer, the summary should highlight signals relevant to that level and role, not just provide a generic transcript summary. This required multiple iterations to identify the right context and prompting approach. For payroll troubleshooting, they developed an agent that investigates why an employee didn't receive correct payment. This might involve examining whether the employee changed countries or addresses, which could affect tax deductions and compliance rules. The agent needs to navigate multiple data sources, understand complex payroll rules, and synthesize a coherent explanation. Anker notes that even this seemingly simple problem space requires substantial iterative effort because the set of possible issues is large and the explanations must be comprehensible to non-expert administrators. The sales briefing agent demonstrates their workflow-tool hybrid approach. It processes introductory calls between account executives and prospects, summarizes the conversation, creates records in Salesforce, and generates documentation. This appears to be a clean, deterministic workflow. However, prospects might ask unexpected questions about security or legal topics that fall outside the predefined flow. The agent needs flexibility to capture and surface these issues even though they're not part of the standard pipeline. The AE still needs to review and validate outputs, similar to how software engineers must review AI-generated code. This acceptance that human-in-the-loop validation remains necessary represents a pragmatic approach to production deployment. ## Tool Management and Context Scaling As agents grow more capable, managing the number and scope of available tools becomes a critical challenge. Rippling's platform encompasses potentially hundreds of distinct operations across HR, IT, and finance domains. Providing all tools to an agent simultaneously leads to context overload and degraded performance. Their solution involves creating sub-graphs within LangGraph (their primary framework) that contain domain-specific tool sets. When the agent determines a query relates to payroll, it can access a payroll sub-graph with its specialized tools and context. This architectural pattern aligns with recent industry developments around "progressive disclosure" of tools. Anker and Harrison discuss how Anthropic's "skills" concept enables agents to discover and load tools dynamically as needed, rather than front-loading everything into the initial context. Manis (the coding agent company) shared an interesting optimization in a LangChain webinar: instead of defining discovered tools as formal tool definitions (which would go at the start of context and invalidate prompt caching), they use a generic bash-execution tool that runs scripts converted from tool definitions. This avoids cache invalidation while still enabling dynamic tool access. LangChain 1.0 and the deep agents package provide key infrastructure for this approach. The ability to treat sub-graphs and tools uniformly from a development perspective simplifies implementation while enabling sophisticated context management. Teams can test and validate domain-specific sub-graphs independently, then compose them into larger agent systems. This modularity supports both the technical requirements of context management and the organizational reality of distributed product teams building features independently. ## Evaluation and Quality Assurance While Anker doesn't extensively detail their evaluation systems, several aspects emerge from the discussion. They've implemented comprehensive tracing using LangSmith, which allows teams to inspect every step of agent execution in production. This observability is essential for debugging complex agentic systems where the path from query to answer isn't predetermined. Teams regularly review traces to understand success and failure patterns, comparing different versions during A/B tests or controlled rollouts. Human feedback collection happens through multiple channels. Direct user feedback on specific outputs (thumbs up/down) provides signal on result quality. Structured feedback sessions with design partners and key stakeholders surface more nuanced issues around workflow fit and feature completeness. Internal dogfooding generates high-volume, real-world usage that might reveal issues not apparent in controlled testing. Anker mentions that this feedback can be "brutal" but is essential for improvement. For code generation use cases (discussed more in the context of internal productivity), they've enabled multiple AI code review tools including Cursor, BugBot, and others. This provides rapid feedback loops that don't depend on human availability. Engineers remain accountable for code quality, but AI-assisted review can catch issues faster than waiting for human reviewers, especially for larger changes. The company even considered implementing hard limits on pull request size since AI tools make it easy to generate massive PRs that are impractical for humans to review thoroughly. Context management practices like cursor rules, agent MDs, and Claude MDs help teams maintain quality when generating code or content at scale. Rippling runs monthly "spotlight sessions" called SPARK (Spotlight on AI at Rippling) where AI champions present their approaches and learnings to the broader engineering organization. This creates a learning culture where best practices spread organically. One team's development of spec-driven AI development—writing detailed specifications that AI then implements—was shared in these sessions as a pattern other teams could adopt. ## Responsible AI and Security Given Rippling's position as a system of record for sensitive employee, payroll, and financial data, their approach to AI security and privacy is necessarily rigorous. For product-facing AI, they design agents to operate "inside the product" within the security boundaries of their existing system. Agents inherit the permissions of the user who invokes them, ensuring that an employee cannot use an agent to access information they wouldn't normally have permission to view. Anker gives the example of the agent correctly refusing to provide another employee's salary information when requested by someone without appropriate permissions. This permission inheritance model treats agents as extensions of the user rather than privileged system components. For enterprise customers running thousands of employees through Rippling, this design maintains the security model customers already trust. The complexity of enterprise permissions—varying by role, department, location, and numerous other factors—makes this challenging to implement correctly, but it's essential for production deployment in their domain. When evaluating external AI solutions or partnerships with startups, Rippling applies a comprehensive "responsible AI practice" checklist. Requirements include zero data retention guarantees and commitments not to use customer data for model training. Anker mentions they've conducted pilots with external tools but shut down some that didn't meet their responsible AI standards, even if the technology showed promise. This disciplined approach reflects the regulatory environment (GDPR, CCPA) and the trust customers place in Rippling as a system of record. Internal AI usage follows a published "AI stance" that communicates the company's position to all employees. The core principle: "AI is your superpower, but you are still accountable." For code generation, engineers remain responsible for the quality of what they push to production. This accountability framework prevents the "AI generated it, not my fault" mindset while encouraging aggressive adoption of AI productivity tools. They provide official access to approved tools (corporate ChatGPT, Claude, etc.) to reduce shadow IT risks while creating an enablement ecosystem that helps employees use AI effectively and safely. ## Internal Productivity and Cultural Change Beyond product features, Anker's role encompasses driving AI adoption across all internal functions—legal, marketing, recruiting, finance, engineering. His philosophy focuses on productivity rather than efficiency, a subtle but important distinction. Efficiency implies doing the same work with fewer resources, which can create anxiety around job security. Productivity means accomplishing more with the same resources, expanding what's possible. Given that no team at Rippling reports having insufficient work, the focus becomes: what parts of your role can be automated or augmented to free capacity for higher-value work? Examples include legal teams building agents to answer repetitive questions from sales, reducing the need for legal review on common topics. Product teams create bots that answer implementation questions about feature capabilities, reducing interruptions to product managers. These agents leverage the same LangChain infrastructure and patterns that product teams use, creating a virtuous cycle where internal users become sophisticated consumers of AI capabilities and therefore better producers of customer-facing AI features. The company provides comprehensive tooling access (enterprise contracts with major AI providers), publishes clear guidelines on acceptable use, and runs regular enablement sessions. The SPARK series mentioned earlier serves both to share technical best practices and to normalize AI usage across the organization. When engineers, lawyers, marketers, and HR professionals all actively use AI daily, it fundamentally changes how they conceptualize product features. They become the first users of their own AI products, creating natural feedback loops and intuition about what works. This cultural transformation doesn't happen automatically. Anker estimates that 10-20% of employees will experiment with AI regardless of official policy (the "shadow AI" users). Another 70-80% are "sitting on the fence"—curious but uncertain how to start or concerned about doing something wrong. By providing official tools, clear guidelines, and visible support from leadership, the company gives this majority permission and pathways to experiment safely. Requests for custom internal agents flow regularly to the AI team, indicating healthy grassroots adoption. ## Outlook and Industry Trends Looking forward, Anker expresses a somewhat contrarian view on agent architectures. While much industry discussion focuses on workflow-centric agents with deterministic, repeatable processes, he argues that real-world usage contains too many unknowns and edge cases for purely workflow-based approaches. The sales briefing example illustrates this: even a seemingly straightforward workflow breaks down when customers ask unexpected questions or situations deviate from the norm. His prediction is that the industry will increasingly "lean into the power of LLMs" rather than trying to constrain them. Given that modern LLMs can perform reasoning, judgment, and complex planning, providing them with rich context and appropriate tools produces more robust systems than rigid workflows. This doesn't mean abandoning structure—deterministic tools and sub-workflows remain valuable—but the orchestration layer should leverage LLM capabilities rather than fight against them. This perspective aligns with the evolution Harrison describes from early LangGraph implementations (very workflow-oriented) to more recent agentic systems like Claude Code, Manis, and Deep Research. The industry appears to be converging on hybrid architectures: structured components (tools, validated workflows, sub-graphs) orchestrated by flexible, reasoning-capable agents. Rippling's experience suggests this balance is necessary for production deployment where both reliability and adaptability matter. The discussion of "AI slop"—low-quality AI-generated output that creates more work than it saves—surfaces an important concern. The antidote involves fast feedback loops (both human and automated), clear accountability, and sophisticated context management. Teams that excel at these practices produce high-quality AI-augmented output, while those that treat AI as a black box generator struggle with quality. This matches Rippling's emphasis on tracing, evaluation, and human feedback as core components of their LLMOps practice. ## Technical Stack and Partnerships Throughout the discussion, specific technologies and partnerships emerge as central to Rippling's approach. LangChain serves as the primary framework for building agents, with LangGraph providing the orchestration layer for complex multi-step processes. The recent LangChain 1.0 release with middleware support and standardized tool definitions has apparently improved their development experience. The deep agents package specifically enables their shift toward more agentic architectures. LangSmith provides essential observability and tracing capabilities. Teams spend significant time reviewing traces to understand agent behavior, debug issues, and compare versions during testing. This level of introspection into agent execution paths seems critical for their production deployments where reliability matters greatly. For foundational infrastructure, they've built a data layer on Databricks, providing the source of truth for agent queries. Model providers include OpenAI, Anthropic, and Google via enterprise contracts, giving teams flexibility to choose appropriate models for different use cases. They also evaluate and integrate point solutions from AI startups when those tools provide clear value and meet their responsible AI requirements. The partnership with LangChain appears substantial beyond just using the open-source tools. Anker and Harrison's dialogue suggests ongoing collaboration where Rippling provides feedback on real-world production needs that inform LangChain's development priorities. This type of close partnership between infrastructure providers and sophisticated users often drives innovation that benefits the broader ecosystem. ## Key Takeaways for LLMOps Practitioners Rippling's experience offers several important lessons for organizations deploying LLMs in production. First, the progression from simple features to standalone products to complex agents represents a pragmatic maturity curve. Starting with lower-risk, high-value use cases builds organizational capability and infrastructure before tackling the hardest problems. Second, production data access and real user feedback cannot be substituted with synthetic tests or demo environments. The edge cases and real-world messiness matter enormously for agent reliability. Third, cultural change around experimentation and accountability is as important as technical infrastructure. Helping teams understand that some experiments fail while maintaining accountability for production quality requires careful change management. Fourth, the hybrid architecture of flexible agents orchestrating structured tools and workflows appears more practical for production than either pure workflows or fully autonomous agents. This balance leverages LLM strengths while maintaining reliability for critical operations. Fifth, comprehensive observability through tracing and monitoring is essential for debugging and improving agentic systems. Without visibility into agent execution paths, teams struggle to diagnose issues or optimize performance. Sixth, the platform approach—building shared primitives and paved paths that distributed teams can leverage—enables rapid innovation while maintaining quality and consistency. Finally, dogfooding and internal adoption create better external products by making developers sophisticated users of the technology they're building. The overall picture is one of thoughtful, iterative deployment of increasingly sophisticated AI capabilities in a high-stakes environment. Rippling's approach balances innovation speed with the reliability requirements of enterprise customers handling payroll, benefits, and financial operations. Their willingness to share both successes and challenges provides valuable insights for others navigating similar journeys in deploying production LLM systems.

Start deploying reproducible AI workflows today