Salesforce: Building and Scaling Production-Ready AI Agents: Lessons from Agent Force

LLMOps Database

Tech

Salesforce

Company

Salesforce

Title

Building and Scaling Production-Ready AI Agents: Lessons from Agent Force

Industry

Tech

Link

https://youtu.be/BAL5QzWmt2I?si=thp_7Q_ChIvAZXFk

Year

2023

Summary (short)

Salesforce introduced Agent Force, a low-code/no-code platform for building, testing, and deploying AI agents in enterprise environments. The case study explores the challenges of moving from proof-of-concept to production, emphasizing the importance of comprehensive testing, evaluation, monitoring, and fine-tuning. Key insights include the need for automated evaluation pipelines, continuous monitoring, and the strategic use of fine-tuning to improve performance while reducing costs.

## Overview This case study captures a presentation from a Salesforce product leader at "Small Con" (likely a conference focused on smaller language models) discussing their journey building AgentForce, Salesforce's platform for building, testing, and deploying AI agents. The presentation offers candid insights into the operational challenges of deploying LLM-based systems at enterprise scale, particularly for CRM use cases across service, sales, and marketing domains. Salesforce, as one of the largest enterprise software providers with virtually all major companies as customers, provides a unique vantage point on the practical challenges of LLMOps. The speaker emphasizes that while the technology is exciting, the path from demo to production is fraught with challenges that require disciplined engineering practices and proper tooling. ## The Demo-to-Production Gap One of the central themes is the stark contrast between building initial demos and achieving production readiness. The speaker notes that building early demos has become "super easy" — often achievable without data scientists, sometimes in days rather than months. However, this represents only about 10% of the actual work. The remaining 90% involves the difficult task of optimizing from an initial 60% accuracy to the 80-90% threshold required for production deployment. This gap exists because LLM-based systems are fundamentally non-deterministic. Users and customers are not accustomed to systems that produce different responses to the same prompt on subsequent requests. This creates what the speaker describes as "fear of the unknown" and contributes to negative past experiences that make enterprises reluctant to re-engage with AI solutions. ## The Challenge of Black Box Systems A significant operational challenge highlighted is the opacity of agentic systems. When a user asks a question and receives a suboptimal response, debugging is extremely difficult. The speaker asks: "How do you backtrack and understand what happened?" There are numerous potential failure points — the number of LLM calls, latency at each step, which knowledge sources were consulted, what RAG chunks were retrieved and cited, and whether guardrails were properly followed. This opacity problem becomes particularly acute with agentic workflows where the system may loop internally, making multiple LLM calls and taking autonomous actions. The speaker emphasizes that agents are increasingly "headless" — embedded in workflows rather than just conversational interfaces — which adds complexity to monitoring and debugging. ## Agent Architecture Principles The presentation defines five essential attributes of a production agent: - **Job definition**: Clear understanding of the agent's purpose and jobs to be done - **Data access**: Grounding in enterprise knowledge to prevent hallucination - **Actionability**: Ability to execute tasks, either autonomously or with user approval - **Guardrails**: Trust layer for security and safety, which Salesforce positions as their "number one value" - **Channel exposure**: Deployment across websites, portals, mobile apps, and other interfaces ## The Optimization Journey The speaker outlines a typical maturation path for LLM-based solutions: Starting with standard LLMs (proprietary or open-source), teams progress through prompt engineering, then add RAG pipelines for grounding, and eventually reach a plateau where further optimization requires fine-tuning. This progression is not arbitrary — organizations don't "jump onto fine-tuning from the get-go" but arrive there after exhausting prompt optimization and RAG improvements. Fine-tuning becomes necessary when teams need to achieve brand voice alignment (ensuring agents respond in company-specific tone and terminology), reduce token consumption to lower costs, or achieve higher accuracy for domain-specific tasks. The speaker specifically mentions finance and legal as domains requiring very high accuracy thresholds. ## Evaluation as Foundation A key organizational change the speaker implemented was requiring product managers to define clear evaluation criteria in every PRD (Product Requirements Document). This ensures teams start with concrete acceptance criteria — for example, defining whether 80% accuracy is sufficient for production launch. The evaluation approach involves three components: **AI-assisted evaluation**: Using LLMs to evaluate responses across different metrics at scale. For example, taking 100 different question types and systematically assessing response quality. **Human-in-the-loop validation**: Randomly spot-checking responses, providing thumbs up/down feedback, and initiating reinforcement learning loops. **Batch testing**: Moving beyond single-turn testing to comprehensive batch evaluation. The speaker emphasizes that "single prompt testing is not enough" and that teams relying only on single-turn testing "hit the wall" when trying to scale usage. ## Testing Infrastructure The presentation demonstrates Salesforce's approach to testing infrastructure, including an agent builder with batch testing capabilities. A key feature is the ability to automatically generate test cases — users can click a button to create 100 or 200 test cases rather than manually authoring each one. This synthetic data generation capability is crucial because "coming up with 5-10 examples is easy; building hundreds of examples is hard." The speaker emphasizes iteration speed as a key differentiator: "Teams that are doing very fast iterations... when I say fast iteration means they make a change, for example they add a new topic into the agents, and then if they can test that thing within 30 minutes, those are the teams that are moving fast." This 30-minute feedback loop becomes a benchmark for operational excellence. ## RAG Pipeline Monitoring The presentation identifies six common failure points in RAG pipelines that require continuous monitoring: - Query processing failures - Retriever performance issues - Chunk retrieval accuracy - Handoff to LLM for generation - Response quality degradation - Citation and grounding failures The speaker emphasizes that building a RAG pipeline is relatively easy, but optimizing it for production requires defining metrics at each stage and monitoring improvements over time. Enterprise knowledge is "buried across many different systems — PDF files, images, documents" — making this optimization particularly challenging. ## Fine-Tuning Use Cases Specific use cases where fine-tuning delivers significant value include: **Brand voice alignment**: Large enterprises care deeply about consistent brand voice in customer communications. Attempting to achieve this through prompt engineering alone results in unwieldy 20-page prompts, driving up token counts and costs. Fine-tuning provides a more scalable solution. **Domain-specific tasks**: Finance and legal customers with specialized terminology and knowledge benefit from fine-tuned models aligned to their enterprise lingo. **Cost-to-serve optimization**: As usage scales, fine-tuning smaller models (the speaker mentions 10 billion parameter models as an example) to achieve comparable quality at faster speeds and lower costs becomes critical. **Summarization and structured output**: These specific CRM use cases benefit from fine-tuning to improve consistency and quality. ## Data Sovereignty Concerns A recurring customer concern is data isolation during fine-tuning. Customers explicitly require that their data not be used to train or improve base models — fine-tuning must occur within their "trust boundaries." This is a critical enterprise requirement that platforms must accommodate. ## Four Pillars of Production Readiness The speaker distills their learnings into four essential capabilities: - **Automated eval plus human-in-the-loop**: Combining AI-assisted evaluation with human oversight - **Fast feedback systems**: Simple mechanisms like thumbs up/down plus implicit feedback from event logs - **Pipeline monitoring**: Particularly for RAG systems, tracking metrics at each stage - **Continuous optimization**: Data-driven conversations about model selection and fine-tuning opportunities ## Key Takeaways The presentation concludes with two main takeaways: The first is that "agents without good eval, monitoring, and fast feedback loops are useless" — vendors and customers who neglect these operational fundamentals will be outpaced by those who invest in them. The second advocates for small language models with fine-tuning as the path to balancing speed, precision, and latency, rather than defaulting to the largest available models for every use case. ## Critical Assessment While the presentation offers valuable operational insights, it should be noted that this is fundamentally a vendor perspective promoting Salesforce's AgentForce platform. The specific claims about accuracy improvements and cost savings are not substantiated with concrete metrics or customer case studies. The 60% to 80-90% accuracy improvement trajectory, while plausible, is presented anecdotally rather than with rigorous data. Additionally, the emphasis on low-code/no-code approaches may understate the engineering complexity required for truly production-grade systems. The 30-minute iteration cycle benchmark, while aspirational, may be achievable only for certain types of changes rather than comprehensive system updates. Nevertheless, the presentation provides a useful framework for thinking about LLMOps maturity and the practical challenges of enterprise AI deployment, particularly around evaluation, monitoring, and the strategic role of fine-tuning in the optimization journey.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source