ZenML

Self-Improving Agentic Systems Using DSPy for Production Email Generation

Relevance AI 2025
View original source

Relevance AI implemented DSPy-powered self-improving AI agents for outbound sales email composition, addressing the challenge of building truly adaptive AI systems that evolve with real-world usage. The solution integrates DSPy's optimization framework with a human-in-the-loop feedback mechanism, where agents pause for approval at critical checkpoints and incorporate corrections into their training data. Through this approach, the system achieved emails matching human-written quality 80% of the time and exceeded human performance in 6% of cases, while reducing agent development time by 50% through elimination of manual prompt tuning. The system demonstrates continuous improvement through automated collection of human-approved examples that feed back into DSPy's optimization algorithms.

Industry

Tech

Technologies

Overview and Context

Relevance AI’s case study describes their implementation of DSPy-powered self-improving agentic systems for production use, specifically focused on automated email composition for outbound sales development. The implementation represents what they characterize as a fundamental shift from static, manually-tuned prompt systems to dynamic, self-improving agents that adapt based on real-world feedback. The company claims their DSPy-powered systems generated emails matching human-written quality 80% of the time, with 6% of cases exceeding human performance, while cutting production agent building time by 50%.

The use case centers on outbound sales development automation, where email composition was selected as the primary optimization target due to its requirement for extensive human oversight and complex decision-making. The system operates at a specific integration point in the workflow—after CRM data gathering and prospect research but before email delivery—creating what they describe as an ideal opportunity for feedback-based learning.

Technical Architecture and DSPy Integration

The architecture follows DSPy’s framework with four core pillars: training data acquisition, program training, inference, and evaluation. Relevance AI emphasizes that training data quality represents the most critical component, directly impacting the entire pipeline’s performance. They acknowledge the traditional “garbage in, garbage out” principle, noting that better data and refined gold sets consistently produce superior system outputs.

For program training, the system leverages three DSPy optimizers matched to different data scales. BootstrapFewShot serves as an entry-level optimizer for fewer than 20 samples, identifying the most effective training examples for few-shot demonstrations. BootstrapFewShot with Random Search is recommended for around 50 samples, providing enhanced capability by searching across larger sample sets to find optimal combinations of few-shot examples. MIPROv2 is positioned as the most sophisticated optimizer for 200+ samples, capable of both selecting optimal examples and generating test prompt instructions. The optimized programs are automatically cached in the knowledge base under “_dspy_programs” for efficient reuse.

During inference, the system leverages cached optimized programs to run inference steps with minimal computational overhead. The system feeds inputs into these optimized programs to generate outputs that form the basis for evaluation metrics. The cloud-based architecture on Relevance’s platform maintains consistent response times of one to two seconds even under heavy load, achieved through sophisticated caching mechanisms and parallel processing capabilities.

Human-in-the-Loop Feedback Mechanism

The feedback mechanism is positioned as central to the self-improvement capability, transforming what would otherwise be a static AI system into an adaptive solution. The implementation uses an “Approval Required” setting for output tools in agent settings, enabling human oversight and refinement of outputs that creates a learning loop for the agent. When the agent executes its assigned task, the system pauses at key checkpoints for human approval. Humans review and potentially correct the output, and the system adds this feedback to its training data. DSPy then uses the updated training data to improve future performance in a continuous learning cycle.

This approach enables real-time feedback integration where the agent pauses at critical points for human input, with feedback flowing directly into the DSPy training set to create what they describe as a dynamic learning environment that evolves with each interaction. The automated collection of training examples is handled through a “Prompt Optimizer - Get Data” tool that collects and stores training examples automatically.

Evaluation Framework

The evaluation framework is built on comparative analysis, conducting parallel tests between DSPy-powered agents and control agents (non-DSPy variants) using identical inputs, tools, and workflows. At the core of the evaluation process is the semanticF1 score—a metric that uses LLMs to measure semantic precision and recall of responses, combining them into a comprehensive performance indicator. This represents a more sophisticated approach than simple string matching, as it attempts to capture semantic similarity between generated and reference texts.

However, it’s worth noting that the claimed performance metrics of “80% matching human-written quality” and “6% exceeding human performance” lack detailed methodology explanation. The text doesn’t clarify how “matching quality” is operationalized, whether this is based solely on semanticF1 scores, or what constitutes “exceeding” human performance. These are significant claims that would benefit from more rigorous documentation of evaluation methodology and benchmarking procedures.

Development Timeline and Implementation Process

The implementation timeline is presented as relatively compact. Initial development of the agentic system is described as taking about one week, focusing on understanding business requirements, setting up basic infrastructure, configuring initial workflows, and testing basic functionalities. DSPy integration is characterized as straightforward, involving creating a single additional tool within the existing system, implementing pre-built DSPy tool steps, and configuring optimization settings. The efficiency is attributed to using pre-existing components that minimize development time.

Training data collection follows two approaches: using the built-in tool for automated data collection enabling rapid deployment, or developing custom training datasets for specialized applications. While automation delivers quick results, the documentation notes that organizations needing custom datasets should allocate extra time for thorough data preparation and validation. After deployment, the system continues to evolve through integration of human-approved responses into the training set, ongoing refinement based on real-world feedback, and periodic optimization of the training pipeline.

Production Considerations and Operational Details

Several key considerations are outlined for production deployment. The system currently optimizes for positive examples, but the documentation acknowledges potential value in including negative examples in the learning framework to help the system identify and avoid problematic responses. This represents a limitation of the current implementation—the lack of explicit negative example learning may mean the system is slower to learn what not to do.

For brand voice consistency, the approach relies on DSPy learning brand voice naturally through human-approved examples rather than following rigid rules. The system adapts to brand messaging patterns through exposure to approved content, with custom message rules available to further enhance brand alignment. Content safety and compliance is addressed through customizable message rules for content modification, multi-layer content filtering systems, mandatory approval workflows for sensitive content, and automated flagging of prohibited terms and topics.

The platform’s cloud-based architecture is positioned as delivering optimal performance without local processing overhead. The caching mechanisms and parallel processing capabilities are emphasized as key to maintaining the reported one to two second response times. Future improvements under exploration include AI-driven feedback interpretation for more autonomous self-improvement and streamlined selection of gold set examples for optimization.

Critical Assessment and LLMOps Implications

From an LLMOps perspective, this case study illustrates several important patterns and considerations for production AI systems. The integration of DSPy represents an attempt to move beyond manual prompt engineering toward more systematic optimization, which aligns with broader industry trends toward treating prompts as learnable components rather than manually crafted artifacts. The human-in-the-loop approach provides a practical mechanism for continuous improvement while maintaining quality control, addressing the common challenge of deploying AI systems that need to adapt to specific organizational contexts and requirements.

However, the case study should be viewed with appropriate skepticism given its promotional nature. The claimed 50% reduction in development time lacks sufficient detail about what baseline is being compared against and what specific activities were eliminated or shortened. The performance metrics, while impressive if accurate, are presented without sufficient methodological detail to fully assess their validity. The semanticF1 metric, while more sophisticated than simple matching, still relies on LLM-based evaluation which has its own limitations and potential biases.

The approach of caching optimized programs represents a practical solution to the computational overhead that could otherwise make DSPy optimization prohibitive in production. The separation of the optimization phase (which may be computationally expensive) from the inference phase (which needs to be fast) is a sensible architectural decision. The use of different optimizers based on available training data size shows awareness of the trade-offs between optimization sophistication and practical constraints.

The reliance on human approval at critical checkpoints represents both a strength and a limitation. It ensures quality control and provides the training signal for improvement, but also means the system isn’t fully autonomous and requires ongoing human involvement. The scalability of this approach depends on the volume of emails being generated and the availability of human reviewers. For organizations with very high volumes, this could become a bottleneck.

The emphasis on training data quality as the most critical component is well-founded and aligns with broader machine learning principles. The automated collection of approved examples is a practical approach to continuously building training datasets, though the quality of this data depends entirely on the quality and consistency of human feedback. There’s potential for drift or inconsistency if different human reviewers have different standards or if standards change over time.

The lack of discussion about model selection, model versioning, or handling of model updates represents a gap in the documentation. In production LLMOps, managing which underlying LLMs are being used, how they’re versioned, and how to handle updates to base models is crucial. The case study doesn’t address how DSPy optimization transfers across different base models or how the system handles updates to underlying LLM capabilities.

The reported one to two second response times are impressive if accurate, particularly for a system that may be doing few-shot prompting with potentially lengthy example sets. This suggests effective caching and optimization, though the text doesn’t provide detail about variability in response times or how performance degrades under different load conditions.

Overall, this case study represents an interesting application of DSPy for production email generation with a practical approach to continuous improvement through human feedback. The technical architecture appears sound, with appropriate separation of concerns between optimization and inference. However, the promotional nature of the content, lack of detailed methodology for claimed improvements, and absence of discussion about limitations or challenges encountered should lead to cautious interpretation of the results. The approach shows promise for organizations looking to deploy adaptive AI systems with human oversight, but would benefit from more rigorous documentation and independent validation of the claimed performance improvements.

More Like This

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57

Enterprise AI Platform Integration for Secure Production Deployment

Rubrik 2025

Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.

customer_support content_moderation chatbot +53

Building Production AI Agents and Agentic Platforms at Scale

Vercel 2025

This AWS re:Invent 2025 session explores the challenges organizations face moving AI projects from proof-of-concept to production, addressing the statistic that 46% of AI POC projects are canceled before reaching production. AWS Bedrock team members and Vercel's director of AI engineering present a comprehensive framework for production AI systems, focusing on three critical areas: model switching, evaluation, and observability. The session demonstrates how Amazon Bedrock's unified APIs, guardrails, and Agent Core capabilities combined with Vercel's AI SDK and Workflow Development Kit enable rapid development and deployment of durable, production-ready agentic systems. Vercel showcases real-world applications including V0 (an AI-powered prototyping platform), Vercel Agent (an AI code reviewer), and various internal agents deployed across their organization, all powered by Amazon Bedrock infrastructure.

code_generation chatbot data_analysis +38