Unify: AI Agent System for Automated B2B Research and Sales Pipeline Generation

LLMOps Database

Tech

Unify

Company

Unify

Title

AI Agent System for Automated B2B Research and Sales Pipeline Generation

Industry

Tech

Link

https://www.youtube.com/watch?v=pKk-LfhujwI

Year

2025

Summary (short)

UniFi built an AI agent system that automates B2B research and sales pipeline generation by deploying research agents at scale to answer customer-defined questions about companies and prospects. The system evolved from initial React-based agents using GPT-4 and O1 models to a more sophisticated architecture incorporating browser automation, enhanced internet search capabilities, and cost-optimized model selection, ultimately processing 36+ billion tokens monthly while reducing per-query costs from 35 cents to 10 cents through strategic model swapping and architectural improvements.

## Company Overview and Use Case UniFi is a technology company that has developed an AI-powered system for automated business research and sales pipeline generation. The company, led by co-founder and CTO Connor, operates on the principle that "growth should be a science" and addresses the fundamental challenge that go-to-market activities are essentially search problems - finding people and companies that have problems you can uniquely solve. Their system transforms what was traditionally a manual research process performed by human researchers into an automated, scalable solution using AI agents. The core value proposition centers on providing repeatability, observability, and scalability - benefits that were historically impossible to achieve with human-driven research processes. UniFi's system allows customers to define research questions about specific companies or individuals, provides guidance on how the research should be conducted, and then deploys agents to answer these questions across thousands or tens of thousands of companies simultaneously. ## Technical Architecture and Evolution ### Initial Agent Development (Version 1) UniFi's journey began in November with the development of their first agent versions. Two founding engineers, Sam and Connor, independently built agents using the React (Reasoning and Acting) framework, which allows for corrective and reactive actions throughout an agent's execution trajectory. This framework was chosen specifically because it enables agents to reason about their actions and make course corrections during execution. The initial architecture included three core tools: internet searching, website searching, and website scraping. However, the two implementations diverged in their model selection strategies. Sam's "Sambbot Mark1" utilized GPT-4 Turbo for both plan generation and revision, prioritizing speed and cost efficiency with a weaker but faster model. Connor's "Connor agent" employed O1-preview, the reasoning model available at the time, for generating more comprehensive and thorough research plans. ### Model Performance and Cost Analysis The comparison between these approaches revealed significant insights about model selection in production environments. O1-preview consistently produced more thorough research plans, with typical outputs around 1,600 tokens compared to GPT-4's 600 tokens for the same prompts. The increased verbosity and specificity of O1-generated plans translated directly into improved downstream performance, with Connor's agent outperforming Sambbot across multiple evaluation categories. However, cost considerations became paramount as the system scaled. Initially, agent runs cost approximately 35 cents each using O1 models. Through systematic model evaluation and optimization, UniFi was able to reduce this cost to around 10 cents per run by strategically replacing O1 with GPT-4.1 for agentic planning tasks, while maintaining similar performance levels. The company conducted extensive model comparisons, testing various alternatives including DeepSeek, Claude 3.7, and Gemini 2.5. Notably, DeepSeek showed promise but suffered from latency issues that only became acceptable within the past few weeks of their evaluation period. GPT-4.1 emerged as the most cost-effective solution for their planning tasks, representing a significant operational cost reduction while maintaining quality standards. ## Evaluation Framework and Metrics ### Initial Evaluation Approach UniFi's evaluation strategy evolved from informal "vibe checks" to systematic metric-driven assessments. Initially, the team spent considerable time examining agent traces and trajectories to understand which approaches were working better. This qualitative analysis revealed that O1's more detailed planning significantly improved downstream outcomes. The formal evaluation framework began with accuracy metrics - simply measuring the percentage of questions answered correctly. The team hand-labeled approximately 500 examples across 100 companies, creating five core datasets based on anticipated customer use cases. These datasets focused on fundamental business intelligence questions such as whether companies were B2B or B2C, along with demographic and technographic information. ### Evaluation Insights and Limitations While accuracy-based evaluations provided useful heuristics for agent performance, the team discovered that these metrics had limitations. Even with robust evaluation frameworks, they frequently found themselves returning to trace analysis for human evaluation and "vibe checks." This experience highlighted a critical challenge in LLMOps: the difficulty of capturing all relevant performance aspects through automated metrics alone. The evaluation process revealed that edge cases were particularly challenging to catch through standard metrics. Even recent changes in OpenAI's models in ChatGPT required manual verification to understand their impact on agent performance. This underscores the ongoing need for human-in-the-loop evaluation processes in production LLM systems. ## Architectural Improvements and Tool Development ### Strategic Improvement Areas After analyzing their initial results, UniFi identified three primary axes for improvement: changing the agent graph architecture, modifying models and prompts, and adding new tools. They strategically chose to focus on model/prompt optimization and tool expansion as their initial improvement areas, based on customer use cases and workflow requirements. ### Enhanced Internet Research Capabilities One of the most significant improvements involved addressing the limitations of traditional internet search in agent workflows. UniFi recognized that standard internet search, whether through SEO-optimized Google results or LLM-powered search grounding services like OpenAI and Perplexity, left result quality largely out of their control. Moreover, they observed that their agents weren't conducting internet research in the same way humans naturally do. Human internet research typically involves searching for a query, reviewing the top 10 links, implicitly filtering out approximately half based on source credibility, opening multiple tabs, reading through content briefly, and then deciding whether to refine the search or use the found information. Their agents weren't mimicking this behavior effectively. To address this, UniFi upgraded from a naive Pydantic model structure that only included query terms to a more sophisticated system incorporating multiple parameters: category specification, live crawl options, text inclusion in summaries, domain constraints, and publication date filters. This enhancement changed the trajectory of Google searches, allowing agents to pull in both URLs and actual page content in a single tool call, rather than making decisions based solely on search previews. ### Browser Automation Integration UniFi developed browser access capabilities to capture rich online data that traditional scraping couldn't handle. This included interactive search experiences, online datasets requiring query entry, and visual content like Google Maps or images. Their browser access tool functions as a sub-agent that decomposes tasks into browser trajectories using GPT-4 Mini and executes actions using computer use preview capabilities. They evaluated the open-source Browser Use alternative but found that while it was marginally faster, it struggled with complex browser tasks, leading them to adopt computer use preview instead. An example of this capability involved researching whether Google has EV parking on-site, where the agent navigated to Google Maps, used Street View to examine the parking lot for EV charging stations, opened new browser tabs for verification, and confirmed the presence of EV charging infrastructure. ### Tool Development Strategy The tool development process was driven by identifying use cases they couldn't currently support and determining which single tool additions would unlock the most new customer workflows. This led to the development of four key tools: deep internet research, browser access, HTML searching, and dataset access. Each tool was designed to address specific limitations in their existing capabilities and expand the range of research tasks their agents could perform. ## Production Scaling and Operational Insights ### Token Volume and Cost Management UniFi's system operates at significant scale, processing 36 billion tokens through OpenAI in April alone, with continued monthly growth. This volume presents both opportunities and challenges for cost optimization. The successful reduction in per-query costs from 35 cents to 10 cents demonstrates the importance of strategic model selection and architectural optimization in production LLM systems. ### Model Specialization and Node-Based Evaluation An important operational insight emerged regarding model specialization: models that perform well for planning tasks may not be equally effective for tool calling, reflection, or other parts of the agent workflow. This realization led UniFi to consider node-based evaluation strategies, where different components of their agent pipeline could be optimized with different models based on their specific requirements. ### Edge Case Management The production environment revealed numerous edge cases that weren't captured by their evaluation frameworks. One notable example involved date formatting issues, where GPT-4 struggled with ambiguous date formats like "5/14/2025 at 3" versus "May 15, 2024," sometimes incorrectly identifying dates as being in the future. UniFi addressed this by providing multiple date format versions in their prompts, improving accuracy and standardizing performance across models. ### Tool Calling Optimization Tool calling presented ongoing challenges, particularly with agents making "throwaway" tool calls such as overly generic searches for "B2B." UniFi addressed this by modifying their Pydantic models for input schemas, forcing the tool-calling components to think more carefully about their inputs and improving the quality of tool utilization. ## Current Architecture and Future Directions ### Production System The current production system, dubbed "Canal Browser Agent" (maintaining their tradition of naming agents after team members), represents the culmination of their architectural evolution. This system incorporates all the lessons learned from their initial implementations, including cost-optimized model selection, enhanced internet research capabilities, browser automation, and improved tool calling. ### Ongoing Challenges and Development UniFi continues to face interesting challenges in agent development, particularly around evaluation and scalability. Their experience demonstrates that running generalized research agents at scale requires continuous refinement of both technical architecture and evaluation methodologies. The team emphasizes that even with sophisticated automated evaluation systems, human oversight and trace analysis remain crucial components of their development process. ## LLMOps Lessons and Industry Implications UniFi's experience provides several valuable insights for the broader LLMOps community. Their systematic approach to model evaluation and cost optimization demonstrates the importance of treating model selection as an ongoing operational decision rather than a one-time architectural choice. The significant cost reductions achieved through strategic model swapping highlight the financial implications of these decisions at scale. Their emphasis on human-in-the-loop evaluation, even with robust automated metrics, underscores the current limitations of purely automated evaluation systems for complex agent behaviors. The discovery that models perform differently across various components of an agent workflow suggests that future LLMOps practices may need to embrace more granular, component-specific optimization strategies. The evolution from simple tool sets to sophisticated browser automation capabilities illustrates how production LLM systems often require expanding beyond traditional API-based tools to achieve real-world utility. This progression from basic internet search to browser automation represents a broader trend toward more sophisticated interaction capabilities in production AI systems. UniFi's experience also highlights the importance of aligning technical capabilities with customer workflows. Their tool development strategy, driven by identifying unsupported use cases and prioritizing tools that unlock new customer workflows, demonstrates how successful LLMOps implementations require close integration between technical capabilities and business requirements.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source