ZenML

Building Production-Ready AI Agents: Lessons from BeeAI Framework Development

IBM 2025
View original source

IBM Research's team spent a year developing and deploying AI agents in production, leading to the creation of the open-source BeeAI Framework. The project addressed the challenge of making LLM-powered agents accessible to developers while maintaining production-grade reliability. Their journey included creating custom evaluation frameworks, developing novel user interfaces for agent interaction, and establishing robust architecture patterns for different use cases. The team successfully launched an open-source stack that gained particular traction with TypeScript developers.

Industry

Tech

Technologies

Overview

This case study documents a year-long journey by IBM Research’s innovation incubation team focused on building AI agents that could democratize access to generative AI capabilities. The author, Maya Murad, shares practical lessons learned from taking AI agents from prototype to production, culminating in the open-source release of the BeeAI Framework. The work represents a thoughtful exploration of the challenges and opportunities in productionizing LLM-powered agentic systems.

Problem Statement and Initial Hypothesis

The team began with two key observations from their prior work incubating generative AI models. First, they recognized that LLMs could be harnessed for higher-complexity problem-solving beyond simple few-shot learning through clever engineering and advanced prompting techniques. Retrieval-Augmented Generation (RAG) demonstrated how models could interact with documents and retrieve factual information, while tool calling enabled dynamic interaction with external systems. When combined with chain-of-thought prompting, these capabilities formed the foundation for AI agents.

Second, they observed a significant gap: non-experts struggled to capture the promised productivity gains of generative AI. While the technology offered vast potential, many users could not translate its capabilities into solving high-value problems. The teams that succeeded had deep expertise in both LLMs and systems engineering. This led to their core hypothesis: could they empower a broader user base to harness AI to solve everyday problems?

Early Prototype and Technical Architecture

To test their hypothesis, the team rapidly prototyped an AI agent called “Peri” that could break down complex queries into sub-steps, navigate multiple web pages to gather supporting information, execute code, and synthesize well-reasoned responses. Notably, this implementation relied on an open-source model, Llama 3-70B-Chat, which unlike OpenAI’s o1 or DeepSeek-R1 did not have built-in reasoning capabilities. They achieved agentic capabilities by implementing a specific architecture combining tool calling with chain-of-thought prompting.

A significant innovation was the development of a “trajectory explorer” - a visual tool that allowed users to investigate the steps taken by the agent to generate its response. This feature went beyond the traditional chat GUI experience and proved crucial for transparency and user trust.

Key Insights from Early Testing

Testing with early adopters revealed several critical insights for production deployments:

Agent trajectory exploration as a trust vector: Users did not simply want correct answers - they wanted to understand how the agent arrived at them. The ability to inspect reasoning steps significantly increased confidence in the system. This finding has important implications for LLMOps, suggesting that observability and explainability features are not optional additions but core requirements for production agent systems.

Reasoning method alignment matters: Users demonstrated strong preferences for reasoning methods that matched their expectations. For example, when solving math problems, users trusted calculator-based solutions over results scraped from search engines. This underscores the importance of designing agent behaviors that align with user mental models - a consideration that goes beyond simple accuracy metrics.

Security amplification: AI agents introduce new security risks by interacting dynamically with external systems. The team identified potential vulnerabilities ranging from unintended system modifications to incorrect or irreversible actions. This is a critical LLMOps consideration, as agentic systems expand the attack surface beyond traditional LLM deployments.

Enterprise rule enforcement: While prompting can guide AI behavior, it does not guarantee strict adherence to business rules. This highlights the need for comprehensive rule enforcement mechanisms that go beyond prompt engineering for enterprise AI agent deployments.

Framework Development and Open-Source Strategy

Based on these learnings, the team decided to develop their own agent framework to address gaps they identified in existing tools for building agents, particularly for full-stack developers. This resulted in the BeeAI Framework, a TypeScript-based library specifically designed to fill this gap. Following advice from mentors, they chose to open-source the entire stack to test what resonated with the community and measure real-world traction.

The framework found its audience primarily with TypeScript developers, who leveraged BeeAI’s capabilities to build innovative applications. Notable community implementations included Bee Canvas and UI Builder, which showcased new human-agent interaction paradigms. Interestingly, their most vocal users were developers engaging with the open-source UI that was initially designed for non-technical users - highlighting the importance of creating intuitive ways for developers to interact with, consume, and demo their AI agents.

Production-Grade Agent Lessons

The team distilled several hard-earned lessons specifically relevant to building production-grade agents:

Agent architectures exist on a spectrum: There is no one-size-fits-all approach to multi-agent architectures. Teams that successfully take agents to production typically design custom architectures tailored to their specific use cases and acceptance criteria. A robust framework should accommodate this diversity. The team’s initial implementation was opinionated - they designed BeeAgent with their own needs in mind. As they gained insights, they renamed it ReActAgent to acknowledge that no single agent design is definitive, and introduced “workflows” - a flexible multi-actor orchestration system enabling users to design agent architectures tailored to their requirements.

Consumption and interaction modalities matter: The way end users interact with agents can be as important as the technical implementation. The team references Anthropic’s Artifact feature as an example of how interaction design can significantly improve the value users derive from LLM-powered workflows. They identified agent-to-agent and agent-to-human interaction as a space ripe for innovation.

Evaluation is key - and more challenging than ever: Without existing benchmark datasets and tooling that fit their needs, the team had to develop their own evaluation processes from scratch. Their approach involved defining every feature their agent should support and constructing a custom benchmark dataset. Features ranged from simple aspects like controlling tone or verbosity of responses to accuracy of complex reasoning tasks. Whenever they introduced a new capability, they rigorously tested that it did not negatively impact existing functionalities. Beyond assessing outputs, they also analyzed the agent’s reasoning trajectory - how it arrived at answers. This remains what they describe as a complex, evolving challenge, with accelerating the path from prototyping to production representing a major opportunity in the field.

User Experience and Developer Experience Lessons

The team also learned important lessons about creating products that users love:

Focus on one persona at a time: Trying to serve multiple user types too soon can dilute impact. While they were excited about reaching everyday builders, their initial traction came from developers, which became their focus going forward. They also recognized that Python is key to unlocking broader adoption, as it remains the dominant language for AI development, and prioritized bringing the Python library to feature parity with TypeScript.

Clean developer experience: A great developer experience makes it easy for newcomers to get value while providing flexibility for advanced users. Initially, their spread of repositories led to friction, prompting consolidation efforts to create a more seamless experience.

Iterate quickly and track what resonates: The AI agent space is evolving rapidly with no clear leader yet. The team advocates staying agile, focusing on user needs, experimenting fast, and refining based on real-world usage.

Assessment and Balanced Perspective

This case study provides valuable practical insights from a research team working on the cutting edge of AI agent development. However, several points warrant consideration. The work represents research and early-stage product development rather than proven enterprise deployments at scale. While the open-source approach is commendable for gathering feedback, the actual production usage patterns and scale are not detailed.

The emphasis on TypeScript as a differentiator is notable but also reflects a potential limitation, as the acknowledgment that Python is needed for broader adoption suggests the initial technology choice may have constrained reach. The evaluation challenges described are real and significant - the fact that the team had to build custom benchmarking infrastructure highlights the immaturity of tooling in this space.

The security concerns raised about agentic systems are important but the specific mitigations implemented are not detailed, leaving this as an identified problem rather than a solved challenge. Similarly, the need for business rule enforcement beyond prompting is identified but the concrete solutions are not fully elaborated.

Overall, this case study represents honest, experience-based learning from a team actively working to productionize AI agents, with insights that should be valuable to others on similar journeys.

More Like This

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57

Reinforcement Learning for Code Generation and Agent-Based Development Tools

Cursor 2025

This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.

code_generation code_interpretation data_analysis +61

Scaling AI Product Development with Rigorous Evaluation and Observability

Notion 2025

Notion AI, serving over 100 million users with multiple AI features including meeting notes, enterprise search, and deep research tools, demonstrates how rigorous evaluation and observability practices are essential for scaling AI product development. The company uses Brain Trust as their evaluation platform to manage the complexity of supporting multilingual workspaces, rapid model switching, and maintaining product polish while building at the speed of AI industry innovation. Their approach emphasizes that 90% of AI development time should be spent on evaluation and observability rather than prompting, with specialized data specialists creating targeted datasets and custom LLM-as-a-judge scoring functions to ensure consistent quality across their diverse AI product suite.

document_processing content_moderation question_answering +52