Lindy.ai evolved from an open-ended LLM agent platform to a more structured workflow-based approach, demonstrating how constraining LLM behavior through visual workflows and rails leads to more reliable and usable AI agents. The company found that by moving away from free-form prompts to guided, step-by-step workflows, they achieved better reliability and user adoption while maintaining the flexibility to handle complex automation tasks like meeting summaries, email processing, and customer support.
Lindy.ai is a no-code platform for building AI agents, positioning itself as “Airtable to LangChain’s MySQL.” Founded by Flo Crivello in late 2022, the platform allows users to create AI-powered automation workflows without coding. This case study documents their significant architectural evolution from an LLM-maximalist approach to a more constrained, structured system—a journey that offers valuable lessons for anyone building production AI agent systems.
The company started during a period Crivello describes as being “LLM-pilled,” believing that with enough context and tools, language models could handle anything. Lindy 1.0 followed this pattern with a big prompt field, a collection of tools, and reliance on the LLM to figure things out. This approach proved unreliable.
The key insight that drove Lindy 2.0 was that constraining agents actually makes them both more reliable and easier for users to understand. Instead of a giant text field where users typed natural language instructions hoping the model would interpret them correctly, the new approach uses a visual workflow builder with explicit components: triggers (e.g., “Zendesk ticket received”), required actions (e.g., “Check knowledge base”), and response generation steps.
This architectural shift reflects a broader principle in AI engineering: minimizing the scope of what you ask the LLM to do while maximizing deterministic, traditional software for everything else. As one of the podcast hosts put it, “Put Shoggoth in a box and make it a very small, minimal viable box. Everything else should be traditional if-this-then-that software.”
Each Lindy agent is built as a directed graph of nodes, where each node can be:
Each node that requires AI processing uses structured outputs. The platform allows configuring which model powers each individual node, with options including Claude 3.5 Sonnet and GPT-4 Turbo. This granular model selection allows optimizing cost and performance for different parts of a workflow.
The system uses an “append-only ledger paradigm” where each node appends to a growing context that subsequent nodes inherit. This design works well with Anthropic’s prompt caching, as the stable prefix of the context can be cached across nodes within a workflow execution.
Rather than asking models to generate free-form responses that are then parsed, Lindy uses structured outputs extensively. When a node like “append row to spreadsheet” executes, the fields to extract are specified as individual prompts. Each purple field in their UI is essentially a mini-prompt that gets the model to extract or generate a specific piece of information.
This approach significantly reduces hallucination risk for structured tasks. The model isn’t generating arbitrary JSON or deciding what fields to include—it’s filling in predefined slots with appropriate content based on context.
Lindy agents can save and retrieve memories between executions. However, the team found that too many memories can confuse agents, so they actively prune memories. This was especially important during the GPT-3.5 era but remains relevant for maintaining agent coherence.
A notable feature discussed is the ability for agents to manage their own memories. For example, when processing emails, users can forward emails to their Lindy and give it rules like “archive emails from X going forward” or “alert me on Slack for this kind of email.” These rules get saved as memories that persist across executions.
Lindys can invoke other Lindys, enabling modular agent design. Users create specialized agents for specific tasks (designer recruiting, customer support, personal assistance) and have them collaborate. This mirrors software engineering practices of separation of concerns and composability.
A sophisticated feature allows workflows to “backtrack” when users respond to agent outputs. For example, if a meeting recorder sends coaching notes and the user replies with a follow-up question, the agent can restore the full context of the original meeting and respond appropriately. This works across different channels—email threads, Slack threads, etc.
The team started building when context windows were 4,000 tokens—smaller than their current system prompts. This historical constraint led to significant investment in context pruning, which Crivello now considers “throwaway work” given modern context window sizes.
A key observation from production experience: with Claude 3.5 Sonnet, the model is no longer the bottleneck. The limitations have shifted to integration quality, user experience, workflow design, and traditional engineering challenges. Crivello notes he finds GPT-4o “overhyped” and doesn’t consider it suitable for agentic behavior.
The company uses different models for different node types, optimizing for the specific requirements of each step. With Anthropic’s prompt caching, they’ve seen significant cost reductions on their append-only context architecture.
Lindy built their own evaluation infrastructure early on, though Crivello acknowledges this was largely because nothing suitable existed in late 2022. They’ve built tooling to create evaluations from conversation histories in one click—essentially turning real conversations into unit tests.
A notable production incident occurred when an agent hallucinated a Rickroll link when asked for video tutorials that didn’t exist. The team caught this through post-agent review, fixed it before the customer saw it, and added a system prompt line to prevent recurrence. Database queries revealed 3-4 other instances had occurred previously—surprisingly low given the volume of interactions.
The company is considering migrating to commercial evaluation tools like Braintrust as the ecosystem has matured.
Lindy implements what Crivello calls a “poor man’s RLHF” system. Users can enable approval requirements on any workflow step, receiving the agent’s proposed action before execution. When users modify an agent’s proposed response before approving, the system:
This creates a continuous improvement loop where agent behavior gets refined through normal usage without requiring explicit training data collection or model fine-tuning.
Rather than relying on computer-use capabilities (like Anthropic’s recently released feature), Lindy focuses on API integrations. Crivello argues that anything that can be done with an API should be done via API, citing orders of magnitude differences in reliability and performance compared to computer-use approaches.
The team has built complex integrations, particularly for meeting scheduling. Their “find available times” action alone is approximately 1,000 lines of code, handling parameters like meeting duration, time windows, weekday preferences, buffer times, and slot quantity.
The team has internalized what they call “the bitter lesson”—the idea that scaling models tends to outperform clever cognitive architectures. Crivello notes he used to be “cognitive architecture-pilled” with patterns like critic-generator loops, but found that simply letting better models do their job often works better.
This philosophy extends to their development prioritization: they actively avoid building features that will improve automatically as models improve, focusing instead on integration quality, user experience, and the input/output layer around models.
Key operational insights from running Lindy in production include:
The case study illustrates a broader trend in AI product development: moving from “give the LLM everything” toward intelligent constraint, where the winners may be those who best package AI capabilities to solve real problems reliably rather than those with the most powerful models.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Anthropic developed a production multi-agent system for their Claude Research feature that uses multiple specialized AI agents working in parallel to conduct complex research tasks across web and enterprise sources. The system employs an orchestrator-worker architecture where a lead agent coordinates and delegates to specialized subagents that operate simultaneously, achieving 90.2% performance improvement over single-agent systems on internal evaluations. The implementation required sophisticated prompt engineering, robust evaluation frameworks, and careful production engineering to handle the stateful, non-deterministic nature of multi-agent interactions at scale.
Slack's Developer Experience team embarked on a multi-year journey to integrate generative AI into their internal development workflows, moving from experimental prototypes to production-grade AI assistants and agentic systems. Starting with Amazon SageMaker for initial experimentation, they transitioned to Amazon Bedrock for simplified infrastructure management, achieving a 98% cost reduction. The team rolled out AI coding assistants using Anthropic's Claude Code and Cursor integrated with Bedrock, resulting in 99% developer adoption and a 25% increase in pull request throughput. They then evolved their internal knowledge bot (Buddybot) into a sophisticated multi-agent system handling over 5,000 escalation requests monthly, using AWS Strands as an orchestration framework with Claude Code sub-agents, Temporal for workflow durability, and MCP servers for standardized tool access. The implementation demonstrates a pragmatic approach to LLMOps, prioritizing incremental deployment, security compliance (FedRAMP), observability through OpenTelemetry, and maintaining model agnosticism while scaling to millions of tokens per minute.