Lindy.ai: Evolution from Open-Ended LLM Agents to Guided Workflows

LLMOps Database

Tech

Lindy.ai

Company

Lindy.ai

Title

Evolution from Open-Ended LLM Agents to Guided Workflows

Industry

Tech

Link

https://www.latent.space/p/lindy

Year

2024

Summary (short)

Lindy.ai evolved from an open-ended LLM agent platform to a more structured workflow-based approach, demonstrating how constraining LLM behavior through visual workflows and rails leads to more reliable and usable AI agents. The company found that by moving away from free-form prompts to guided, step-by-step workflows, they achieved better reliability and user adoption while maintaining the flexibility to handle complex automation tasks like meeting summaries, email processing, and customer support.

Tags

## Overview Lindy.ai is a no-code platform for building AI agents, positioning itself as "Airtable to LangChain's MySQL." Founded by Flo Crivello in late 2022, the platform allows users to create AI-powered automation workflows without coding. This case study documents their significant architectural evolution from an LLM-maximalist approach to a more constrained, structured system—a journey that offers valuable lessons for anyone building production AI agent systems. ## The Evolution from Lindy 1.0 to 2.0 The company started during a period Crivello describes as being "LLM-pilled," believing that with enough context and tools, language models could handle anything. Lindy 1.0 followed this pattern with a big prompt field, a collection of tools, and reliance on the LLM to figure things out. This approach proved unreliable. The key insight that drove Lindy 2.0 was that constraining agents actually makes them both more reliable and easier for users to understand. Instead of a giant text field where users typed natural language instructions hoping the model would interpret them correctly, the new approach uses a visual workflow builder with explicit components: triggers (e.g., "Zendesk ticket received"), required actions (e.g., "Check knowledge base"), and response generation steps. This architectural shift reflects a broader principle in AI engineering: minimizing the scope of what you ask the LLM to do while maximizing deterministic, traditional software for everything else. As one of the podcast hosts put it, "Put Shoggoth in a box and make it a very small, minimal viable box. Everything else should be traditional if-this-then-that software." ## Technical Architecture ### Workflow-Based Agent Design Each Lindy agent is built as a directed graph of nodes, where each node can be: - A trigger (email received, calendar event, Slack message, etc.) - An action (append to spreadsheet, send message, invoke another Lindy) - A filter/condition - An AI reasoning step Each node that requires AI processing uses structured outputs. The platform allows configuring which model powers each individual node, with options including Claude 3.5 Sonnet and GPT-4 Turbo. This granular model selection allows optimizing cost and performance for different parts of a workflow. The system uses an "append-only ledger paradigm" where each node appends to a growing context that subsequent nodes inherit. This design works well with Anthropic's prompt caching, as the stable prefix of the context can be cached across nodes within a workflow execution. ### Structured Outputs and Constrained Generation Rather than asking models to generate free-form responses that are then parsed, Lindy uses structured outputs extensively. When a node like "append row to spreadsheet" executes, the fields to extract are specified as individual prompts. Each purple field in their UI is essentially a mini-prompt that gets the model to extract or generate a specific piece of information. This approach significantly reduces hallucination risk for structured tasks. The model isn't generating arbitrary JSON or deciding what fields to include—it's filling in predefined slots with appropriate content based on context. ### Memory Management Lindy agents can save and retrieve memories between executions. However, the team found that too many memories can confuse agents, so they actively prune memories. This was especially important during the GPT-3.5 era but remains relevant for maintaining agent coherence. A notable feature discussed is the ability for agents to manage their own memories. For example, when processing emails, users can forward emails to their Lindy and give it rules like "archive emails from X going forward" or "alert me on Slack for this kind of email." These rules get saved as memories that persist across executions. ### Cross-Agent Collaboration Lindys can invoke other Lindys, enabling modular agent design. Users create specialized agents for specific tasks (designer recruiting, customer support, personal assistance) and have them collaborate. This mirrors software engineering practices of separation of concerns and composability. ### Context Restoration and Backtracking A sophisticated feature allows workflows to "backtrack" when users respond to agent outputs. For example, if a meeting recorder sends coaching notes and the user replies with a follow-up question, the agent can restore the full context of the original meeting and respond appropriately. This works across different channels—email threads, Slack threads, etc. ## Model Selection and Evolution The team started building when context windows were 4,000 tokens—smaller than their current system prompts. This historical constraint led to significant investment in context pruning, which Crivello now considers "throwaway work" given modern context window sizes. A key observation from production experience: with Claude 3.5 Sonnet, the model is no longer the bottleneck. The limitations have shifted to integration quality, user experience, workflow design, and traditional engineering challenges. Crivello notes he finds GPT-4o "overhyped" and doesn't consider it suitable for agentic behavior. The company uses different models for different node types, optimizing for the specific requirements of each step. With Anthropic's prompt caching, they've seen significant cost reductions on their append-only context architecture. ## Evaluation and Quality Control Lindy built their own evaluation infrastructure early on, though Crivello acknowledges this was largely because nothing suitable existed in late 2022. They've built tooling to create evaluations from conversation histories in one click—essentially turning real conversations into unit tests. A notable production incident occurred when an agent hallucinated a Rickroll link when asked for video tutorials that didn't exist. The team caught this through post-agent review, fixed it before the customer saw it, and added a system prompt line to prevent recurrence. Database queries revealed 3-4 other instances had occurred previously—surprisingly low given the volume of interactions. The company is considering migrating to commercial evaluation tools like Braintrust as the ecosystem has matured. ## Human-in-the-Loop and Continuous Improvement Lindy implements what Crivello calls a "poor man's RLHF" system. Users can enable approval requirements on any workflow step, receiving the agent's proposed action before execution. When users modify an agent's proposed response before approving, the system: - Saves the modification - Embeds it in a vector database - Retrieves similar examples for future tasks - Injects them into the context window This creates a continuous improvement loop where agent behavior gets refined through normal usage without requiring explicit training data collection or model fine-tuning. ## Integration Strategy Rather than relying on computer-use capabilities (like Anthropic's recently released feature), Lindy focuses on API integrations. Crivello argues that anything that can be done with an API should be done via API, citing orders of magnitude differences in reliability and performance compared to computer-use approaches. The team has built complex integrations, particularly for meeting scheduling. Their "find available times" action alone is approximately 1,000 lines of code, handling parameters like meeting duration, time windows, weekday preferences, buffer times, and slot quantity. ## The Bitter Lesson Applied The team has internalized what they call "the bitter lesson"—the idea that scaling models tends to outperform clever cognitive architectures. Crivello notes he used to be "cognitive architecture-pilled" with patterns like critic-generator loops, but found that simply letting better models do their job often works better. This philosophy extends to their development prioritization: they actively avoid building features that will improve automatically as models improve, focusing instead on integration quality, user experience, and the input/output layer around models. ## Production Lessons Key operational insights from running Lindy in production include: - Users often provision separate Google Workspace accounts for their AI agents to maintain the illusion of a human assistant - Many users intentionally add delays to agent responses to make them seem more human-like - The visual workflow builder dramatically improved user onboarding compared to the original text-field approach - Cross-channel context restoration (responding in email threads, Slack threads) is a significant differentiator - Memory pruning remains important even with larger context windows—not all stored information is useful The case study illustrates a broader trend in AI product development: moving from "give the LLM everything" toward intelligent constraint, where the winners may be those who best package AI capabilities to solve real problems reliably rather than those with the most powerful models.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source