ZenML

Building a Property Management AI Copilot with LangGraph and LangSmith

AppFolio 2024
View original source

AppFolio developed Realm-X Assistant, an AI-powered copilot for property management, using LangChain ecosystem tools. By transitioning from LangChain to LangGraph for complex workflow management and leveraging LangSmith for monitoring and debugging, they created a system that helps property managers save over 10 hours per week. The implementation included dynamic few-shot prompting, which improved specific feature performance from 40% to 80%, along with robust testing and evaluation processes to ensure reliability.

Industry

Tech

Technologies

Overview

AppFolio, a technology company serving the real estate industry, developed Realm-X Assistant—an AI-powered copilot designed to streamline the day-to-day operations of property managers. The system represents a practical application of LLMs in a production environment where users interact with complex business data and execute actions across multiple domains including residents, vendors, units, bills, and work orders. According to AppFolio, early users have reported saving over 10 hours per week, though this claim should be considered within the context of a promotional case study published on the LangChain blog.

The core problem Realm-X addresses is the need for a more intuitive natural language interface that allows property managers to engage with AppFolio’s platform without navigating complex menus or learning specialized query languages. The conversational interface aims to help users understand their business state, get contextual help, and execute bulk actions efficiently.

Agent Architecture and Framework Evolution

AppFolio’s journey with Realm-X demonstrates a common evolution pattern in production LLM systems. The team initially built the assistant using LangChain, primarily leveraging its interoperability features that allow for model provider switching without code changes. LangChain also provided convenient abstractions for tool calling and structured outputs.

As the system’s requirements grew more complex, the team made a strategic transition to LangGraph. This shift was motivated by the need to handle more sophisticated request processing and to simplify response aggregation from different processing nodes. LangGraph’s graph-based architecture provided clearer visibility into execution flows, which proved valuable for designing workflows that could “reason before acting”—a pattern commonly associated with more reliable agent behavior.

One of the key architectural benefits highlighted is LangGraph’s ability to run independent code branches in parallel. The system simultaneously determines relevant actions, calculates fallbacks, and runs a question-answering component over help documentation. This parallelization strategy serves a dual purpose: it reduces overall latency by not serializing independent operations, and it enables the system to provide related suggestions that enhance user experience even when the primary action determination is still in progress.

The transition from a simpler agent framework to a more structured graph-based approach illustrates a common maturation path for production LLM systems, where initial rapid prototyping gives way to more controlled and observable architectures as the stakes and complexity increase.

Production Monitoring with LangSmith

LangSmith serves as the observability backbone for Realm-X in production. The integration provides several critical capabilities that are essential for operating LLM systems at scale.

For real-time monitoring, the AppFolio team tracks error rates, costs, and latency through LangSmith’s feedback charts. These metrics are fundamental for maintaining service reliability and understanding the operational characteristics of the system. The team has also implemented automatic feedback collection triggers that activate when users submit actions drafted by Realm-X, providing a continuous stream of implicit user satisfaction data.

Beyond passive monitoring, AppFolio employs automatic feedback generation based on both LLM-based evaluators and heuristic evaluators. This hybrid approach to continuous monitoring allows the team to catch issues that might not be immediately apparent from user feedback alone, while also leveraging the flexibility of LLM-based assessment for more nuanced quality dimensions.

LangSmith’s tracing capabilities proved particularly valuable for debugging. When issues arise in production, the detailed traces allow engineers to pinpoint exactly where in the execution flow problems occurred. During development, the team uses comparison views and the LangSmith playground to iterate on workflows before deployment. The shareability of traces across team members facilitates collaboration among stakeholders, which is especially important when debugging complex multi-step agent workflows.

Dynamic Few-Shot Prompting and Prompt Engineering

One of the more technically interesting aspects of the Realm-X system is its use of dynamic few-shot prompting. Rather than relying on static examples embedded in prompts, the system dynamically pulls relevant examples based on the context of each query. This approach enables more personalized and accurate responses by ensuring that the examples the model sees are relevant to the specific task at hand.

LangSmith played a crucial role in optimizing this dynamic few-shot system. The team used the platform to identify cases where wrong samples were being pulled, where relevant samples were missing, or where samples were poorly formatted. The comparison view feature was particularly useful for identifying subtle differences in prompt construction that could lead to output variations.

The LangSmith Playground enabled rapid iteration on prompts, base models, and tool descriptions without modifying underlying code. This separation of prompt engineering from code deployment significantly shortened the feedback cycle between stakeholders, allowing non-engineering team members to participate more directly in prompt optimization.

The results of this dynamic few-shot approach are notable: text-to-data functionality reportedly improved from approximately 40% to approximately 80% performance. While the specific metric definition isn’t provided, this represents a significant improvement in system accuracy. The team also claims to have maintained high performance even as they’ve expanded the number of actions and data models available to users, suggesting that the dynamic approach scales better than static prompt engineering would.

Evaluation and Testing Strategy

AppFolio has implemented a comprehensive evaluation strategy that emphasizes user experience and continuous validation. Every step in the Realm-X workflow—from individual actions to end-to-end executions—undergoes testing using custom evaluators alongside LangSmith’s evaluation tools.

The team maintains a central repository of sample cases that contain message history, metadata, and ideal outputs. This test case repository serves multiple purposes: the samples can be used as evaluation datasets, unit tests, or as examples for few-shot prompting. This multi-purpose approach to test data is an efficient pattern that ensures alignment between what the system is trained to do (via examples) and what it’s tested against (via evaluations).

The integration of evaluations into the CI/CD pipeline is a notable LLMOps practice. Evaluations run as part of CI, with results tracked and integrated into pull requests. Code changes are blocked from merging unless all unit tests pass and evaluation thresholds are met. This gating mechanism helps prevent regressions in LLM-based functionality, which can be particularly insidious since they may not manifest as traditional code failures but rather as degraded output quality.

LangGraph’s structured approach also helps with testability by organizing complex conditional logic (described as “intricate if-statement logic”) into clear, flexible code paths. This structural clarity makes it easier to test individual components and understand system behavior during debugging.

Future Directions

AppFolio is continuing to evolve the Realm-X system with plans to use LangGraph for state management and self-validation loops. Self-validation loops—where the system checks its own outputs before presenting them to users—represent an interesting pattern for improving reliability in agentic systems. State management improvements suggest the team is working toward more complex multi-turn interactions that require maintaining context across extended conversations.

Critical Assessment

While this case study presents compelling results, several caveats are worth noting. The case study is published on the LangChain blog and naturally emphasizes the benefits of LangChain’s ecosystem of tools. The reported metrics (10+ hours saved per week, 40% to 80% performance improvement) are self-reported without external validation or detailed methodology.

The text doesn’t provide information about the scale of the deployment (number of users, request volume) or specific failure modes encountered. It also doesn’t discuss costs, model selection rationale beyond interoperability, or trade-offs considered when making architectural decisions.

That said, the case study does illustrate several genuine LLMOps best practices: the evolution from simple to more structured agent architectures, the importance of observability and tracing, the value of dynamic prompting strategies, and the integration of LLM evaluations into CI/CD pipelines. These patterns are widely applicable regardless of the specific tooling chosen.

The property management domain is also a reasonable fit for conversational AI, as it involves structured data operations, repetitive tasks, and users who may not have technical backgrounds—all characteristics that favor natural language interfaces over traditional software interfaces.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building an AI Private Banker with Agentic Systems for Customer Service and Financial Operations

Nubank 2025

Nubank, one of Brazil's largest banks serving 120 million users, implemented large-scale LLM systems to create an AI private banker for their customers. They deployed two main applications: a customer service chatbot handling 8.5 million monthly contacts with 60% first-contact resolution through LLMs, and an agentic money transfer system that reduced transaction time from 70 seconds across nine screens to under 30 seconds with over 90% accuracy and less than 0.5% error rate. The implementation leveraged LangChain, LangGraph, and LangSmith for development and evaluation, with a comprehensive four-layer ecosystem including core engines, testing tools, and developer experience platforms. Their evaluation strategy combined offline and online testing with LLM-as-a-judge systems that achieved 79% F1 score compared to 80% human accuracy through iterative prompt engineering and fine-tuning.

customer_support fraud_detection chatbot +36

Building Production AI Agents with Advanced Testing, Voice Architecture, and Multi-Model Orchestration

Sierra 2025

Sierra, an AI agent platform company, discusses their comprehensive approach to deploying LLMs in production for customer service automation across voice and chat channels. The company addresses fundamental challenges in productionizing AI agents including non-deterministic behavior, latency requirements, and quality assurance through novel solutions like simulation-based testing that runs thousands of parallel test scenarios, speculative execution for voice latency optimization, and constellation-based multi-model orchestration where 10-20 different models handle various aspects of each conversation. Their outcome-based pricing model aligns incentives with customer success, while their hybrid no-code/code platform enables both business and technical teams to collaboratively build, test, and deploy agents. The platform serves large enterprise customers across multiple industries, with agents handling millions of customer interactions in production environments.

customer_support chatbot speech_recognition +36