ZenML

Production LLM Systems: Document Processing and Real Estate Agent Co-pilot Case Studies

Various 2023
View original source

A comprehensive webinar featuring two case studies of LLM systems in production. First, Docugami shared their experience building a document processing pipeline that leverages hierarchical chunking and semantic understanding, using custom LLMs and extensive testing infrastructure. Second, Reet presented their development of Lucy, a real estate agent co-pilot, highlighting their journey with OpenAI function calling, testing frameworks, and preparing for fine-tuning while maintaining production quality.

Industry

Tech

Technologies

Overview

This case study summarizes a webinar featuring two companies—Docugami and Rehat—sharing their experiences deploying LLMs into production. Both presenters offer complementary perspectives: Docugami focuses on document engineering with custom models and hierarchical understanding, while Rehat describes building a conversational AI copilot for real estate agents. The presentations are unified by their use of LangChain and LangSmith for orchestration, observability, and continuous improvement.


Docugami: Document Engineering at Scale

Company Background

Docugami is a document engineering company founded by Jean Paoli (co-creator of the XML standard) and Mike Tuck, with around five years of experience building systems that understand and generate complex documents. Their core product processes real-world documents—such as leases, insurance policies, and legal contracts—that contain complex layouts, tables, multi-column structures, and nested semantic hierarchies.

Challenge 1: Structural Chunking for Document Understanding

One of the fundamental challenges Docugami addresses is that real-world documents are not flat text. Standard text extraction and naive chunking approaches fail to capture the true structure of documents. For example, a legal document may have inline headings, two-column layouts, tables spanning columns, and key-value pairs that are semantically related but visually separated.

Docugami’s approach involves:

This hierarchical awareness is critical for high-signal retrieval in RAG applications. Without it, a question like “What is the rentable area for the property owned by DHA group?” fails because standard retrievers cannot disambiguate references across the document.

Challenge 2: Semantic Knowledge Graphs for Retrieval

Beyond structure, Docugami builds semantic knowledge graphs that label chunks with domain-specific metadata. This enables advanced retrieval techniques like the LangChain self-querying retriever, which can filter documents based on metadata rather than just text similarity.

In the webinar, a demonstration showed that without Docugami metadata, the self-querying retriever returns zero documents because it tries to filter on file names. With Docugami’s enriched metadata (e.g., filtering by “landlord”), the retriever successfully finds relevant documents.

The XML data model is conceptually simple: chunks contain other chunks, and every chunk has a semantic label. Documents that discuss similar topics share the same labeling schema, enabling cross-document reasoning.

Challenge 3: Debugging Complex Chains

Docugami shared an example of a complex chain for natural language to SQL translation. The chain includes:

They avoid full agent architectures because agents consume many tokens and smaller custom models are not tuned for agent syntax. Instead, they use LangChain Expression Language (LCEL) to build sophisticated chains with retry logic and parallel execution.

LangSmith is essential for debugging these complex chains. The team demonstrated how to trace through each step—from input parsing to few-shot example selection to parallel explanation generation—and diagnose failures like context overflow or broken SQL that fix-up logic couldn’t repair.

The team emphasized that making LangSmith traces “look good and be actionable” is an art, including naming lambdas properly and passing config correctly to conditionally invoked runnables.

End-to-End LLMOps Workflow

Docugami uses a custom LLM deployed on their own infrastructure for efficiency. Their architecture includes:

Their continuous improvement workflow involves:

Every 500 fixed runs, they sample 10% to add to their few-shot learning set, which is reloaded weekly. This improves in-context learning for specific users and question types. However, they note that few-shot learning alone doesn’t generalize well—fine-tuning remains essential for broader improvements.

For data annotation, they use V7 Labs for image-based document annotation and Prodigy (by Explosion AI) for NLP annotation. They expressed interest in deeper integration between tools like Prodigy and LangSmith for a unified workflow.


Rehat: Building Lucy, a Real Estate AI Copilot

Company Background

Rehat provides a SaaS platform for real estate agents, consolidating 10-20 different tools into a single “super app.” Lucy is their AI copilot, designed as a conversational interface that can perform any action within the platform—managing CRM, marketing, and other agent workflows.

Initial Approach and Pivot

The team initially built Lucy using LangChain’s structured chat agent, which provided reasoning capabilities and structured outputs. However, this approach was too slow and produced unreliable structured outputs because the prompt grew very large as they added tools.

The release of OpenAI’s function calling capability was transformative. Function calling embedded reasoning and structured output into the model itself, resulting in 3-4x latency improvements and much more reliable outputs. The timing of this release gave the team confidence that they were solving the right problems.

Evaluation and Testing Challenges

A major challenge was measuring LLM performance. They needed to quantify improvements from prompt engineering and fine-tuning, but traditional approaches didn’t fit:

Their solution was to build a custom testing pipeline that runs hundreds of synthetic test cases concurrently. They generated test cases by manually asking ChatGPT to create 50 user requests per use case in JSON format, then imported them into their CI/CD system.

The pipeline provides statistical feedback—showing how many tests pass or fail for each use case—rather than binary yes/no results. They continuously add new assertions to catch additional failure modes.

While they built their own test runner (which they acknowledge may have been a mistake), they integrated deeply with LangSmith:

This workflow enables rapid iteration: push code, read logs, click a failed test link, reproduce in playground, fix the prompt, commit, and validate against all test cases—all in minutes.

Prompt Engineering Insights

OpenAI function calling imposes constraints on prompt engineering:

Some behaviors cannot be fixed through prompting. For example, the model always adds “Regards, [your name]” placeholders to emails, regardless of instructions. These cases require fine-tuning or software patches.

Fine-Tuning Strategy

Because Lucy relies heavily on function calling and reasoning, Rehat has standardized on GPT models. GPT-4 is too slow for their iterative function-calling architecture, so they use GPT-3.5.

When OpenAI announced GPT-3.5 fine-tuning, they were excited—but fine-tuned models currently lose function calling capability. They are building the pipeline infrastructure so that when OpenAI releases fine-tuning with function calling support, they’ll be ready.

Their planned workflow uses LangSmith datasets:

Currently, a human QA team makes the good/bad determination during hundreds of daily test conversations. They may later use LLM evaluators for scalability, but value human oversight during this phase.

Philosophical Takeaways

Rehat emphasized several principles:


Common Themes Across Both Case Studies

Both teams emphasized the importance of LangSmith for observability, debugging, and continuous improvement. They use it to trace complex chains, identify failure modes, and build datasets for fine-tuning.

Both acknowledge that fine-tuning is essential for generalization—prompt engineering and few-shot learning can only take you so far. However, they also recognize the value of hybrid approaches: using larger models offline to correct smaller models, and combining human annotation with LLM-assisted labeling.

Cost and latency are persistent concerns. Docugami hosts models locally on Kubernetes with Nvidia Triton to control costs and protect customer data. Rehat is constrained to GPT-3.5 because GPT-4’s latency is unacceptable for their iterative function-calling architecture.

Finally, both teams stress the importance of domain expertise. Docugami leverages decades of document engineering knowledge to build hierarchical understanding. Rehat uses deep understanding of real estate workflows to design Lucy’s capabilities. Generic LLM approaches are insufficient; domain-specific structure and supervision are essential for production-quality applications.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Reinforcement Learning for Code Generation and Agent-Based Development Tools

Cursor 2025

This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.

code_generation code_interpretation data_analysis +61

Building an Enterprise-Grade AI Agent for Recruiting at Scale

LinkedIn 2025

LinkedIn developed Hiring Assistant, an AI agent designed to transform the recruiting workflow by automating repetitive tasks like candidate sourcing, evaluation, and engagement across 1.2+ billion profiles. The system addresses the challenge of recruiters spending excessive time on pattern-recognition tasks rather than high-value decision-making and relationship building. Using a plan-and-execute agent architecture with specialized sub-agents for intake, sourcing, evaluation, outreach, screening, and learning, Hiring Assistant combines real-time conversational interfaces with large-scale asynchronous execution. The solution leverages LinkedIn's Economic Graph for talent insights, custom fine-tuned LLMs for candidate evaluation, and cognitive memory systems that learn from recruiter behavior over time. The result is a globally available agentic product that enables recruiters to work with greater speed, scale, and intelligence while maintaining human-in-the-loop control for critical decisions.

healthcare customer_support question_answering +51