Company
Microsoft
Title
Building a Production-Ready Business Analytics Assistant with ChatGPT
Industry
Tech
Year
2023
Summary (short)
A detailed case study on automating data analytics using ChatGPT, where the challenge of LLMs' limitations in quantitative reasoning is addressed through a novel multi-agent system. The solution implements two specialized ChatGPT agents - a data engineer and data scientist - working together to analyze structured business data. The system uses ReAct framework for reasoning, SQL for data retrieval, and Streamlit for deployment, demonstrating how to effectively operationalize LLMs for complex business analytics tasks.
## Overview This case study from Microsoft's Data Science team, authored by James Nguyen, presents a reference implementation and methodology for using ChatGPT as an automated business analytics assistant. The core innovation addresses a fundamental limitation of LLMs: while they excel at natural language tasks, they struggle with quantitative reasoning and structured data analysis. Rather than relying on the LLM to perform calculations directly, the solution treats ChatGPT as the "brain" that orchestrates analytical workflows using external tools—much like a human analyst would use software to perform the actual number crunching. The approach is notable for being presented as an open-source reference implementation (available on GitHub) rather than a proprietary production system, which means the claims are more about technical feasibility than proven business results. This is an important distinction for evaluating the case study fairly. ## Problem Statement The fundamental challenge addressed is the gap between LLM capabilities and business needs for structured data analytics. While ChatGPT and similar models have proven valuable for unstructured text tasks (summarization, information extraction, augmented data generation), they have historically been unreliable for quantitative analysis. The author explicitly acknowledges that LLMs "are less reliable for quantitative reasoning tasks, including documented instances in which ChatGPT and other LLMs have made errors in dealing with numbers." However, structured data remains critical for business decision-making, creating a need for systems that can bridge natural language interaction with accurate data analysis. The goal is to enable users without advanced technical skills to ask complex analytical questions of business data and receive accurate answers with rich visualizations. ## Technical Architecture ### Multi-Agent Design The solution employs a two-agent architecture that separates concerns between data acquisition and data analysis: - **Data Engineer Agent**: Responsible for data acquisition from source systems (SQL databases). This agent receives instructions from the Data Scientist agent, identifies relevant tables, retrieves schemas, and formulates SQL queries to extract the necessary data. The separation ensures that the complexity of database interaction is isolated from analytical reasoning. - **Data Scientist Agent**: The primary agent that interfaces with users, plans analytical approaches, requests data from the Data Engineer agent, performs analysis using Python and data science libraries, and produces visualizations. This agent orchestrates the overall workflow and is responsible for the final output. This separation follows good software engineering principles of modularity and makes each agent's task more tractable for the LLM. The author notes that "dividing a potentially complex task into multiple sub-tasks that are easier for ChatGPT to work on" is a key design consideration. ### ReAct Framework Implementation Both agents implement the ReAct (Reasoning and Acting) framework, which combines chain-of-thought prompting with action execution. This represents a significant advancement over simple prompt engineering because it allows agents to: - Plan execution at both high and detailed levels - Break complex problems into addressable steps - Execute actions using tools - Observe results from those actions - Re-evaluate and adjust the planned approach based on observations - Iterate through multiple Thought/Action/Observation cycles until reaching a final answer The iterative nature of ReAct is crucial for handling non-trivial problems where intermediate analysis steps may lead to "unanticipated yet advanced outcomes" and where observations may change the original plan. This adaptive capability is essential for real-world analytics scenarios where the optimal approach isn't always clear from the initial question. ### Prompt Engineering The prompts are carefully structured with several key components: - **Role definition**: Clearly establishing the agent's identity (data scientist or data engineer) - **Task instruction**: Describing the overall objective and workflow - **Tools and their use**: Documenting available utility functions and how to invoke them - **Few-shot examples**: Providing concrete examples of the expected Thought/Action/Observation pattern The author emphasizes that few-shot examples are "needed to help ChatGPT understand details that are difficult to convey with only instructions." The examples include specific Python code patterns while "unnecessary specifics omitted for brevity and generalizability." The prompts also establish inter-agent communication patterns, making each agent aware of the other's existence and how to collaborate. For example, the Data Scientist agent knows to issue requests like `request_to_data_engineer` to acquire necessary data. ### Tool Design A key LLMOps principle demonstrated is the abstraction of complex operations behind simple tool interfaces. The agents are provided with utility functions that hide implementation complexity: - **display()**: A unified function for rendering different output types (Plotly visualizations, Matplotlib figures, text, dataframes) to the end user - **load()** and **persist()**: Functions for managing data exchange between agents using Streamlit's session state - **execute_sql()**: A function that handles database connection complexity for both SQLite and SQL Server, including connection string management, query execution, and basic data type inference - **get_table_names()** and **get_table_schema()**: Functions for dynamic schema discovery This tool abstraction is essential for reliable LLM operation. The author explicitly recommends "wrapping multiple complex APIs into a simple API before exposing to ChatGPT" because "complex APIs and interaction flow may confuse ChatGPT." ## Production Considerations and Best Practices The case study provides substantial guidance on production deployment challenges, which demonstrates mature thinking about LLMOps: ### Context Management and Token Limits A significant practical challenge is that production data sources often have "numerous data objects and tables with complex schema" that would exceed ChatGPT's token limits if passed entirely as context. The solution implements dynamic context building in three stages: first identifying relevant tables, then retrieving only those schemas, and finally building queries based on the retrieved schema. This pattern is essential for scaling to real-world enterprise data environments. ### Custom Domain Knowledge The author acknowledges that each business domain has "proprietary names, rules, and concepts that are not part of the public knowledge that ChatGPT was trained on." These must be incorporated as additional context, potentially using the same dynamic loading mechanisms as schema information. This is a crucial consideration for enterprise deployments where domain-specific terminology is common. ### Handling Complex Analytical Scenarios For scenarios requiring complex analytical logic (revenue forecasting, causal analysis), the recommendation is to build "specialized prompt template[s] just for your scenario." While this limits generality, specialized templates can provide the detailed instructions needed for reliable execution of domain-specific logic. ### Output Consistency and Validation A critical production concern is that "ChatGPT has a certain level of randomness in its output format, even with clear instruction." The recommended mitigation is implementing "validation and retry flow" mechanisms. This acknowledges that LLM outputs require programmatic verification before downstream processing. ### Reliability and Hallucination The author explicitly states that "randomness and hallucination may impact the ability of ChatGPT — or any LLM — to deliver accuracy and reliability." Two mitigations are suggested: - Training users to ask clear and specific questions - Designing applications with "multiple validation mechanisms at intermediate processing steps" - Making intermediate outputs available for user validation (e.g., displaying generated SQL queries and code) This transparency approach—showing the work rather than just the final answer—is essential for building trust and enabling human oversight in production analytics systems. ### Deployment Architecture For production deployment, the author recommends separating concerns: "deploy the agents as restful back-end APIs using a framework such as Flask and deploy the UI layer using Streamlit as a front-end application for scalability and maintainability." This separation allows independent scaling of the agent backend and user interface layers. ## Application Platform Streamlit is chosen as the application platform for its: - Support for Python code execution and library integration - Web-based data visualization capabilities - Interactive widgets for end-user interaction - Stateful session memory management (critical for maintaining context between agent interactions) While alternatives like Dash and Bokeh are mentioned, Streamlit's simplicity and stateful memory support make it well-suited for this prototype. The session state mechanism is particularly important as it enables data persistence and exchange between the Data Engineer and Data Scientist agents during a user session. ## Critical Assessment While this case study presents a thoughtful architecture for LLM-powered analytics, several limitations should be noted: - This is a reference implementation rather than a battle-tested production system, so claims about effectiveness are theoretical - No quantitative metrics on accuracy, latency, or cost are provided - The complexity of maintaining and updating prompts as business requirements change is not addressed - Security considerations for SQL query generation (injection risks, access control) are not discussed - The approach requires significant prompt engineering expertise to adapt to new domains Nevertheless, the architectural patterns and best practices documented provide valuable guidance for teams considering similar LLM-powered analytics applications.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.