ZenML

LLM-Powered Data Classification System for Enterprise-Scale Metadata Generation

Grab 2023
View original source

Grab developed an automated data classification system using LLMs to replace manual tagging of sensitive data across their PetaByte-scale data infrastructure. They built an orchestration service called Gemini that integrates GPT-3.5 to classify database columns and generate metadata tags, significantly reducing manual effort in data governance. The system successfully processed over 20,000 data entities within a month of deployment, with 80% user satisfaction and minimal need for tag corrections.

Industry

Tech

Technologies

Overview

Grab, Southeast Asia’s leading superapp platform providing ride-hailing, delivery, and financial services across 428 cities in eight countries, faced a significant challenge in managing and classifying their petabyte-level data. The company needed to understand the sensitivity of countless data entities—including database tables and Kafka message schemas—to both protect user, driver, and merchant-partner data and enable efficient data discovery for analysts and scientists.

This case study documents how Grab transitioned from manual, campaign-based data classification to an LLM-powered automated system, demonstrating a practical production deployment of large language models for enterprise data governance at scale.

The Problem

Grab’s initial approach to protecting sensitive data relied on manual processes where data producers tagged schemas with sensitivity tiers (Tier 1 being most sensitive, Tier 4 indicating no sensitive information). This approach led to over-classification: half of all schemas were marked as Tier 1, enforcing the strictest access controls even when only a single highly sensitive table existed within an otherwise non-sensitive schema.

Shifting to table-level access controls was not feasible due to the lack of granular classification. Manual classification campaigns at the table level were impractical for two key reasons: the explosive growth in data volume, velocity, and variety made manual efforts unsustainable, and inconsistent interpretation of data classification policies across app developers led to unreliable results.

The team initially built an orchestration service called Gemini (named before Google’s model of the same name) that used a third-party classification service with regex classifiers. However, this approach had limitations: the third-party tool’s ML classifiers couldn’t be customized, regex patterns produced too many false positives, and building in-house classifiers would require a dedicated data science team with significant time investment for understanding governance rules and preparing labeled training data.

LLM Integration Solution

The advent of ChatGPT and the broader LLM ecosystem presented a solution to these pain points. The team identified that LLMs provide a natural language interface that allows data governance personnel to express requirements through text prompts, enabling customization without code or model training.

Architecture and Orchestration

The production system architecture consists of three main components working together:

The orchestration layer handles several critical LLMOps concerns. Request aggregation is achieved through message queues at fixed intervals to create reasonable mini-batches. A rate limiter is attached at the workflow level to prevent throttling from cloud provider APIs.

Two specific LLM-related limits required careful management: the context length (4000 tokens for GPT-3.5 at development time, approximately 3000 words) and the overall token limit (240K tokens per minute shared across all Azure OpenAI model deployments under one account). These constraints directly influenced the batch sizing and request scheduling strategies.

Prompt Engineering Approach

The classification task is defined as: given a data entity with a defined schema, tag each field with metadata classifications following an internal governance scheme. Tags include categories like <Personal.ID> for government-issued identification numbers, <Personal.Name> for names and usernames, <Personal.Contact_Info> for contact information, and <Geo.Geohash> for geographic data.

The team developed and refined their prompts using several key techniques:

The prompt design also includes explicit negative instructions to prevent common misclassifications. For example, the <Personal.ID> tag definition explicitly states it “should absolutely not be assigned to columns named ‘id’, ‘merchant id’, ‘passenger id’, ‘driver id’ or similar since these are not government-provided identification numbers.”

Output Processing and Verification

Since LLM outputs are typically free text, the system requires structured JSON responses for downstream processing. The prompt specifies the exact JSON format expected, and the system processes these structured predictions for publication.

Predictions are published to a Kafka queue for downstream data platforms to consume. A human verification workflow notifies data owners weekly to review classified tags. This verification serves dual purposes: improving model correctness and enabling iterative prompt improvement based on user corrections. The team plans to remove mandatory verification once accuracy reaches acceptable thresholds.

Production Results and Metrics

The system demonstrated impressive production performance:

The classified tags enable downstream applications including determining sensitivity tiers for data entities, enforcing Attribute-based Access Control (ABAC) policies, and implementing Dynamic Data Masking for downstream queries.

Future Development and Lessons

The team identified several areas for ongoing improvement:

The project was validated through Grab’s participation in Singapore’s Privacy Enhancing Technology Sandbox run by the Infocomm Media Development Authority, which concluded in March 2024. This regulatory sandbox demonstrated how LLMs can efficiently perform data classification while safeguarding sensitive information.

Critical Assessment

While the results are impressive, several aspects warrant balanced consideration. The 80% user satisfaction metric, while positive, means 20% of users found the process less helpful, and the context of the survey (during initial rollout) may influence responses. The accuracy claim that users change “less than one tag” on average for acknowledged tables is promising but leaves questions about edge cases and the distribution of corrections.

The concurrent operation of the third-party tool and GPT-3.5 suggests the team is still evaluating the LLM approach against traditional methods, indicating the solution may not yet be fully proven for all use cases. The cost efficiency claims are relative to current load and may change with scaling.

Nevertheless, this case study represents a well-documented, practical application of LLMs in production for an enterprise data governance use case, with clear architectural decisions, prompt engineering strategies, and measurable business outcomes.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building Production-Ready CRM Integration for ChatGPT using Model Context Protocol

Hubspot 2025

HubSpot developed the first third-party CRM connector for ChatGPT using the Model Context Protocol (MCP), creating a remote MCP server that enables 250,000+ businesses to perform deep research through conversational AI without requiring local installations. The solution involved building a homegrown MCP server infrastructure using Java and Dropwizard, implementing OAuth-based user-level permissions, creating a distributed service discovery system for automatic tool registration, and designing a query DSL that allows AI models to generate complex CRM searches through natural language interactions.

customer_support chatbot question_answering +38

Observability Platform's Journey to Production GenAI Integration

New Relic 2023

New Relic, a major observability platform processing 7 petabytes of data daily, implemented GenAI both internally for developer productivity and externally in their product offerings. They achieved a 15% increase in developer productivity through targeted GenAI implementations, while also developing sophisticated AI monitoring capabilities and natural language interfaces for their customers. Their approach balanced cost, accuracy, and performance through a mix of RAG, multi-model routing, and classical ML techniques.

code_generation data_analysis data_cleaning +32