ZenML

Production-Ready Question Generation System Using Fine-Tuned T5 Models

Digits 2023
View original source

Digits implemented a production system for generating contextual questions for accountants using fine-tuned T5 models. The system helps accountants interact with clients by automatically generating relevant questions about transactions. They addressed key challenges like hallucination and privacy through multiple validation checks, in-house fine-tuning, and comprehensive evaluation metrics. The solution successfully deployed using TensorFlow Extended on Google Cloud Vertex AI with careful attention to training-serving skew and model performance monitoring.

Industry

Finance

Technologies

Overview

Digits is an accounting automation company that provides AI-powered bookkeeping and financial management tools for small businesses and accounting firms. This case study, published in March 2023, details how they implemented generative machine learning to assist accountants in their day-to-day client communications. The specific use case focuses on automatically generating contextual questions about financial transactions that accountants can send to their clients for clarification.

The core problem being addressed is the repetitive nature of accountant-client interactions around transaction categorization and verification. Accountants frequently need to ask clients about ambiguous transactions, and manually typing these questions for every transaction creates significant time overhead. Digits aimed to reduce this tedium by generating suggested questions that accountants can either send with a single click or edit before sending.

It’s worth noting that this article comes from Digits’ own engineering blog and serves a dual purpose of technical education and marketing. While the technical details appear genuine and substantive, readers should be aware that the narrative naturally emphasizes the positive aspects of their implementation.

Technical Architecture

Base Model Selection and Fine-Tuning Approach

Digits uses models from the T5 (Text-to-Text Transfer Transformer) family as their base model. The T5 architecture, pre-trained by Google Brain, follows the encoder-decoder transformer pattern that has become foundational for generative text tasks. Rather than training from scratch—which would require massive computational resources (the article references that OpenAI’s GPT-3 3B model required 50 petaflop/s-days of compute)—Digits fine-tunes these pre-trained models for their domain-specific accounting use case.

The fine-tuning approach allows them to maintain full control over the training data used for domain adaptation while leveraging the linguistic capabilities learned during pre-training. The team acknowledges a key limitation here: they don’t have visibility into the original pre-training data used by large model providers, which introduces potential implicit biases.

Training Data Structure

The training data is structured around two key inputs:

This persona-based approach is particularly interesting from a product perspective, as it allows accountants to maintain authentic communication styles with different clients while still benefiting from automation.

Data Preprocessing Pipeline

Digits uses TensorFlow Transform for data preprocessing, which runs on Google Cloud Dataflow for scalability. A key architectural decision highlighted in the case study is the export of the preprocessing graph alongside the model. This is a best practice in MLOps that helps avoid training-serving skew—a common problem where the data processing applied during training differs from what’s applied during inference.

The preprocessing code shown in the article demonstrates:

By incorporating tokenization directly into the exported model using TensorFlow Text, they achieve a cleaner deployment architecture where the model accepts raw text inputs rather than requiring a separate tokenization service.

Training Infrastructure

Model training is orchestrated through TensorFlow Extended (TFX) running on Google Cloud’s Vertex AI platform. This setup provides:

While the article mentions converting HuggingFace T5 models to TensorFlow ops, this is notable because it enables deployment on TensorFlow Serving without requiring a Python layer—a decision that likely improves inference performance and simplifies deployment.

Model Serving Architecture

The serving signature shown in the code demonstrates how the trained model is packaged for production use. The model includes:

This all-in-one approach simplifies the inference pipeline and reduces the risk of inconsistencies between training and serving environments.

Safety and Quality Measures

Hallucination Concerns

The article is refreshingly candid about the hallucination problem in generative models. They provide a vivid example where the model got stuck generating “fil-a” repeatedly when processing a Chick-fil-A transaction, failing to produce a stop token. This kind of failure mode is characteristic of autoregressive text generation where token-by-token generation can compound errors.

Multi-Layer Safety System

Digits implements at least three layers of protection before generated content reaches end users:

This layered approach reflects a mature understanding that generative AI outputs cannot be trusted blindly, especially in professional contexts where reputation matters.

Evaluation Framework

Custom TFX Evaluation Component

Digits developed a custom TFX component for model evaluation that runs as part of every training pipeline. This component:

The removal of humans from the deployment decision process (based on quantitative metrics) is an interesting approach that can help ensure consistency and reduce bias in release decisions.

Evaluation Metrics

The evaluation framework uses a thoughtfully designed set of complementary metrics:

The tension between Levenshtein distance and semantic similarity is particularly clever—they want models that express the same meaning in diverse ways, not models that simply memorize training examples.

Evaluation Dataset

They maintain a curated evaluation dataset with human-written reference questions for each transaction type. This allows for consistent comparison across model versions, though the article doesn’t specify the size or diversity of this evaluation set.

Privacy Considerations

The article emphasizes that Digits fine-tunes models in-house and never shares customer data without consent. This is an important consideration for financial services applications where transaction data is highly sensitive. By performing fine-tuning internally rather than using external APIs, they maintain tighter control over data handling.

Limitations and Considerations

While the article presents a well-engineered system, there are some aspects worth considering:

Conclusion

This case study demonstrates a practical, production-focused approach to deploying generative AI in a domain-specific business context. The emphasis on safety measures, evaluation rigor, and infrastructure best practices reflects lessons learned from deploying ML systems at scale. The use of established tools (TFX, TensorFlow Serving, Vertex AI) rather than custom solutions suggests a pragmatic engineering culture focused on reliability over novelty.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Climate Tech Foundation Models for Environmental AI Applications

Various 2025

Climate tech startups are leveraging Amazon SageMaker HyperPod to build specialized foundation models that address critical environmental challenges including weather prediction, sustainable material discovery, ecosystem monitoring, and geological modeling. Companies like Orbital Materials and Hum.AI are training custom models from scratch on massive environmental datasets, achieving significant breakthroughs such as tenfold performance improvements in carbon capture materials and the ability to see underwater from satellite imagery. These startups are moving beyond traditional LLM fine-tuning to create domain-specific models with billions of parameters that process multimodal environmental data including satellite imagery, sensor networks, and atmospheric measurements at scale.

healthcare document_processing classification +53

Enterprise AI Platform Integration for Secure Production Deployment

Rubrik 2025

Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.

customer_support content_moderation chatbot +53