Digits implemented a production system for generating contextual questions for accountants using fine-tuned T5 models. The system helps accountants interact with clients by automatically generating relevant questions about transactions. They addressed key challenges like hallucination and privacy through multiple validation checks, in-house fine-tuning, and comprehensive evaluation metrics. The solution successfully deployed using TensorFlow Extended on Google Cloud Vertex AI with careful attention to training-serving skew and model performance monitoring.
Digits is an accounting automation company that provides AI-powered bookkeeping and financial management tools for small businesses and accounting firms. This case study, published in March 2023, details how they implemented generative machine learning to assist accountants in their day-to-day client communications. The specific use case focuses on automatically generating contextual questions about financial transactions that accountants can send to their clients for clarification.
The core problem being addressed is the repetitive nature of accountant-client interactions around transaction categorization and verification. Accountants frequently need to ask clients about ambiguous transactions, and manually typing these questions for every transaction creates significant time overhead. Digits aimed to reduce this tedium by generating suggested questions that accountants can either send with a single click or edit before sending.
It’s worth noting that this article comes from Digits’ own engineering blog and serves a dual purpose of technical education and marketing. While the technical details appear genuine and substantive, readers should be aware that the narrative naturally emphasizes the positive aspects of their implementation.
Digits uses models from the T5 (Text-to-Text Transfer Transformer) family as their base model. The T5 architecture, pre-trained by Google Brain, follows the encoder-decoder transformer pattern that has become foundational for generative text tasks. Rather than training from scratch—which would require massive computational resources (the article references that OpenAI’s GPT-3 3B model required 50 petaflop/s-days of compute)—Digits fine-tunes these pre-trained models for their domain-specific accounting use case.
The fine-tuning approach allows them to maintain full control over the training data used for domain adaptation while leveraging the linguistic capabilities learned during pre-training. The team acknowledges a key limitation here: they don’t have visibility into the original pre-training data used by large model providers, which introduces potential implicit biases.
The training data is structured around two key inputs:
This persona-based approach is particularly interesting from a product perspective, as it allows accountants to maintain authentic communication styles with different clients while still benefiting from automation.
Digits uses TensorFlow Transform for data preprocessing, which runs on Google Cloud Dataflow for scalability. A key architectural decision highlighted in the case study is the export of the preprocessing graph alongside the model. This is a best practice in MLOps that helps avoid training-serving skew—a common problem where the data processing applied during training differs from what’s applied during inference.
The preprocessing code shown in the article demonstrates:
By incorporating tokenization directly into the exported model using TensorFlow Text, they achieve a cleaner deployment architecture where the model accepts raw text inputs rather than requiring a separate tokenization service.
Model training is orchestrated through TensorFlow Extended (TFX) running on Google Cloud’s Vertex AI platform. This setup provides:
While the article mentions converting HuggingFace T5 models to TensorFlow ops, this is notable because it enables deployment on TensorFlow Serving without requiring a Python layer—a decision that likely improves inference performance and simplifies deployment.
The serving signature shown in the code demonstrates how the trained model is packaged for production use. The model includes:
This all-in-one approach simplifies the inference pipeline and reduces the risk of inconsistencies between training and serving environments.
The article is refreshingly candid about the hallucination problem in generative models. They provide a vivid example where the model got stuck generating “fil-a” repeatedly when processing a Chick-fil-A transaction, failing to produce a stop token. This kind of failure mode is characteristic of autoregressive text generation where token-by-token generation can compound errors.
Digits implements at least three layers of protection before generated content reaches end users:
Toxicity Screening: Every generated message is automatically screened for toxicity, insults, and obscenity. Any detection triggers immediate discarding of the text and alerts to the ML team for investigation.
Hallucination Pattern Matching: Generated suggestions are validated against known patterns of hallucinations. This catches outputs that may not be toxic but are confusing or out of context.
Human Review: A human accountant always reviews and confirms any suggested question before it is sent to a client. This human-in-the-loop approach is essential for high-stakes business communications.
This layered approach reflects a mature understanding that generative AI outputs cannot be trusted blindly, especially in professional contexts where reputation matters.
Digits developed a custom TFX component for model evaluation that runs as part of every training pipeline. This component:
The removal of humans from the deployment decision process (based on quantitative metrics) is an interesting approach that can help ensure consistency and reduce bias in release decisions.
The evaluation framework uses a thoughtfully designed set of complementary metrics:
Levenshtein Distance: Measures character-level differences between expected and generated questions. Interestingly, they want this to be high—indicating linguistic diversity rather than rote memorization of training examples.
Semantic Similarity: Measures whether the generated question captures the same meaning as the human-curated reference, even if expressed differently. They want this to be high.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation): A standard metric for text generation that measures overlap between generated and reference texts.
Must-Have and Optional Token Scoring: A custom metric that rewards the model for including contextually relevant terms (e.g., mentioning “travel” when processing an airline transaction).
The tension between Levenshtein distance and semantic similarity is particularly clever—they want models that express the same meaning in diverse ways, not models that simply memorize training examples.
They maintain a curated evaluation dataset with human-written reference questions for each transaction type. This allows for consistent comparison across model versions, though the article doesn’t specify the size or diversity of this evaluation set.
The article emphasizes that Digits fine-tunes models in-house and never shares customer data without consent. This is an important consideration for financial services applications where transaction data is highly sensitive. By performing fine-tuning internally rather than using external APIs, they maintain tighter control over data handling.
While the article presents a well-engineered system, there are some aspects worth considering:
This case study demonstrates a practical, production-focused approach to deploying generative AI in a domain-specific business context. The emphasis on safety measures, evaluation rigor, and infrastructure best practices reflects lessons learned from deploying ML systems at scale. The use of established tools (TFX, TensorFlow Serving, Vertex AI) rather than custom solutions suggests a pragmatic engineering culture focused on reliability over novelty.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Climate tech startups are leveraging Amazon SageMaker HyperPod to build specialized foundation models that address critical environmental challenges including weather prediction, sustainable material discovery, ecosystem monitoring, and geological modeling. Companies like Orbital Materials and Hum.AI are training custom models from scratch on massive environmental datasets, achieving significant breakthroughs such as tenfold performance improvements in carbon capture materials and the ability to see underwater from satellite imagery. These startups are moving beyond traditional LLM fine-tuning to create domain-specific models with billions of parameters that process multimodal environmental data including satellite imagery, sensor networks, and atmospheric measurements at scale.
Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.