## Overview
Numbers Station is a company focused on bringing foundation model technology into the modern data stack to accelerate time-to-insights for organizations. This case study, presented as a lightning talk in collaboration with Stanford AI Lab, explores how large language models and foundation models can be applied to automate various data engineering tasks that traditionally require significant manual effort. The presentation offers a balanced view of both the capabilities and the operational challenges of deploying these models in production data environments.
The modern data stack encompasses a set of tools used to process, store, and analyze data—from data origination in apps like Salesforce or HubSpot, through extraction and loading into data warehouses like Snowflake, transformation with tools like dbt, and visualization with Tableau or Power BI. Despite the maturity of these tools, there remains substantial manual work throughout the data pipeline, which Numbers Station aims to address with foundation model automation.
## Foundation Models and Emergent Capabilities
The talk begins with a recap of foundation models—very large neural networks trained on massive amounts of unlabeled data using self-supervised learning techniques. The key innovation highlighted is the emergence of in-context learning capabilities at scale, where models can generalize to downstream tasks through carefully crafted prompts without requiring task-specific fine-tuning. This represents a paradigm shift from traditional AI, enabling rapid prototyping of AI applications by users who may not be AI experts themselves.
The auto-regressive language model architecture, trained to predict the next word in a sequence, serves as the foundation. By casting any task as a generation task through prompt crafting, the same underlying model can be reused across many different applications. This flexibility is central to Numbers Station's approach to data engineering automation.
## Key Applications in Production
### SQL Code Generation
One of the primary applications discussed is generating SQL queries from natural language requests. In traditional enterprise settings, business users must submit requests to data engineering teams for ad-hoc queries, which involves multiple iterations and significant delays. Foundation models can reduce this back-and-forth by directly translating natural language questions into SQL.
However, the presentation is careful to note that while this works well for simple queries, there are significant caveats for complex queries that require domain-specific knowledge. For instance, when a table has multiple date columns, the model may not inherently know which one to use for a particular business question without additional context. This points to an important production consideration: foundation models alone may not be sufficient for enterprise-grade SQL generation without supplementary domain knowledge integration.
### Data Cleaning
Data cleaning—fixing typos, correcting missing values, and standardizing formats—is traditionally handled through extensive SQL rule development. This process is time-consuming and fragile, as rules often break when encountering edge cases not anticipated during development.
Foundation models offer an alternative approach using in-context learning. By creating a prompt with a few examples of correct transformations, the model can generalize these patterns across entire datasets. The model derives patterns automatically from the provided examples, potentially eliminating much of the manual rule-crafting process.
The presentation acknowledges scalability issues with this approach when applied to large datasets, which will be addressed in the technical challenges section below.
### Data Linkage
Data linkage involves finding connections between different data sources that lack common identifiers—for example, linking customer records between Salesforce and HubSpot when there's no shared ID for joins. Traditional approaches require engineers to develop complex matching rules, which can be brittle in production.
With foundation models, the approach involves feeding both records to the model and asking in natural language whether they represent the same entity. The presentation notes that the best production solution often combines rules with foundation model inference: use rules for the 80% of cases that are straightforward, then call the foundation model for complex edge cases that would otherwise require extensive rule engineering.
## Production Challenges and Solutions
### Scalability
Foundation models are extremely large and can be expensive and slow to run at scale. The presentation distinguishes between two usage patterns with different scalability requirements:
- **Human-in-the-loop applications** (like SQL co-pilot): Here, latency matters more than throughput, and the scale challenge is less severe.
- **Batch data processing** (like cleaning or linking millions of rows): Running foundation models over entire databases is prohibitively expensive and slow compared to rule-based solutions.
The primary solution discussed is **model distillation**, where a large foundation model is used for prototyping, then its knowledge is transferred to a smaller model through fine-tuning. This distilled model can achieve comparable performance with significantly reduced computational requirements. The presentation claims this approach can effectively "bridge the gap" between large and small models with good prototyping and fine-tuning practices.
Another strategy is to use foundation models selectively—only invoking the model when truly necessary. For tasks simple enough to be solved with rules, the model can be used to automatically derive those rules from data rather than making predictions directly. This approach is described as "always better than handcrafting rules" while avoiding the computational overhead of model inference at scale.
### Prompt Brittleness
A significant operational challenge is the sensitivity of foundation models to prompt formatting. The same logical prompt expressed differently can yield different predictions, which is problematic for data applications where users expect deterministic outputs. The presentation provides an example showing that manual demonstration selection versus random demonstration selection produces a "huge performance gap."
To address this, Numbers Station developed techniques published in academic venues (referenced as an "AMA paper"). The core idea is to apply multiple prompts to the same input and aggregate the predictions to produce a final result. This ensemble approach reduces variance and improves reliability compared to single-prompt methods.
Additional techniques mentioned include:
- Decomposing prompts using chains (similar to chain-of-thought or decomposition strategies)
- Smart sampling of demonstrations for in-context examples
### Domain-Specific Knowledge
Foundation models are trained on public data and lack organizational knowledge critical for enterprise tasks. The example given is generating a query for "active customers" when no explicit `is_active` column exists—the model needs to understand the organization's definition of customer activity.
Two solution categories are presented:
**Training-time solutions**: Continual pre-training of open-source models on organizational documents, logs, and metadata. This approach makes models "aware of domain knowledge" by incorporating internal knowledge during the training process.
**Inference-time solutions**: Augmenting the foundation model with external memory accessed through:
- Knowledge graphs
- Semantic layers
- Search indices over internal documents
This is essentially a retrieval-augmented generation (RAG) approach, where relevant context is retrieved and provided to the model at inference time to supplement its base knowledge.
## LLMOps Considerations
This case study provides several important lessons for LLMOps practitioners:
The hybrid approach of combining rules with model inference is particularly noteworthy. Rather than treating foundation models as a complete replacement for traditional systems, the optimal production architecture often involves using models selectively—either to handle edge cases that rules cannot address or to generate rules automatically. This reduces both computational costs and the risk of model errors propagating through the data pipeline.
The emphasis on prompt engineering techniques like demonstration selection and multi-prompt aggregation highlights that production LLM systems require careful attention to input design, not just model selection. The brittleness of prompts means that seemingly minor formatting changes can significantly impact output quality.
The distillation approach offers a practical path from prototype to production. Large models can be used for initial development and to generate training data, while smaller distilled models handle production inference workloads. This addresses both cost and latency concerns that would otherwise make foundation model deployment impractical for high-volume data applications.
The domain knowledge integration strategies—whether through continual pre-training or RAG—are essential for enterprise deployments where generic models lack necessary business context. The choice between training-time and inference-time solutions likely depends on how dynamic the organizational knowledge is and the resources available for model customization.
It's worth noting that while the presentation highlights significant potential, the specific quantitative results and production deployments are not detailed. The techniques discussed appear to be primarily research contributions from the collaboration with Stanford AI Lab, though Numbers Station is described as building products that incorporate this technology. Organizations considering similar approaches should validate performance claims in their specific contexts.
## Collaboration and Research Foundation
The work presented is done in collaboration with the Stanford AI Lab, suggesting a research-oriented approach to these production challenges. Multiple papers are referenced (though not cited by name in the transcript), indicating that the techniques have undergone academic peer review. This collaboration between industry application and academic research is a valuable model for developing robust LLMOps practices.