ANNA, a UK business banking provider, implemented LLMs to automate transaction categorization for tax and accounting purposes across diverse business types. They achieved this by combining traditional ML with LLMs, particularly focusing on context-aware categorization that understands business-specific nuances. Through strategic optimizations including offline predictions, improved context utilization, and prompt caching, they reduced their LLM costs by 75% while maintaining high accuracy in their AI accountant system.
ANNA is a UK-based business banking provider that offers a chat-based banking application primarily serving sole traders, small businesses, freelancers, and startups. What sets them apart is their integrated AI accountant capability that can handle various tax-related tasks including corporation tax, VAT, and self-assessment filings. Their implementation of LLMs in production focuses on three main areas: a customer-facing chatbot, an AI accountant system, and financial crime prevention through account summary generation.
The case study primarily focuses on their AI accountant system, particularly the transaction categorization component, which represented about two-thirds of their LLM-related costs. This system demonstrates a sophisticated approach to combining traditional machine learning with LLMs in a production environment.
### Technical Implementation and Architecture
The system uses a hybrid approach combining traditional ML models (XGBoost) with LLMs. The key architectural decisions include:
* Using simpler ML models and rule-based systems for straightforward transactions
* Leveraging LLMs for complex contextual categorization where business context matters
* Processing transactions in batches after the financial year ends, rather than real-time categorization
* Implementing a sophisticated prompt structure including:
- System prompt (category rulebook) ~50,000 tokens
- Company context (500-1000 tokens)
- Transaction batch (up to 8K tokens)
Initially, they used Anthropic's Claude 3.5 model through VertexAI, chosen after extensive evaluation against other models including OpenAI's offerings. The choice was partially driven by Claude's more transparent and customer-friendly explanations.
### Cost Optimization Strategies
The team implemented several sophisticated cost optimization strategies:
1. **Offline Predictions**
* Achieved 50% cost reduction by accepting up to 24-hour delay in processing
* Maintained prediction quality while significantly reducing costs
* Managed TTL (Time To Live) constraints of 29 days for predictions
2. **Context Window Optimization**
* Leveraged Claude 3.7's improved output token limit (16x increase)
* Increased batch sizes from 15 to 145 transactions per request
* Reduced total API calls needed for processing
* Added 22 percentage points of cost savings
3. **Prompt Caching**
* Implemented caching for static components (especially the 50,000 token rulebook)
* Achieved additional 3% cost reduction
* Balanced caching benefits with provider-specific implementation details
### Challenges and Lessons Learned
The implementation revealed several important insights:
* Larger batch sizes (>100 transactions) led to increased hallucinations and irrelevant predictions
* The team found a sweet spot of around 120 transactions per batch to balance efficiency and accuracy
* Prompt caching combined with offline predictions showed variable efficiency
* Regular manual inspection of inputs proved crucial for identifying optimization opportunities
* JSON handling errors caused unexpected token consumption
### Quality Assurance and Testing
The team maintained rigorous quality control through:
* Maintaining test sets annotated by accountants
* Regular evaluation of model outputs, especially customer-facing explanations
* Monitoring and handling of failed predictions through retry mechanisms
* Continuous evaluation of metrics even after optimization changes
### Alternative Approaches Considered
The team evaluated several alternative approaches:
* RAG (Retrieval-Augmented Generation) was considered but rejected due to query generation challenges
* Self-hosted LLM solutions were evaluated but deemed not cost-effective at their current scale
* Fine-tuning was considered but rejected due to rapid model improvements making it potentially inefficient
### Results and Impact
The final implementation achieved:
* 75% total cost reduction through combined optimizations
* Maintained high accuracy in categorization
* Successfully handled complex business-specific categorization rules
* Automated approximately 80% of customer requests
This case study demonstrates how careful architecture decisions, combined with systematic optimization strategies, can dramatically reduce LLM costs while maintaining high service quality in a production environment. It also highlights the importance of continuous evaluation and adaptation as LLM capabilities and pricing models evolve.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.