ZenML

Multi-Model LLM Orchestration with Rate Limit Management

Bito 2023
View original source

Bito, an AI coding assistant startup, faced challenges with API rate limits while scaling their LLM-powered service. They developed a sophisticated load balancing system across multiple LLM providers (OpenAI, Anthropic, Azure) and accounts to handle rate limits and ensure high availability. Their solution includes intelligent model selection based on context size, cost, and performance requirements, while maintaining strict guardrails through prompt engineering.

Industry

Tech

Technologies

Overview

Bito is an AI coding assistant startup founded by Anas (CTO) and his co-founder Amar, both experienced tech entrepreneurs with backgrounds in large-scale systems. The company pivoted from a developer collaboration tool to an AI-powered coding assistant after recognizing the potential of GPT-3.5’s reasoning capabilities. Their product operates as IDE extensions for Visual Studio Code and JetBrains, providing features like code explanation, code generation, test case generation, and AI-powered code understanding.

This case study, shared through a podcast interview with Anas, offers a transparent look at the operational challenges of running LLM-powered applications at scale, particularly around rate limiting, multi-model orchestration, prompt management, and evaluation.

The Scale Problem: Rate Limits and Multi-Model Architecture

One of the most significant operational challenges Bito faced was hitting TPM (tokens per minute) and RPM (requests per minute) limits across LLM providers. Starting with a single OpenAI account, they quickly encountered rate limiting issues as usage grew. Even after requesting limit increases from OpenAI, they found themselves at maximum allocations that still couldn’t handle their traffic.

Their solution was to build a custom multiplexer and load balancer that routes requests across multiple providers and accounts:

The load balancer makes routing decisions based on several factors:

Graceful Degradation Strategy

Bito implemented a sophisticated degradation strategy for when primary options are unavailable:

Anas emphasized that model availability is unpredictable—latency can vary from 1 second to 15+ seconds depending on platform load, and occasional timeouts occur. Having multiple fallback options ensures users get answers even when individual providers experience issues.

Prompt Engineering Across Multiple Models

A key operational insight from Bito’s experience is that prompts are not portable across models. What works on GPT-3.5 may not work on GPT-4, and neither will work identically on Anthropic’s models. This leads to maintaining what Anas calls a “prompt repository” with model-specific variations.

Specific observations on model behavior:

This creates significant operational overhead when adding new models—each addition requires developing and testing new prompts, understanding model-specific behaviors, and maintaining separate prompt versions. Anas’s advice for teams not yet at scale: stick with one model until you absolutely must expand.

Testing and Evaluation Approach

Bito’s evaluation strategy relies heavily on human feedback and manually curated test cases:

Anas was candid that hallucinations remain a challenge and their system isn’t “100% there.” He views hallucinations as a “necessary evil” in some contexts—for generative use cases like API design, creative suggestions may be valuable, but for answering questions about existing code, hallucinations are problematic.

The team explored open-source evaluation tools but found gaps in the verification piece—how do you automatically verify an answer when you don’t know what’s correct? This is especially challenging with technical questions where even human reviewers may lack domain expertise (e.g., Rust questions being reviewed by Java developers).

Guardrails Through Prompt Design

Rather than using external guardrail tools, Bito implements guardrails primarily through prompt engineering:

This hybrid approach—combining deterministic tools with LLM interpretation—was highlighted as a best practice for reducing hallucinations in scenarios where ground truth exists.

Local-First Vector Database for Code Understanding

Bito’s “AI that understands your code” feature indexes user codebases for retrieval-augmented generation. A key design constraint was privacy—they wanted user code to never leave the user’s machine.

Their current implementation:

The index files can grow to 3x the size of the original code, creating storage considerations. They’re evaluating purpose-built vector databases that can be installed locally as libraries for future scaling.

For enterprise/server-side deployments, they’re considering hosted vector databases (Pinecone, etc.) that can scale horizontally while supporting local hosting for security-conscious customers.

Build vs. Buy: The GPU Question

Anas provided a thoughtful analysis of why Bito uses API providers rather than hosting their own models:

A rough cost estimate: A good GPU machine on AWS (4 GPU cards, 64GB RAM) runs approximately $340/day, before considering clustering, high availability, and operational overhead.

Anas noted this calculus changes with scale and requirements. Enterprise customers with strict security requirements may demand on-premises deployment from day one, inverting the build-vs-buy decision.

Advice for Developers Using AI Coding Assistants

Drawing from extensive experience building and using AI coding tools, Anas shared practical debugging tips:

Key Takeaways for LLMOps Practitioners

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building Enterprise-Ready AI Development Infrastructure from Day One

Windsurf 2024

Codeium's journey in building their AI-powered development tools showcases how investing early in enterprise-ready infrastructure, including containerization, security, and comprehensive deployment options, enabled them to scale from individual developers to large enterprise customers. Their "go slow to go fast" approach in building proprietary infrastructure for code completion, retrieval, and agent-based development culminated in Windsurf IDE, demonstrating how thoughtful early architectural decisions can create a more robust foundation for AI tools in production.

code_generation code_interpretation high_stakes_application +42

Building a Multi-Agent Research System for Complex Information Tasks

Anthropic 2025

Anthropic developed a production multi-agent system for their Claude Research feature that uses multiple specialized AI agents working in parallel to conduct complex research tasks across web and enterprise sources. The system employs an orchestrator-worker architecture where a lead agent coordinates and delegates to specialized subagents that operate simultaneously, achieving 90.2% performance improvement over single-agent systems on internal evaluations. The implementation required sophisticated prompt engineering, robust evaluation frameworks, and careful production engineering to handle the stateful, non-deterministic nature of multi-agent interactions at scale.

question_answering document_processing data_analysis +48