.png)
ZenML's architecture has always prioritized simplicity and ease of deployment: a FastAPI server handling pipeline orchestration, backed by a SQL database for persistence. This design works well for most use cases, but as our users began running increasingly complex pipelines—particularly those with high parallelism and rich metadata—we identified several performance bottlenecks that needed systematic addressing.
Some enterprise customers reported API response times occasionally exceeding 30 seconds during peak loads, particularly when running pipelines with complex step dependencies and extensive metadata. These timeouts were triggering HTTP client failures and, in some cases, causing parallel pipeline steps to fail during execution.
Our v0.83.0 release addresses these performance issues through systematic database query optimization and FastAPI threading improvements. This post details our step-by-step investigation process, the specific bottlenecks we discovered, and the solutions we implemented to achieve significant performance improvements.
We write this in hopes that other engineering teams facing similar scaling challenges can learn from our systematic approach to performance optimization. The techniques we used—realistic load testing, systematic instrumentation, iterative problem-solving—are broadly applicable beyond ZenML to any system dealing with database bottlenecks and concurrent request handling.
Stage 1: The "Too Simple" Problem
Our performance investigation began with what seemed like a straightforward test: run 100 parallel pipeline steps and measure the results. We crafted a simple test pipeline where each step would perform minimal operations:
The results? Everything worked perfectly. But our users were still reporting problems.
The disconnect was stark: our synthetic tests passed while real-world usage failed. We realized our test pipeline was too simple—it didn't reflect the complexity of actual ML workflows. Production pipelines involve significantly more complexity:
When we enhanced our test pipeline to include realistic complexity, performance issues became immediately apparent. Under load, some configurations experienced difficulties even with moderate parallelism (10+ concurrent steps), particularly when combined with rich metadata and frequent API interactions.
Stage 2: Enhanced Logging and Problem Identification
With realistic test conditions reproducing the issues, we needed better visibility into what was happening. We instrumented the server with detailed logging to capture performance metrics at REST API level and database level:
The enhanced logging revealed the smoking gun. Database queries were the primary bottleneck:
We discovered that the expensive get_run
operations were being called unnecessarily for authentication purposes, even when not explicitly requested by the client. Pipeline run fetching had become prohibitively expensive because it involved multiple SQLAlchemy queries to build complete objects with steps, artifacts, metadata, and relationships.


Stage 3: Database Query Optimization
Armed with specific data about the bottlenecks, we implemented comprehensive database optimizations.
Response Structure Refactoring
We analyzed which attributes were actually needed for different use cases and restructured our API responses:
We eliminated N+1 query patterns and implemented intelligent joins:
Specialized Endpoints
We created targeted endpoints for specific use cases:
Initial Results
These optimizations brought significant improvements, but we weren't done yet. We were now able to handle more complex workloads, but still hit issues at higher parallelism levels.
Stage 4: The FastAPI Threading Discovery
While our database optimizations helped, we still saw unexpected behavior under high load. To isolate remaining issues, we created a controlled experiment: a single server pod with one FastAPI thread, making 10 concurrent get_run
calls to fetch the same pipeline run (measured baseline: ~2.5 seconds per query).
The expected behavior would be linear scaling: each subsequent call waiting for the previous one to complete. However, the actual results showed a different pattern:
This was a revelation about FastAPI's internal behavior. When using synchronous endpoints, FastAPI executes the handler function in a worker thread, but also queues response serialization in the same threadpool:
With limited worker threads and many queued requests, response serialization tasks accumulate behind the handler tasks, creating a bottleneck.
FastAPI Threading Fix
The solution was to convert synchronous endpoints to async endpoints that manually dispatch to the threadpool:
This ensures response serialization happens on the event loop rather than competing for worker threads. The results after the fix showed perfect linear scaling:
Stage 5: Comprehensive Model Optimizations
With both database queries and FastAPI threading optimized, we implemented the final round of model-level improvements. These focused on eliminating remaining inefficiencies:
Step Run Response Improvements
Artifact Producer Query Optimization
Stage 6: Retry Storm Prevention
Our final optimization addressed an unexpected amplification effect. When ZenML clients experience timeouts, they retry requests up to 10 times. Under heavy server load, these retries can amplify the problem:
We implemented server-side request queue monitoring to proactively reject requests that would likely timeout:
Performance Results
The combined optimizations produced dramatic improvements across all measured metrics:
Database Query Performance:Post-optimization, our worst-performing database operations completed in under 10 seconds, compared to previous peaks exceeding 40 seconds:

Throughput Improvements:Our performance testing framework now successfully runs 100+ parallel pipeline steps with complex metadata, compared to previous configurations that experienced difficulties with high-parallelism workloads under similar conditions. Our worse API call duration under load dropped below 20 seconds compared to the previous values exceeding 80 seconds.

Resource Efficiency:The optimizations also improved resource utilization, allowing the same workloads to run effectively with fewer server replicas. Our autoscaling configurations can now handle peak loads with reduced infrastructure requirements.
The Math:Moving from struggling with some 10+ parallel step configurations to smoothly handling 100+ parallel steps, combined with 20x database performance improvements, resulted in an overall 200x performance improvement for complex pipeline workloads.
Technical Insights
Iterative Problem-Solving Approach
Our step-by-step methodology proved crucial. Each stage built on the previous discoveries:
- Realistic testing exposed the problems
- Enhanced logging identified specific bottlenecks
- Database optimizations addressed the primary issues
- Controlled experiments revealed secondary bottlenecks
- Comprehensive optimizations eliminated remaining inefficiencies
- Monitoring prevented amplification effects
Framework Behavior Understanding
Understanding FastAPI's threading implementation details was crucial for optimization. Similar performance characteristics likely exist in other async frameworks, making this analysis broadly applicable.
Response Design Impact
API response structure has direct performance implications. Separating heavyweight attributes into optional sections (resources
) dramatically reduces default response times while maintaining flexibility.
Multi-Layer Performance Issues
Database optimization, threading behavior, and client retry logic all contributed to overall performance characteristics. Addressing these issues required coordinated changes across multiple system layers.
Conclusion
The optimizations implemented in ZenML v0.83.0 address the core performance bottlenecks we identified through systematic testing and analysis. The database query improvements, FastAPI threading optimizations, and retry logic enhancements work together to provide a 200x improvement in throughput for complex, parallel pipeline workloads.
Our iterative performance testing framework has become an integral part of our development process, enabling us to proactively identify performance regressions and validate optimizations under realistic load conditions.
These improvements provide substantial headroom for larger-scale ML workloads while maintaining ZenML's ease of deployment and operation. For users running complex pipelines with high parallelism, extensive metadata, or frequent API interactions, these optimizations should significantly improve reliability and reduce timeout-related failures.
Get Started
Ready to experience the performance improvements? Upgrade to ZenML v0.83.0 today:
… alongside updating your server image.
The performance improvements are immediately available—no configuration changes required. Your existing pipelines will run faster, and you'll have the headroom to tackle much larger workloads.
Want to see the technical details? Check out our performance testing documentation and the optimization pull requests that made this possible.
The ZenML team is constantly working to make MLOps more scalable and reliable. Follow our GitHub repository for the latest updates, and join our Slack community to discuss performance optimization strategies with other ML engineers.