Company
Anthropic
Title
Infrastructure Noise in Agentic Coding Evaluations
Industry
Research & Academia
Year
2026
Summary (short)
Anthropic discovered that infrastructure configuration alone can produce differences in agentic coding benchmark scores that exceed the typical margins between top models on leaderboards. Through systematic experiments running Terminal-Bench 2.0 across six resource configurations on Google Kubernetes Engine, they found a 6 percentage point gap between the most- and least-resourced setups. The research revealed that while moderate resource headroom (up to 3x specifications) primarily improves infrastructure stability by preventing spurious failures, more generous allocations actively help agents solve problems they couldn't solve before. These findings challenge the notion that small leaderboard differences represent pure model capability measurements and led to recommendations for specifying both guaranteed allocations and hard kill thresholds, calibrating resource bands empirically, and treating resource configuration as a first-class experimental variable in LLMOps practices.
## Overview Anthropic's research into infrastructure noise in agentic coding evaluations represents a critical examination of how production infrastructure choices affect the measurement of large language model capabilities. The company runs Terminal-Bench 2.0 evaluations on Google Kubernetes Engine clusters and discovered significant discrepancies between their scores and official leaderboard results. This case study is particularly valuable for the LLMOps community because it reveals how seemingly minor infrastructure decisions can confound model capability measurements by several percentage points—often exceeding the competitive margins between top models on public leaderboards. The investigation began when Anthropic noticed their internal Terminal-Bench 2.0 scores didn't align with published leaderboard results, and their infrastructure error rates were surprisingly high at around 6% of tasks failing due to pod errors unrelated to model capability. This discovery led to a systematic investigation of how resource configuration impacts evaluation outcomes, ultimately revealing that infrastructure is not merely a passive substrate for model execution but an active component that can fundamentally change what benchmarks actually measure. ## The Nature of Agentic Coding Evaluations Traditional static benchmarks score model outputs directly—the runtime environment plays no role in determining results. Agentic coding evaluations fundamentally differ in that models receive full environments where they write programs, run tests, install dependencies, and iterate over multiple turns. The runtime becomes an integral component of the problem-solving process rather than a passive container. This means that two agents with different resource budgets and time limits aren't actually taking the same test, even if they're evaluated on identical task definitions. Anthropic runs these evaluations on production infrastructure using Google Kubernetes Engine clusters, which enforces resource constraints through container runtime parameters. The evaluation framework treats Terminal-Bench's per-task resource specifications as both guaranteed allocations and hard limits, meaning containers are guaranteed specified resources but terminated immediately upon exceeding them. Container runtimes enforce resources via two separate parameters: a guaranteed allocation (resources reserved upfront) and a hard limit (threshold at which the container is killed). When these are set to the same value, there's zero headroom for transient spikes—a momentary memory fluctuation can OOM-kill a container that would otherwise have succeeded. ## Experimental Methodology To quantify the effect of infrastructure configuration, Anthropic ran Terminal-Bench 2.0 across six resource configurations, ranging from strict enforcement of per-task specifications (1x, where specs act as both floor and ceiling) to completely uncapped resources. All other variables remained constant: same Claude model, same evaluation harness, same task set. This controlled experimental approach allowed them to isolate the effect of resource configuration on evaluation outcomes. The experiments revealed that success rates increased monotonically with resource headroom, primarily driven by infrastructure error rates dropping at each step—from 5.8% at strict enforcement to 0.5% when uncapped. The drop between strict enforcement and 3x headroom (5.8% to 2.1%) was statistically significant at p < 0.001. This improvement came from fewer containers being killed for exceeding their allocations during normal operation. ## Critical Threshold Discovery A particularly important finding emerged around the 3x resource allocation mark. From 1x through 3x, success scores fluctuated within margins of noise (p=0.40), with most tasks that crashed at 1x failing regardless of resource availability—the agent would explore, hit a resource wall, and get preempted, but it was never on a path to the correct solution. However, starting around 3x, this trend changed dramatically: success rates climbed faster than infrastructure errors declined. Between 3x and uncapped resources, infrastructure errors dropped an additional 1.6 percentage points while success jumped almost 4 percentage points. The extra resources enabled agents to try approaches that only work with generous allocations, such as pulling in large dependencies, spawning expensive subprocesses, and running memory-intensive test suites. At uncapped resources, the total improvement over 1x reached 6 percentage points (p < 0.01), representing a substantial and statistically significant effect. ## What Infrastructure Configuration Actually Measures Anthropic's analysis revealed that resource configuration has two distinct effects depending on the allocation level. Up to roughly 3x Terminal-Bench specifications, additional resources fix infrastructure reliability problems—specifically transient resource spikes that cause spurious failures. The sandboxing provider used by Terminal-Bench maintainers implicitly provides this headroom behind the scenes, making the evaluation more stable without making it easier. Above the 3x mark, however, additional resources start actively helping agents solve problems they couldn't solve before, demonstrating that resource limits actually change what the evaluation measures. Tight limits inadvertently reward very efficient strategies, while generous limits are more forgiving and reward agents that can better exploit all available resources. An agent that writes lean, efficient code quickly will perform well under tight constraints, while an agent that brute-forces solutions with heavyweight tools will excel under generous ones. Both represent legitimate capabilities to test, but collapsing them into a single score without specifying resource configuration makes the differences—and real-world generalizability—difficult to interpret. ## Specific Task Examples The case study provides concrete examples of how resource configuration affects specific tasks. On the `bn-fit-modify` task (a Terminal-Bench task requiring Bayesian network fitting), some models' first move is to install the standard Python data science stack: pandas, networkx, scikit-learn, and all their dependencies. Under generous limits, this approach works fine. Under tight constraints, the pod runs out of memory during installation, before the agent writes a single line of solution code. A leaner strategy exists—implementing the mathematics from scratch using only the standard library—and some models default to it while others don't. Different models have different default approaches, and the resource configuration determines which of those approaches happen to succeed. Tasks like `rstan-to-pystan` and `compile-compcert` showed significant success rate improvements when given memory headroom, demonstrating that resource availability can be the determining factor in whether certain solution strategies are viable. This creates a situation where benchmark rankings may partially reflect which models happen to default to strategies compatible with the evaluator's infrastructure choices rather than purely measuring underlying capability. ## Generalization to Other Benchmarks Anthropic replicated their core findings across different Claude model variants, with the direction of the effect remaining consistent while magnitude varied. To test whether this pattern extends beyond Terminal-Bench, they ran a crossover experiment on SWE-bench, varying total available RAM up to 5x the baseline across 227 problems with 10 samples each. The same effect held, though the magnitude was smaller: scores increased monotonically with RAM but were only 1.54 percentage points higher at 5x than 1x. Since SWE-bench tasks are less resource-intensive than Terminal-Bench, a smaller effect is expected, but it demonstrates that resource allocation isn't neutral across different agentic coding benchmarks. ## Additional Confounding Factors Resource allocation isn't the only hidden variable affecting evaluation results. Time limits also play a role in certain configurations. In principle, every element of the evaluation setup can influence final scores: cluster health, hardware specifications, concurrency levels, and even egress bandwidth. Agentic evaluations are end-to-end system tests by construction, and any component of that system can act as a confounder. Anthropic observed anecdotally that pass rates fluctuate with time of day, likely because API latency varies with traffic patterns and incidents. While they haven't formally quantified this effect, it illustrates that the boundary between "model capability" and "infrastructure behavior" is blurrier than a single benchmark score suggests. Model providers can shield their evaluation infrastructure from this variance by dedicating hardware, but external evaluators can't easily do the same. This creates potential asymmetries in how different organizations experience benchmark results. ## Recommendations for LLMOps Practitioners Based on their findings, Anthropic offers several concrete recommendations for running agentic coding evaluations in production. Given how container runtimes actually enforce resources—via a guaranteed allocation and a separate hard kill threshold—they recommend that evaluations specify both parameters per task rather than a single pinned value. A single exact specification sets the guaranteed allocation equal to the kill threshold, leaving zero margin for the transient memory spikes that can destabilize evaluations. Separating the two parameters allows containers enough breathing room to avoid spurious OOM kills while still enforcing a hard ceiling that prevents score inflation. The band between guaranteed allocation and kill threshold should be calibrated so that scores at the floor and ceiling fall within noise of each other. For Terminal-Bench 2.0 specifically, a 3x ceiling over the per-task specs cut infrastructure error rates by roughly two-thirds (5.8% to 2.1%, p < 0.001) while keeping the score lift modest and well within noise (p = 0.40). This represents a reasonable tradeoff: the infrastructure confounder is largely neutralized without removing meaningful resource pressure. The exact multiplier will vary by benchmark and task distribution and should be reported, but the empirical calibration principle is general. ## Production Implications These findings have significant practical consequences for how organizations use LLM evaluations in production decision-making. Benchmark scores are increasingly used as decision-making inputs for model selection and deployment, but this increased reliance hasn't always been accompanied by corresponding rigor in how benchmarks are run or reported. As things currently stand, a 2-point lead on a leaderboard might reflect a genuine capability difference, or it might reflect that one evaluation ran on more powerful hardware, or even at a luckier time of day, or some combination of these factors. Without published or standardized setup configurations, it's difficult to determine which from the outside. For AI labs like Anthropic, the implication is that resource configuration for agentic evaluations should be treated as a first-class experimental variable, documented and controlled with the same rigor as prompt format or sampling temperature. For benchmark maintainers, publishing recommended resource specifications (as Terminal-Bench 2.0 does) helps, while specifying enforcement methodology would close the gap Anthropic identified. For anyone consuming benchmark results, the core takeaway is that small score differences on agentic evaluations carry more uncertainty than the precision of reported numbers suggests—especially as some confounders are simply too hard to control for. ## Practical Guidance for Interpreting Leaderboards Until resource methodology is standardized across the industry, Anthropic's data suggests that leaderboard differences below 3 percentage points deserve skepticism until the evaluation configuration is documented and matched. The observed spread across the moderate range of resource configurations in Terminal-Bench is just below 2 percentage points. Naive binomial confidence intervals already span 1-2 percentage points; the infrastructure confounders documented here stack on top of that rather than within it. At the extremes of the allocation range, the spread reaches 6 percentage points—more than enough to completely reverse rankings on closely-matched models. ## Broader LLMOps Context This case study exemplifies several critical LLMOps concerns. First, it demonstrates how production infrastructure choices interact with model behavior in non-trivial ways. The evaluation infrastructure isn't simply executing the model's outputs but actively shaping what solutions are viable. Second, it highlights the importance of reproducibility and standardization in production LLM systems. Without standardized infrastructure configurations, different deployments of the same model may exhibit substantially different capabilities. Third, the research underscores the challenge of separating model capability from system capability in agentic applications. When LLMs operate in environments where they can take actions, install dependencies, and iterate over multiple turns, their effective capability depends heavily on the resources and constraints of that environment. This has implications beyond benchmarking: production deployments of agentic systems will similarly exhibit behavior that depends on infrastructure choices, and organizations need to consider these factors when planning deployments and setting expectations. The case study also illustrates best practices for rigorous evaluation in LLMOps contexts. Anthropic's approach of systematically varying infrastructure parameters while holding other factors constant, using statistical significance testing, and replicating findings across different benchmarks and models represents the kind of empirical rigor needed to understand production LLM behavior. Their transparency in documenting not just successes but also challenges (like the initial mismatch with leaderboard scores) provides valuable lessons for the broader community. Finally, this research highlights an often-overlooked aspect of LLMOps: the infrastructure and operational choices made by evaluation providers and benchmark maintainers become part of the specification of what's being measured. When organizations use these benchmarks to make deployment decisions, they're implicitly assuming that benchmark conditions will generalize to their production environment—an assumption that may not hold if infrastructure configurations differ significantly. This suggests that organizations deploying agentic LLM systems should conduct their own infrastructure sensitivity analyses similar to Anthropic's to understand how their specific deployment constraints will affect realized capabilities.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.