Bismuth, a startup focused on software agents, developed SM-100, a comprehensive benchmark to evaluate AI agents' capabilities in software maintenance tasks, particularly bug detection and fixing. The benchmark revealed significant limitations in existing popular agents, with most achieving only 7% accuracy in finding complex bugs and exhibiting high false positive rates (90%+). While agents perform well on feature development benchmarks like SWE-bench, they struggle with real-world maintenance tasks that require deep system understanding, cross-file reasoning, and holistic code evaluation. Bismuth's own agent achieved better performance (10 out of 100 bugs found vs. 7 for the next best), demonstrating that targeted improvements in model architecture, prompting strategies, and navigation techniques can enhance bug detection capabilities in production software maintenance scenarios.
Bismuth is a startup that has been developing software agents for over a year, founded by Ian (CEO, with a background in data engineering, ML, and search from Zillow) and Nick (CTO, with experience in software security at Google). The company presented their work at an event called “Engineers World Fair,” introducing SM-100, a novel benchmark designed to evaluate how well LLM-based coding agents perform at finding and fixing bugs—a critical but underexplored aspect of the software development lifecycle.
The core thesis of their presentation is that while existing benchmarks (like HumanEval, Polyglot Benchmark, and LiveCodeBench) measure code generation capabilities effectively, they fail to capture the reality that feature development is only one part of what developers do. Maintenance tasks—bug fixes, dependency upgrades, migrations—represent a substantial portion of real-world engineering work, yet no comprehensive benchmarks existed to evaluate agent performance in this domain.
Bismuth built SM-100 by painstakingly gathering 100 bugs from 84 public repositories. These bugs have all been remediated in the wild, representing real-world issues encountered and fixed by developers across various experience levels. The benchmark was designed with several key principles in mind.
First, they focused on objective bugs only—explicit security issues or logical issues that could cause data loss or system crashes. They explicitly excluded feature requests, optimizations, style/formatting issues, and design decisions. This approach reduces ambiguity and makes the benchmark reproducible, as subjective issues like code style are still debated among human developers today.
The benchmark is multi-language, covering Python, TypeScript, JavaScript, and Go. Python and TypeScript/JavaScript were chosen because they are popular languages where LLMs supposedly perform best, while Go was included as a control representing lower-level systems engineering languages where performance might differ.
Each bug was annotated with rich metadata including severity, context (where defined and called), domain knowledge requirements, difficulty (how long even an expert would take to find it), and the implication category (data loss, crash, security exploit, etc.). This classification system allows researchers to understand what level of bugs AI agents can find regularly, rather than just whether they occasionally get lucky on complex issues.
The benchmark produces four key metrics for each system evaluated:
For the needle in haystack evaluation, Bismuth developed a clever methodology to avoid biasing the results while keeping evaluation tractable. Rather than asking agents to scan entire repositories (which would take too long), they broke repositories into subsystems containing interrelated files. They then filtered to subsystems containing files modified in the bug-fixing commit, providing the agent with a reduced but complete view of a relevant code section. This approach avoids hinting at the actual bug while still scoping the search appropriately.
The results paint a sobering picture of current agent capabilities. Bismuth’s own agent led the pack on needle in haystack detection, finding 10 of the 100 bugs, with the next best solution finding only 7. While these numbers highlight significant room for improvement across the industry, they also demonstrate an unsaturated benchmark that can drive future progress.
On true positive rates, the variance was striking. Claude Code achieved 16%, Bismuth 25%, and Codex (presumably OpenAI’s) led with 45%. However, other popular agents like Devin, Cursor Agent, and Cosign reported between 900-1,300 items with only 3-10% true positive rates—essentially flooding developers with false positives.
Basic agents (simple agentic loops with shell tools, search/replace, and a report mechanism) performed quite poorly. DeepSeek R1 achieved only a 1% true positive rate, Llama Maverick 2%, while Sonnet 4 in a loop found 6 needles with a 3% true positive rate and O3 found 2 with 6%. The authors noted that a basic implementation—giving an agent shell tools and putting it in a loop—might technically “work” and even find some bugs, but with a 97% false positive rate, such agents are not practically useful.
One particularly concerning finding: an agent produced 70 bug reports for a single issue. As the presenters noted, no engineer will realistically sift through 70 potential bugs to find the one that matters. This highlights that raw capability isn’t enough—agents must also demonstrate precision and signal quality.
The Bismuth team identified several key patterns in why current agents struggle:
Narrow Thinking: Even thinking models exhibit surprisingly narrow reasoning. They explore a limited number of potential avenues at a single time, missing bugs that human developers would catch immediately while confirming bugs that humans would immediately discard. This narrowness manifests even when using models specifically designed for extended reasoning.
Lack of Holistic Evaluation: On a per-run basis, the total number of bugs found remains roughly consistent, but the specific bugs change between runs. This suggests agents aren’t holistically inventorying everything in a file—they appear to have different biases causing them to look at code one way versus another in different runs.
Shallow Depth: Even when agents do focus on something, they don’t go deep enough. The combination of narrow scope and shallow depth means complex bugs requiring understanding of system architecture and cross-file relationships are frequently missed.
Gap Between Generation and Maintenance: The most striking finding is the disconnect between performance on existing benchmarks and SM-100. Despite agents scoring 60-80% on SWE-Bench (a popular coding benchmark), they still struggle significantly with SM-100. The implication is clear: current agents can create software upfront but will struggle to manage and fix software after deployment.
The presentation highlighted a concrete example: a state management bug where isDirty was never set to false, preventing form clearing after submission. Only two agents (Bismuth and Codex) found this issue. While not a critical security vulnerability, this type of bug has real user experience consequences and is exactly the kind of issue a human developer would catch immediately.
The team referenced a recent case where O3 was able to find a zero-day exploit, but importantly noted it took approximately 100 runs over the same context to achieve this. While this demonstrates possibility, it’s far from practical reliability for everyday use.
Bismuth itself runs primarily on Anthropic models (particularly through Vertex) and was able to outperform Claude Code in multiple categories while building on the same base model. This suggests that agent architecture, system prompting, information presentation, and navigation strategy matter significantly beyond just the underlying model capabilities.
The presentation noted that Base10 provided credits and compute for running the benchmark across both DeepSeek R1 and Llama for Maverick, indicating the significant computational resources required for comprehensive agent evaluation.
This case study carries several important implications for production LLM deployments:
The evaluation highlighted that the most frequently used agents today have a high risk of introducing bugs—a concerning finding for organizations relying on these tools for production code. However, newer agents (including Bismuth, Claude Code, and Codex) are showing improved reasoning capabilities and tighter, more focused output.
For organizations deploying LLM-based coding tools, the findings suggest that feature development assistance may be reliable enough for production use, but maintenance tasks—bug detection, code review, and remediation—require much more careful human oversight. The high false positive rates from many agents mean that treating their output as trustworthy without review would be counterproductive.
The multi-language analysis is also relevant for LLMOps: organizations should expect different performance characteristics across languages. The inclusion of Go as a “control” for systems engineering languages suggests that organizations working in less common languages should be especially cautious about agent reliability.
Finally, the benchmark itself represents valuable infrastructure for the LLMOps space. Having reproducible, objective metrics for maintenance-oriented tasks enables organizations to make informed decisions about which tools to deploy and provides a roadmap for improvement across the industry.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
This case study explores the evolution of LLM-based systems in production through discussions with Raven Kumar from Google DeepMind about building products like Notebook LM, Project Mariner, and working with the Gemini and Gemma model families. The conversation covers the rapid progression from simple function calling to complex agentic systems capable of multi-step reasoning, the critical importance of evaluation harnesses as competitive advantages, and practical considerations around context engineering, tool orchestration, and model selection. Key insights include how model improvements are causing teams to repeatedly rebuild agent architectures, the importance of shipping products quickly to learn from real users, and strategies for evaluating increasingly complex multi-modal agentic systems across different scales from edge devices to cloud-based deployments.
Cosine, a company building enterprise coding agents, faced the challenge of deploying high-performance AI systems in highly constrained environments including on-premise and air-gapped deployments where large frontier models were not viable. They developed a multi-agent architecture using specialized orchestrator and worker models, leveraging model distillation, supervised fine-tuning, preference optimization, and reinforcement fine-tuning to create smaller models that could match or exceed the performance of much larger models. The result was a 31% performance increase on the SWE-bench Freelancer benchmark, 3X latency improvement, 60% reduction in GPU footprint, and 20% fewer errors in generated code, all while operating on as few as 4 H100 GPUs and maintaining full deployment flexibility across cloud, VPC, and on-premise environments.