## Overview
This case study presents research conducted by Nicholas Arcolano, head of research at Jellyfish, examining the real-world adoption and impact of AI coding tools and autonomous agents in production software development environments. Jellyfish is a company that provides analytics and insights for software engineering leaders, and they leveraged their unique position to analyze an extensive dataset comprising 20 million pull requests from approximately 200,000 developers across roughly 1,000 companies, collected over more than a year from June 2024 to early 2025.
The study addresses critical questions facing organizations undergoing AI transformation: what good adoption looks like, expected productivity gains, side effects of transformation, and what to do when AI tools don't deliver as advertised. Importantly, this is a data-driven analysis rather than anecdotal evidence or vendor claims, providing a more balanced view of actual production usage of AI-assisted coding tools.
## Data Sources and Methodology
Jellyfish's analytical approach combines multiple data sources to create a comprehensive view of AI tool usage in production environments. The platform ingests data from several key systems including usage and interaction data with AI coding tools (specifically mentioning GitHub Copilot, Cursor, and Claude Code), interactions with autonomous coding agents (Devon and Codeex), and PR review bots. This is combined with source control platform data (primarily GitHub) to understand the actual codebase changes, and task management platforms (Linear and Jira) to understand the goals and context of work being performed.
This multi-source approach is particularly relevant to LLMOps because it demonstrates how understanding AI tool effectiveness in production requires holistic observability across the entire development workflow, not just measuring model performance or individual tool metrics in isolation. The methodology essentially treats the entire software development process as a production system where AI tools are being deployed at scale.
## Adoption Patterns and Metrics
The study tracked adoption using two primary metrics. First, they looked at the percentage of code generated by AI, finding that only about 2% of companies were generating 50% or more of their code with AI in June 2024, but this grew steadily to nearly half of companies by early 2025. However, Arcolano notes this is not necessarily the most useful metric.
More significantly, they developed an "AI adoption rate" metric for developers, defined as the fraction of time developers use AI tools when coding. A 100% adoption rate means using AI tools every time you code, and a company's adoption rate is the average across all developers. This metric proved to be the most strongly correlated with positive productivity outcomes. The median company adoption rate was around 22% in summer 2024, growing steadily to close to 90% by early 2025, showing dramatic acceleration in adoption.
The study reveals interesting distribution patterns, with the 25th, 50th, and 75th percentiles all showing steady upward trends. This suggests that adoption is happening broadly across different types of organizations, not just among early adopters. The fact that median adoption reached 90% is particularly striking, indicating that at the median company, developers are using AI tools in the vast majority of their coding activities.
Regarding autonomous coding agents specifically, the findings are much more sobering and represent an important reality check for the industry. Only about 44% of companies in the dataset had done anything with autonomous agents at all in the three months preceding the study. Moreover, the vast majority of this usage was characterized as trialing and experimentation rather than full-scale production deployment. Ultimately, work done by autonomous agents amounted to less than 2% of the millions of PRs merged during the timeframe studied. This is a critical LLMOps insight: while there is significant hype around fully autonomous agents, the actual production deployment at scale remains in very early stages. The interactive AI coding tools (copilots and assistants) are seeing real production adoption, while autonomous agents remain largely experimental.
## Productivity Impacts
The study examined productivity through multiple lenses, starting with PR throughput (pull requests merged per engineer per week). While acknowledging this metric varies based on factors like work scoping and architecture, tracking changes in PR throughput within organizations provides meaningful signal. The analysis revealed a clear correlation between AI adoption rate and PR throughput, with an average trend showing approximately 2x improvement as companies move from 0% to 100% AI adoption.
The visualization methodology is notable for LLMOps practitioners: each data point represents a snapshot of a company on a given week, with the x-axis showing AI adoption rate and y-axis showing average PRs per engineer. This time-series approach across multiple organizations provides stronger evidence than simple before/after comparisons within single organizations, as it controls for various confounding factors through aggregate analysis.
Cycle time (defined as time from first commit in a PR until merge) also showed improvements, with a 24% decrease on average as adoption increased from 0% to 100%. Interestingly, the cycle time distribution revealed distinct horizontal bands in the data—a lower cluster for tasks taking less than a day, a middle band for tasks taking about two days, and a long tail of longer-duration tasks. This distribution pattern itself is valuable for understanding how software development work is naturally structured, and the fact that AI tools can compress these timescales across all bands suggests genuine impact rather than just affecting certain types of tasks.
It's important to note the balanced assessment here: while these gains are substantial, they're not the 10x improvements sometimes claimed in marketing materials. The 2x average improvement is significant and valuable, but organizations should calibrate expectations accordingly. The fact that both throughput and cycle time improved simultaneously suggests these are real efficiency gains rather than just rushing work through the system faster.
## Code Quality and Side Effects
A critical concern with accelerated development using AI tools is whether quality suffers. The study examined multiple quality indicators including bug ticket creation rates, PR revert rates (code that had to be rolled back), and bug resolution rates. The findings here are somewhat reassuring but warrant careful interpretation: no statistically significant relationship was found between AI adoption rates and bug creation or revert rates.
Interestingly, bug resolution rates actually increased with AI adoption. Digging deeper into this finding, the researchers discovered that teams are disproportionately using AI to tackle bug tickets in their backlog. This makes intuitive sense from an LLMOps perspective—bug fixes are often well-scoped, verifiable tasks with clear success criteria, making them suitable targets for AI coding assistance. The ability to verify correctness (did the bug get fixed?) provides a natural quality gate that may not exist for all development tasks.
However, Arcolano appropriately notes that we should "not really" be seeing big quality effects yet, with appropriate caution that this could change, particularly as usage of asynchronous autonomous agents grows. This represents responsible data interpretation—the absence of evidence for quality problems is not definitive evidence that quality problems won't emerge as usage patterns evolve.
The study also found that PRs are getting 18% larger on average in terms of net lines of code added as teams fully adopt AI coding tools. Importantly, this size increase is driven more by additions than deletions, suggesting net new code rather than just rewrites. Additionally, the average number of files touched per PR remains about the same, indicating that the code is becoming more thorough or verbose within the same scope rather than sprawling across more of the codebase. This is a subtle but important distinction for understanding how AI tools are changing development patterns.
## Architecture Impact: A Critical LLMOps Insight
Perhaps the most valuable finding for LLMOps practitioners is the dramatic impact of code architecture on AI tool effectiveness. The study introduced a metric called "active repos per engineer"—how many distinct repositories a typical engineer pushes code to in a given week. This metric is scale-independent (normalizing by engineer count removes correlation with company size) and serves as a proxy for whether organizations use centralized architectures (monorepos, monolithic services) versus distributed architectures (polyrepos, microservices).
The researchers segmented companies into four regimes: centralized, balanced, distributed, and highly distributed. When they re-ran the PR throughput analysis separately for each regime, dramatically different patterns emerged. Centralized and balanced architectures showed approximately 4x gains in PR throughput with full AI adoption—double the overall average. Distributed architectures tracked closer to the 2x average trend. Most strikingly, highly distributed architectures showed essentially no correlation between AI adoption and PR throughput, with the weak trend that did exist actually being slightly negative.
This finding has profound implications for LLMOps and explains why some organizations may not see expected benefits despite high adoption. The root cause appears to be context limitations. Most current AI coding tools are designed to work with one repository at a time, and combining context across repositories is challenging both for humans and AI agents. Moreover, relationships between repos and the systems they compose are often not formally documented—they exist primarily in the heads of senior engineers and are not accessible to coding tools and agents.
Arcolano notes an interesting tension here: many voices in the industry advocate that microservices and distributed architectures are the "right way" for AI-native development. He speculates that with improved context engineering and mature autonomous agents, the relationship might flip and highly distributed architectures could become most productive. But the current reality shows the opposite—highly distributed architectures are struggling to realize AI productivity gains.
This also explains why absolute PR counts are poor metrics across organizations. Highly distributed architectures naturally require more PRs to accomplish the same functional outcomes due to cross-repo coordination and migrations. This is why tracking change in PR throughput within organizations (or properly segmenting by architecture) is essential rather than comparing absolute numbers.
## LLMOps Implications and Considerations
This study provides several critical insights for LLMOps practitioners deploying AI coding tools in production:
**Context is paramount for production AI systems.** The architecture findings underscore that AI tool effectiveness is deeply dependent on how well the tools can access and reason about relevant context. This isn't just a coding-specific problem—it generalizes to any production LLM system where context spans multiple sources or systems. Organizations need to invest in "context engineering" as a first-class discipline, ensuring that AI tools can access the information they need to be effective.
**Adoption patterns matter as much as technology choices.** The strong correlation between the adoption rate metric and productivity gains suggests that successful AI transformation is as much about organizational change management as it is about tool selection. Simply providing access to AI tools isn't sufficient; teams need to actually use them consistently to see benefits. This implies that LLMOps should include monitoring adoption patterns and identifying barriers to usage.
**Interactive assistance is currently more production-ready than full autonomy.** The stark contrast between 90% adoption of interactive tools and <2% actual production usage of autonomous agents is important for setting realistic expectations. Organizations should focus on getting value from assistant-level AI tools before betting heavily on autonomous agents.
**Measuring production AI effectiveness requires multi-dimensional metrics.** The study's approach of combining throughput, cycle time, quality metrics, and architectural factors demonstrates that no single metric tells the complete story. LLMOps platforms need to provide holistic observability across the development workflow, not just model-centric metrics.
**Your mileage will vary based on system architecture.** The 2x average improvement masks substantial variation—from 4x improvements in well-suited architectures to essentially zero improvement in highly distributed architectures. Organizations should assess their specific context before setting expectations.
It's worth noting that this study comes from a vendor (Jellyfish) selling analytics tools, so there's inherent incentive to emphasize the importance of measurement and analytics. However, the methodology appears sound with a genuinely large dataset, and the findings include results that cut against simple narratives (like autonomous agents not yet working at scale, or highly distributed architectures struggling). The balanced presentation of both positive results and limitations increases credibility.
The temporal aspect is also important—this data spans June 2024 to early 2025, a period of extremely rapid evolution in AI coding tools. Some findings may already be outdated as tools improve, particularly around context handling and autonomous agent capabilities. This underscores the importance of continuous measurement and evaluation in LLMOps contexts rather than one-time assessments.
For organizations implementing AI coding tools at scale, this study suggests prioritizing three things: driving high adoption rates among developers (targeting 90%+ usage when coding), investing in context engineering appropriate to your architecture (especially critical for distributed architectures), and establishing comprehensive measurement across productivity, quality, and adoption dimensions to understand what's actually working in your specific context.