## Overview
This research paper from Microsoft and GitHub presents findings from an extensive qualitative study examining the real-world challenges faced by software engineers building product copilots—AI-powered features that use large language models to assist users through natural language interactions. The study involved semi-structured interviews with 26 professional software engineers actively engaged in building copilot products across various companies, supplemented by structured brainstorming sessions. The paper was published in December 2023 and represents an important early examination of LLMOps challenges as the industry raced to integrate AI capabilities into existing products.
The researchers found that virtually every large technology company was attempting to add copilot capabilities to their software products, with examples ranging from Salesforce's Einstein Copilot to Microsoft 365 Copilot and GitHub Copilot. However, for most software engineers, this represented their first encounter with integrating AI-powered technology at scale, and existing software engineering processes and tools had not caught up with the unique challenges involved. The study systematically documents pain points at every step of the engineering process and explores how these challenges strained existing development practices.
## Prompt Engineering Challenges
Prompt engineering emerged as fundamentally different from traditional software engineering, with participants describing it as "more of an art than a science." Engineers were caught off guard by the unpredictable and fragile nature of large language models, requiring extensive "behavior control and steering through prompting." While these models unlocked new capabilities—described as "superpowers" by participants—the process of creating effective prompts proved extremely time-consuming and resource-intensive.
The typical workflow involved starting with ad hoc experimentation in playgrounds provided by OpenAI or similar services. Engineers described a transient and ephemeral process of "just playing around with prompts" and trying "not to break things." One participant characterized the early days as "we just wrote a bunch of crap to see if it worked." However, this trial-and-error approach quickly became problematic as engineers had to "accommodate for all these corner cases" and manage "all the differences in physical and contextual attributes that need to flow smoothly into a prompt." The experimental nature of prompt development was identified as "the most time-consuming" aspect when proper tools weren't available.
A major challenge emerged around wrangling consistent output from models. Engineers initially attempted to force structured outputs by providing JSON schemas, but discovered "a million ways you can effect it," ranging from simple formatting issues like "it's stuck with the quoted string" to more complex problems where models would "make up objects that didn't conform to that JSON schema" or "hallucinate stop tokens." Through iteration, engineers learned that working with the model's natural tendencies proved more effective than fighting against them. For instance, when requesting file structures, engineers found that parsing ASCII tree representations (which models naturally generate) yielded higher reliability than attempting to force array-of-objects formats.
Context management presented another significant challenge. Users often provide referential phrases like "refactor this code" or "add borders to the table," requiring the copilot to understand the user's current task and environment. Engineers struggled with "squishing more information about the data frame into a smaller string" while staying within token limits. They had to constantly make decisions about what to "selectively truncate because it won't all fit into the prompt," particularly when conversation history grew long. The difficulty in testing the impact of different prompt components on overall performance compounded these challenges.
As prompts matured, engineers realized that monolithic prompts were problematic and needed to break them down into reusable components including examples, instructions, rules, and templates. This led to "a library of prompts and things like that" that could be dynamically populated before final execution. However, this componentization introduced new challenges around version control, tracking, and debugging. Engineers found it difficult to "inspect that final prompt" and had to resort to "going through the logs and mapping the actual prompt back to the original template and each dynamic step made." There was no systematic way to continuously validate prompt performance over time or assess the impact of tweaks to prompts or model changes.
## Orchestration and Workflow Complexity
Building functional copilots required extensive orchestration beyond simple prompt-response patterns. Many engineers started with single-turn interactions where the user provides a query and receives a response, but this quickly evolved into more complex workflows. A common pattern involved intent detection as the first step, where the user's query would be analyzed to determine "what kind of intent does the user have for this specific query out of intents that we predefine and provide." Once intent was detected, the query would be routed to the appropriate "skill" capable of handling that type of request, such as "adding a test or generating documentation."
After receiving model responses, additional processing was necessary to interpret and apply the results. For code generation scenarios, engineers needed to determine "whether we need to update the current selection or just insert something below." However, commanding capabilities were often limited due to safety concerns. While it seemed logical to progress "from copilot chat saying here's how you would set this up to actually setting that up for the user," engineers recognized that "it's dangerous to let copilot chat just do stuff for you without your intervention" since "this content is AI generated and you should review all of it."
Intent-routing architectures proved problematic for multi-turn conversations or simple follow-up questions. The automatic population of prompts with skill-specific instructions and context disrupted natural conversational flow. Some engineers explored more advanced "agent-based" approaches where the LLM acts as an autonomous agent in an environment, performing internal observations and reasoning. One participant described a planning system that allowed engineers to build "semantic functions that could be woven together by a simple plan language." However, agent-based approaches came with significant tradeoffs—while "more powerful," the behavior proved "really hard to manage and steer."
A persistent problem with agent-based systems was the tendency for models to "get stuck in loops or to go really far off track." Engineers found that models had difficulty accurately recognizing task completion, often thinking "it's done, but it's not done." User experience sessions revealed instances where models "completely lost the script" and "gone off the rails" after misinterpreting user intent. These experiences highlighted the need for better visibility into internal reasoning states, improved tracking of multi-step tasks, and stronger guardrails on agent behavior.
## Testing and Benchmarking Struggles
Software engineers naturally attempted to apply classical software engineering methods like unit testing to LLM-based systems, but quickly encountered fundamental incompatibilities. The core problem was that generative models produce different responses each time, making traditional assertions impossible—"it was like every test case was a flaky test." To cope, engineers developed creative workarounds such as running "each test 10 times" and only considering it passing if "7 of the 10 instances passed." The experimental mindset extended to test inputs as well, since "if you do it for one scenario no guarantee it will work for another scenario."
Engineers maintained manually curated spreadsheets containing hundreds of "input/output examples" with multiple output responses per input. However, these examples required manual updates whenever prompts or models changed, creating significant maintenance burden. Some engineers adopted metamorphic testing approaches, focusing on "pass/fail criteria and structure more than the contents," such as checking if "code has been truncated" rather than validating exact output content.
Benchmarking proved even more challenging. Engineers needed benchmarks to perform regression testing and evaluate performance differences between models or agent designs, but faced two fundamental problems: no suitable benchmarks existed for their specific use cases, and no clear metrics existed to determine "good enough" or "better" performance. For qualitative outputs, the solution often involved "humans in the loop saying yes or no," but as one engineer noted, "the hardest parts are testing and benchmarks."
Building manually labeled datasets was described as "mind numbingly boring and time-consuming" work that companies often outsourced. One participant's team labeled "about 10k responses" but acknowledged "more is always better," with decisions ultimately coming down to available budget. The costs of running test inputs through LLMs created additional constraints—while individual tests might "cost 1-2 cents to run," costs quickly accumulated with large test suites. One engineer was asked to stop automated testing efforts due to costs, resorting instead to manually running small test sets only after large changes. Another had to suspend testing entirely when it interfered with production endpoint performance.
Determining acceptable performance thresholds remained unclear. As one participant asked, "Where is that line that clarifies we're achieving the correct result without overspending resources and capital to attain perfection?" Engineers developed pragmatic approaches like simple grading schemes with "A, B, etc." grades, acknowledging that "grading introduces its own biases, but by averaging, we can somewhat mitigate that." However, these approaches lacked the rigor and standardization that engineers desired.
## Learning and Knowledge Evolution
The learning challenges faced by participants were amplified compared to typical software engineering domains due to the nascent and rapidly evolving nature of LLM technology. Many engineers had to start "from scratch," "stumbling around trying to figure out" approaches without established paths. As one participant emphasized, "This is brand new to us. We are learning as we go. There is no specific path to do the right way!"
Engineers leveraged emerging communities of practice forming around social media, particularly hashtags and subreddits dedicated to LLMs. They found value in seeing "a bunch of examples of people's prompts" and "comparing and contrasting with what they've done, showing results on their projects, and then showing what tools they've used." Some engineers even used the models themselves as learning aids, describing a "meta" approach where they would "feed all of the code and talk to GPT-4 to ask questions" to minimize the learning curve.
However, uncertainty about future directions and unstable knowledge created unique challenges. The ecosystem was "evolving quickly and moving so fast," making investments in comprehensive documentation or guidebooks seem premature. Engineers questioned the longevity of skills they were developing, wondering "how long prompting will stay" as a relevant capability. The "lack of authoritative information on best practices" and a sense that "it's too early to make any decisions" created anxiety. There was also concern about job relevance, with "angst in the community as some particular job function may no longer be relevant."
For some engineers, building copilots required fundamental mindset shifts. One participant articulated this transformation: "For someone coming into it, they have to come into it with an open mind, in a way, they kind of need to throw away everything that they've learned and rethink it. You cannot expect deterministic responses, and that's terrifying to a lot of people. There is no 100% right answer. You might change a single word in a prompt, and the entire experience could be wrong. The idea of testing is not what you thought it was." Despite these challenges, there was overwhelming desire for best practices to be defined so engineers could focus on "the idea and get it in front of a customer."
## Safety, Privacy, and Compliance
Software systems incorporating AI decision-making can exhibit bias and discrimination, but LLMs introduced additional vectors of harm. Ensuring user safety and installing "guardrails" represented significant priorities for engineers. One participant working on Windows-based systems expressed concern about "putting power into the hands of AI" given that "Windows runs in nuclear power plants." Common tactics included detecting off-topic requests, though conversations could easily drift—for example, when collecting feedback with questions like "would you recommend this to a friend," users might respond with "no one would ask me about this, I don't have friends," requiring careful steering to avoid inappropriate follow-ups.
Some organizations mandated that copilots call managed endpoints with content filtering on all requests. However, these measures weren't always sufficient, leading engineers to implement rule-based classifiers and manual guard lists to prevent "certain vocab or phrases we are not displaying to our customers." Privacy and security requirements added another layer of complexity, with engineers needing to ensure that "output of the model must not contain identifiers that is easily retrievable in the context of our overall system." Third-party model hosting policies created additional complications, with one participant noting that partnering with OpenAI to host an internal model was necessary because "they can actually ingest any conversation to use as a training data that it's like a huge compliance risk for us."
Telemetry presented a catch-22 situation. While "telemetry is ideal way to understand how users are interacting with copilots," privacy constraints severely limited its utility. Engineers often could only see "what runs in the back end, like what skills get used" but not the actual user prompts, leading to insights like "the explain skill is most used but not what the user asked to explain." This limitation meant that "telemetry will not be sufficient; we need a better idea to see what's being generated."
Responsible AI assessments represented a new and resource-intensive process for most engineers. One participant described starting with an "impact assessment" that required "reading dozens of pages to understand the safety standards and know if your system meets those standards," consuming "1-2 days on just focus on that." Initial meetings with AI assessment coaches lasted "3.5 hours of lots of discussion," resulting in "a bunch of work items, lots of required documentation, with more work to go." Compared to typical security or privacy reviews taking 1-2 days, the responsible AI process required two full weeks. For one team, a major outcome was the need to generate automated benchmarks ensuring content filters flagged harmful content across "hundreds of subcategories" including hate, self-harm, and violence—work that became a shipping blocker.
## Developer Experience and Tooling Gaps
The overall developer experience for building copilots was characterized by fragmentation and inadequate tooling. When evaluating tools or libraries, engineers valued rich ecosystems with "clear-cut examples" showing "the breadth of what's possible." Langchain emerged as a popular choice for prototyping due to its "basic building blocks and most rich ecosystem." However, it proved inadequate for production systems, with engineers finding that "if you want to get deeper" beyond prototypes, more systematic design was necessary. Most interviewed engineers ultimately chose not to use Langchain for actual products, with one expressing fatigue at "learning and comparing tools" and preferring to "focus on the customer problem."
Getting started with new projects presented significant challenges due to lack of integration between tools. As one engineer described, "There's no consistent easy way to have everything up and running in one shot. You kind of have to do things piece-wise and stick things together." Even basic tasks like calling different completion endpoints required accounting for "behavioral discrepancies among proxies or different model hosts." Engineers desired "a whole design or software engineering workflow where we can start breaking up the individual components rather than just jumping in," including the ability to have "validation baked in, separately defining the preconditions and postconditions of a prompt."
Across the interviews, engineers used a "constellation of tools" to piece together solutions, but there was "no one opinionated workflow" that integrated prompt engineering, orchestration, testing, benchmarking, and performance monitoring. This fragmentation created significant friction and slowed development cycles.
## Proposed Solutions and Tool Design Opportunities
Through brainstorming sessions, engineers and researchers identified several opportunities for improved tooling and processes. For prompt engineering, suggestions included building prompt linters to validate prompts against team-defined best practices, such as avoiding hard-coded language-specific instructions when supporting multiple programming languages. Techniques inspired by delta-debugging could systematically explore eliminating portions of prompts to identify the most impactful components, enabling prompt compression and optimization. One creative approach involved using GPT-4 itself as a "rubberduck" for prompt writing, with engineers running prompts through the model to detect ambiguous scenarios before deployment.
For orchestration and lifecycle management, engineers desired better mechanisms for context sharing and commanding. They recognized that users expected copilots to both see actions being performed and execute available commands, but considerable engineering effort and safety concerns needed addressing before open-ended access could be provided. Automated benchmark creation through systems that capture direct feedback from crowdsourced evaluators or end-users was highly desired, with engineers preferring straightforward percentage evaluations with actionable insights over complex machine learning metrics like BLEU scores.
Visibility and awareness tools were considered critical, including mechanisms to alert stakeholders of drastic cost changes and rigorous regression testing capabilities given that "small changes in prompts can have large and cascading effects on performance." Engineers wanted clear insights into the behaviors of systems built with frameworks like Langchain or Semantic Kernel, particularly the various transformations that occur to prompts through multiple layers of abstraction.
The ultimate vision expressed by participants was for a unified "one-stop shop" that would streamline development of intelligent applications. Current solutions like Langchain fell short in providing comprehensive workflow integration. Engineers advocated for templates designed for common application patterns (like Q&A systems) that would come bundled with essential configurations including hosting setups, prompts, vector databases, and tests. Tools to guide selection of appropriate tool suites from the vast options available would also prove valuable.
## Critical Assessment
While this research provides valuable insights into real-world LLMOps challenges, it's important to note several limitations. The study captures experiences from a specific time period (late 2023) when LLM tooling and best practices were particularly immature. Many identified pain points may have been partially addressed by subsequent tool development, though the fundamental challenges around non-determinism, testing, and orchestration likely persist. The participant pool, while diverse, may not fully represent the experiences of smaller organizations or those with more extensive ML/AI backgrounds.
The paper effectively documents problems but provides limited concrete solutions or validated approaches. The brainstorming sessions generated ideas for tools and techniques, but these remained conceptual rather than implemented and evaluated. Additionally, the focus on copilot-style conversational interfaces may not fully capture the breadth of LLM integration patterns used in production systems.
The research also reflects a particular moment in the industry's learning curve. Some challenges described—such as the difficulty with JSON output formatting—have been partially addressed through improved model capabilities and structured output features. However, the higher-level challenges around testing adequacy, cost management, responsible AI compliance, and orchestration complexity remain highly relevant to contemporary LLMOps practice.
Despite these limitations, the study provides invaluable documentation of the engineering challenges that arise when moving LLMs from experimental prototypes to production systems. It highlights the gap between traditional software engineering practices and the requirements of AI-powered applications, emphasizing the need for new tools, processes, and mental models tailored to the unique characteristics of large language models.