Company
Luna
Title
Building Production-Ready AI Analytics with LLMs: Lessons from Jira Integration
Industry
Tech
Year
2025
Summary (short)
Luna developed an AI-powered Jira analytics system using GPT-4 and Claude 3.7 to extract actionable insights from complex project management data, helping engineering and product teams track progress, identify risks, and predict delays. Through iterative development, they identified seven critical lessons for building reliable LLM applications in production, including the importance of data quality over prompt engineering, explicit temporal context handling, optimal temperature settings for structured outputs, chain-of-thought reasoning for accuracy, focused constraints to reduce errors, leveraging reasoning models effectively, and addressing the "yes-man" effect where models become overly agreeable rather than critically analytical.
## Company and Use Case Overview Luna, a project management AI company, embarked on developing an AI-powered Jira analytics system to help engineering and product teams extract actionable insights from complex project management data. The system was designed to track progress, identify risks, and predict potential delays more effectively than traditional manual analysis methods. The development team utilized advanced Large Language Models including GPT-4 and Claude 3.7 to process and analyze Jira data, transforming raw project information into meaningful insights for decision-making. The initiative represents a comprehensive real-world application of LLMs in production environments, where the stakes are high and reliability is crucial. Luna's experience building this system revealed significant challenges and insights that extend beyond typical academic or experimental LLM applications, providing valuable lessons for practitioners working with LLMs in production settings. ## Technical Implementation and LLMOps Challenges The development process exposed several critical areas where LLM performance in production environments differs significantly from controlled testing scenarios. One of the most significant discoveries was that approximately 80% of problematic LLM outputs were not due to prompt engineering issues but rather stemmed from poor data quality and consistency. This finding challenges the common assumption that LLM problems are primarily solved through better prompting techniques. The team encountered substantial issues with data quality that directly impacted LLM performance. Inconsistent calculations across different Jira projects, missing contextual information that seemed obvious to human users, and conflicting information between different data points all contributed to unreliable AI outputs. For instance, the same engineering metrics like sprint velocity or cycle time were calculated differently across various data sources, leading to confusion in the LLM's analysis. Status values like "Done" carried different meanings in different workflows, creating ambiguity that humans could easily resolve but thoroughly confused the language models. This experience highlights a fundamental principle in LLMOps: the quality of input data is often more critical than sophisticated prompt engineering. The team found that when presented with conflicting or ambiguous data, LLMs don't hallucinate randomly but instead attempt to improvise reasonable interpretations based on flawed input. By implementing rigorous data standardization and quality checks before feeding information to the models, they achieved far more reliable and accurate outputs without requiring overly complex prompt engineering approaches. ## Temporal Context and Time-Based Analysis One of the most surprising technical challenges encountered was the models' inability to handle temporal context effectively. Despite their advanced capabilities, both GPT-4 and Claude 3.7 demonstrated what the team termed "time blindness" - they lacked inherent understanding of current dates and struggled to interpret relative time references like "last week" or "next sprint" without explicit guidance. This limitation became particularly problematic when analyzing sprint deadlines, issue aging, and time-based risk factors within the Jira analytics tool. The models would fail to identify overdue tasks or miscalculate time remaining until deadlines when working with dates alone. For example, a date like "April 10" could not be properly contextualized as past or future without explicit reference to the current date. The solution required implementing a comprehensive pre-processing layer that handles all time calculations before passing data to the LLM. Instead of providing raw dates, the system now calculates and provides both absolute and relative temporal context. A task created on March 20 with a deadline of April 10 would be presented as "Task created on March 20, 2025 (6 days ago), deadline April 10, 2025 (15 days remaining from today, March 26, 2025)." This approach eliminated an entire category of errors and significantly improved the accuracy of time-based analysis. ## Parameter Optimization and Model Configuration The team's exploration of temperature settings revealed nuanced insights about balancing structure and analytical judgment in LLM outputs. Initially assuming that temperature 0 (minimum randomness) would yield the most consistent results for structured analytics reports, they discovered this approach was counterproductive for nuanced analysis tasks. At temperature 0, the models exhibited excessive rigidity, treating minor potential issues with the same alarming severity as major blockers. They became context-blind, missing important nuances or mitigating factors that required proportional responses. The models made overly deterministic declarations even when data was incomplete or ambiguous, and they struggled with edge cases, sometimes "panicking" over normal conditions like zero velocity at the start of a sprint. Through experimentation, the team found that temperature settings between 0.2 and 0.3 provided optimal balance for tasks requiring both structure and analytical judgment. This configuration maintained consistent formatting and output structure while allowing for appropriate nuance in assessments, recommendations, and severity scoring. The models handled edge cases and ambiguity more gracefully and produced more natural, human-like analysis that users found more trustworthy and actionable. ## Chain-of-Thought Reasoning and Accuracy Improvements The implementation of chain-of-thought prompting techniques yielded remarkable improvements in LLM accuracy and reliability. Even with state-of-the-art models, the team noticed occasional lapses in reasoning or simple computational errors that affected output quality. The solution involved adapting chain-of-thought prompting by requiring models to explain their reasoning step-by-step before providing final conclusions. This approach produced several significant improvements. Models became more discerning about what truly constituted high-priority risks versus minor observations. Simple mathematical errors, such as summing story points or calculating percentages, decreased significantly. The tendency to overreact to single data points or minor signals was reduced, leading to more balanced analysis. Additionally, when errors did occur, the explicit reasoning steps allowed for quick identification of exactly where the reasoning process went wrong, improving debugging capabilities and providing explainable AI functionality. The team implemented specific instructions requiring self-reflection: "Before providing your final analysis/summary/risk score, explain your reasoning step-by-step. Show how you evaluated the input data, what calculations you performed, and how you reached your conclusions." While this approach consumes additional tokens and increases costs, the return on investment in terms of accuracy, explainability, and trustworthiness proved significant for their AI Jira analytics use case. ## Output Constraint and Focus Optimization An important discovery was that allowing LLMs to generate excessive output could actually introduce more errors and inconsistencies. The team found that constraining output scope led to better focus and improved performance. By asking for less information, offloading basic tasks that could be handled outside the LLM, breaking complex requests into separate calls, and setting lower maximum token limits, they achieved more focused and accurate results. This approach worked because LLMs have limits on how much information they can process effectively at once, and fewer tasks allow them to dedicate more capacity to getting each task right. Like humans, LLMs don't multitask as well as might be expected, and giving them fewer simultaneous tasks usually leads to better quality on each specific task. The team noted that creative tasks are different - for brainstorming or exploring ideas, generating more content can be useful, but for analytical tasks requiring precision, constraint improves performance. ## Reasoning Model Utilization The team's experience with newer reasoning models like Claude 3.7 and GPT o-series revealed that these models represent a fundamental shift in how LLMs approach problems. These models think differently, solving problems step-by-step in a visible, logical chain rather than jumping straight to an answer. However, they don't use that reasoning capability fully unless explicitly prompted to do so. The key insight was that reasoning models require explicit requests for visible reasoning to achieve their full potential. Without this, they're more likely to miss guidelines or make mistakes. The team implemented prompts requesting explicit reasoning within thinking blocks, which improved accuracy significantly. The tradeoffs include increased latency as reasoning takes time, higher costs due to more tokens, but substantially improved accuracy for complex analytical tasks like risk assessment. ## Addressing Model Bias and the "Yes-Man" Effect One of the most subtle but critical challenges identified was the tendency of LLMs to be overly agreeable, potentially reinforcing user assumptions or biases rather than providing objective, critical analysis. This "yes-man" effect manifested in several ways: models would mirror implicit assumptions within prompts rather than challenging them based on data, agree with flawed premises or leading questions instead of identifying underlying problems, omit crucial caveats or risks unless explicitly prompted, and deliver potentially inaccurate information with unnervingly high confidence. Interestingly, the team observed differences between models in this regard. Claude models sometimes provided more pushback, flagging gaps, surfacing potential blockers unprompted, and offering more critical assessments compared to other models tested. To counteract this effect, they implemented several strategies: explicitly instructing models to be critical by asking them to evaluate what's missing, what assumptions might be flawed, and what the biggest risks are; framing questions neutrally to avoid leading questions; considering model diversity for tasks requiring strong analytical distance; and always verifying rather than trusting confident-sounding responses. ## Production Deployment Considerations The case study reveals important considerations for deploying LLM-based analytics systems in production environments. The team emphasized the importance of rigorous testing across different data conditions, implementing comprehensive monitoring for output quality, establishing feedback loops for continuous improvement, and maintaining human oversight for critical decisions. The experience demonstrates that successful LLMOps requires more than just technical knowledge of APIs and prompts. It demands practical, hands-on experience and deep understanding of models' unique characteristics, strengths, and limitations, especially when dealing with domain-specific data like Jira project management information. The lessons learned have broader applicability beyond Jira analytics, providing valuable insights for any organization looking to implement reliable AI systems using Large Language Models in production environments. The iterative nature of the development process, with continuous refinement based on real-world performance, exemplifies best practices in LLMOps where ongoing monitoring, evaluation, and improvement are essential for maintaining system reliability and effectiveness. The team's approach of systematically identifying and addressing each category of issues provides a model for other organizations facing similar challenges in deploying LLM-based systems in production contexts.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.