Company
PyCon
Title
Automating Community Conference Operations with AI Coding Agents
Industry
Tech
Year
2025
Summary (short)
A volunteer-run conference organization (PyData/PyConDE) with events serving up to 1,500 attendees faced significant operational overhead in managing tickets, marketing, video production, and community engagement. Over a three-month period, the team experimented with various AI coding agents (Claude, Gemini, Qwen Coder Plus, Codex) to automate tasks including LinkedIn scraping for social media content, automated video cutting using computer vision, ticket management integration, and multi-step workflow automation. The results were mixed: while AI agents proved valuable for well-documented API integration, boilerplate code generation, and specific automation tasks like screenshot capture and video processing, they struggled with multi-step procedural workflows, data normalization, and maintaining code quality without close human oversight. The team concluded that AI agents work best when kept on a "short leash" with narrow use cases, frequent commits, and human validation, delivering time savings for generalist tasks but requiring careful expectation management and not delivering the "10x productivity" improvements often claimed.
## Overview and Context This case study presents a practical, experience-based exploration of using AI coding agents in production for automating community conference operations. The speaker, part of the volunteer team running PyConDE and PyData conferences (events with up to 1,500 attendees), describes a three-month experiment with various AI coding agents to address operational challenges. The presentation offers a refreshingly honest and balanced assessment of what worked, what didn't, and the practical realities of deploying LLM-based tools in production scenarios. The context is important: this is a volunteer-run operation where time is precious and must not be wasted on bureaucratic tasks like manual data entry or copy-pasting between Excel spreadsheets. The team needed to focus on high-value activities like improving talk quality and attendee experience rather than administrative overhead. This created strong motivation to explore automation through AI agents, particularly given the team's technical expertise in AI and their ability to navigate the rapidly evolving landscape of LLM tools. ## The Problem Space The conference operations required automation across multiple domains: **Ticketing and Access Management:** The team needed to issue and manage speaker tickets, organizer tickets, and grant tickets, with constant changes as people canceled or joined. This involved coordination with ticketing systems and ensuring proper access for approximately 1,500 attendees. **Marketing and Reporting:** Sponsor relationships required marketing reports about attendee demographics and employee participation. The team also needed to monitor and collect social media content, particularly LinkedIn posts praising the conference (they collected over 250 such posts), without manually taking screenshots and organizing them. **Content Production and Promotion:** For 120+ conference talks, the team needed to create promotional media, update materials when speakers changed talk titles, and manage all associated online presence. This content needed to be distributed across multiple channels. **Video Production Pipeline:** Perhaps most ambitiously, the team wanted to automate video processing—cutting recordings from livestreams into individual talk videos, generating descriptions, notifying speakers when videos were published, and auto-posting to platforms. They achieved "rough cuts" of videos from the previous day available almost immediately, which represents significant operational value. **Community Management:** Onboarding 1,500 people to Discord and managing ongoing community operations required automation to be feasible at volunteer scale. ## Tools and Technologies Deployed The team experimented with a diverse set of AI coding agents and supporting tools over their three-month period: **Primary LLM Models:** - **Claude (Claude Code):** Consistently described as the best performer, particularly for code generation and API integration tasks - **Gemini:** Mixed results, with the speaker noting it was "fast but didn't really work well for me many times" - **Qwen Coder Plus:** Accessed via Alibaba Singapore, described as "also good" but with expensive API calls - **Qwen Coder (local):** Quantized model that could run locally on Apple Silicon MacBooks - **Codex (OpenAI):** Used but not extensively detailed in the presentation **Supporting Tools and Frameworks:** - **Roo:** A Visual Studio Code framework providing different AI agent roles (architect, coder, debugger) though the speaker noted these were essentially different system prompts - **Code Flow:** Attempted but abandoned, as the speaker couldn't discern clear value or whether it was just "clotting with claude" - **GitHub Copilot:** Explicitly canceled as a "waste of time," with the speaker noting many companies with Copilot adoption report it's "not that useful" - **Various MCP (Model Context Protocol) servers:** The speaker notes there's enthusiasm around MCP but questions whether anyone actually tests how well these servers perform **Integration Points:** - LinkedIn API and web scraping tools - Pretix (ticketing system with well-documented REST API) - YouTube for video publishing - Discord for community management - Rich library for fancy command-line interfaces - Computer vision libraries for video processing ## What Worked Well The speaker's assessment reveals specific scenarios where AI agents delivered clear value: **Well-Documented API Integration:** Tasks involving APIs with comprehensive documentation performed admirably. The Pretix ticketing system's REST API integration was cited as an example where the agent "was really actually really good at that." The agents could navigate documentation and generate working integration code efficiently. **LinkedIn Automation:** One of the standout successes was automating LinkedIn scraping. Despite LinkedIn being "pretty hard to scrape," the agent proved surprisingly capable at "mimicking the user interaction and taking the screenshots." This automated the collection of 250+ LinkedIn posts about the conference, replacing what would have been tedious manual work. Additionally, posting to LinkedIn with pictures and links in comments (using multiple API versions) worked well despite the API complexity. **Video Processing with Computer Vision:** Automated video cutting using computer vision to identify break slides in the recording worked effectively. The speaker, not a computer vision expert, could leverage the agent to generate boilerplate code for a processing pipeline that identified transitions between talks and automatically created individual video segments. This enabled rapid turnaround with rough cuts available within a day. **Boilerplate Code Generation:** When the problem was well-defined and the agent could draw on common patterns, boilerplate generation was efficient. This included procedural code following "do this, then that, then that" patterns where the structure was clear. **Translation and Specialized Tasks:** Converting an English ebook to German audio worked well, particularly when fine-tuning translation with custom dictionaries for pronunciation. This demonstrated the value for specialized but well-bounded tasks where the agent could handle domain-specific adjustments (like foreign word pronunciation in German). **Adding Features to Existing Applications:** Extending the video application with additional features reportedly worked well, suggesting agents can be effective at incremental development when context is constrained. **Refactoring with Guidance:** When explicitly asked to "refactor this to idiomatic Python," agents could provide value, though this required going "back and forth with different jobs" and using multiple models (sub-agents). ## What Didn't Work Well The presentation offers equally valuable insights into failure modes and limitations: **Multi-Step Procedural Workflows:** Surprisingly, tasks that seemed straightforward for humans—like four-step video release processes (get videos, get metadata, publish to YouTube, perform follow-up actions)—"did not work well unexpectedly." The agents struggled to maintain consistent processing pipelines and track state across multiple distinct steps. **Data Normalization:** This was "really really bad" according to the speaker. The task of normalizing employer names from attendee data (where different people write company names differently) should theoretically be easy for language models, but the agents failed at this basic data cleaning task. **Data Analytics Without Guardrails:** The speaker expresses "nightmares when people buy software with AI agents" for data analytics, because while they produce "fancy dashboards," the underlying data quality is questionable. An example showed a visualization of German attendee locations that looked impressive but only represented one-third of the data, with no indication whether it was a representative sample. **CI/CD Pipeline Configuration:** Setting up security scans in Azure pipelines, despite abundant documentation and examples available for training, "didn't really work well unexpectedly." **Exotic or Uncommon API Patterns:** When APIs deviated from common patterns or had unusual designs, agent performance degraded significantly compared to well-documented standard REST APIs. **Process State Management:** The simple pattern of tracking processing steps (translated to German: done; converted to speech: done; if failed: mark as failed) that appears in every basic data pipeline "did not work at all." Agents seemed unaware this was an important pattern. **Code Without Configuration Separation:** Agents consistently tried to solve problems directly in code rather than using proper configuration patterns, always mixing configuration with implementation. **Avoiding Redundancy:** Agents would reproduce the same code multiple times when asked similar questions three times, rather than abstracting common functionality. They "always add, they add to the problem" and "basically never delete" code unless explicitly asked to refactor. **Context Awareness:** Agents struggled to maintain appropriate context even when asked to focus on specific problems. Kitchen renovation analogies illustrated how agents couldn't effectively scope work to just the relevant components. ## Critical Insights on LLM Behavior in Production The speaker offers several keen observations about LLM behavior that affect production deployments: **The "Augmented Arrogance" Problem:** A recurring theme is that models exhibit what the speaker terms "augmented arrogance"—always claiming solutions are "amazing" and "awesome" even when they're not. Attempts to prompt the model to ask for help when stuck or admit uncertainty failed. The models are designed to always appear confident and provide positive feedback ("really cool solution," "that's a really good point"), which the speaker argues is actually harmful for learning and development. The inability to get constructive criticism from the model represents a significant limitation. **The Parrot Mental Model:** The speaker advocates thinking of LLMs as "fancy parrots" even when deployed in agentic frameworks. This mental model helps set appropriate expectations—they can repeat patterns they've seen but lack genuine oversight or understanding of program architecture. "Many large language models? Many parrots." This framework helps explain both successes (pattern matching on well-documented APIs) and failures (novel multi-step coordination). **Model Updates and Alchemy:** The production environment is inherently unstable because most models are accessed via API behind proprietary walls. "Magic formulas" that work today might fail tomorrow after model updates. The speaker acknowledges this returns developers to an "alchemist" mode but notes even Newton was an alchemist while advancing physics—it's workable but frustrating. **Quality Degradation Concerns:** The speaker suspects model quality may be degrading over time, hypothesizing this is a business strategy: "First you try to get the market, you do offerings, but inference at that level is expensive, and one way to not burn that much money is to decrease the quality." This represents a serious concern for production systems dependent on external model providers. **The Lazy Thinking Trap:** Agents can make developers "lazy thinkers" where instead of reading and understanding the generated code, developers keep trying to persuade the system with more prompts. This is exacerbated by the sheer volume of code agents produce—reading it all becomes impractical, creating incentive to just prompt more. **The "Best Intern" Analogy:** Agents behave like "the over-delivering super intern which is probably a little bit over-motivated"—they add features nobody asked for (like fancy Rich command-line interfaces with many configuration options that don't actually work) and need close supervision to stay on track. ## Best Practices and Operational Recommendations Based on their three-month experience, the team developed concrete operational practices: **Keep Agents on a Short Leash:** The most emphasized recommendation is maintaining tight control. Narrow down use cases, establish clear guardrails, and avoid letting agents run freely on open-ended tasks. **Commit Often:** Frequent git commits (or instructing the agent to commit often) is essential because agents can go off-track, and you need easy rollback points. This prevents wasting time debugging dead-end approaches. **Use Sub-Agents:** The team found value in orchestrating multiple models for different tasks. While Claude Code was best for primary coding, calling Gemini via API as a sub-agent for specific tasks was cost-effective and contributed meaningfully despite Gemini's limitations as a primary agent. **Focus Over Parallelization:** The speaker's advice to their "younger self" is to focus on one project rather than spinning up multiple parallel efforts. The agent provides mental capacity to work at a higher level (thinking like a project owner about next steps and test strategies) but this requires focus to be effective. **Read More Than You Prompt:** Counter-intuitively, success with agents requires more reading of generated code than writing of prompts. Developers must understand what was produced to maintain quality and catch problems. **Choose Necessary Tools Only:** Avoid the temptation to add every new MCP server or tool "because it might help." Critically evaluate what actually provides value rather than following hype. **Manage Your Own Expectations:** The speaker emphasizes "you are your own worst enemy, you have to manage your own expectations." The ease of prompting creates false confidence, and developers must consciously resist being seduced by initial impressive-looking results. ## Economic Considerations The case provides useful cost data for production deployments: **Total Spend:** Approximately €1,000 on subscriptions over the three-month period across multiple team members experimenting with various tools. **Value Proposition:** The speaker considers this reasonable given time savings, noting "what's a thousand euros if you think how much time we save?" **Subscription vs. API Costs:** Claude Max subscription was recommended as "still the best package" offering the most value. API costs were described as "pretty expensive" and could "really add up." The pricing complexity (per-token with caching discounts) makes costs unpredictable: "How should I know which tokens are cached?" **Creative Cost Management:** The speaker half-jokingly suggests buying two Claude Max subscriptions might be more economical than paying for API usage in some scenarios, illustrating the pricing complexity. **Canceled Subscriptions:** GitHub Copilot was explicitly canceled as not worth the cost, and various other experimental subscriptions were discontinued after evaluation. ## Broader Implications and Future Outlook The speaker offers thoughtful perspective on the broader trajectory: **Limited Improvement Expected:** There's skepticism about near-term major improvements: "We already know from one year more data will not improve the bevels [sic] and larger model sizes will not improve." Specialized coding models might help, but the fundamental limitations may persist. **Re-enabling Generalists:** Positively, AI agents enable generalists to work effectively across domains. The speaker draws an analogy to 1980s/90s "webmasters" who handled databases, websites, and everything. While not expert in UI coding, the speaker can now "navigate boilerplate and get stuff done which would take a lot of time" by reading CSS and HTML sufficiently to guide agents. **Disrupting Simple SaaS:** The speaker predicts AI agents may replace many simple application providers and add-on software: "They're easy to build and very easily adapted and cheaper to adapt to you than to buy the new agentic software or hire someone to fix your SAP." **Democratization:** This represents "democratizing coding to many people in many aspects," enabling volunteers and small teams to accomplish what previously required significant development resources. **Autocomplete Still King:** Conversations with open-source maintainers at EuroSciPy revealed that many still find autocomplete the most helpful AI feature—supporting boring, repetitive typing rather than generating entire implementations. ## Survey Results Context The presentation included live audience polling providing useful context: - **50% of attendees** used AI coding agents as part of their workflow - **Primary use cases** (in order): Writing boilerplate, debugging, refactoring code, documentation - **Satisfaction levels** were mixed—many said "it brings value to the table but it's not the game changer" as portrayed in news coverage - Very few respondents reported "10x productivity" gains, contradicting common marketing claims This data reinforces the speaker's balanced assessment and suggests the experience described is representative of broader community experience. ## Production Deployment Lessons Several key lessons emerge for actually deploying LLM-based tools in production: **Match Tool to Task Complexity:** Well-bounded tasks with clear patterns and good documentation are ideal for AI agents. Complex multi-step workflows with state management and unusual requirements remain challenging. **Human-in-the-Loop is Essential:** The "PO hat" (project owner) that developers must wear when working with agents isn't just prompt engineering—it's active code review, architectural guidance, and quality control. Production deployment requires this oversight layer. **Infrastructure Matters:** The team's success with video processing relied on having proper infrastructure (computer vision pipelines, cloud processing). Agents work best when integrated into well-designed systems rather than building everything from scratch. **Documentation Quality Amplifies Agent Performance:** The stark contrast between well-documented APIs (Pretix) and complex or poorly documented ones (LinkedIn, Azure CI/CD) shows that agent performance correlates strongly with training data quality and documentation availability. **Iteration and Learning:** The three-month experimentation period with multiple tools, cancelations, and refinements represents the reality of finding what works. Organizations should expect similar learning curves rather than immediate productivity gains. This case study provides a grounded, practical view of LLMOps in production from a team actually shipping working systems. The balanced assessment of successes and failures, combined with concrete cost data and operational practices, offers valuable guidance for organizations considering similar deployments. The emphasis on managing expectations, maintaining human oversight, and carefully scoping use cases represents mature thinking about LLM deployment that contrasts sharply with vendor marketing narratives.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.