Various: LLM Applications in Education: Personalized Learning and Assessment Systems

Overview

This case study captures a panel discussion organized by Schmidt Futures, featuring Harrison Chase (founder of LangChain) and three organizations actively deploying LLMs in educational contexts. The discussion, moderated by Kumar (a Vice President at Schmidt Futures who previously served as President Obama’s EdTech lead), provides valuable insights into how LLMs are being operationalized in production educational technology environments. The panel represents a cross-section of the edtech ecosystem: a nonprofit direct-to-classroom tool (Podzi), an academic research project (ITEL at Vanderbilt), and a data science competition platform (The Learning Agency Lab).

Harrison Chase provided context on LangChain as a developer toolkit available in Python and TypeScript, emphasizing its role in connecting LLMs to external data sources and computation. He highlighted that while language models are powerful, they have significant shortcomings including lack of access to user-specific data, inability to connect to up-to-date information, and challenges with personalization. LangChain provides tooling to address these gaps through chaining, reflection techniques, and integration capabilities.

Podzi: Production LLM Deployment for Personalized Learning

Podzi, founded by Joshua Ling (a former middle school math teacher turned software developer), is a nonprofit edtech organization that has been experimenting with LLMs in production for approximately six months at the time of the discussion. Their platform provides personalized spaced repetition review for students, inspired by a 2014 paper showing that personalized spaced-out review can improve learning outcomes by 16.5 percent over a semester.

Current Production Features:

Podzi uses OpenAI’s GPT-3.5 Turbo chat model for several production features. The first is question generation from background text, where teachers can paste educational content and the system generates five questions for selection. Each selected question then triggers an additional API call to generate corresponding answers. This represents a practical implementation of LLMs in a teacher-facing workflow where human oversight is maintained.

The second feature, still in alpha and not yet released to production, is a tutoring chatbot designed similarly to Khan Academy’s Khanmigo. This chatbot allows students to ask clarifying questions after answering problems, using Socratic dialogue to address misconceptions.

Technical Challenges and LangChain Integration:

Podzi encountered significant challenges when attempting to use LLMs for math problem variation generation. The core use case involves automatically generating variations of math problems so students don’t simply memorize specific answers. For example, a problem like “5x + 7 = 2x - 40” might need multiple variations with different numbers while maintaining the same mathematical structure.

The team discovered that base LLMs are unreliable at solving math problems accurately. They experimented with chain-of-thought prompting by adding “think through this step by step” to prompts, which improved accuracy for some problems but failed with larger numbers. This led them to explore LangChain’s tool integration capabilities, specifically:

Basic calculator tools
Program Aided Language Models (PAL), which represent math problems as computer programs and execute them through a code interpreter
Wolfram Alpha integration for more complex mathematical operations

The team is iterating within LangFlow (a LangChain-related library) to find consistently accurate solutions for answer generation.

Teacher-in-the-Loop Architecture:

A notable aspect of Podzi’s approach is their deliberate decision to keep teachers involved in the personalization loop rather than allowing fully autonomous student-LLM interactions. The chatbot is not yet released because they’re developing features for teachers to monitor conversations and intervene when necessary. This represents a thoughtful production architecture that acknowledges LLM limitations around accuracy and hallucination while still leveraging the technology’s capabilities.

The team envisions using LangChain to enable SQL queries against their student response database, allowing LLMs to serve as an intelligent layer that can direct teachers to the most critical areas for intervention—identifying common misconceptions, flagging struggling students, and generating actionable reports.

ITEL: Academic Research Deployment with LangChain

Wesley Morris, a second-year PhD student at Vanderbilt University, presented ITEL (Intelligent Textbooks for Enhanced Lifelong Learning), a project from the LearnLab as part of AILO, an NSF-funded Institute for adult learning. ITEL represents a research-focused deployment scenario with different constraints than commercial products.

System Architecture:

ITEL is designed as a domain-agnostic framework for generating intelligent textbooks. Developers provide informative text in MDX format, and the system automatically generates an interactive textbook. The primary user-facing feature at the time of the discussion was summary writing evaluation at the end of each textbook section.

The system uses two Longformer models (a specific transformer architecture suited for long documents) trained to evaluate summaries on two dimensions: content (how well the summary reflects source material) and wording (how well the summary paraphrases rather than plagiarizes). The evaluation pipeline includes a junk filter, plagiarism detection, and quality scoring.

The platform collects multiple data streams for research purposes: user behavior data, keystroke logging during summary writing, and attention-to-text metrics measuring time spent on each textbook subsection.

LangChain Integration Journey:

Wesley provided a candid assessment of the development experience before and after LangChain adoption. Prior to LangChain, question generation required complex regular expression parsing to handle LLM output—described as a “nightmare of regular expressions.” LangChain addressed this through several features:

Prompt Templates: Simplified the process of feeding prompts to the LLM with consistent formatting.

Response Schemas: A critical feature for structured output. For question generation, the team defined schemas for recall, summarization, and inference questions, each with fields for the question text, correct answer, and incorrect answer. This structured approach enables automatic conversion to DataFrames for downstream processing and model training.

Vector Stores for Embeddings: Used for similarity analysis between student summaries and textbook subsections. The implementation uses a lightweight embedding model (MiniLM) for performance. The similarity scores enable two use cases: providing students feedback about which sections they may have missed in their summary, and research analysis correlating time-on-text with summary content.

The embedding-based similarity approach is particularly elegant: the system can identify which specific subsection a student’s summary is most similar to, then cross-reference this with how long the student spent reading that section. This enables both automated feedback (“you may want to revisit section 3.2”) and research insights into reading behavior and learning outcomes.

Adult Learner Considerations:

A key architectural difference from Podzi is that ITEL assumes minimal instructor engagement since it targets adult learners. This means the system must be more autonomous—“plug and play”—which raises the stakes for accuracy and creates interest in reinforcement learning approaches where instructors can provide spot-check feedback (thumbs up/down on evaluation results) to continuously improve the model.

The Learning Agency Lab: Dataset and Competition Platform

Perpetual Baffour from The Learning Agency Lab presented a different approach to LLM deployment: rather than building end-user products, the Lab creates open datasets and runs data science competitions to crowdsource LLM-based solutions for educational challenges.

Competition Methodology:

The Lab partners with education-focused organizations (nonprofits, public agencies) to identify specific research questions or data assets that could benefit from AI/ML/NLP solutions. They translate educational problems into machine learning tasks (classification, recommendation, automatic grading) and then run Kaggle competitions to crowdsource solutions.

Feedback Prize Case Study:

The flagship project, Feedback Prize, addressed the problem of students receiving insufficient feedback on their writing, with data showing few students graduate high school as proficient writers. The competition series had three phases:

Building LLM solutions to identify different argumentative components in student essays
Evaluating the quality of these argumentative components
Focusing on language proficiency for English language learners

The competitions generated over 100,000 solutions, with winning solutions achieving human-level accuracy. Notably, winning solutions were primarily ensembles of transformer models (a type of large language model architecture).

The resulting models can segment essays into argumentative components, label them across seven categories (lead, position/thesis, claims, evidence, etc.), and evaluate the quality of each component (effective, adequate, ineffective).

Production Considerations:

The Lab emphasizes three key metrics for production-ready models: accuracy, efficiency, and fairness. They run “efficiency tracks” in competitions that prize simple, fast models with high accuracy—recognizing that classroom deployment requires computational efficiency.

Fairness is particularly emphasized given education’s role in serving diverse student populations. The Lab takes deliberate steps to minimize bias in datasets through diverse data collection and competition designs that address algorithmic bias. For example, they partnered with Learning Equality’s Kolibri platform to source multilingual learning materials for recommendation engine competitions.

The Lab positions itself as a bridge between competition-winning models and production deployment, consulting with product developers on user-centric approaches for integrating LLM solutions.

Cross-Cutting Technical Themes

Personalization Architecture:

Harrison Chase prompted extensive discussion on personalization—how to tailor educational experiences to individual students while maintaining accuracy. The consensus approach emphasized maintaining human oversight. Podzi keeps teachers in the loop for intervention; ITEL collects data to enable future adaptive features; The Learning Agency Lab builds recommendation engines that could personalize content delivery.

The technical approaches mentioned for personalization include:

Storing and querying user history (past interactions, performance data)
Using SQL queries against databases to surface relevant student information
Agent-based architectures that could simulate student behavior for instructor practice
Multi-agent systems where teacher and student agents interact for training purposes

Chat Interface Limitations:

The panel discussed chat-based interfaces critically. Joshua from Podzi noted that overly open chat interfaces lead to off-topic conversations (students asking “how was your day?”). The preferred approach is structured interactions where LLMs generate reports or summaries, with chat available for follow-up questions. This represents a more controlled UX pattern that constrains LLM behavior while preserving interactivity.

An alternative use case mentioned was “co-pilot for the teacher”—using LLMs for sentiment analysis, feedback generation, and optimizing teacher time allocation rather than direct student interaction.

Simulation for Training:

The panel briefly discussed using LLMs to simulate students for teacher training, allowing instructors to practice explaining misconceptions in low-stakes environments before facing real students. This represents an interesting application of multi-agent systems for educational purposes, though the panel acknowledged this is largely unexplored territory.

Critical Assessment

The discussion provides valuable insights but should be viewed with appropriate caveats. All presenters are early in their LLM deployment journeys (weeks to months of experience), and several features described are still in alpha or not yet released to production. The technical challenges around math accuracy, hallucination, and maintaining educational quality are acknowledged but not yet solved.

The emphasis on human-in-the-loop architectures is pragmatic given current LLM limitations, but it also means these systems may not yet deliver on the full promise of personalized, scalable educational intervention. The production deployments described are relatively narrow in scope—question generation, summary evaluation, writing feedback—rather than full tutoring systems.

That said, the panel demonstrates thoughtful approaches to operationalizing LLMs in high-stakes educational contexts, with appropriate attention to accuracy, fairness, and user experience. The combination of academic research (ITEL), applied nonprofit work (Podzi), and dataset/competition infrastructure (Learning Agency Lab) represents a healthy ecosystem for advancing LLMs in education responsibly.

LLM Applications in Education: Personalized Learning and Assessment Systems

Industry

Technologies

Overview

Podzi: Production LLM Deployment for Personalized Learning

ITEL: Academic Research Deployment with LangChain

The Learning Agency Lab: Dataset and Competition Platform

Cross-Cutting Technical Themes

Critical Assessment

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Scaling AI Product Development with Rigorous Evaluation and Observability