Translating AI Evaluations for Education
Written by Ariel Colon and Vivian Chong
At apgard, we’re building evaluation frameworks for youth-centered AI experiences. As AI tools are increasingly adopted in classrooms and for learning, it’s not enough for these systems to work and align with content policies; for them to truly be safe for children, they must teach and adapt appropriately, aligning with the educational and developmental needs of every user.
Recently, we partnered with an EdTech company developing an AI-powered immersive educational game for K12 students. Throughout product iterations, they manually checked whether the game worked, but didn’t have a comprehensive view of the app’s performance, often discovering edge cases post-deployment.
This captures the core of AI reliability: evaluation is not just about technical performance. It’s also about understanding the context in which the AI is deployed. For K12 education, this requires trust and alignment with pedagogical and developmental principles.
Where We Started
During discovery, we first mapped the foundation: which models were being used, what behaviors they produced, and how the product team currently tested quality. Since their process included manual playthroughs, there were no aggregated metrics or systematic evaluations.
So, we began from scratch and designed custom evaluation flows for this specific use case. Our goal was to move from ad hoc checks to a repeatable, interpretable evaluation framework that aligns stakeholders and flags when their AI system over- or underperforms for particular grade levels.
Throughout this process, we anchored our work around a few guiding questions:
What are we optimizing for? (e.g. engagement, learning, fairness)
How can we translate those goals into measurable, transparent criteria?
How do we interpret those metrics in ways that improve the AI system for the above optimization goals?
Designing for Engagement
Together with the client, we identified our primary goal: to improve student engagement with the EdTech application’s curriculum. But “engagement” is not a single metric – it’s a balance between curiosity, comprehension, and comfort.
To ground our approach, we referenced traditional pedagogical frameworks and how they were being applied within AI systems. We then focused on a particular principle, ‘managing cognitive load’. Effective learning in games happens when the material is challenging enough to promote growth, yet not so difficult that it leads to frustration or disengagement. Translating this into evaluation terms, we initially focused on four measurable qualities: verbosity, readability, tone, and coherence.
We designed the AI to produce dialogue and explanations that were readable at the appropriate grade level, balanced in length (neither too brief nor too dense), and structured to sustain a natural flow of interaction within the game.
Unlike a traditional tutor, this AI-powered game isn’t teaching students from a particular textbook. Its job is to make the game easy to play, personalize content to match students’ education levels, and sustain participation through adaptive conversation.
We also considered the range of student personas that may interact with the AI-powered game. How does the game handle frustrated students who get stuck in a loop, students attempting to bypass challenges, or other edge cases? Evaluating these scenarios is essential to understanding real-world engagement.
Turning Goals into Criteria
To move from intuition to evidence, we needed to operationalize what student engagement means within the context of an AI platform. We created a matrix for our first evaluation run, pairing measurable traits with simulated personas across the middle and high school levels:
By breaking “engagement” down into these layers, AI EdTech platforms, educators, and caregivers can interpret it across a shared framework.
We also considered sensitive themes when selecting which episodes to test to embed content safety considerations throughout our evaluation process. In this context, we focused on episodes that incorporated themes of rebellion, moral dilemmas (e.g. labor rights, racism, and propaganda), and marginalized identities.
Measuring What Matters
Selecting the right measurement tools is critical to performing any meaningful and effective evaluation.
We measured verbosity of the AI outputs using sentence and word count, then we mapped this to average reading speeds per grade level to estimate time per round in the game, incorporating gameplay flow.
For readability, we applied metrics like Flesch–Kincaid Grade Level and SMOG Index, which together capture sentence complexity and vocabulary difficulty. These indicators helped ensure the AI’s responses were grade-appropriate.
We leveraged an LLM as a “judge” to categorize tone (supportive, vague, or critical), and score coherence (i.e. semantic relevance) across consecutive turns.
The below illustration demonstrates how we incorporated these evaluations for the immersive AI EdTech history game, and simulating a student role-playing an opposing perspective to the U.S. 19th Amendment in 1919. Content Warning: sensitive language below.
Note: The above scores are based upon evaluating the EdTech platform’s AI outputs, which are hidden from view to protect our client. We evaluated such outputs for 9th-grade level, with the following acceptance criteria:
Verbosity: Turns lasting one minute or less (~200 words or less)
Readability: Flesch-Kincaid score 8–10, mapping 1-to-1 with U.S. reading grade level, and SMOG index <13, indicating a high school level of vocabulary difficulty
Tone: Overall supportive of the student
Coherence: 0.70 or above, indicating % of semantic relevancy of consecutive text outputs
These thresholds defined pass (green) and fail (red) for each criterion.
A Key Learning
Throughout this process, we’ve observed that while text structure tends to be consistent, large language models often default to more advanced vocabulary and long outputs even when instructed not to. This subtle linguistic inflation increases cognitive load and can lead to reduced engagement. The challenge lies in balancing sentence structure, word choice, sentiment, and natural tone, ensuring content feels engaging, grade-appropriate, and authentic. Subject matter experts like educators, curriculum designers, and child development experts can help to define this balance throughout the development process.
Looking ahead
Translating AI evaluations into educational contexts requires looking beyond standard natural language processing (NLP) metrics. It involves integrating pedagogical principles, content safety considerations, and engagement measures into the evaluation loop.
In our next round of evaluations, we will incorporate additional pedagogical principles, such as adapting to learner profiles and broadening the emotional context of student personas. We will also move from single-turn to multi-turn testing, capturing session-level engagement through average session length, completion rates, and sustained participation. Beyond measuring engagement, we aim to assess how the AI system adapts to support the user’s development of specific skills and competencies over time, linking user experience to meaningful learning outcomes. This next phase will help us understand how AI supports not just momentary interactions, but sustains learning and growth, more closely simulating real-world classroom experiences. If you’re interested in partnering with us for future AI EdTech evaluations, we’d love to hear from you!

