You’ve watched generative AI rush into classrooms since late 2022, when ChatGPT vaulted from research to mainstream and reshaped how novices approach code. Educators have wrestled with a split picture: meta-analyses and systematic reviews often find small-to-moderate gains when AI is scaffolded into instruction, especially for debugging and comprehension; yet instructors also report over-reliance, uneven accuracy, and skill atrophy when tools are used as solvers rather than tutors. Into that debate comes a new, open-access study from the University of Tartu with a clear headline finding: students who reported using AI chatbots more frequently tended to earn lower grades in a first-year object-oriented programming course.
The study at a glance
Researchers Marina Lepp and Joosep Kaimre surveyed 231 of 323 enrolled students midway through a 16-week Java OOP course. The survey captured how often and why students used AI assistants, alongside attitudes about helpfulness. The team linked responses to course assessments (two programming tests, a final exam, and total points) and ran Spearman correlations after confirming scores weren’t normally distributed. In plain terms: they asked who uses chatbots, for what, and whether those habits align with stronger—or weaker—marks.
Most students had tried AI at least once (~80%), but habitual use was rare: only ~4–5% reported weekly use by mid-semester. Students primarily used chatbots to debug code, understand examples, and clarify concepts; they used them far less for quizzes or pure theory. Helpfulness ratings were generally positive. However—and this is the crux—reported frequency of AI use correlated negatively with performance: moderate for the first programming test (ρ≈−0.315) and weaker but still significant for the second test, exam, and total points. Notably, perceived helpfulness didn’t correlate with grades.
Why “more AI” mapped to “lower scores”
Correlation is not causation; the authors are careful on that point. Yet the pattern is consistent with two mutually compatible explanations you’ve likely seen firsthand. First, students who already struggle reach more often for external aids—AI included—which makes frequency a proxy for difficulty rather than a driver of poor performance. Second, when students lean on chatbots for solutions—not just explanations—they may bypass the friction that builds mental models in OOP, which shows up later on timed, unaided assessments. The University’s release and subsequent coverage frame it similarly: unguided or heavy reliance can hinder learning, even as students feel less stuck.
Importantly, the result doesn’t invalidate earlier work showing benefits in structured settings. Reviews and meta-analyses find that when AI is embedded as a tutor—prompting, explaining, nudging—students often gain on debugging speed and comprehension; some controlled studies even show achievement gains. The tension you see here mirrors what we noted above: context and pedagogy matter more than the tool itself.
What students actually liked—and what tripped them up
When chatbots worked, they worked for familiar reasons: instant availability, faster paths to pinpointing bugs, and clear, iterative explanations that “talk through” code. Students also reported creative but pragmatic uses such as generating test data and translating familiar Python idioms into Java as a bridge—use cases you might consider legitimizing with guardrails. Conversely, errors, over-engineered suggestions, and misread prompts were common pain points, which likely reinforced uneven trust and the “Google vs. GPT” trade-off.
What this means for your classroom or program
If you’re teaching or designing curriculum, the signal is to treat AI as a structured support, not a primary solver. Concretely, align policy and practice with three moves:
- Channel toward explanation and debugging. Require students to annotate when, how, and why they used AI, and to contrast its output with their own approach. This keeps the locus of reasoning with the learner while preserving AI’s speed advantage. (As discussed above, benefits show up when AI behaves like a tutor.)
- Assess for transfer under non-AI conditions. Weight checkpoints that demand reasoning without tools—short code traces, concept questions, or constrained IDE environments—so that habits built with AI must generalize. The Tartu findings suggest performance gaps widen when unaided understanding is tested.
- Target support to heavy users. Frequent AI users reported feeling less struggle and more motivation, yet they scored lower. Proactive coaching—on prompt design for explanations, on reading errors, and on stepping down to simpler baselines—can redirect effort into durable skills.
As ever, scope the caveats. This is one university, one course, one academic year, and mid-semester self-report. Nevertheless, paired with prior syntheses, it’s a timely corrective to the assumption that “more AI” must mean “better outcomes.” Used well, these systems can accelerate feedback and comprehension; used as solution engines, they may do the opposite—especially early in a CS pathway.