The Promise and Limitations of AI in Education: A Nuanced Look at Emerging Research
Published on AEI Ideas
The Promise and Limitations of AI in Education: A Nuanced Look at Emerging Research
As AI continues to advance, its potential applications in education have become a subject of considerable interest and debate. Recent studies illuminate AI’s promise and limitations in different facets of teaching and learning, providing an emerging and nuanced perspective on how these new large language models (LLMs) can transform educational practices.
AI as a Grader: Efficiency with Caveats. One of the most promising areas for AI in education is in automating the grading process, particularly for short-answer questions. The study, “Can Large Language Models Make the Grade?” examined how well GPT-4 and GPT-3.5 graded student responses across subjects and grade levels. Remarkably, GPT-4’s grading accuracy scored 0.70 on the measurement scale, which is nearly as high as the 0.75 score human graders achieved. Moreover, the AI completed the grading task in a fraction of the time humans required: just two hours, compared with an estimated more than 11 hours for 1,700 responses. This efficiency gain, coupled with consistent performance across subjects and difficulty levels, suggests that AI could be a valuable tool for formative assessment, freeing up teachers’ time for other activities. However, the study also highlighted the need for further research into the datasets used and the types of questions AI models can effectively grade.
The Double-Edged Sword of AI Tutoring. AI’s potential as a tutoring tool was explored in the study “Generative AI Can Harm Learning,” which conducted a randomized controlled trial involving nearly 1,000 students. The findings were mixed. While access to GPT-4-based tutoring tools significantly improved students’ performance during practice sessions (with the specialized GPT Tutor leading to a 127 percent improvement), it also led to a concerning outcome: Students who relied on AI assistance performed 17 percent worse on subsequent exams when AI was not available, compared to those who did not use AI. This suggests that while AI can enhance learning in the short term, it may also encourage dependency, diminishing students’ ability to perform independently. However, it is important to note the limitations of the study. It examined a single high school in Turkey with only four tutoring sessions that covered just 15 percent of the curriculum.
Reasoning Abilities: AI’s Strengths and Weaknesses. One of the more interesting debates is whether LLMs have reasoning capabilities or simply predict the next word in a sentence. Recent studies are contributing to our understanding of the complex reasoning capabilities of LLMs, revealing impressive strengths and notable limitations. The study “Physics of Language Models” demonstrated that models like GPT-2 and Llama can develop genuine reasoning skills for solving grade school math problems rather than merely memorizing patterns. These models exhibit human-like abilities such as planning ahead, step-by-step problem-solving, and generating efficient solutions.
Building on these findings, the study “Inductive or Deductive? Rethinking the Fundamental Reasoning Abilities of LLMs” provides a more nuanced understanding of LLMs’ cognitive capabilities. Using a novel SolverLearner framework, the study reveals that advanced models like GPT-4 effectively completed inductive reasoning tasks, often achieving near-perfect performance. However, a significant disparity exists in LLMs’ deductive reasoning, particularly in counterfactual scenarios.
These studies highlight the importance of model architecture in determining reasoning capabilities and suggest that traditional evaluation methods may underestimate LLMs’ true inductive reasoning potential.
Predicting Social Science Experiments with AI. Finally, a fascinating study examined GPT-4’s ability to predict the results of social science experiments. Analyzing 476 treatment effects from 70 pre-registered survey experiments, researchers found that LLM-derived predictions strongly correlated with actual effects, even for unpublished studies. LLMs matched or surpassed human forecasters’ accuracy and showed consistency across demographic subgroups. In nine additional “mega-studies” testing various interventions, LLM predictions were less accurate but still comparable to expert forecasts, especially for text-based survey experiments.
The findings suggest that LLMs could augment social science research by enabling rapid, low-cost pilot studies, generating effect size estimates for power analyses, and assessing past work’s reliability. However, the study also revealed potential misuse risks as LLMs could identify harmful content such as anti-vaccination messages. For education research, LLMs may help prioritize promising interventions and design more effective experiments, though limitations and biases must be carefully considered.
As the field rapidly evolves, the strongly held views on AI in education—both enthusiastic and skeptical—must be tempered and informed by ongoing research. The field would greatly benefit from a coordinated research effort—similar to those led by the National Science Foundation, Department of Energy, Department of Commerce, and Advanced Research Projects Agency for Health—to systematically explore AI’s implications in various educational contexts. This research should focus not only on AI’s technical capabilities but also on its pedagogical impacts, ethical considerations, and long-term effects on learning outcomes.