Why AI Struggles with Basic Math (and How That’s Changing)

Published at AEI Ideas: .

Why AI Struggles with Basic Math (and How That’s Changing)

Large Language Models (LLMs) have ushered in a new era of artificial intelligence (AI) demonstrating remarkable capabilities in language generation, translation, and reasoning. Yet, LLMs often stumble over basic math problems, posing a problem for their use in settings—including education—where math is essential. However, these limitations are being addressed through improvements in the models themselves along with better prompting strategies.  

The limitations of LLMs in mathematics have been widely recognized. For instance, Paul von Hippel, an associate dean at the University of Texas, pointed out ChatGPT’s inadequacies in teaching Geometry, while The Wall Street Journal highlighted similar struggles of the AI tutor Khanmigo. These difficulties stem from LLMs’ foundational bias towards linguistic rather than mathematical intelligence, a gap further widened by the scarcity of complex math in their training data, limiting their grasp of advanced mathematical concepts.

Nonetheless, performance varies widely among different models. GPT-4, for instance, achieved the 89th percentile on the SAT, while Google’s PaLM 2 surpassed GPT-4 in math assessments, including over 20,000 school-level problems and word puzzles.

There is also a growing number of specialized math models aimed at improving mathematical capabilities. Without requiring additional fine-tuning, LLEMMA can solve math problems and leverage computational tools, such as the Python interpreter and formal theorem provers, to solve mathematical problems and produce mathematical proofs. 

Google DeepMind’s AlphaGeometry achieved expert-level geometric problem-solving. When benchmarked against 30 problems from the International Mathematical Olympiad (IMO), AlphaGeometry solved 25 within the strict time constraints. For comparison, the average human gold medal winner in the IMO solves 25.9 problems, and the previous state-of-the-art geometric reasoning program could only solve 10 of the 30 benchmark problems.

Google DeepMind also developed a specialized LLM called FunSearch which was able to create code that produced a correct and previously unknown solution to the cap set problem, which involves determining the maximum number of dots that can be plotted on graph paper without any three dots forming a straight line. This niche yet important problem has confounded mathematicians who do not agree on how to solve it.

While the underlying models will improve over time, there are additional strategies that can be used now. Better and more sophisticated prompts can often improve accuracy. Researchers recently took the steps students use to solve arithmetic problems and applied them to a chain-of-thought prompting technique. The prompt incorporates ideas like cross-checking the intermediate steps and solving the same math problem using multiple approaches in its design to improve accuracy. The technique achieved a 92.5 percent  accuracy on the MultiArith dataset (a collection of mathematical problems specifically designed to test the ability of models to perform complex arithmetic operations and reasoning) compared to 78.7 percent for previous state-of-the-art systems

Incorporating the Wolfram GPT makes ChatGPT smarter by giving it access to powerful computation, accurate math, curated knowledge, real-time data, and visualization. Using OpenAI’s Code Interpreter (now called Advance Data Analysis) provides more accurate calculations, in part because it writes a small Python program to perform the actual math. Researchers tested OpenAI’s GPT-4 Code Interpreter on the challenging MATH benchmark and achieved a new state-of-the-art accuracy of 69.7 percent , far surpassing GPT-4’s 42.2 percent.  

Despite initial hurdles in mathematical capabilities, the trajectory of LLMs is undeniably upward, fueled by continuous advancements and innovative solutions. Specialized models and integration with computational tools underscore a future where LLMs not only comprehend and generate language with unprecedented sophistication but also navigate complex mathematics with ease. As these models evolve, their potential to revolutionize fields including education, science, technology, engineering, public policy, and more—becomes increasingly apparent.