Skip to content

Archives

GSM-Symbolic

Published October 12, 2024

GSM-Symbolic

"GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models", from Apple Machine Learning Research:
We investigate the fragility of mathematical reasoning in these models and show that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is because current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data. Adding a single clause that seems relevant to the question causes significant performance drops (up to 65%) across all state-of-the-art models, even though the clause doesn't contribute to the reasoning chain needed for the final answer.
Even better -- "the performance of all models declines when only the numerical values in the question are altered" seems to suggest that great performance on benchmarks like GSM8K just mean that the LLMs have been trained on the answers...

(tags: training benchmarks ai llms gsm-symbolic reasoning ml apple papers gsm8k)