Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning
Detailed Summary:
Context and Motivation:
Large language models (LLMs) have demonstrated impressive progress in mathematical reasoning, achieving near-human or even superhuman performance on existi
Detailed Summary:
**Context and Motivation:**
Large language models (LLMs) have made significant strides in mathematical reasoning, sometimes achieving near-human or superhuman performance on benchmarks like MATH or GSM8K. However, these benchmarks face two critical issues: saturation (models scoring too high, making differentiation difficult) and data contamination (problems being present in training data, leading to inflated results). To address these, we introduced Putnam-AXIOM, a new benchmark using advanced problems from the William Lowell Putnam Mathematical Competition. Our goal is to evaluate LLMs' true reasoning capabilities on complex problems while mitigating data contamination through functional variations.
**Main Contributions:**
**Putnam-AXIOM Original Dataset:**
The dataset comprises 236 problems from Putnam competitions (1985–2023), spanning 11 mathematical domains (e.g., geometry, algebra, analysis, combinatorics). Problems are meticulously formatted in LaTeX, including complex equations and vector graphics (Asymptote), to preserve their mathematical rigor. To facilitate automated evaluation and eliminate format-induced biases, final answers are consistently enclosed in \(\boxed{}\).
**Putnam-AXIOM Variation Dataset:**
To counter data contamination, we created 52 functional variations of problems from the original dataset. These variations involve modifying variables, constants, or problem formulations while preserving the original difficulty. For instance, a problem might see a variable change from \(x\) to \(w\) or a numerical constant from 2011 to 4680. This forces models to engage in genuine reasoning rather than relying on memorization.
**Model Evaluation:**
We evaluated several state-of-the-art LLMs, both proprietary and open-source, on both the Original and Variation datasets. Models were required to output answers in the \(\boxed{}\) format, allowing for precise comparison with reference solutions. Performance was measured by average accuracy and 95% confidence intervals.
**Key Results:**
**Benchmark Difficulty:**
Putnam-AXIOM proved to be exceptionally challenging for all evaluated models:
* OpenAI o1-preview, the top performer, achieved only 41.95% accuracy.
* GPT-4o scored 17.80%.
* Math-specialized models like Qwen2-Math-7B and NuminaMath-7B achieved merely 5.51% and 4.66%, respectively.
These results confirm Putnam-AXIOM is substantially more difficult than existing benchmarks like MATH or GSM8K.
**Impact of Functional Variations:**
Model performance significantly declined on functional variations compared to their original counterparts:
* OpenAI o1-preview dropped from 41.95% to 33.96%.
* GPT-4o fell from 17.80% to 9.43%.
These findings underscore that models often rely on memorization of original problems and struggle to generalize to structurally similar, yet parametrically different, variations.
**Error Analysis:**
Even the leading models frequently exhibited a lack of mathematical rigor:
* They often presented unsupported claims or skipped crucial logical steps.
* Common errors included calculation mistakes and the inclusion of irrelevant information.
Open-source models, such as NuminaMath-7B, were particularly prone to comprehension and reasoning errors.
**Examples of Problems and Solutions:**
**Original Problem:**
Determine the sum of the first positive integers with a specific property.
Final answer: \(2k^2 - 4k + 3\).
**Modified Problem (Variation):**
Variables and constants are altered, while the core problem structure remains consistent.
Final answer: a new expression based on these modifications.
**Model Response Analysis:**
Models like GPT-4o and OpenAI o1-preview sometimes produced the correct final answer but often lacked rigorous justification in their solution steps.
**Conclusion:**
Putnam-AXIOM is a novel and exceptionally challenging benchmark designed to rigorously evaluate the advanced reasoning capabilities of LLMs. It addresses the limitations of current benchmarks by introducing functional variations to prevent data contamination and demanding mathematically rigorous answers. Our results demonstrate that even the most advanced models still have considerable room for improvement in solving complex mathematical problems with true rigor. We hope this benchmark will catalyze further research into artificial reasoning and foster the development of models truly capable of high-level mathematical problem-solving.By Romain Peter