$Obálka$

DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

Uloženo v:

Podrobná bibliografie
Vydáno v:	arXiv.org (Dec 23, 2024), p. n/a
Hlavní autor:	Tong, Yuxuan
Další autoři:	Zhang, Xiwen, Wang, Rui, Wu, Ruidong, He, Junxian
Vydáno:	Cornell University Library, arXiv.org
Témata:	Problem solving Datasets Tuning Rejection Large language models Mathematical problems Proprietary Query processing Synthetic data Reasoning
On-line přístup:	Citation/Abstract Full text outside of ProQuest
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

MARC


LEADER	00000nab a2200000uu 4500
001	3126995661
003	UK-CbPIL
022			\|a 2331-8422
035			\|a 3126995661
045	0		\|b d20241223
100	1		\|a Tong, Yuxuan
245	1		\|a DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving
260			\|b Cornell University Library, arXiv.org \|c Dec 23, 2024
513			\|a Working Paper
520	3		\|a Solving mathematical problems requires advanced reasoning abilities and presents notable challenges for large language models. Previous works usually synthesize data from proprietary models to augment existing datasets, followed by instruction tuning to achieve top-tier results. However, our analysis of these datasets reveals severe biases towards easy queries, with frequent failures to generate any correct response for the most challenging queries. Hypothesizing that difficult queries are crucial to learn complex reasoning, we propose Difficulty-Aware Rejection Tuning (DART), a method that allocates difficult queries more trials during the synthesis phase, enabling more extensive training on difficult samples. Utilizing DART, we have created new datasets for mathematical problem-solving that focus more on difficult queries and are substantially smaller than previous ones. Remarkably, our synthesis process solely relies on a 7B-sized open-weight model, without reliance on the commonly used proprietary GPT-4. We fine-tune various base models on our datasets ranging from 7B to 70B in size, resulting in a series of strong models called DART-MATH. In comprehensive in-domain and out-of-domain evaluation on 6 mathematical benchmarks, DART-MATH outperforms vanilla rejection tuning significantly, being superior or comparable to previous arts, despite using much smaller datasets and no proprietary models. Furthermore, our results position our synthetic datasets as the most effective and cost-efficient publicly available resources for advancing mathematical problem-solving.
653			\|a Problem solving
653			\|a Datasets
653			\|a Tuning
653			\|a Rejection
653			\|a Large language models
653			\|a Mathematical problems
653			\|a Proprietary
653			\|a Query processing
653			\|a Synthetic data
653			\|a Reasoning
700	1		\|a Zhang, Xiwen
700	1		\|a Wang, Rui
700	1		\|a Wu, Ruidong
700	1		\|a He, Junxian
773	0		\|t arXiv.org \|g (Dec 23, 2024), p. n/a
786	0		\|d ProQuest \|t Engineering Database
856	4	1	\|3 Citation/Abstract \|u https://www.proquest.com/docview/3126995661/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch
856	4	0	\|3 Full text outside of ProQuest \|u http://arxiv.org/abs/2407.13690