| 英文摘要 |
In this study, we assess ChatGPT, OpenAI’s latest conversational chatbot and large language model (LLM), on its performance in elementary-grade arithmetic and logic problems. Despite its impressive coherence in natural language processing and ability to follow instructions, our findings indicate that Chat- GPT still has room for improvement in mathematical tasks. To evaluate its performance, we used six math and logic datasets, including SingleEq, AddSub, SVAMP, MultiArith, Simple Arithmetic and counting, and Arithmetic (word variation), and found that ChatGPT performed better than previous models such as InstructGPT and Minerva. However, our arithmetic dataset, which includes two- to seven-digit equations, revealed that ChatGPT’s accuracy in solving addition problems decreased from 100% to 64%, with simple arithmetic errors such as not carrying over in addition being a common issue. Additionally, the model struggled with basic multi-step word problems. To address this, we propose a novel benchmark for evaluating LLMs’mathematical abilities. Further research is needed for LLMs to reach the level of mathematical reasoning comparable to their natural language processing abilities. Overall, our study highlights the need for continued improvement in LLMs’mathematical abilities to make them more effective in real-world applications. |