When LLMs Meet Cunning Texts:
A Fallacy Understanding Benchmark for
Large Language Models

Yinghui Li¹, Qingyu Zhou, Yuanzhen Luo, Shirong Ma¹,
Yangning Li¹, Hai-Tao Zheng¹, Xuming Hu², Philip S. Yu³

¹Tsinghua University
²The Hong Kong University of Science and Technology (Guangzhou)
³University of Illinois Chicago

liyinghu20@mails.tsinghua.edu.cn

Introduction

We challenge the reasoning and understanding abilities of LLMs by proposing a FaLlacy Understanding Benchmark (FLUB) containing cunning texts that are easy for humans to understand but difficult for models to grasp. We collect real cunning texts as our raw data from a famous Chinese online forum, the “Ruozhiba”. Figure 1(a) shows the running examples from FLUB.

Based on our constructed FLUB and its annotation information (as shown in Figure 1(b)), we design three tasks with increasing difficulty to test whether the LLMs can understand the fallacy and solve the "cunning" texts. Specifically, (1) Answer Selection: The model is asked to select the correct one from the four answers provided by FLUB for each input text. (2) Cunning Type Classification: Given a cunning text as input, the model is expected to directly identify its fallacy type defined in our scheme. (3) Fallacy Explanation: We hope the model sees a cunning text and intelligently generates a correct explanation for the fallacy contained in the text, just like humans, without falling into its trap.

Cunning Text in FLUB

We observe that most collected cunning texts can be categorized into a certain type (e.g., paradox, word game, and so on). Therefore, we define 8 cunning types within the collected texts along with their corresponding examples, as shown in Figure 2. In summary, FLUB comprises 834 samples that span 8 cunning types.

Experiments

The main results are presented as follows and we have some interesting discoveries and insights:

LLMs are very poor in their ability to perceive fallacy types in cunning texts.
For a specific task, LLMs with larger parameter sizes do not always perform better.
There is a close relationship between the Answer Selection task and the Fallacy Explanation task, and the interaction between them is critical to promoting the understanding of fallacies in LLMs.
On FLUB, the widely used Chain-of-Thought and In-context Learning deserve further improvement and research.

When LLMs Meet Cunning Texts:A Fallacy Understanding Benchmark for Large Language Models

Introduction

Cunning Text in FLUB

Experiments

When LLMs Meet Cunning Texts:
A Fallacy Understanding Benchmark for
Large Language Models