When LLMs Meet Cunning Texts:
A Fallacy Understanding Benchmark for
Large Language Models

Yinghui Li1, Qingyu Zhou, Yuanzhen Luo, Shirong Ma1,
Yangning Li1, Hai-Tao Zheng1, Xuming Hu2, Philip S. Yu3
1Tsinghua University
2The Hong Kong University of Science and Technology (Guangzhou)
3University of Illinois Chicago
liyinghu20@mails.tsinghua.edu.cn
Arxiv Paper Data & Code

Introduction

We challenge the reasoning and understanding abilities of LLMs by proposing a FaLlacy Understanding Benchmark (FLUB) containing cunning texts that are easy for humans to understand but difficult for models to grasp. We collect real cunning texts as our raw data from a famous Chinese online forum, the “Ruozhiba”. Figure 1(a) shows the running examples from FLUB.

figure1_flub_example.png

Based on our constructed FLUB and its annotation information (as shown in Figure 1(b)), we design three tasks with increasing difficulty to test whether the LLMs can understand the fallacy and solve the "cunning" texts. Specifically, (1) Answer Selection: The model is asked to select the correct one from the four answers provided by FLUB for each input text. (2) Cunning Type Classification: Given a cunning text as input, the model is expected to directly identify its fallacy type defined in our scheme. (3) Fallacy Explanation: We hope the model sees a cunning text and intelligently generates a correct explanation for the fallacy contained in the text, just like humans, without falling into its trap.

Cunning Text in FLUB

We observe that most collected cunning texts can be categorized into a certain type (e.g., paradox, word game, and so on). Therefore, we define 8 cunning types within the collected texts along with their corresponding examples, as shown in Figure 2. In summary, FLUB comprises 834 samples that span 8 cunning types.

figure2_cunning_type.png

Experiments

The main results are presented as follows and we have some interesting discoveries and insights:

table1_result.png
table2_result.png