We challenge the reasoning and understanding abilities of LLMs by proposing a FaLlacy Understanding Benchmark (FLUB) containing cunning texts that are easy for humans to understand but difficult for models to grasp. We collect real cunning texts as our raw data from a famous Chinese online forum, the “Ruozhiba”. Figure 1(a) shows the running examples from FLUB.
Based on our constructed FLUB and its annotation information (as shown in Figure 1(b)), we design three tasks with increasing difficulty to test whether the LLMs can understand the fallacy and solve the "cunning" texts. Specifically, (1) Answer Selection: The model is asked to select the correct one from the four answers provided by FLUB for each input text. (2) Cunning Type Classification: Given a cunning text as input, the model is expected to directly identify its fallacy type defined in our scheme. (3) Fallacy Explanation: We hope the model sees a cunning text and intelligently generates a correct explanation for the fallacy contained in the text, just like humans, without falling into its trap.
We observe that most collected cunning texts can be categorized into a certain type (e.g., paradox, word game, and so on). Therefore, we define 8 cunning types within the collected texts along with their corresponding examples, as shown in Figure 2. In summary, FLUB comprises 834 samples that span 8 cunning types.
The main results are presented as follows and we have some interesting discoveries and insights: