Publications | Liangming Pan

Please see my Google Scholar or Semantic Scholar for the most up-to-date list of publications.

Preprints

2025

arXiv
How does Transformer Learn Implicit Reasoning?

Jiaran Ye, Zijun Yao, Zhidian Huang, Liangming Pan, Jinxin Liu, Yushi Bai, Amy Xin, Liu Weichuan, Xiaoyin Che, Lei Hou, and Juanzi Li

arXiv preprint, 2025

Abstract Bib PDF

Recent work suggests that large language models (LLMs) can perform multi-hop reasoning implicitly – producing correct answers without explicitly verbalizing intermediate steps – but the underlying mechanisms remain poorly understood. In this paper, we study how such implicit reasoning emerges by training transformers from scratch in a controlled symbolic environment. Our analysis reveals a three-stage developmental trajectory: early memorization, followed by in-distribution generalization, and eventually cross-distribution generalization. We find that training with atomic triples is not necessary but accelerates learning, and that second-hop generalization relies on query-level exposure to specific compositional structures. To interpret these behaviors, we introduce two diagnostic tools: cross-query semantic patching, which identifies semantically reusable intermediate representations, and a cosine-based representational lens, which reveals that successful reasoning correlates with the cosine-base clustering in hidden space. This clustering phenomenon in turn provides a coherent explanation for the behavioral dynamics observed across training, linking representational structure to reasoning capability. These findings provide new insights into the interpretability of implicit multi-hop reasoning in LLMs, helping to clarify how complex reasoning processes unfold internally and offering pathways to enhance the transparency of such models.
@misc{ye2025doestransformerlearnimplicit, title = {How does Transformer Learn Implicit Reasoning?}, author = {Ye, Jiaran and Yao, Zijun and Huang, Zhidian and Pan, Liangming and Liu, Jinxin and Bai, Yushi and Xin, Amy and Weichuan, Liu and Che, Xiaoyin and Hou, Lei and Li, Juanzi}, year = {2025}, eprint = {2505.23653}, archiveprefix = {arXiv}, primaryclass = {cs.LG}, url = {https://arxiv.org/abs/2505.23653} }

2024

arXiv
COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement

Yuxi Xie, Anirudh Goyal, Xiaobao Wu, Xunjian Yin, Xiao Xu, Min-Yen Kan, Liangming Pan, and William Yang Wang

arXiv preprint, 2024

Abstract Bib PDF Code

Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks. However, existing approaches typically implement iterative refinement at the application or prompting level, relying on autoregressive (AR) modeling. The sequential token generation in AR models can lead to high inference latency. To overcome these challenges, we propose Context-Wise Order-Agnostic Language Modeling (COrAL), which incorporates iterative refinement directly into the LLM architecture while maintaining computational efficiency. Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally during the generation process. Leveraging the order-agnostic nature of COrAL, we introduce sliding blockwise order-agnostic decoding, which performs multi-token forward prediction and backward reconstruction within context windows. This allows the model to iteratively refine its outputs in parallel in the sliding block, effectively capturing diverse dependencies without the high inference cost of sequential generation. Empirical evaluations on reasoning tasks demonstrate that COrAL improves performance and inference speed, respectively, achieving absolute accuracy gains of 4.6% on GSM8K and 4.0% on LogiQA, along with inference speedups of up to 3.9× over next-token baselines. Preliminary results on code generation indicate a drop in pass rates due to inconsistencies in order-agnostic outputs, highlighting the inherent quality–speed trade-off.
@misc{xie2024coralorderagnosticlanguagemodeling, title = {COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement}, author = {Xie, Yuxi and Goyal, Anirudh and Wu, Xiaobao and Yin, Xunjian and Xu, Xiao and Kan, Min-Yen and Pan, Liangming and Wang, William Yang}, year = {2024}, eprint = {2410.09675}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, url = {https://arxiv.org/abs/2410.09675} }

Conference & Journal Papers

2025

EMNLP Oral Presentation
How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark

Minglai Yang, Ethan Huang, Liang Zhang, Mihai Surdeanu, William Wang, and Liangming Pan^*

In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025

Abstract Bib PDF Code (*: Corresponding Author)

We introduce Grade School Math with Distracting Context (GSM-DC), a synthetic benchmark to evaluate Large Language Models’ (LLMs) reasoning robustness against systematically controlled irrelevant context (IC). GSM-DC constructs symbolic reasoning graphs with precise distractor injections, enabling rigorous, reproducible evaluation. Our experiments demonstrate that LLMs are significantly sensitive to IC, affecting both reasoning path selection and arithmetic accuracy. Additionally, training models with strong distractors improves performance in both in-distribution and out-of-distribution scenarios. We further propose a stepwise tree search guided by a process reward model, which notably enhances robustness in out-of-distribution conditions.
@inproceedings{yang2025llmreasoningdistractedirrelevant, title = {How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark}, author = {Yang, Minglai and Huang, Ethan and Zhang, Liang and Surdeanu, Mihai and Wang, William and Pan, Liangming}, year = {2025}, booktitle = {Conference on Empirical Methods in Natural Language Processing (EMNLP)}, address = {Suzhou, China}, url = {https://arxiv.org/abs/2505.18761} }
EMNLP
ConciseRL: Conciseness-Guided Reinforcement Learning for Efficient Reasoning Models

Razvan-Gabriel Dumitru, Darius Peteleaza, Vikas Yadav, and Liangming Pan^*

In Findings of Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025

Abstract Bib (*: Corresponding Author)

We introduce Grade School Math with Distracting Context (GSM-DC), a synthetic benchmark to evaluate Large Language Models’ (LLMs) reasoning robustness against systematically controlled irrelevant context (IC). GSM-DC constructs symbolic reasoning graphs with precise distractor injections, enabling rigorous, reproducible evaluation. Our experiments demonstrate that LLMs are significantly sensitive to IC, affecting both reasoning path selection and arithmetic accuracy. Additionally, training models with strong distractors improves performance in both in-distribution and out-of-distribution scenarios. We further propose a stepwise tree search guided by a process reward model, which notably enhances robustness in out-of-distribution conditions.
@inproceedings{dumitru2025conciserl, title = {ConciseRL: Conciseness-Guided Reinforcement Learning for Efficient Reasoning Models}, author = {Dumitru, Razvan-Gabriel and Peteleaza, Darius and Yadav, Vikas and Pan, Liangming}, year = {2025}, booktitle = {Findings of Conference on Empirical Methods in Natural Language Processing (EMNLP)}, address = {Suzhou, China}, url = {} }
ACM MM Oral Presentation
Towards Temporal-Aware Multi-Modal Retrieval Augmented Generation in Finance

Fengbin Zhu, Junfeng Li, Liangming Pan^*, Wenjie Wang, Fuli Feng, Chao Wang, Huanbo Luan, and Tat-Seng Chua

In ACM International Conference on Multimedia (ACM MM), 2025

Abstract Bib PDF (*: Corresponding Author)

Finance decision-making often relies on in-depth data analysis across various data sources, including financial tables, news articles, stock prices, etc. In this work, we introduce FinTMMBench, the first comprehensive benchmark for evaluating temporal-aware multi-modal Retrieval-Augmented Generation (RAG) systems in finance. Built from heterologous data of NASDAQ 100 companies, FinTMMBench offers three significant advantages. 1) Multi-modal Corpus: It encompasses a hybrid of financial tables, news articles, daily stock prices, and visual technical charts as the corpus. 2) Temporal-aware Questions: Each question requires the retrieval and interpretation of its relevant data over a specific time period, including daily, weekly, monthly, quarterly, and annual periods. 3) Diverse Financial Analysis Tasks: The questions involve 10 different financial analysis tasks designed by domain experts, including information extraction, trend analysis, sentiment analysis and event detection, etc. We further propose a novel TMMHybridRAG method, which first leverages LLMs to convert data from other modalities (e.g., tabular, visual and time-series data) into textual format and then incorporates temporal information in each node when constructing graphs and dense indexes. Its effectiveness has been validated in extensive experiments, but notable gaps remain, highlighting the challenges presented by our FinTMMBench.
@inproceedings{zhu2025towardstemporal-aware, title = {Towards Temporal-Aware Multi-Modal Retrieval Augmented Generation in Finance}, author = {Zhu, Fengbin and Li, Junfeng and Pan, Liangming and Wang, Wenjie and Feng, Fuli and Wang, Chao and Luan, Huanbo and Chua, Tat-Seng}, year = {2025}, booktitle = {ACM International Conference on Multimedia (ACM MM)}, address = {Dublin, Ireland}, url = {https://arxiv.org/abs/2503.05185} }
ACL Oral Presentation
AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge

Xiaobao Wu, Liangming Pan, Yuxi Xie, Ruiwen Zhou, Shuai Zhao, Yubo Ma, Mingzhe Du, Rui Mao, Anh Tuan Luu, and William Yang Wang

In Annual Meeting of the Association for Computational Linguistics (ACL), 2025

SAC Highlights Award

Abstract Bib PDF Dataset

Data contamination hinders fair LLM evaluation by introducing test data into newer models’ training sets. Existing studies solve this challenge by updating benchmarks with newly collected data. However, they fail to guarantee contamination-free evaluation as the newly collected data may contain pre-existing knowledge, and their benchmark updates rely on intensive human labor. To address these issues, we in this paper propose AntiLeak-Bench, an automated anti-leakage benchmarking framework. Instead of simply using newly collected data, we construct samples with explicitly new knowledge absent from LLMs’ training sets, which thus ensures strictly contamination-free evaluation. We further design a fully automated workflow to build and update our benchmark without human labor. This significantly reduces the cost of benchmark maintenance to accommodate emerging LLMs. Through extensive experiments, we highlight that data contamination likely exists before LLMs’ cutoff time and demonstrate AntiLeak-Bench effectively overcomes this challenge.
@inproceedings{wu2024antileakbenchpreventingdatacontamination, title = {AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge}, author = {Wu, Xiaobao and Pan, Liangming and Xie, Yuxi and Zhou, Ruiwen and Zhao, Shuai and Ma, Yubo and Du, Mingzhe and Mao, Rui and Luu, Anh Tuan and Wang, William Yang}, year = {2025}, booktitle = {Annual Meeting of the Association for Computational Linguistics (ACL)}, address = {Vienna, Austria}, pages = {18403--18419}, url = {https://arxiv.org/abs/2412.13670} }
ACL Oral Presentation
SeaKR: Self-aware Knowledge Retrieval for Adaptive Retrieval Augmented Generation

Zijun Yao, Weijian Qi, Liangming Pan, Shulin Cao, Linmei Hu, Weichuan Liu, Lei Hou, and Juanzi Li

In Annual Meeting of the Association for Computational Linguistics (ACL), 2025

Abstract Bib PDF Code

This paper introduces Self-aware Knowledge Retrieval (SeaKR), a novel adaptive RAG model that extracts self-aware uncertainty of LLMs from their internal states. SeaKR activates retrieval when the LLMs present high self-aware uncertainty for generation. To effectively integrate retrieved knowledge snippets, SeaKR re-ranks them based on LLM’s self-aware uncertainty to preserve the snippet that reduces their uncertainty to the utmost. To facilitate solving complex tasks that require multiple retrievals, SeaKR utilizes their self-aware uncertainty to choose among different reasoning strategies. Our experiments on both complex and simple Question Answering datasets show that SeaKR outperforms existing adaptive RAG methods.
@inproceedings{yao2024seakrselfawareknowledgeretrieval, title = {SeaKR: Self-aware Knowledge Retrieval for Adaptive Retrieval Augmented Generation}, author = {Yao, Zijun and Qi, Weijian and Pan, Liangming and Cao, Shulin and Hu, Linmei and Liu, Weichuan and Hou, Lei and Li, Juanzi}, year = {2025}, booktitle = {Annual Meeting of the Association for Computational Linguistics (ACL)}, address = {Vienna, Austria}, pages = {27022--27043}, url = {https://arxiv.org/abs/2406.19215} }
ACL Oral Presentation
Aristotle: Mastering Logical Reasoning with A Logic-Complete Decompose-Search-Resolve Framework

Jundong Xu, Hao Fei, Meng Luo, Qian Liu, Liangming Pan, William Yang Wang, Preslav Nakov, Mong-Li Lee, and Wynne Hsu

In Annual Meeting of the Association for Computational Linguistics (ACL), 2025

Abstract Bib PDF Code

In the context of large language models (LLMs), current advanced reasoning methods have made impressive strides in various reasoning tasks. However, when it comes to logical reasoning tasks, major challenges remain in both efficacy and efficiency. This is rooted in the fact that these systems fail to fully leverage the inherent structure of logical tasks throughout the reasoning processes such as decomposition, search, and resolution. To address this, we propose a logic-complete reasoning framework, Aristotle, with three key components: Logical Decomposer, Logical Search Router, and Logical Resolver. In our framework, symbolic expressions and logical rules are comprehensively integrated into the entire reasoning process, significantly alleviating the bottlenecks of logical reasoning, i.e., reducing sub-task complexity, minimizing search errors, and resolving logical contradictions. The experimental results on several datasets demonstrate that Aristotle consistently outperforms state-of-the-art reasoning frameworks in both accuracy and efficiency, particularly excelling in complex logical reasoning scenarios.
@inproceedings{xu2024aristotlemasteringlogicalreasoning, title = {Aristotle: Mastering Logical Reasoning with A Logic-Complete Decompose-Search-Resolve Framework}, author = {Xu, Jundong and Fei, Hao and Luo, Meng and Liu, Qian and Pan, Liangming and Wang, William Yang and Nakov, Preslav and Lee, Mong-Li and Hsu, Wynne}, year = {2025}, booktitle = {Annual Meeting of the Association for Computational Linguistics (ACL)}, address = {Vienna, Austria}, pages = {3052--3075}, url = {https://arxiv.org/abs/2412.16953} }
ACL
RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios

Ruiwen Zhou, Wenyue Hua, Liangming Pan, Sitao Cheng, Xiaobao Wu, En Yu, and William Yang Wang

In Annual Meeting of the Association for Computational Linguistics (ACL), 2025

Abstract Bib PDF Dataset

This paper introduces RuleArena, a novel and challenging benchmark designed to evaluate the ability of large language models (LLMs) to follow complex, real-world rules in reasoning. Covering three practical domains – airline baggage fees, NBA transactions, and tax regulations – RuleArena assesses LLMs’ proficiency in handling intricate natural language instructions that demand long-context understanding, logical reasoning, and accurate mathematical computation. Two key attributes distinguish RuleArena from traditional rule-based reasoning benchmarks: (1) it extends beyond standard first-order logic representations, and (2) it is grounded in authentic, practical scenarios, providing insights into the suitability and reliability of LLMs for real-world applications. Our findings reveal several notable limitations in LLMs: (1) they struggle to identify and apply the appropriate rules, frequently becoming confused by similar but distinct regulations, (2) they cannot consistently perform accurate mathematical computations, even when they correctly identify the relevant rules, and (3) in general, they perform poorly in the benchmark. These results highlight significant challenges in advancing LLMs’ rule-guided reasoning capabilities in real-life applications.
@inproceedings{zhou2024rulearenabenchmarkruleguidedreasoning, title = {RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios}, author = {Zhou, Ruiwen and Hua, Wenyue and Pan, Liangming and Cheng, Sitao and Wu, Xiaobao and Yu, En and Wang, William Yang}, year = {2025}, booktitle = {Annual Meeting of the Association for Computational Linguistics (ACL)}, address = {Vienna, Austria}, pages = {550--572}, url = {https://arxiv.org/abs/2412.08972} }
ACL
Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement

Xunjian Yin, Xinyi Wang, Liangming Pan, Xiaojun Wan, and William Yang Wang

In Annual Meeting of the Association for Computational Linguistics (ACL), 2025

Abstract Bib PDF Code

The rapid advancement of large language models (LLMs) has significantly enhanced the capabilities of AI-driven agents across various tasks. However, existing agentic systems, whether based on fixed pipeline algorithms or pre-defined meta-learning frameworks, cannot search the whole agent design space due to the restriction of human-designed components, and thus might miss the globally optimal agent design. In this paper, we introduce Gödel Agent, a self-evolving framework inspired by the Gödel machine, enabling agents to recursively improve themselves without relying on predefined routines or fixed optimization algorithms. Gödel Agent leverages LLMs to dynamically modify its own logic and behavior, guided solely by high-level objectives through prompting. Experimental results on mathematical reasoning and complex agent tasks demonstrate that implementation of Gödel Agent can achieve continuous self-improvement, surpassing manually crafted agents in performance, efficiency, and generalizability.
@inproceedings{yin2024godelagentselfreferentialagent, title = {G\"odel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement}, author = {Yin, Xunjian and Wang, Xinyi and Pan, Liangming and Wan, Xiaojun and Wang, William Yang}, year = {2025}, booktitle = {Annual Meeting of the Association for Computational Linguistics (ACL)}, address = {Vienna, Austria}, pages = {27890--27913}, url = {https://arxiv.org/abs/2410.04444} }
ACL
InductionBench: LLMs Fail in the Simplest Complexity Class

Wenyue Hua, Tyler Wong, Sun Fei, Liangming Pan, Adam Jardine, and William Yang Wang

In Annual Meeting of the Association for Computational Linguistics (ACL), 2025

Abstract Bib PDF Dataset

Large language models (LLMs) have shown remarkable improvements in reasoning and many existing benchmarks have been addressed by models such as o1 and o3 either fully or partially. However, a majority of these benchmarks emphasize deductive reasoning, including mathematical and coding tasks in which rules such as mathematical axioms or programming syntax are clearly defined, based on which LLMs can plan and apply these rules to arrive at a solution. In contrast, inductive reasoning, where one infers the underlying rules from observed data, remains less explored. Such inductive processes lie at the heart of scientific discovery, as they enable researchers to extract general principles from empirical observations. To assess whether LLMs possess this capacity, we introduce InductionBench, a new benchmark designed to evaluate the inductive reasoning ability of LLMs. Our experimental findings reveal that even the most advanced models available struggle to master the simplest complexity classes within the subregular hierarchy of functions, highlighting a notable deficiency in current LLMs’ inductive reasoning capabilities.
@inproceedings{hua2025inductionbenchllmsfailsimplest, title = {InductionBench: LLMs Fail in the Simplest Complexity Class}, author = {Hua, Wenyue and Wong, Tyler and Fei, Sun and Pan, Liangming and Jardine, Adam and Wang, William Yang}, year = {2025}, booktitle = {Annual Meeting of the Association for Computational Linguistics (ACL)}, address = {Vienna, Austria}, pages = {26526--26546}, url = {http://arxiv.org/abs/2502.15823} }
ACL Oral Presentation
Understanding the Interplay between Parametric and Contextual Knowledge for Large Language Models

Sitao Cheng, Liangming Pan, Xunjian Yin, Xinyi Wang, and William Yang Wang

In ACL Workshop on Towards Knowledgeable Foundation Models (KnowFM@ACL), 2025

Abstract Bib PDF Code (*: Corresponding Author)

Large language models (LLMs) encode vast amounts of knowledge during pre-training (parametric knowledge, or PK) and can further be enhanced by incorporating contextual knowledge (CK). Can LLMs effectively integrate their internal PK with external CK to solve complex problems? In this paper, we investigate the dynamic interaction between PK and CK, categorizing their relationships into four types: Supportive, Complementary, Conflicting, and Irrelevant. To support this investigation, we introduce ECHOQA, a benchmark spanning scientific, factual, and commonsense knowledge. Our results show that LLMs tend to suppress their PK when contextual information is available, even when it is complementary or irrelevant. While tailored instructions can encourage LLMs to rely more on their PK, they still struggle to fully leverage it. These findings reveal a key vulnerability in LLMs, raising concerns about their reliability in knowledge-intensive tasks.
@inproceedings{cheng2024understandinginterplayparametriccontextual, title = {Understanding the Interplay between Parametric and Contextual Knowledge for Large Language Models}, author = {Cheng, Sitao and Pan, Liangming and Yin, Xunjian and Wang, Xinyi and Wang, William Yang}, year = {2025}, booktitle = {ACL Workshop on Towards Knowledgeable Foundation Models (KnowFM@ACL)}, address = {Vienna, Austria}, url = {https://arxiv.org/abs/2410.08414} }
NAACL
CausalEval: Towards Better Causal Reasoning in Language Models

Longxuan Yu, Delin Chen, Siheng Xiong, Qingyang Wu, Qingzhen Liu, Dawei Li, Zhikai Chen, Xiaoze Liu, and Liangming Pan

In Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2025

Abstract Bib PDF Code (*: Corresponding Author)

Causal reasoning (CR) is a crucial aspect of intelligence, essential for problem-solving, decision-making, and understanding the world. While large language models (LLMs) can generate rationales for their outputs, their ability to reliably perform causal reasoning remains uncertain, often falling short in tasks requiring a deep understanding of causality. In this survey, we provide a comprehensive review of research aimed at enhancing LLMs for causal reasoning. We categorize existing methods based on the role of LLMs: either as reasoning engines or as helpers providing knowledge or data to traditional CR methods, followed by a detailed discussion of the methodologies in each category. We then evaluate the performance of LLMs on various causal reasoning tasks, providing key findings and in-depth analysis. Finally, we provide insights from current studies and highlight promising directions for future research. We aim for this work to serve as a comprehensive resource, fostering further advancements in causal reasoning with LLMs.
@inproceedings{yu2024improvingcausalreasoningsurvey, title = {CausalEval: Towards Better Causal Reasoning in Language Models}, author = {Yu, Longxuan and Chen, Delin and Xiong, Siheng and Wu, Qingyang and Liu, Qingzhen and Li, Dawei and Chen, Zhikai and Liu, Xiaoze and Pan, Liangming}, year = {2025}, booktitle = {Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)}, address = {Albuquerque, USA}, pages = {12512--12540}, url = {https://arxiv.org/abs/2410.16676} }
NAACL
Investigating the Transferability of Code Repair for Low-Resource Programming Languages

Kyle Wong, Alfonso Amayuelas, Liangming Pan, and William Yang Wang

In Findings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2025

Abstract Bib PDF Code

Large language models (LLMs) have shown remarkable performance on code generation tasks. A recent application of LLMs for code generation is iterative code repair, where a model fixes an incorrect program by rationalizing about errors and generating a new program. However, code repair is primarily studied on high-resource languages like Python, and the framework’s efficacy is under-explored on low-resource languages. To apply code repair for low-resource languages, we propose Distilling Low-Resource Repairs (DistiLRR), an approach that transfers the reasoning and code generation ability from a teacher model to a student model. Our results show that DistiLRR consistently outperforms baselines on low-resource languages, but has similar performance on high-resource languages. To investigate this behavior, we perform a further analysis and find that the correlation between rationale quality and code correctness is weaker than previously perceived. We hypothesize this weakness is magnified in low-resource settings where base models lack deep knowledge of a programming language, leading to wavering benefits of code repair between high-resource and low-resource languages.
@inproceedings{wong2024distilrrtransferringcoderepair, title = {Investigating the Transferability of Code Repair for Low-Resource Programming Languages}, author = {Wong, Kyle and Amayuelas, Alfonso and Pan, Liangming and Wang, William Yang}, year = {2025}, booktitle = {Findings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)}, address = {Albuquerque, USA}, pages = {3410--3432}, url = {https://arxiv.org/abs/2406.14867} }
AAAI
Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning

Shengqiong Wu, Hao Fei, Liangming Pan, William Yang Wang, Shuicheng Yan, and Tat-Seng Chua

In AAAI Conference on Artificial Intelligence (AAAI), 2025

Abstract Bib PDF

Recent advancements in multimodal large language models (MLLMs) have shown unprecedented capabilities in advancing various vision-language tasks. However, MLLMs face significant challenges with hallucinations, and misleading outputs that do not align with the input data. While existing efforts are paid to combat MLLM hallucinations, several pivotal challenges are still unsolved. First, while current approaches aggressively focus on addressing errors at the perception level, another important type at the cognition level requiring factual commonsense can be overlooked. In addition, existing methods might fall short in finding a more effective way to represent visual input, which is yet a key bottleneck that triggers visual hallucinations. Moreover, MLLMs can frequently be misled by faulty textual inputs and cause hallucinations, while unfortunately, this type of issue has long been overlooked by existing studies. Inspired by human intuition in handling hallucinations, this paper introduces a novel bottom-up reasoning framework. Our framework systematically addresses potential issues in both visual and textual inputs by verifying and integrating perception-level information with cognition-level commonsense knowledge, ensuring more reliable outputs. Extensive experiments demonstrate significant improvements in multiple hallucination benchmarks after integrating MLLMs with the proposed framework. In-depth analyses reveal the great potential of our methods in addressing perception- and cognition-level hallucinations.
@inproceedings{wu2025combating, author = {Wu, Shengqiong and Fei, Hao and Pan, Liangming and Wang, William Yang and Yan, Shuicheng and Chua, Tat{-}Seng}, title = {Combating Multimodal {LLM} Hallucination via Bottom-Up Holistic Reasoning}, booktitle = {AAAI Conference on Artificial Intelligence (AAAI)}, year = {2025}, url = {https://arxiv.org/abs/2412.11124} }

2024

NeurIPS
TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning

Xinyuan Lu, Liangming Pan^*, Yubo Ma, Preslav Nakov, and Min-Yen Kan

In NeurIPS Workshop on Table Representation Learning Workshop (TRL@NeurIPS), 2024

Best Paper Runner-Up

Abstract Bib PDF Code (*: Corresponding Author)

Current Large Language Models (LLMs) exhibit limited ability to understand table structures and to apply precise numerical reasoning, which is crucial for tasks such as table question answering (TQA) and table-based fact verification (TFV). To address these challenges, we introduce our Tool-Augmented Reasoning framework for Tables (TART), which integrates LLMs with specialized tools. TART contains three key components: a table formatter to ensure accurate data representation, a tool maker to develop specific computational tools, and an explanation generator to maintain explainability. We also present the TOOLTAB dataset, a new benchmark designed specifically for training LLMs in table-tool integration. Our experiments indicate that TART achieves substantial improvements over existing methods (e.g., Chain-of-Thought) by improving both the precision of data processing and the clarity of the reasoning process. Notably, TART paired with CodeLlama achieves 90.0% of the accuracy of the closed-sourced LLM GPT-3.5-turbo, highlighting its robustness in diverse real-world scenarios.
@inproceedings{lu2024tartopensourcetoolaugmentedframework, title = {TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning}, author = {Lu, Xinyuan and Pan, Liangming and Ma, Yubo and Nakov, Preslav and Kan, Min-Yen}, year = {2024}, booktitle = {NeurIPS Workshop on Table Representation Learning Workshop (TRL@NeurIPS)}, address = {Vancouver, Canada}, url = {https://arxiv.org/abs/2409.11724} }
NeurIPS
MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations

Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, Pan Zhang, Liangming Pan, Yu-Gang Jiang, Jiaqi Wang, Yixin Cao, and Aixin Sun

In Annual Conference on Neural Information Processing Systems (NeurIPS) (Dataset and Benchmark Track), 2024

Spotlight Paper

Abstract Bib PDF Code Dataset Website

Understanding documents with rich layouts and multi-modal components is a long-standing and practical task. Recent Large Vision-Language Models (LVLMs) have made remarkable strides in various tasks, particularly in single-page document understanding (DU). However, their abilities on long-context DU remain an open problem. This work presents MMLongBench-Doc, a long-context, multi-modal benchmark comprising 1,062 expert-annotated questions. Distinct from previous datasets, it is constructed upon 130 lengthy PDF-formatted documents with an average of 49.4 pages and 20,971 textual tokens. Towards comprehensive evaluation, answers to these questions rely on pieces of evidence from (1) different sources (text, image, chart, table, and layout structure) and (2) various locations (i.e. page number). Moreover, 33.2% of the questions are cross-page questions requiring evidence across multiple pages. 22.8% of the questions are designed to be unanswerable for detecting potential hallucinations. Experiments on 14 LVLMs demonstrate that long-context DU greatly challenges current models. Notably, the best-performing model, GPT-4o, achieves an F1 score of only 42.7%, while the second-best, GPT-4V, scores 31.4%. Furthermore, 12 LVLMs (all except GPT-4o and GPT-4V) even present worse performance than their LLM counterparts which are fed with lossy-parsed OCR documents. These results validate the necessity of future research toward more capable long-context LVLMs.
@inproceedings{ma-etal-2024-mmlongbench-doc, title = {MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations}, author = {Ma, Yubo and Zang, Yuhang and Chen, Liangyu and Chen, Meiqi and Jiao, Yizhu and Li, Xinze and Lu, Xinyuan and Liu, Ziyu and Ma, Yan and Dong, Xiaoyi and Zhang, Pan and Pan, Liangming and Jiang, Yu-Gang and Wang, Jiaqi and Cao, Yixin and Sun, Aixin}, year = {2024}, booktitle = {Annual Conference on Neural Information Processing Systems (NeurIPS) (Dataset and Benchmark Track)}, address = {Vancouver, Canada}, url = {https://arxiv.org/abs/2407.01523} }
EMNLP
AKEW: Assessing Knowledge Editing in the Wild

Xiaobao Wu, Liangming Pan, William Yang Wang, and Anh Tuan Luu

In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Abstract Bib PDF Code

Knowledge editing aims to inject knowledge updates into language models to keep them correct and up-to-date. However, its current evaluation strategies are notably impractical: they solely update with well-curated structured facts (triplets with subjects, relations, and objects), whereas real-world knowledge updates commonly emerge in unstructured texts like news articles. In this paper, we propose a new benchmark, Unstructured Knowledge Editing (UKE). It evaluates editing performance directly using unstructured texts as knowledge updates, termed unstructured facts. Hence UKE avoids the laborious construction of structured facts and enables efficient and responsive knowledge editing, becoming a more practical benchmark. We conduct extensive experiments on newly built datasets and demonstrate that UKE poses a significant challenge to state-of-the-art knowledge editing methods, resulting in their critical performance declines. We further show that this challenge persists even if we extract triplets as structured facts. Our analysis discloses key insights to motivate future research in UKE for more practical knowledge editing.
@inproceedings{wu-etal-2024-akew, title = {AKEW: Assessing Knowledge Editing in the Wild}, author = {Wu, Xiaobao and Pan, Liangming and Wang, William Yang and Luu, Anh Tuan}, year = {2024}, booktitle = {Conference on Empirical Methods in Natural Language Processing (EMNLP)}, address = {Miami, USA}, publisher = {Association for Computational Linguistics}, url = {https://arxiv.org/abs/2402.18909} }
EMNLP
SciAgent: Tool-augmented Language Models for Scientific Reasoning

Yubo Ma, Zhibin Gou, Junheng Hao, Ruochen Xu, Shuohang Wang, Liangming Pan, Yujiu Yang, Yixin Cao, Aixin Sun, Hany Awadalla, and Weizhu Chen

In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Abstract Bib PDF

Scientific reasoning poses an excessive challenge for even the most advanced Large Language Models (LLMs). To make this task more practical and solvable for LLMs, we introduce a new task setting named tool-augmented scientific reasoning. This setting supplements LLMs with scalable toolsets, and shifts the focus from pursuing an omniscient problem solver to a proficient tool-user. To facilitate the research of such setting, we construct a tool-augmented training corpus named MathFunc which encompasses over 30,000 samples and roughly 6,000 tools. Building on MathFunc, we develop SciAgent to retrieve, understand and, if necessary, use tools for scientific problem solving. Additionally, we craft a benchmark, SciToolBench, spanning five scientific domains to evaluate LLMs’ abilities with tool assistance. Extensive experiments on SciToolBench confirm the effectiveness of SciAgent. Notably, SciAgent-Mistral-7B surpasses other LLMs with the same size by more than 13% in absolute accuracy. Furthermore, SciAgent-DeepMath-7B shows much superior performance than ChatGPT.
@inproceedings{ma-etal-2024-sciagent, title = {SciAgent: Tool-augmented Language Models for Scientific Reasoning}, author = {Ma, Yubo and Gou, Zhibin and Hao, Junheng and Xu, Ruochen and Wang, Shuohang and Pan, Liangming and Yang, Yujiu and Cao, Yixin and Sun, Aixin and Awadalla, Hany and Chen, Weizhu}, year = {2024}, booktitle = {Conference on Empirical Methods in Natural Language Processing (EMNLP)}, address = {Miami, USA}, publisher = {Association for Computational Linguistics}, url = {https://arxiv.org/abs/2402.11451} }
EMNLP
A Survey on Detection of LLMs-Generated Content

Xianjun Yang, Liangming Pan, Xuandong Zhao, Haifeng Chen, Linda Petzold, William Yang Wang, and Wei Cheng

In Findings of Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Abstract Bib PDF Website

The burgeoning capabilities of advanced large language models (LLMs) such as ChatGPT have led to an increase in synthetic content generation with implications across a variety of sectors, including media, cybersecurity, public discourse, and education. As such, the ability to detect LLMs-generated content has become of paramount importance. We aim to provide a detailed overview of existing detection strategies and benchmarks, scrutinizing their differences and identifying key challenges and prospects in the field, advocating for more adaptable and robust models to enhance detection accuracy. We also posit the necessity for a multi-faceted approach to defend against various attacks to counter the rapidly advancing capabilities of LLMs. To the best of our knowledge, this work is the first comprehensive survey on the detection in the era of LLMs. We hope it will provide a broad understanding of the current landscape of LLMs-generated content detection, offering a guiding reference for researchers and practitioners striving to uphold the integrity of digital information in an era increasingly dominated by synthetic content.
@inproceedings{yang-etal-2024-survey, title = {A Survey on Detection of LLMs-Generated Content}, author = {Yang, Xianjun and Pan, Liangming and Zhao, Xuandong and Chen, Haifeng and Petzold, Linda and Wang, William Yang and Cheng, Wei}, year = {2024}, booktitle = {Findings of Conference on Empirical Methods in Natural Language Processing (EMNLP)}, address = {Miami, USA}, publisher = {Association for Computational Linguistics}, url = {https://arxiv.org/abs/2310.15654} }
EMNLP
MultiAgent Collaboration Attack: Investigating Adversarial Attacks in Large Language Model Collaborations via Debate

Alfonso Amayuelas, Xianjun Yang, Antonis Antoniades, Wenyue Hua, Liangming Pan, and William Wang

In Findings of Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Abstract Bib PDF

Large Language Models (LLMs) have shown exceptional results on current benchmarks when working individually. The advancement in their capabilities, along with a reduction in parameter size and inference times, has facilitated the use of these models as agents, enabling interactions among multiple models to execute complex tasks. Such collaborations offer several advantages, including the use of specialized models (e.g. coding), improved confidence through multiple computations, and enhanced divergent thinking, leading to more diverse outputs. Thus, the collaborative use of language models is expected to grow significantly in the coming years. In this work, we evaluate the behavior of a network of models collaborating through debate under the influence of an adversary. We introduce pertinent metrics to assess the adversary’s effectiveness, focusing on system accuracy and model agreement. Our findings highlight the importance of a model’s persuasive ability in influencing others. Additionally, we explore inference-time methods to generate more compelling arguments and evaluate the potential of prompt-based mitigation as a defensive strategy.
@inproceedings{amayuelas-etal-2024-multiagent, title = {MultiAgent Collaboration Attack: Investigating Adversarial Attacks in Large Language Model Collaborations via Debate}, author = {Amayuelas, Alfonso and Yang, Xianjun and Antoniades, Antonis and Hua, Wenyue and Pan, Liangming and Wang, William}, year = {2024}, booktitle = {Findings of Conference on Empirical Methods in Natural Language Processing (EMNLP)}, address = {Miami, USA}, publisher = {Association for Computational Linguistics}, url = {https://arxiv.org/abs/2406.14711} }
EMNLP
Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers

Yuxia Wang, Revanth Gangi Reddy, Zain Muhammad Mujahid, Arnav Arora, Aleksandr Rubashevskii, Jiahui Geng, Osama Mohammed Afzal, Liangming Pan, Nadav Borenstein, Aditya Pillai, Isabelle Augenstein, Iryna Gurevych, and Preslav Nakov

In Findings of Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Abstract Bib PDF Code

The increased use of large language models (LLMs) across a variety of real-world applications calls for mechanisms to verify the factual accuracy of their outputs. In this work, we present a holistic end-to-end solution for annotating the factuality of LLM-generated responses, which encompasses a multi-stage annotation scheme designed to yield detailed labels concerning the verifiability and factual inconsistencies found in LLM outputs. We design and build an annotation tool to speed up the labelling procedure and ease the workload of raters. It allows flexible incorporation of automatic results in any stage, e.g. automatically-retrieved evidence. We further construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document. Preliminary experiments show that FacTool, FactScore and Perplexity.ai are struggling to identify false claims with the best F1=0.53.
@inproceedings{wang-etal-2024-factcheck, title = {Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers}, author = {Wang, Yuxia and Reddy, Revanth Gangi and Mujahid, Zain Muhammad and Arora, Arnav and Rubashevskii, Aleksandr and Geng, Jiahui and Afzal, Osama Mohammed and Pan, Liangming and Borenstein, Nadav and Pillai, Aditya and Augenstein, Isabelle and Gurevych, Iryna and Nakov, Preslav}, year = {2024}, booktitle = {Findings of Conference on Empirical Methods in Natural Language Processing (EMNLP)}, address = {Miami, USA}, publisher = {Association for Computational Linguistics}, url = {https://arxiv.org/abs/2311.09000} }
TACL
Automatically Correcting Large Language Models: Surveying the Landscape of Diverse Automated Correction Strategies

Liangming Pan, Michael Saxon, Wenda Xu, Deepak Nathani, Xinyi Wang, and William Yang Wang

Transactions of the Association for Computational Linguistics (TACL), 2024

Oral Presentation at ACL 2024

Abstract Bib PDF Poster Website

While large language models (LLMs) have shown remarkable effectiveness in various NLP tasks, they are still prone to issues such as hallucination, unfaithful reasoning, and toxicity. A promising approach to rectify these flaws is correcting LLMs with feedback, where the LLM itself is prompted or guided with feedback to fix problems in its own output. Techniques leveraging automated feedback—either produced by the LLM itself (self-correction) or some external system—are of particular interest as they make LLM-based solutions more practical and deployable with minimal human intervention. This paper provides an exhaustive review of the recent advances in correcting LLMs with automated feedback, categorizing them into training-time, generation-time, and post-hoc approaches. We also identify potential challenges and future directions in this emerging field.
@article{pan-etal-2024-automatically, title = {Automatically Correcting Large Language Models: Surveying the Landscape of Diverse Automated Correction Strategies}, author = {Pan, Liangming and Saxon, Michael and Xu, Wenda and Nathani, Deepak and Wang, Xinyi and Wang, William Yang}, journal = {Transactions of the Association for Computational Linguistics (TACL)}, volume = {12}, year = {2024}, address = {Cambridge, MA}, publisher = {MIT Press}, url = {https://aclanthology.org/2024.tacl-1.27}, pages = {484--506} }
ACL
Faithful Logical Reasoning via Symbolic Chain-of-Thought

Jundong Xu, Hao Fei, Liangming Pan, Qian Liu, Mong-Li Lee, and Wynne Hsu

In Annual Meeting of the Association for Computational Linguistics (ACL), 2024

Abstract Bib PDF Code

While the recent Chain-of-Thought (CoT) technique enhances the reasoning ability of large language models (LLMs) with the theory of mind, it might still struggle in handling logical reasoning that relies much on symbolic expressions and rigid deducing rules. To strengthen the logical reasoning capability of LLMs, we propose a novel Symbolic Chain-of-Thought, namely SymbCoT, a fully LLM-based framework that integrates symbolic expressions and logic rules with CoT prompting. Technically, building upon an LLM, SymbCoT 1) first translates the natural language context into the symbolic format, and then 2) derives a step-by-step plan to solve the problem with symbolic logical rules, 3) followed by a verifier to check the translation and reasoning chain. Via thorough evaluations on 5 standard datasets with both First-Order Logic and Constraint Optimization symbolic expressions, SymbCoT shows striking improvements over the CoT method consistently, meanwhile refreshing the current state-of-the-art performances. We further demonstrate that our system advances in more faithful, flexible, and explainable logical reasoning. To our knowledge, this is the first to combine symbolic expressions and rules into CoT for logical reasoning with LLMs.
@inproceedings{xu-etal-2024-faithful, title = {Faithful Logical Reasoning via Symbolic Chain-of-Thought}, author = {Xu, Jundong and Fei, Hao and Pan, Liangming and Liu, Qian and Lee, Mong{-}Li and Hsu, Wynne}, booktitle = {Annual Meeting of the Association for Computational Linguistics (ACL)}, year = {2024}, address = {Thailand}, publisher = {Association for Computational Linguistics}, url = {https://arxiv.org/abs/2405.18357} }
ACL Oral Presentation
Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement

Wenda Xu, Guanglei Zhu, Xuandong Zhao, Liangming Pan, Lei Li, and William Yang Wang

In Annual Meeting of the Association for Computational Linguistics (ACL), 2024

Abstract Bib PDF Code

Recent studies show that large language models (LLMs) improve their performance through self-feedback on certain tasks while degrade on others. We discovered that such a contrary is due to LLM’s bias in evaluating their own output. In this paper, we formally define LLM’s self-bias - the tendency to favor its own generation - using two statistics. We analyze six LLMs (GPT-4, GPT-3.5, Gemini, LLaMA2, Mixtral and DeepSeek) on translation, constrained text generation, and mathematical reasoning tasks. We find that self-bias is prevalent in all examined LLMs across multiple languages and tasks. Our analysis reveals that while the self-refine pipeline improves the fluency and understandability of model outputs, it further amplifies self-bias. To mitigate such biases, we discover that larger model size and external feedback with accurate assessment can significantly reduce bias in the self-refine pipeline, leading to actual performance improvement in downstream tasks.
@inproceedings{xu-etal-2024-pride, title = {Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement}, author = {Xu, Wenda and Zhu, Guanglei and Zhao, Xuandong and Pan, Liangming and Li, Lei and Wang, William Yang}, booktitle = {Annual Meeting of the Association for Computational Linguistics (ACL)}, year = {2024}, address = {Thailand}, publisher = {Association for Computational Linguistics}, url = {https://arxiv.org/abs/2402.11436} }
ACL
Knowledge of Knowledge: Exploring Known-Unknowns Uncertainty with Large Language Models

Alfonso Amayuelas, Kyle Wong, Liangming Pan, Wenhu Chen, and William Yang Wang

In Findings of Annual Meeting of the Association for Computational Linguistics (ACL), 2024

Abstract Bib PDF Code

This paper investigates the capabilities of Large Language Models (LLMs) in the context of understanding their knowledge and uncertainty over questions. Specifically, we focus on addressing known-unknown questions, characterized by high uncertainty due to the absence of definitive answers. To facilitate our study, we collect a new dataset with Known-Unknown Questions (KUQ) and establish a categorization framework to clarify the origins of uncertainty in such queries. Subsequently, we examine the performance of open-source LLMs, fine-tuned using this dataset, in distinguishing between known and unknown queries within open-ended question-answering scenarios. The fine-tuned models demonstrated a significant improvement, achieving a considerable increase in F1-score relative to their pre-fine-tuning state. Through a comprehensive analysis, we reveal insights into the models’ improved uncertainty articulation and their consequent efficacy in multi-agent debates. These findings help us understand how LLMs can be trained to identify and express uncertainty, improving our knowledge of how they understand and express complex or unclear information.
@inproceedings{alfonso-etal-2024-knowledge, title = {Knowledge of Knowledge: Exploring Known-Unknowns Uncertainty with Large Language Models}, author = {Amayuelas, Alfonso and Wong, Kyle and Pan, Liangming and Chen, Wenhu and Wang, William Yang}, booktitle = {Findings of Annual Meeting of the Association for Computational Linguistics (ACL)}, year = {2024}, address = {Thailand}, publisher = {Association for Computational Linguistics}, url = {https://arxiv.org/abs/2305.13712} }
ACL
The Knowledge Alignment Problem: Bridging Human and External Knowledge for Large Language Models

Shuo Zhang, Liangming Pan, Junzhou Zhao, and William Yang Wang

In Findings of Annual Meeting of the Association for Computational Linguistics (ACL), 2024

Abstract Bib PDF Code

Large language models often necessitate grounding on external knowledge to generate faithful and reliable answers. Yet even with the correct groundings in the reference, they can ignore them and rely on wrong groundings or their inherent biases to hallucinate when users, being largely unaware of the specifics of the stored information, pose questions that might not directly correlate with the retrieved groundings. In this work, we formulate this knowledge alignment problem and introduce MixAlign, a framework that interacts with both the human user and the knowledge base to obtain and integrate clarifications on how the user question relates to the stored information. MixAlign employs a language model to achieve automatic knowledge alignment and, if necessary, further enhances this alignment through human user clarifications. Experimental results highlight the crucial role of knowledge alignment in boosting model performance and mitigating hallucination, with improvements noted up to 22.2% and 27.1% respectively. We also demonstrate the effectiveness of MixAlign in improving knowledge alignment by producing high-quality, user-centered clarifications.
@inproceedings{zhang-etal-2024-knowledge, title = {The Knowledge Alignment Problem: Bridging Human and External Knowledge for Large Language Models}, author = {Zhang, Shuo and Pan, Liangming and Zhao, Junzhou and Wang, William Yang}, booktitle = {Findings of Annual Meeting of the Association for Computational Linguistics (ACL)}, year = {2024}, address = {Thailand}, publisher = {Association for Computational Linguistics}, url = {https://arxiv.org/abs/2305.13669} }
ACL
Towards Verifiable Generation: A Benchmark for Knowledge-aware Language Model Attribution

Xinze Li, Yixin Cao, Liangming Pan, Yubo Ma, and Aixin Sun

In Findings of Annual Meeting of the Association for Computational Linguistics (ACL), 2024

Abstract Bib PDF

Although achieving great success, Large Language Models (LLMs) usually suffer from unreliable hallucinations. Although language attribution can be a potential solution, there are no suitable benchmarks and evaluation metrics to attribute LLMs to structured knowledge. In this paper, we define a new task of Knowledge-aware Language Model Attribution (KaLMA) that improves upon three core concerns with conventional attributed LMs. First, we extend attribution source from unstructured texts to Knowledge Graph (KG), whose rich structures benefit both the attribution performance and working scenarios. Second, we propose a new “Conscious Incompetence” setting considering the incomplete knowledge repository, where the model identifies the need for supporting knowledge beyond the provided KG. Third, we propose a comprehensive automatic evaluation metric encompassing text quality, citation quality, and text citation alignment. To implement the above innovations, we build a dataset in biography domain BioKaLMA via evolutionary question generation strategy, to control the question complexity and necessary knowledge to the answer. For evaluation, we develop a baseline solution and demonstrate the room for improvement in LLMs’ citation generation, emphasizing the importance of incorporating the “Conscious Incompetence” setting, and the critical role of retrieval accuracy.
@inproceedings{li-etal-2024-towards, title = {Towards Verifiable Generation: A Benchmark for Knowledge-aware Language Model Attribution}, author = {Li, Xinze and Cao, Yixin and Pan, Liangming and Ma, Yubo and Sun, Aixin}, booktitle = {Findings of Annual Meeting of the Association for Computational Linguistics (ACL)}, year = {2024}, address = {Thailand}, publisher = {Association for Computational Linguistics}, url = {https://arxiv.org/abs/2310.05634} }
ACL
Modeling Dynamic Topics in Chain-Free Fashion by Evolution-Tracking Contrastive Learning and Unassociated Word Exclusion

Xiaobao Wu, Xinshuai Dong, Liangming Pan, Thong Nguyen, and Anh Tuan Luu

In Findings of Annual Meeting of the Association for Computational Linguistics (ACL), 2024

Abstract Bib PDF Code

Dynamic topic models track the evolution of topics in sequential documents, which have derived various applications like trend analysis and opinion mining. However, existing models suffer from repetitive topic and unassociated topic issues, failing to reveal the evolution and hindering further applications. To address these issues, we break the tradition of simply chaining topics in existing work and propose a novel neural \modelfullname. We introduce a new evolution-tracking contrastive learning method that builds the similarity relations among dynamic topics. This not only tracks topic evolution but also maintains topic diversity, mitigating the repetitive topic issue. To avoid unassociated topics, we further present an unassociated word exclusion method that consistently excludes unassociated words from discovered topics. Extensive experiments demonstrate our model significantly outperforms state-of-the-art baselines, tracking topic evolution with high-quality topics, showing better performance on downstream tasks, and remaining robust to the hyperparameter for evolution intensities.
@inproceedings{wu-etal-2024-modeling, title = {Modeling Dynamic Topics in Chain-Free Fashion by Evolution-Tracking Contrastive Learning and Unassociated Word Exclusion}, author = {Wu, Xiaobao and Dong, Xinshuai and Pan, Liangming and Nguyen, Thong and Luu, Anh Tuan}, booktitle = {Findings of Annual Meeting of the Association for Computational Linguistics (ACL)}, year = {2024}, address = {Thailand}, publisher = {Association for Computational Linguistics}, url = {https://arxiv.org/abs/2405.17957} }
ICML
Understanding Reasoning Ability of Language Models From the Perspective of Reasoning Paths Aggregation

Xinyi Wang, Alfonso Amayuelas, Kexun Zhang, Liangming Pan, Wenhu Chen, and William Yang Wang

In International Conference on Machine Learning (ICML), 2024

Abstract Bib PDF Code

Pre-trained language models (LMs) are able to perform complex reasoning without explicit fine-tuning. To understand how pre-training with a next-token prediction objective contributes to the emergence of such reasoning capability, we propose that we can view an LM as deriving new conclusions by aggregating indirect reasoning paths seen at pre-training time. We found this perspective effective in two important cases of reasoning: logic reasoning with knowledge graphs (KGs) and chain-of-thought (CoT) reasoning. More specifically, we formalize the reasoning paths as random walk paths on the knowledge/reasoning graphs. Analyses of learned LM distributions suggest that a weighted sum of relevant random walk path probabilities is a reasonable way to explain how LMs reason. Experiments and analysis on multiple KG and CoT datasets reveal the effect of training on random walk paths and suggest that augmenting unlabeled random walk reasoning paths can improve real-world multi-step reasoning performance.
@inproceedings{wang-etal-2024-understanding, title = {Understanding Reasoning Ability of Language Models From the Perspective of Reasoning Paths Aggregation}, author = {Wang, Xinyi and Amayuelas, Alfonso and Zhang, Kexun and Pan, Liangming and Chen, Wenhu and Wang, William Yang}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2024}, address = {Austria}, url = {https://arxiv.org/abs/2402.03268} }
ICML
Tweets to Citations: Unveiling the Impact of Social Media Influencers on AI Research Visibility

Iain Xie Weissburg, Mehir Arora, Xinyi Wang, Liangming Pan, and William Yang Wang

In International Conference on Machine Learning (ICML), 2024

Abstract Bib PDF

As the number of accepted papers at AI and ML conferences reaches into the thousands, it has become unclear how researchers access and read research publications. In this paper, we investigate the role of social media influencers in enhancing the visibility of machine learning research, particularly the citation counts of papers they share. We have compiled a comprehensive dataset of over 8,000 papers, spanning tweets from December 2018 to October 2023, alongside controls precisely matched by 9 key covariates. Our statistical and causal inference analysis reveals a significant increase in citations for papers endorsed by these influencers, with median citation counts 2-3 times higher than those of the control group. Additionally, the study delves into the geographic, gender, and institutional diversity of highlighted authors. Given these findings, we advocate for a responsible approach to curation, encouraging influencers to uphold the journalistic standard that includes showcasing diverse research topics, authors, and institutions.
@inproceedings{weissburg-etal-2024-tweets, title = {Tweets to Citations: Unveiling the Impact of Social Media Influencers on AI Research Visibility}, author = {Weissburg, Iain Xie and Arora, Mehir and Wang, Xinyi and Pan, Liangming and Wang, William Yang}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2024}, address = {Austria}, url = {https://arxiv.org/abs/2401.13782} }
TMLR
A Survey on Data Selection for Language Models

Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, and William Yang Wang

Transactions on Machine Learning Research (TMLR), 2024

Abstract Bib PDF Website

A major factor in the recent success of large language models is the use of enormous and ever-growing text datasets for unsupervised pre-training. However, naively training a model on all available data may not be optimal (or feasible), as the quality of available text data can vary. Filtering out data can also decrease the carbon footprint and financial costs of training models by reducing the amount of training required. Data selection methods aim to determine which candidate data points to include in the training dataset and how to appropriately sample from the selected data points. The promise of improved data selection methods has caused the volume of research in the area to rapidly expand. However, because deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive, few organizations have the resources for extensive data selection research. Consequently, knowledge of effective data selection practices has become concentrated within a few organizations, many of which do not openly share their findings and methodologies. To narrow this gap in knowledge, we present a comprehensive review of existing literature on data selection methods and related research areas, providing a taxonomy of existing approaches. By describing the current landscape of research, this work aims to accelerate progress in data selection by establishing an entry point for new and established researchers. Additionally, throughout this review we draw attention to noticeable holes in the literature and conclude the paper by proposing promising avenues for future research.
@article{DBLP:journals/corr/abs-2402-16827, author = {Albalak, Alon and Elazar, Yanai and Xie, Sang Michael and Longpre, Shayne and Lambert, Nathan and Wang, Xinyi and Muennighoff, Niklas and Hou, Bairu and Pan, Liangming and Jeong, Haewon and Raffel, Colin and Chang, Shiyu and Hashimoto, Tatsunori and Wang, William Yang}, title = {A Survey on Data Selection for Language Models}, journal = {Transactions on Machine Learning Research (TMLR)}, year = {2024}, url = {https://doi.org/10.48550/arXiv.2402.16827} }

2023

EMNLP
SCITAB: A Challenging Benchmark for Compositional Reasoning and Claim Verification on Scientific Tables

Xinyuan Lu^*, Liangming Pan^*, Qian Liu, Preslav Nakov, and Min-Yen Kan

In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Abstract Bib PDF Video Code

Current scientific fact-checking benchmarks exhibit several shortcomings, such as biases arising from crowd-sourced claims and an over-reliance on text-based evidence. We present SCITAB, a challenging evaluation dataset consisting of 1.2K expert-verified scientific claims that 1) originate from authentic scientific publications and 2) require compositional reasoning for verification. The claims are paired with evidence-containing scientific tables annotated with labels. Through extensive evaluations, we demonstrate that SCITAB poses a significant challenge to state-of-the-art models, including table-based pretraining models and large language models. All models except GPT-4 achieved performance barely above random guessing. Popular prompting techniques, such as Chain-of-Thought, do not achieve much performance gains on SCITAB. Our analysis uncovers several unique challenges posed by SCITAB, including table grounding, claim ambiguity, and compositional reasoning. Our codes and data are publicly available at https://github.com/XinyuanLu00/SciTab.
@inproceedings{lu-etal-2023-scitab, title = {{SCITAB}: A Challenging Benchmark for Compositional Reasoning and Claim Verification on Scientific Tables}, author = {Lu, Xinyuan and Pan, Liangming and Liu, Qian and Nakov, Preslav and Kan, Min-Yen}, booktitle = {Conference on Empirical Methods in Natural Language Processing (EMNLP)}, year = {2023}, address = {Singapore}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.emnlp-main.483}, pages = {7787--7813} }
EMNLP
MAF: Multi-Aspect Feedback for Improving Reasoning in Large Language Models

Deepak Nathani, David Wang, Liangming Pan, and William Wang

In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Abstract Bib PDF Video Code

Language Models (LMs) have shown impressive performance in various natural language tasks. However, when it comes to natural language reasoning, LMs still face challenges such as hallucination, generating incorrect intermediate reasoning steps, and making mathematical errors. Recent research has focused on enhancing LMs through *self-improvement* using feedback. Nevertheless, existing approaches relying on a single generic feedback source fail to address the diverse error types found in LM-generated reasoning chains. In this work, we propose **Multi-Aspect Feedback**, an iterative refinement framework that integrates multiple feedback modules, including frozen LMs and external tools, each focusing on a specific error category. Our experimental results demonstrate the efficacy of our approach to addressing several errors in the LM-generated reasoning chain and thus improving the overall performance of an LM in several reasoning tasks. We see an improvement of up to 20% in Mathematical Reasoning and up to 18% in Logical Entailment.
@inproceedings{nathani-etal-2023-maf, title = {{MAF}: Multi-Aspect Feedback for Improving Reasoning in Large Language Models}, author = {Nathani, Deepak and Wang, David and Pan, Liangming and Wang, William}, booktitle = {Conference on Empirical Methods in Natural Language Processing (EMNLP)}, year = {2023}, address = {Singapore}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.emnlp-main.407}, pages = {6591--6616} }
EMNLP Oral Presentation
INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback

Wenda Xu, Danqing Wang, Liangming Pan, Zhenqiao Song, Markus Freitag, William Wang, and Lei Li

In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Abstract Bib PDF Video Code Model

Automatically evaluating the quality of language generation is critical. Although recent learned metrics show high correlation with human judgement, these metrics do not provide explicit explanation of their verdict, nor associate the scores with defects in the generated text. To address this limitation, we present INSTRUCTSCORE, a fine-grained explainable evaluation metric for text generation. By harnessing both explicit human instruction and the implicit knowledge of GPT-4, we fine-tune a text evaluation metric based on LLaMA, producing both a score for generated text and a human readable diagnostic report. We evaluate INSTRUCTSCORE on a variety of generation tasks, including translation, captioning, data-to-text, and commonsense generation. Experiments show that our 7B model surpasses all other unsupervised metrics, including those based on 175B GPT-3 and GPT-4. Surprisingly, our INSTRUCTSCORE, even without direct supervision from human-rated data, achieves performance levels on par with state-of-the-art metrics like COMET22, which were fine-tuned on human ratings.
@inproceedings{xu-etal-2023-instructscore, title = {{INSTRUCTSCORE}: Towards Explainable Text Generation Evaluation with Automatic Feedback}, author = {Xu, Wenda and Wang, Danqing and Pan, Liangming and Song, Zhenqiao and Freitag, Markus and Wang, William and Li, Lei}, booktitle = {Conference on Empirical Methods in Natural Language Processing (EMNLP)}, year = {2023}, address = {Singapore}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.emnlp-main.365}, pages = {5967--5994} }
EMNLP
Doolittle: Benchmarks and Corpora for Academic Writing Formalization

Shizhe Diao, Yongyu Lei, Liangming Pan, Tianqing Fang, Wangchunshu Zhou, Sedrick Keh, Min-Yen Kan, and Tong Zhang

In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Abstract Bib PDF Video Dataset

Improving the quality of academic writing is a meaningful but challenging task. Conventional methods of language refinement focus on narrow, specific linguistic features within isolated sentences, such as grammatical errors and improper word use. We propose a more general task, Academic Writing Formalization (AWF), to improve the overall quality of formal academic writing at the paragraph level. We formulate this language refinement task as a formal text style transfer task which transfers informal-academic text to formal-academic and contribute a large-scale non-parallel dataset, Doolittle, for this purpose. Concurrently, we apply a method named metric-oriented reinforcement learning (MORL) to two large language models (LLM) where we incorporate different levels of automatic feedback into the training process. Our experiments reveal that existing text transfer models and grammatical error correction models address certain aspects of AWF but still have a significant performance gap compared to human performance. Meanwhile, language models fine-tuned with our MORL method exhibit considerably improved performance, rivaling the latest chatbot ChatGPT, but still have a non-negligible gap compared to the ground truth formal-academic texts in Doolittle.
@inproceedings{diao-etal-2023-doolittle, title = {Doolittle: Benchmarks and Corpora for Academic Writing Formalization}, author = {Diao, Shizhe and Lei, Yongyu and Pan, Liangming and Fang, Tianqing and Zhou, Wangchunshu and Keh, Sedrick and Kan, Min-Yen and Zhang, Tong}, booktitle = {Conference on Empirical Methods in Natural Language Processing (EMNLP)}, year = {2023}, address = {Singapore}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.emnlp-main.809}, pages = {13093--13111} }
EMNLP
Logic-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning

Liangming Pan, Alon Albalak, Xinyi Wang, and William Wang

In Findings of Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Abstract Bib PDF Code

Large Language Models (LLMs) have shown human-like reasoning abilities but still struggle with complex logical problems. This paper introduces a novel framework, Logic-LM, which integrates LLMs with symbolic solvers to improve logical problem-solving. Our method first utilizes LLMs to translate a natural language problem into a symbolic formulation. Afterward, a deterministic symbolic solver performs inference on the formulated problem. We also introduce a self-refinement module, which utilizes the symbolic solver’s error messages to revise symbolic formalizations. We demonstrate Logic-LM’s effectiveness on five logical reasoning datasets: ProofWriter, PrOntoQA, FOLIO, LogicalDeduction, and AR-LSAT. On average, Logic-LM achieves a significant performance boost of 39.2% over using LLM alone with standard prompting and 18.4% over LLM with chain-of-thought prompting. Our findings suggest that Logic-LM, by combining LLMs with symbolic logic, offers a promising avenue for faithful logical reasoning.
@inproceedings{pan-etal-2023-logic, title = {Logic-{LM}: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning}, author = {Pan, Liangming and Albalak, Alon and Wang, Xinyi and Wang, William}, booktitle = {Findings of Conference on Empirical Methods in Natural Language Processing (EMNLP)}, year = {2023}, address = {Singapore}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.findings-emnlp.248}, pages = {3806--3824} }
EMNLP
On the Risk of Misinformation Pollution with Large Language Models

Yikang Pan^*, Liangming Pan^*, Wenhu Chen, Preslav Nakov, Min-Yen Kan, and William Wang

In Findings of Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Abstract Bib PDF Code

We investigate the potential misuse of modern Large Language Models (LLMs) for generating credible-sounding misinformation and its subsequent impact on information-intensive applications, particularly Open-Domain Question Answering (ODQA) systems. We establish a threat model and simulate potential misuse scenarios, both unintentional and intentional, to assess the extent to which LLMs can be utilized to produce misinformation. Our study reveals that LLMs can act as effective misinformation generators, leading to a significant degradation (up to 87%) in the performance of ODQA systems. Moreover, we uncover disparities in the attributes associated with persuading humans and machines, presenting an obstacle to current human-centric approaches to combat misinformation. To mitigate the harm caused by LLM-generated misinformation, we propose three defense strategies: misinformation detection, vigilant prompting, and reader ensemble. These approaches have demonstrated promising results, albeit with certain associated costs. Lastly, we discuss the practicality of utilizing LLMs as automatic misinformation generators and provide relevant resources and code to facilitate future research in this area.
@inproceedings{pan-etal-2023-risk, title = {On the Risk of Misinformation Pollution with Large Language Models}, author = {Pan, Yikang and Pan, Liangming and Chen, Wenhu and Nakov, Preslav and Kan, Min-Yen and Wang, William}, booktitle = {Findings of Conference on Empirical Methods in Natural Language Processing (EMNLP)}, year = {2023}, address = {Singapore}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.findings-emnlp.97}, pages = {1389--1403} }
EMNLP
QACheck: A Demonstration System for Question-Guided Multi-Hop Fact-Checking

Liangming Pan, Xinyuan Lu, Min-Yen Kan, and Preslav Nakov

In Conference on Empirical Methods in Natural Language Processing: System Demonstrations (EMNLP Demo), 2023

Abstract Bib PDF Video Code

Fact-checking real-world claims often requires intricate, multi-step reasoning due to the absence of direct evidence to support or refute them. However, existing fact-checking systems often lack transparency in their decision-making, making it challenging for users to comprehend their reasoning process. To address this, we propose the Question-guided Multi-hop Fact-Checking (QACheck) system, which guides the model’s reasoning process by asking a series of questions critical for verifying a claim. QACheck has five key modules: a claim verifier, a question generator, a question-answering module, a QA validator, and a reasoner. Users can input a claim into QACheck, which then predicts its veracity and provides a comprehensive report detailing its reasoning process, guided by a sequence of (question, answer) pairs. QACheck also provides the source of evidence supporting each question, fostering a transparent, explainable, and user-friendly fact-checking process.
@inproceedings{pan-etal-2023-qacheck, title = {{QAC}heck: A Demonstration System for Question-Guided Multi-Hop Fact-Checking}, author = {Pan, Liangming and Lu, Xinyuan and Kan, Min-Yen and Nakov, Preslav}, booktitle = {Conference on Empirical Methods in Natural Language Processing: System Demonstrations (EMNLP Demo)}, year = {2023}, address = {Singapore}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.emnlp-demo.23}, pages = {264--273} }
ACL
Fact-Checking Complex Claims with Program-Guided Reasoning

Liangming Pan, Xiaobao Wu, Xinyuan Lu, Anh Tuan Luu, William Yang Wang, Min-Yen Kan, and Preslav Nakov

In Annual Meeting of the Association for Computational Linguistics (ACL), 2023

Abstract Bib PDF Video Code Poster Slides

Fact-checking real-world claims often requires collecting multiple pieces of evidence and applying complex multi-step reasoning. In this paper, we present Program-Guided Fact-Checking (ProgramFC), a novel fact-checking model that decomposes complex claims into simpler sub-tasks that can be solved using a shared library of specialized functions. We first leverage the in-context learning ability of large language models to generate reasoning programs to guide the verification process. Afterward, we execute the program by delegating each sub-task to the corresponding sub-task handler. This process makes our model both explanatory and data-efficient, providing clear explanations of its reasoning process and requiring minimal training data. We evaluate ProgramFC on two challenging fact-checking datasets and show that it outperforms seven fact-checking baselines across different settings of evidence availability, with explicit output programs that benefit human debugging. Our codes and data are publicly available at \urlhttps://github.com/mbzuai-nlp/ProgramFC.
@inproceedings{pan-etal-2023-fact, title = {Fact-Checking Complex Claims with Program-Guided Reasoning}, author = {Pan, Liangming and Wu, Xiaobao and Lu, Xinyuan and Luu, Anh Tuan and Wang, William Yang and Kan, Min-Yen and Nakov, Preslav}, booktitle = {Annual Meeting of the Association for Computational Linguistics (ACL)}, year = {2023}, address = {Toronto, Canada}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.acl-long.386}, pages = {6981--7004} }
ACL
Modeling What-to-ask and How-to-ask for Answer-unaware Conversational Question Generation

Xuan Long Do, Bowei Zou, Shafiq Joty, Tran Tai, Liangming Pan, Nancy Chen, and Ai Ti Aw

In Annual Meeting of the Association for Computational Linguistics (ACL), 2023

Abstract Bib PDF Video Code

Conversational Question Generation (CQG) is a critical task for machines to assist humans in fulfilling their information needs through conversations. The task is generally cast into two different settings: answer-aware and answer-unaware. While the former facilitates the models by exposing the expected answer, the latter is more realistic and receiving growing attentions recently. What-to-ask and how-to-ask are the two main challenges in the answer-unaware setting. To address the first challenge, existing methods mainly select sequential sentences in context as the rationales. We argue that the conversation generated using such naive heuristics may not be natural enough as in reality, the interlocutors often talk about the relevant contents that are not necessarily sequential in context. Additionally, previous methods decide the type of question to be generated (boolean/span-based) implicitly. Modeling the question type explicitly is crucial as the answer, which hints the models to generate a boolean or span-based question, is unavailable. To this end, we present SG-CQG, a two-stage CQG framework. For the what-to-ask stage, a sentence is selected as the rationale from a semantic graph that we construct, and extract the answer span from it. For the how-to-ask stage, a classifier determines the target answer type of the question via two explicit control signals before generating and filtering. In addition, we propose Conv-Distinct, a novel evaluation metric for CQG, to evaluate the diversity of the generated conversation from a context. Compared with the existing answer-unaware CQG models, the proposed SG-CQG achieves state-of-the-art performance.
@inproceedings{do-etal-2023-modeling, title = {Modeling What-to-ask and How-to-ask for Answer-unaware Conversational Question Generation}, author = {Do, Xuan Long and Zou, Bowei and Joty, Shafiq and Tai, Tran and Pan, Liangming and Chen, Nancy and Aw, Ai Ti}, booktitle = {Annual Meeting of the Association for Computational Linguistics (ACL)}, year = {2023}, address = {Toronto, Canada}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.acl-long.603}, pages = {10785--10803} }
AACL / IJCNLP Oral Presentation
Attacking Open-domain Question Answering by Injecting Misinformation

Liangming Pan, Wenhu Chen, Min-Yen Kan, and William Yang Wang

In International Joint Conference on Natural Language Processing and Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (IJCNLP-AACL), 2023

Area Chair Award (Question Answering Track)

Abstract Bib PDF Code Slides

With a rise in false, inaccurate, and misleading information in propaganda, news, and social media, real-world Question Answering (QA) systems face the challenges of synthesizing and reasoning over misinformation-polluted contexts to derive correct answers. This urgency gives rise to the need to make QA systems robust to misinformation, a topic previously unexplored. We study the risk of misinformation to QA models by investigating the sensitivity of open-domain QA models to corpus pollution with misinformation documents. We curate both human-written and model-generated false documents that we inject into the evidence corpus of QA models and assess the impact on the performance of these systems. Experiments show that QA models are vulnerable to even small amounts of evidence contamination brought by misinformation, with large absolute performance drops on all models. Misinformation attack brings more threat when fake documents are produced at scale by neural models or the attacker targets hacking specific questions of interest. To defend against such a threat, we discuss the necessity of building a misinformation-aware QA system that integrates question-answering and misinformation detection in a joint fashion.
@inproceedings{pan-etal-2023-attacking, title = {Attacking Open-domain Question Answering by Injecting Misinformation}, author = {Pan, Liangming and Chen, Wenhu and Kan, Min-Yen and Wang, William Yang}, booktitle = {International Joint Conference on Natural Language Processing and Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (IJCNLP-AACL)}, year = {2023}, address = {Nusa Dua, Bali}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.ijcnlp-main.35}, pages = {525--539} }
AACL / IJCNLP Oral Presentation
Investigating Zero- and Few-shot Generalization in Fact Verification

Liangming Pan, Yunxiang Zhang, and Min-Yen Kan

In International Joint Conference on Natural Language Processing and Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (IJCNLP-AACL), 2023

Abstract Bib PDF Code

In this paper, we explore zero- and few-shot generalization for fact verification (FV), which aims to generalize the FV model trained on well-resourced domains (e.g., Wikipedia) to low-resourced domains that lack human annotations. To this end, we first construct a benchmark dataset collection which contains 11 FV datasets representing 6 domains. We conduct an empirical analysis of generalization across these FV datasets, finding that current models generalize poorly. Our analysis reveals that several factors affect generalization, including dataset size, length of evidence, and the type of claims. Finally, we show that two directions of work improve generalization: 1) incorporating domain knowledge via pretraining on specialized domains, and 2) automatically generating training data via claim generation.
@inproceedings{pan-etal-2023-investigating, title = {Investigating Zero- and Few-shot Generalization in Fact Verification}, author = {Pan, Liangming and Zhang, Yunxiang and Kan, Min-Yen}, booktitle = {International Joint Conference on Natural Language Processing and Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (IJCNLP-AACL)}, year = {2023}, address = {Nusa Dua, Bali}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.ijcnlp-main.34}, pages = {511--524} }
AACL / IJCNLP Oral Presentation
FollowupQG: Towards Information-Seeking Follow-up Question Generation

Yan Meng, Liangming Pan, Yixin Cao, and Min-Yen Kan

In International Joint Conference on Natural Language Processing and Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (IJCNLP-AACL), 2023

Abstract Bib PDF Dataset

Humans ask follow-up questions driven by curiosity, which reflects a creative human cognitive process. We introduce the task of real-world information-seeking follow-up question generation (FQG), which aims to generate follow-up questions seeking a more in-depth understanding of an initial question and answer. We construct FOLLOWUPQG, a dataset of over 3K real-world (initial question, answer, follow-up question) tuples collected from a Reddit forum providing layman-friendly explanations for open-ended questions. In contrast to existing datasets, questions in FOLLOWUPQG use more diverse pragmatic strategies to seek information, and they also show higher-order cognitive skills (such as applying and relating). We evaluate current question generation models on their efficacy for generating follow-up questions, exploring how to generate specific types of follow-up questions based on step-by-step demonstrations. Our results validate FOLLOWUPQG as a challenging benchmark, as model-generated questions are adequate but far from human-raised questions in terms of informativeness and complexity.
@inproceedings{meng-etal-2023-followupqg, title = {{F}ollowup{QG}: Towards Information-Seeking Follow-up Question Generation}, author = {Meng, Yan and Pan, Liangming and Cao, Yixin and Kan, Min-Yen}, booktitle = {International Joint Conference on Natural Language Processing and Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (IJCNLP-AACL)}, year = {2023}, address = {Nusa Dua, Bali}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.ijcnlp-main.17}, pages = {252--271} }
NeurIPS
Efficient Online Data Mixing For Language Model Pre-Training

Alon Albalak, Liangming Pan, Colin Raffel, and William Yang Wang

In NeurIPS Workshop on Robustness of Few-shot and Zero-shot Learning in Foundation Models (R0-FoMo@NeurIPS), 2023

Spotlight Paper

Abstract Bib PDF

The data used to pretrain large language models has a decisive impact on a model’s downstream performance, which has led to a large body of work on data selection methods that aim to automatically determine the most suitable data to use for pretraining. Existing data selection methods suffer from slow and computationally expensive processes, a problem amplified by the increasing size of models and of pretraining datasets. Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together and determining sampling probabilities across entire groups. However, data mixing proportions are typically fixed before training and therefore cannot adapt to changing training dynamics. To address these limitations, we develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing. Based on multi-armed bandit algorithms, our online approach optimizes the data mixing proportions during training. Remarkably, our method trains a model that reaches the final perplexity of the next best method with 19% fewer training iterations, and improves performance on the 5-shot MMLU benchmark by 1.9% relative accuracy, while adding negligible wall-clock time during pretraining.
@inproceedings{weissburg-etal-2024-tweett, title = {Efficient Online Data Mixing For Language Model Pre-Training}, author = {Albalak, Alon and Pan, Liangming and Raffel, Colin and Wang, William Yang}, booktitle = {NeurIPS Workshop on Robustness of Few-shot and Zero-shot Learning in Foundation Models (R0-FoMo@NeurIPS)}, year = {2023}, address = {New Orleans, USA}, url = {https://arxiv.org/abs/2312.02406} }
WWW
Hashtag-Guided Low-Resource Tweet Classification

Shizhe Diao, Sedrick Scott Keh, Liangming Pan, Zhiliang Tian, Yan Song, and Tong Zhang

In International World Wide Web Conference (WWW), 2023

Abstract Bib PDF Code

Social media classification tasks (e.g., tweet sentiment analysis, tweet stance detection) are challenging because social media posts are typically short, informal, and ambiguous. Thus, training on tweets is challenging and demands large-scale human-annotated labels, which are time-consuming and costly to obtain. In this paper, we find that providing hashtags to social media tweets can help alleviate this issue because hashtags can enrich short and ambiguous tweets in terms of various information, such as topic, sentiment, and stance. This motivates us to propose a novel Hashtag-guided Tweet Classification model (HashTation), which automatically generates meaningful hashtags for the input tweet to provide useful auxiliary signals for tweet classification. To generate high-quality and insightful hashtags, our hashtag generation model retrieves and encodes the post-level and entity-level information across the whole corpus. Experiments show that HashTation achieves significant improvements on seven low-resource tweet classification tasks, in which only a limited amount of training data is provided, showing that automatically enriching tweets with model-generated hashtags could significantly reduce the demand for large-scale human-labeled data. Further analysis demonstrates that HashTation is able to generate high-quality hashtags that are consistent with the tweets and their labels.
@inproceedings{DBLP:conf/www/DiaoKPT0023, author = {Diao, Shizhe and Keh, Sedrick Scott and Pan, Liangming and Tian, Zhiliang and Song, Yan and Zhang, Tong}, title = {Hashtag-Guided Low-Resource Tweet Classification}, booktitle = {International World Wide Web Conference (WWW)}, pages = {1415--1426}, publisher = {{ACM}}, year = {2023}, url = {https://doi.org/10.1145/3543507.3583194} }

AAAI

InfoCTM: A Mutual Information Maximization Perspective of Cross-Lingual Topic Modeling

Xiaobao Wu, Xinshuai Dong, Thong Nguyen, Chaoqun Liu, Liangming Pan, and Anh Tuan Luu

In AAAI Conference on Artificial Intelligence (AAAI), 2023

Bib PDF Code

@inproceedings{DBLP:conf/aaai/WuDNLPL23,
  author = {Wu, Xiaobao and Dong, Xinshuai and Nguyen, Thong and Liu, Chaoqun and Pan, Liangming and Luu, Anh Tuan},
  title = {InfoCTM: {A} Mutual Information Maximization Perspective of Cross-Lingual
                      Topic Modeling},
  booktitle = {AAAI Conference on Artificial Intelligence (AAAI)},
  pages = {13763--13771},
  publisher = {{AAAI} Press},
  year = {2023},
  url = {https://doi.org/10.1609/aaai.v37i11.26612}
}

2022

ACL Oral Presentation
KQA Pro: A Dataset with Explicit Compositional Programs for Complex Question Answering over Knowledge Base

Shulin Cao, Jiaxin Shi, Liangming Pan, Lunyiu Nie, Yutong Xiang, Lei Hou, Juanzi Li, Bin He, and Hanwang Zhang

In Annual Meeting of the Association for Computational Linguistics (ACL), 2022

Abstract Bib PDF Code Dataset

Complex question answering over knowledge base (Complex KBQA) is challenging because it requires various compositional reasoning capabilities, such as multi-hop inference, attribute comparison, set operation, etc. Existing benchmarks have some shortcomings that limit the development of Complex KBQA: 1) they only provide QA pairs without explicit reasoning processes; 2) questions are poor in diversity or scale. To this end, we introduce KQA Pro, a dataset for Complex KBQA including around 120K diverse natural language questions. We introduce a compositional and interpretable programming language KoPL to represent the reasoning process of complex questions. For each question, we provide the corresponding KoPL program and SPARQL query, so that KQA Pro can serve for both KBQA and semantic parsing tasks. Experimental results show that state-of-the-art KBQA methods cannot achieve promising results on KQA Pro as on current datasets, which suggests that KQA Pro is challenging and Complex KBQA requires further research efforts. We also treat KQA Pro as a diagnostic dataset for testing multiple reasoning skills, conduct a thorough evaluation of existing models and discuss further directions for Complex KBQA.
@inproceedings{cao-etal-2022-kqa, title = {{KQA} Pro: A Dataset with Explicit Compositional Programs for Complex Question Answering over Knowledge Base}, author = {Cao, Shulin and Shi, Jiaxin and Pan, Liangming and Nie, Lunyiu and Xiang, Yutong and Hou, Lei and Li, Juanzi and He, Bin and Zhang, Hanwang}, booktitle = {Annual Meeting of the Association for Computational Linguistics (ACL)}, year = {2022}, address = {Dublin, Ireland}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2022.acl-long.422}, pages = {6101--6119} }
ACL
Interpreting the Robustness of Neural NLP Models to Textual Perturbations

Yunxiang Zhang, Liangming Pan, Samson Tan, and Min-Yen Kan

In Findings of Annual Meeting of the Association for Computational Linguistics (ACL), 2022

Abstract Bib PDF

Modern Natural Language Processing (NLP) models are known to be sensitive to input perturbations and their performance can decrease when applied to real-world, noisy data. However, it is still unclear why models are less robust to some perturbations than others. In this work, we test the hypothesis that the extent to which a model is affected by an unseen textual perturbation (robustness) can be explained by the learnability of the perturbation (defined as how well the model learns to identify the perturbation with a small amount of evidence). We further give a causal justification for the learnability metric. We conduct extensive experiments with four prominent NLP models — TextRNN, BERT, RoBERTa and XLNet — over eight types of textual perturbations on three datasets. We show that a model which is better at identifying a perturbation (higher learnability) becomes worse at ignoring such a perturbation at test time (lower robustness), providing empirical support for our hypothesis.
@inproceedings{zhang-etal-2022-interpreting, title = {Interpreting the Robustness of Neural {NLP} Models to Textual Perturbations}, author = {Zhang, Yunxiang and Pan, Liangming and Tan, Samson and Kan, Min-Yen}, booktitle = {Findings of Annual Meeting of the Association for Computational Linguistics (ACL)}, year = {2022}, address = {Dublin, Ireland}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2022.findings-acl.315}, pages = {3993--4007} }
COLING
KHANQ: A Dataset for Generating Deep Questions in Education

Huanli Gong, Liangming Pan, and Hengchang Hu

In International Conference on Computational Linguistics (COLING), 2022

Abstract Bib PDF

Designing in-depth educational questions is a time-consuming and cognitively demanding task. Therefore, it is intriguing to study how to build Question Generation (QG) models to automate the question creation process. However, existing QG datasets are not suitable for educational question generation because the questions are not real questions asked by humans during learning and can be solved by simply searching for information. To bridge this gap, we present KHANQ, a challenging dataset for educational question generation, containing 1,034 high-quality learner-generated questions seeking an in-depth understanding of the taught online courses in Khan Academy. Each data sample is carefully paraphrased and annotated as a triple of 1) Context: an independent paragraph on which the question is based; 2) Prompt: a text prompt for the question (e.g., the learner’s background knowledge); 3) Question: a deep question based on Context and coherent with Prompt. By conducting a human evaluation on the aspects of appropriateness, coverage, coherence, and complexity, we show that state-of-the-art QG models which perform well on shallow question generation datasets have difficulty in generating useful educational questions. This makes KHANQ a challenging testbed for educational question generation.
@inproceedings{gong-etal-2022-khanq, title = {{KHANQ}: A Dataset for Generating Deep Questions in Education}, author = {Gong, Huanli and Pan, Liangming and Hu, Hengchang}, booktitle = {International Conference on Computational Linguistics (COLING)}, year = {2022}, address = {Gyeongju, Republic of Korea}, publisher = {International Committee on Computational Linguistics}, url = {https://aclanthology.org/2022.coling-1.518}, pages = {5925--5938} }
COLING
CoHS-CQG: Context and History Selection for Conversational Question Generation

Xuan Long Do, Bowei Zou, Liangming Pan, Nancy F. Chen, Shafiq Joty, and Ai Ti Aw

In International Conference on Computational Linguistics (COLING), 2022

Abstract Bib PDF Code

Conversational question generation (CQG) serves as a vital task for machines to assist humans, such as interactive reading comprehension, through conversations. Compared to traditional single-turn question generation (SQG), CQG is more challenging in the sense that the generated question is required not only to be meaningful, but also to align with the provided conversation. Previous studies mainly focus on how to model the flow and alignment of the conversation, but do not thoroughly study which parts of the context and history are necessary for the model. We believe that shortening the context and history is crucial as it can help the model to optimise more on the conversational alignment property. To this end, we propose CoHS-CQG, a two-stage CQG framework, which adopts a novel CoHS module to shorten the context and history of the input. In particular, it selects the top-p sentences and history turns by calculating the relevance scores of them. Our model achieves state-of-the-art performances on CoQA in both the answer-aware and answer-unaware settings.
@inproceedings{do-etal-2022-cohs, title = {{C}o{HS}-{CQG}: Context and History Selection for Conversational Question Generation}, author = {Do, Xuan Long and Zou, Bowei and Pan, Liangming and Chen, Nancy F. and Joty, Shafiq and Aw, Ai Ti}, booktitle = {International Conference on Computational Linguistics (COLING)}, year = {2022}, address = {Gyeongju, Republic of Korea}, publisher = {International Committee on Computational Linguistics}, url = {https://aclanthology.org/2022.coling-1.48}, pages = {580--591} }
NAACL
Automatic True/False Question Generation for Educational Purpose

Bowei Zou, Pengfei Li, Liangming Pan, and Ai Ti Aw

In NAACL Workshop on Innovative Use of NLP for Building Educational Applications (BEA@NAACL), 2022

Abstract Bib PDF Video

In field of teaching, true/false questioning is an important educational method for assessing students’ general understanding of learning materials. Manually creating such questions requires extensive human effort and expert knowledge. Question Generation (QG) technique offers the possibility to automatically generate a large number of questions. However, there is limited work on automatic true/false question generation due to the lack of training data and difficulty finding question-worthy content. In this paper, we propose an unsupervised True/False Question Generation approach (TF-QG) that automatically generates true/false questions from a given passage for reading comprehension test. TF-QG consists of a template-based framework that aims to test the specific knowledge in the passage by leveraging various NLP techniques, and a generative framework to generate more flexible and complicated questions by using a novel masking-and-infilling strategy. Human evaluation shows that our approach can generate high-quality and valuable true/false questions. In addition, simulated testing on the generated questions challenges the state-of-the-art inference models from NLI, QA, and fact verification tasks.
@inproceedings{zou-etal-2022-automatic, title = {Automatic True/False Question Generation for Educational Purpose}, author = {Zou, Bowei and Li, Pengfei and Pan, Liangming and Aw, Ai Ti}, booktitle = {NAACL Workshop on Innovative Use of NLP for Building Educational Applications (BEA@NAACL)}, year = {2022}, address = {Seattle, Washington}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2022.bea-1.10}, pages = {61--70} }
ICMR
Ingredient-enriched Recipe Generation from Cooking Videos

Jianlong Wu, Liangming Pan, Jingjing Chen, and Yu-Gang Jiang

In International Conference on Multimedia Retrieval (ICMR), 2022

Abstract Bib PDF

Cooking video captioning aims to generate the text instructions that describes the cooking procedures presented in the video. Current approaches tend to use large neural models or use more robust feature extractors to increase the expressive ability of features, ignoring the strong correlation between consecutive cooking steps in the video. However, it is intuitive that previous cooking steps can provide clues for the next cooking step. Specially, consecutive cooking steps tend to share the same ingredients. Therefore, accurate ingredients recognition can help to introduce more fine-grained information in captioning. To improve the performance of video procedural caption in cooking video, this paper proposes a framework that introduces ingredient recognition module which uses the copy mechanism to fuse the predicted ingredient information into the generated sentence. Moreover, we integrate the visual information of the previous step into the generation of the current step, and the visual information of the two steps together assist in the generation process. Extensive experiments verify the effectiveness of our propose framework and it achieves the promising performances on both YouCookII and Cooking-COIN datasets.
@inproceedings{DBLP:conf/mir/WuPCJ22, author = {Wu, Jianlong and Pan, Liangming and Chen, Jingjing and Jiang, Yu{-}Gang}, title = {Ingredient-enriched Recipe Generation from Cooking Videos}, booktitle = {International Conference on Multimedia Retrieval (ICMR)}, pages = {249--257}, publisher = {{ACM}}, year = {2022}, url = {https://doi.org/10.1145/3512527.3531388} }
RecSys
Modeling and Leveraging Prerequisite Context in Recommendation

Hengchang Hu, Liangming Pan, Yiding Ran, and Min-Yen Kan

In RecSys Workshop on Context-Aware Recommender Systems (CARS@RecSys), 2022

Abstract Bib PDF Code

Prerequisites can play a crucial role in users’ decision-making yet recommendation systems have not fully utilized such contextual background knowledge. Traditional recommendation systems (RS) mostly enrich user-item interactions where the context consists of static user profiles and item descriptions, ignoring the contextual logic and constraints that underlie them. For example, an RS may recommend an item on the condition that the user has interacted with another item as its prerequisite. Modeling prerequisite context from conceptual side information can overcome this weakness. We propose Prerequisite Driven Recommendation (PDR), a generic context-aware framework where prerequisite context is explicitly modeled to facilitate recommendation. We first design a Prerequisite Knowledge Linking (PKL) algorithm, to curate datasets facilitating PDR research. Employing it, we build a 75k+ high-quality prerequisite concept dataset which spans three domains. We then contribute PDRS, a neural instantiation of PDR. By jointly optimizing both the prerequisite learning and recommendation tasks through multi-layer perceptrons, we find PDRS consistently outperforms baseline models in all three domains, by an average margin of 7.41%. Importantly, PDRS performs especially well in cold-start scenarios with improvements of up to 17.65%.
@inproceedings{he-etal-2022-modeling, title = {Modeling and Leveraging Prerequisite Context in Recommendation}, author = {Hu, Hengchang and Pan, Liangming and Ran, Yiding and Kan, Min{-}Yen}, booktitle = {RecSys Workshop on Context-Aware Recommender Systems (CARS@RecSys)}, year = {2022}, address = {Seattle, WA, USA}, url = {https://arxiv.org/abs/2209.11471} }

2021

ACL Oral Presentation
Zero-shot Fact Verification by Claim Generation

Liangming Pan, Wenhu Chen, Wenhan Xiong, Min-Yen Kan, and William Yang Wang

In Annual Meeting of the Association for Computational Linguistics (ACL), 2021

Abstract Bib PDF Video Code Slides

Neural models for automated fact verification have achieved promising results thanks to the availability of large, human-annotated datasets. However, for each new domain that requires fact verification, creating a dataset by manually writing claims and linking them to their supporting evidence is expensive. We develop QACG, a framework for training a robust fact verification model by using automatically generated claims that can be supported, refuted, or unverifiable from evidence from Wikipedia. QACG generates question-answer pairs from the evidence and then converts them into different types of claims. Experiments on the FEVER dataset show that our QACG framework significantly reduces the demand for human-annotated training data. In a zero-shot scenario, QACG improves a RoBERTa model’s F1 from 50% to 77%, equivalent in performance to 2K+ manually-curated examples. Our QACG code is publicly available.
@inproceedings{pan-etal-2021-zero, title = {Zero-shot Fact Verification by Claim Generation}, author = {Pan, Liangming and Chen, Wenhu and Xiong, Wenhan and Kan, Min-Yen and Wang, William Yang}, booktitle = {Annual Meeting of the Association for Computational Linguistics (ACL)}, year = {2021}, address = {Online}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2021.acl-short.61}, pages = {476--483} }
NAACL Oral Presentation
Unsupervised Multi-hop Question Answering by Question Generation

Liangming Pan, Wenhu Chen, Wenhan Xiong, Min-Yen Kan, and William Yang Wang

In Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2021

Abstract Bib PDF Video Code Poster Slides

Obtaining training data for multi-hop question answering (QA) is time-consuming and resource-intensive. We explore the possibility to train a well-performed multi-hop QA model without referencing any human-labeled multi-hop question-answer pairs, i.e., unsupervised multi-hop QA. We propose MQA-QG, an unsupervised framework that can generate human-like multi-hop training data from both homogeneous and heterogeneous data sources. MQA-QG generates questions by first selecting/generating relevant information from each data source and then integrating the multiple information to form a multi-hop question. Using only generated training data, we can train a competent multi-hop QA which achieves 61% and 83% of the supervised learning performance for the HybridQA and the HotpotQA dataset, respectively. We also show that pretraining the QA system with the generated data would greatly reduce the demand for human-annotated training data.
@inproceedings{pan-etal-2021-unsupervised, title = {Unsupervised Multi-hop Question Answering by Question Generation}, author = {Pan, Liangming and Chen, Wenhu and Xiong, Wenhan and Kan, Min-Yen and Wang, William Yang}, booktitle = {Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)}, year = {2021}, address = {Online}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2021.naacl-main.469}, pages = {5866--5880} }
TMM
A Hybrid Approach for Detecting Prerequisite Relations in Multi-Modal Food Recipes

Liangming Pan, Jingjing Chen, Shaoteng Liu, Chong-Wah Ngo, Min-Yen Kan, and Tat-Seng Chua

IEEE Transactions on Multimedia (TMM), 2021

Abstract Bib PDF

Modeling the structure of culinary recipes is the core of recipe representation learning. Current approaches mostly focus on extracting the workflow graph from recipes based on text descriptions. Process images, which constitute an important part of cooking recipes, has rarely been investigated in recipe structure modeling. We study this recipe structure problem from a multi-modal learning perspective, by proposing a prerequisite tree to represent recipes with cooking images at a step-level granularity. We propose a simple-yet-effective two-stage framework to automatically construct the prerequisite tree for a recipe by (1) utilizing a trained classifier to detect pairwise prerequisite relations that fuses multi-modal features as input; then (2) applying different strategies (greedy method, maximum weight, and beam search) to build the tree structure. Experiments on the MM-ReS dataset demonstrates the advantages of introducing process images for recipe structure modeling. Also, compared with neural methods which require large numbers of training data, we show that our two-stage pipeline can achieve promising results using only 400 labeled prerequisite trees as training data.
@article{DBLP:journals/tmm/PanCLNKC21, author = {Pan, Liangming and Chen, Jingjing and Liu, Shaoteng and Ngo, Chong{-}Wah and Kan, Min{-}Yen and Chua, Tat{-}Seng}, title = {A Hybrid Approach for Detecting Prerequisite Relations in Multi-Modal Food Recipes}, journal = {IEEE Transactions on Multimedia (TMM)}, volume = {23}, pages = {4491--4501}, year = {2021}, url = {https://doi.org/10.1109/TMM.2020.3042706} }

2020

ACL Oral Presentation
Semantic Graphs for Generating Deep Questions

Liangming Pan, Yuxi Xie, Yansong Feng, Tat-Seng Chua, and Min-Yen Kan

In Annual Meeting of the Association for Computational Linguistics (ACL), 2020

Abstract Bib PDF Video Code

This paper proposes the problem of Deep Question Generation (DQG), which aims to generate complex questions that require reasoning over multiple pieces of information about the input passage. In order to capture the global structure of the document and facilitate reasoning, we propose a novel framework that first constructs a semantic-level graph for the input document and then encodes the semantic graph by introducing an attention-based GGNN (Att-GGNN). Afterward, we fuse the document-level and graph-level representations to perform joint training of content selection and question decoding. On the HotpotQA deep-question centric dataset, our model greatly improves performance over questions requiring reasoning over multiple facts, leading to state-of-the-art performance.
@inproceedings{pan-etal-2020-semantic, title = {Semantic Graphs for Generating Deep Questions}, author = {Pan, Liangming and Xie, Yuxi and Feng, Yansong and Chua, Tat-Seng and Kan, Min-Yen}, booktitle = {Annual Meeting of the Association for Computational Linguistics (ACL)}, year = {2020}, address = {Online}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2020.acl-main.135}, pages = {1463--1475} }
ACL Oral Presentation
Expertise Style Transfer: A New Task Towards Better Communication between Experts and Laymen

Yixin Cao, Ruihao Shui, Liangming Pan, Min-Yen Kan, Zhiyuan Liu, and Tat-Seng Chua

In Annual Meeting of the Association for Computational Linguistics (ACL), 2020

Abstract Bib PDF Video Dataset

The curse of knowledge can impede communication between experts and laymen. We propose a new task of expertise style transfer and contribute a manually annotated dataset with the goal of alleviating such cognitive biases. Solving this task not only simplifies the professional language, but also improves the accuracy and expertise level of laymen descriptions using simple words. This is a challenging task, unaddressed in previous work, as it requires the models to have expert intelligence in order to modify text with a deep understanding of domain knowledge and structures. We establish the benchmark performance of five state-of-the-art models for style transfer and text simplification. The results demonstrate a significant gap between machine and human performance. We also discuss the challenges of automatic evaluation, to provide insights into future research directions.
@inproceedings{cao-etal-2020-expertise, title = {Expertise Style Transfer: A New Task Towards Better Communication between Experts and Laymen}, author = {Cao, Yixin and Shui, Ruihao and Pan, Liangming and Kan, Min-Yen and Liu, Zhiyuan and Chua, Tat-Seng}, booktitle = {Annual Meeting of the Association for Computational Linguistics (ACL)}, year = {2020}, address = {Online}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2020.acl-main.100}, pages = {1061--1071} }
EMNLP
Exploring and Evaluating Attributes, Values, and Structures for Entity Alignment

Zhiyuan Liu, Yixin Cao, Liangming Pan, Juanzi Li, Zhiyuan Liu, and Tat-Seng Chua

In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

Abstract Bib PDF Video Code

Entity alignment (EA) aims at building a unified Knowledge Graph (KG) of rich content by linking the equivalent entities from various KGs. GNN-based EA methods present promising performance by modeling the KG structure defined by relation triples. However, attribute triples can also provide crucial alignment signal but have not been well explored yet. In this paper, we propose to utilize an attributed value encoder and partition the KG into subgraphs to model the various types of attribute triples efficiently. Besides, the performances of current EA methods are overestimated because of the name-bias of existing EA datasets. To make an objective evaluation, we propose a hard experimental setting where we select equivalent entity pairs with very different names as the test set. Under both the regular and hard settings, our method achieves significant improvements (5.10% on average Hits@1 in DBP15k) over 12 baselines in cross-lingual and monolingual datasets. Ablation studies on different subgraphs and a case study about attribute types further demonstrate the effectiveness of our method.
@inproceedings{liu-etal-2020-exploring, title = {Exploring and Evaluating Attributes, Values, and Structures for Entity Alignment}, author = {Liu, Zhiyuan and Cao, Yixin and Pan, Liangming and Li, Juanzi and Liu, Zhiyuan and Chua, Tat-Seng}, booktitle = {Conference on Empirical Methods in Natural Language Processing (EMNLP)}, year = {2020}, address = {Online}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2020.emnlp-main.515}, pages = {6355--6364} }
COLING
Exploring Question-Specific Rewards for Generating Deep Questions

Yuxi Xie, Liangming Pan^*, Dongzhe Wang, Min-Yen Kan, and Yansong Feng

In International Conference on Computational Linguistics (COLING), 2020

Abstract Bib PDF Code (*: Corresponding Author)

Recent question generation (QG) approaches often utilize the sequence-to-sequence framework (Seq2Seq) to optimize the log likelihood of ground-truth questions using teacher forcing. However, this training objective is inconsistent with actual question quality, which is often reflected by certain global properties such as whether the question can be answered by the document. As such, we directly optimize for QG-specific objectives via reinforcement learning to improve question quality. We design three different rewards that target to improve the fluency, relevance, and answerability of generated questions. We conduct both automatic and human evaluations in addition to thorough analysis to explore the effect of each QG-specific reward. We find that optimizing on question-specific rewards generally leads to better performance in automatic evaluation metrics. However, only the rewards that correlate well with human judgement (e.g., relevance) lead to real improvement in question quality. Optimizing for the others, especially answerability, introduces incorrect bias to the model, resulting in poorer question quality.
@inproceedings{xie-etal-2020-exploring, title = {Exploring Question-Specific Rewards for Generating Deep Questions}, author = {Xie, Yuxi and Pan, Liangming and Wang, Dongzhe and Kan, Min-Yen and Feng, Yansong}, booktitle = {International Conference on Computational Linguistics (COLING)}, year = {2020}, address = {Barcelona, Spain (Online)}, publisher = {International Committee on Computational Linguistics}, url = {https://aclanthology.org/2020.coling-main.228}, pages = {2534--2546} }
ACM MM Oral Presentation
Multi-modal Cooking Workflow Construction for Food Recipes

Liangming Pan, Jingjing Chen, Jianlong Wu, Shaoteng Liu, Chong-Wah Ngo, Min-Yen Kan, Yu-Gang Jiang, and Tat-Seng Chua

In ACM International Conference on Multimedia (ACM MM), 2020

Abstract Bib PDF

Understanding food recipe requires anticipating the implicit causal effects of cooking actions, such that the recipe can be converted into a graph describing the temporal workflow of the recipe. This is a non-trivial task that involves common-sense reasoning. However, existing efforts rely on hand-crafted features to extract the workflow graph from recipes due to the lack of large-scale labeled datasets. Moreover, they fail to utilize the cooking images, which constitute an important part of food recipes. In this paper, we build MM-ReS, the first large-scale dataset for cooking workflow construction, consisting of 9,850 recipes with human-labeled workflow graphs. Cooking steps are multi-modal, featuring both text instructions and cooking images. We then propose a neural encoder-decoder model that utilizes both visual and textual information to construct the cooking workflow, which achieved over 20% performance gain over existing hand-crafted baselines.
@inproceedings{DBLP:conf/mm/PanCWLNKJC20, author = {Pan, Liangming and Chen, Jingjing and Wu, Jianlong and Liu, Shaoteng and Ngo, Chong{-}Wah and Kan, Min{-}Yen and Jiang, Yu{-}Gang and Chua, Tat{-}Seng}, title = {Multi-modal Cooking Workflow Construction for Food Recipes}, booktitle = {ACM International Conference on Multimedia (ACM MM)}, pages = {1132--1141}, publisher = {{ACM}}, year = {2020}, url = {https://doi.org/10.1145/3394171.3413765} }
CVPR
Hyperbolic Visual Embedding Learning for Zero-Shot Recognition

Shaoteng Liu, Jingjing Chen, Liangming Pan, Chong-Wah Ngo, Tat-Seng Chua, and Yu-Gang Jiang

In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

Abstract Bib PDF Code

This paper proposes a Hyperbolic Visual Embedding Learning Network for zero-shot recognition. The network learns image embeddings in hyperbolic space, which is capable of preserving the hierarchical structure of semantic classes in low dimensions. Comparing with existing zero-shot learning approaches, the network is more robust because the embedding feature in hyperbolic space better represents class hierarchy and thereby avoid misleading resulted from unrelated siblings. Our network outperforms exiting baselines under hierarchical evaluation with an extremely challenging setting, i.e., learning only from 1,000 categories to recognize 20,841 unseen categories. While under flat evaluation, it has competitive performance as state-of-the-art methods but with five times lower embedding dimensions.
@inproceedings{DBLP:conf/cvpr/LiuCPNCJ20, author = {Liu, Shaoteng and Chen, Jingjing and Pan, Liangming and Ngo, Chong{-}Wah and Chua, Tat{-}Seng and Jiang, Yu{-}Gang}, title = {Hyperbolic Visual Embedding Learning for Zero-Shot Recognition}, booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, pages = {9270--9278}, publisher = {Computer Vision Foundation / {IEEE}}, year = {2020}, url = {https://openaccess.thecvf.com/content\_CVPR\_2020/html/Liu\_Hyperbolic\_Visual\_Embedding\_Learning\_for\_Zero-Shot\_Recognition\_CVPR\_2020\_paper.html} }
AAAI
Zero-Shot Ingredient Recognition by Multi-Relational Graph Convolutional Network

Jingjing Chen, Liangming Pan, Zhipeng Wei, Xiang Wang, Chong-Wah Ngo, and Tat-Seng Chua

In AAAI Conference on Artificial Intelligence (AAAI), 2020

Abstract Bib PDF

Recognizing ingredients for a given dish image is at the core of automatic dietary assessment, attracting increasing attention from both industry and academia. Nevertheless, the task is challenging due to the difficulty of collecting and labeling sufficient training data. On one hand, there are hundred thousands of food ingredients in the world, ranging from the common to rare. Collecting training samples for all of the ingredient categories is difficult. On the other hand, as the ingredient appearances exhibit huge visual variance during the food preparation, it requires to collect the training samples under different cooking and cutting methods for robust recognition. Since obtaining sufficient fully annotated training data is not easy, a more practical way of scaling up the recognition is to develop models that are capable of recognizing unseen ingredients. Therefore, in this paper, we target the problem of ingredient recognition with zero training samples. More specifically, we introduce multi-relational GCN (graph convolutional network) that integrates ingredient hierarchy, attribute as well as co-occurrence for zero-shot ingredient recognition. Extensive experiments on both Chinese and Japanese food datasets are performed to demonstrate the superior performance of multi-relational GCN and shed light on zero-shot ingredients recognition.
@inproceedings{DBLP:conf/aaai/ChenPWWNC20, author = {Chen, Jingjing and Pan, Liangming and Wei, Zhipeng and Wang, Xiang and Ngo, Chong{-}Wah and Chua, Tat{-}Seng}, title = {Zero-Shot Ingredient Recognition by Multi-Relational Graph Convolutional Network}, booktitle = {AAAI Conference on Artificial Intelligence (AAAI)}, pages = {10542--10550}, publisher = {{AAAI} Press}, year = {2020}, url = {https://doi.org/10.1609/aaai.v34i07.6626} }

2017

ACL
Prerequisite Relation Learning for Concepts in MOOCs

Liangming Pan, Chengjiang Li, Juanzi Li, and Jie Tang

In Annual Meeting of the Association for Computational Linguistics (ACL), 2017

Abstract Bib PDF

What prerequisite knowledge should students achieve a level of mastery before moving forward to learn subsequent coursewares? We study the extent to which the prerequisite relation between knowledge concepts in Massive Open Online Courses (MOOCs) can be inferred automatically. In particular, what kinds of information can be leverage to uncover the potential prerequisite relation between knowledge concepts. We first propose a representation learning-based method for learning latent representations of course concepts, and then investigate how different features capture the prerequisite relations between concepts. Our experiments on three datasets form Coursera show that the proposed method achieves significant improvements (+5.9-48.0% by F1-score) comparing with existing methods.
@inproceedings{pan-etal-2017-prerequisite, title = {Prerequisite Relation Learning for Concepts in {MOOC}s}, author = {Pan, Liangming and Li, Chengjiang and Li, Juanzi and Tang, Jie}, booktitle = {Annual Meeting of the Association for Computational Linguistics (ACL)}, year = {2017}, address = {Vancouver, Canada}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/P17-1133}, pages = {1447--1456} }
AACL / IJCNLP
Course Concept Extraction in MOOCs via Embedding-Based Graph Propagation

Liangming Pan, Xiaochen Wang, Chengjiang Li, Juanzi Li, and Jie Tang

In International Joint Conference on Natural Language Processing (IJCNLP), 2017

Abstract Bib PDF

Massive Open Online Courses (MOOCs), offering a new way to study online, are revolutionizing education. One challenging issue in MOOCs is how to design effective and fine-grained course concepts such that students with different backgrounds can grasp the essence of the course. In this paper, we conduct a systematic investigation of the problem of course concept extraction for MOOCs. We propose to learn latent representations for candidate concepts via an embedding-based method. Moreover, we develop a graph-based propagation algorithm to rank the candidate concepts based on the learned representations. We evaluate the proposed method using different courses from XuetangX and Coursera. Experimental results show that our method significantly outperforms all the alternative methods (+0.013-0.318 in terms of R-precision; p\textless\textless0.01, t-test).
@inproceedings{pan-etal-2017-course, title = {Course Concept Extraction in {MOOC}s via Embedding-Based Graph Propagation}, author = {Pan, Liangming and Wang, Xiaochen and Li, Chengjiang and Li, Juanzi and Tang, Jie}, booktitle = {International Joint Conference on Natural Language Processing (IJCNLP)}, year = {2017}, address = {Taipei, Taiwan}, publisher = {Asian Federation of Natural Language Processing}, url = {https://aclanthology.org/I17-1088}, pages = {875--884} }

PhD thesis

2022

PhD thesis

Towards Generating Deep Questions from Text

Liangming Pan

National University of Singapore, 2022

PDF