Thi Nhat, Anh Nguyen, Mingwei Shen, Karen Hovsepian
System: The paper proposes a model for large-scale unsupervised abstractive summarization of customer reviews in e-commerce. The model addresses the challenge of reducing generic and uninformative content and producing useful information related to specific product aspects by modeling reviews in the context of topical classes of interest. The proposed model can generate class-specific summaries from multiple reviews of each product without ground-truth summaries, using only class probabilities or labels. The model combines a generative variational autoencoder with a class-correlation gating mechanism and a hierarchical structure. Human evaluation shows that the generated summaries are relevant, fluent, and representative, and evaluation using a reference dataset shows that the model outperforms state-of-the-art abstractive and extractive baselines.
Masaru Isonuma, Junichiro Mori, Danushka Bollegala, Ichiro Sakata
The paper presents a new method for summarizing opinionated texts using a recursive Gaussian mixture model. The model generates sentences with tree-structured topic guidance, where the root sentence conveys generic content, and the leaf sentences describe specific topics. Experimental results show that the generated topic sentences are more informative and cover more input contents than those generated by recent unsupervised summarization models. The paper also demonstrates that the variance of latent Gaussians represents the granularity of sentences, similar to Gaussian word embedding.
Arthur Bražinskas, Mirella Lapata, Ivan Titov
The paper discusses the task of opinion summarization, which involves automatically creating summaries that reflect subjective information expressed in multiple documents, such as product reviews. While previous work has focused on selecting fragments from input reviews to produce a summary, the authors propose a generative model that can produce abstractive summaries by generating novel sentences. They consider the unsupervised setting, where no summaries are used in training, and define a hierarchical variational autoencoder model that can control the "amount of novelty" in new reviews. At test time, the model produces summaries that reflect consensus opinions by forcing the novelty to be minimal. Experiments on Amazon and Yelp datasets show that setting the review's latent code to its mean allows the model to produce fluent and coherent summaries.
Juan Ramirez-Orta, Evangelos Milios
System: The paper proposes a simple and fast method for summarizing any document of any size using sentence embeddings produced by deep language models. This method is based on graph centrality and can satisfy any length constraints for the summaries produced. The proposed method offers competitive performance to more sophisticated supervised methods and can serve as a proxy for abstractive summarization techniques.
Jhen-Yi Wu, Ying-Jia Lin, Hung-Yu Kao
The paper discusses the importance of content frequency in abstractive summarization and proposes a two-stage training framework for the model to learn the frequency of each semantic unit in the source text. The model is trained in an unsupervised manner and identifies sentences with high-frequency semantic units during inference to generate summaries. The model outperforms other unsupervised methods on the CNN/Daily Mail summarization task and achieves competitive ROUGE scores with fewer parameters than pre-trained models. It can be trained under low-resource language settings and is a potential solution for real-world applications where pre-trained models are not applicable.
Peter West, Ari Holtzman, Jan Buys, Yejin Choi
The paper proposes a new approach to unsupervised sentence summarization using the Information Bottleneck principle. The approach seeks a compressed sentence that can best predict the next sentence, using an iterative algorithm that gradually searches shorter subsequences of the given sentence. The method can efficiently perform extractive sentence summarization over a large corpus using only pretrained language models with no direct supervision. The paper also presents a new approach to self-supervised abstractive summarization, where a transformer-based language model is trained on the output summaries of the unsupervised method. Empirical results show that the extractive method outperforms other unsupervised models on multiple automatic metrics, and the self-supervised abstractive model outperforms unsupervised baselines by human evaluation along multiple attributes.
Jiawei Zhou, Alexander M. Rush
System: The paper proposes an unsupervised method for sentence summarization using language modeling. The approach uses two language models, one generic and one specific to the target domain, and employs a product-of-experts criteria to maintain contextual matching and output fluency. The experiments show promising results for both abstractive and extractive summarization without the need for paired data.
Masaru Isonuma, Junichiro Mori, Ichiro Sakata
The paper presents a model for end-to-end abstractive summarization of product reviews without supervision. The model uses a discourse tree to represent the review, with the summary as the root and child sentences providing detailed explanations. The model recursively estimates parents from their children to learn the discourse tree and generate a concise summary. An architecture is introduced to rank the importance of each sentence on the tree and focus on the main review point. Experimental results show that the model outperforms other unsupervised approaches and achieves competitive performance with supervised models for long reviews. The induced tree demonstrates that child sentences provide additional information about their parent, and the generated summary abstracts the entire review.
Ziyi Yang, Chenguang Zhu, Robert Gmyr, Michael Zeng, Xuedong Huang, Eric Darve
The paper proposes a transformer-based unsupervised abstractive summarization system called TED that pretrains on large-scale data using the lead bias in news articles. The system is then fine-tuned on target domains through theme modeling and a denoising autoencoder to enhance the quality of generated summaries. TED outperforms all unsupervised abstractive baselines on various datasets and the summaries generated by TED are highly abstractive. Each component in the objective function of TED is highly effective.
Ryosuke Kohita, Akifumi Wachi, Yang Zhao, Ryuki Tachibana
The paper proposes a new approach for unsupervised text summarization using Q-learning with an edit-based summarization. The method combines two modules to form an Editorial Agent and Language Model converter (EALM), where the agent predicts edit actions and the LM converter generates a summary based on the action signals. Q-learning is used to train the agent to produce proper edit actions. Experimental results show that EALM performs competitively compared to previous methods, even with no validation set. The approach also allows for the use of reinforcement learning techniques in unsupervised summarization. Qualitative analysis is conducted to provide insights for future research in unsupervised summarizers.
Jingzhou Liu, Dominic J. D. Hughes, Yiming Yang
The paper discusses the limitations of supervised summarization due to the high cost and difficulty of obtaining large quantities of human-generated summaries. It proposes an unsupervised approach to extractive text summarization using an automatically constructed sentence graph to select salient sentences based on similarities and relative distances. The approach is generalized from single-document to multi-document settings by aggregating document-level graphs via proximity-based cross-document edges. In experiments on benchmark datasets, the proposed approach achieved competitive or better results than previous state-of-the-art unsupervised extractive summarization methods in both single-document and multi-document settings, and the performance is competitive to strong supervised baselines.
Nishant Yadav, Matteo Brucato, Anna Fariha, Oscar Youngquist, Julian Killingback, Alexandra Meliou, Peter J. Haas
The paper discusses the need for tailored summaries based on the user's intent and how existing methods fall short when query interpretation is subjective. While several datasets exist for summarization with objective intents, no datasets exist for subjective intents where different users will provide different summaries. The authors present SUBSUME, the first dataset for evaluation of subjective summary extraction systems, containing 2,200 triplets over 48 Wikipedia pages with ten intents of varying subjectivity. The paper explores baseline algorithms for subjective extractive summarization and shows that example-based approaches better capture subjective intents than query-based ones, motivating further research on this challenging problem.
Jennifer A Bishop, Qianqian Xie, Sophia Ananiadou
The paper proposes a hybrid, unsupervised, abstractive-extractive approach for text summarization (TS) that generates salient textual fragments representing key points in a document and selects the most important sentences using BERTScore. The approach is evaluated on documents from the biomedical and general scientific domains and compared to existing unsupervised and supervised methods. The authors show that their approach out-performs existing methods despite not needing a vast amount of labelled training data.
Puyuan Liu, Chenyang Huang, Lili Mou
The paper proposes a Non-Autoregressive Unsupervised Summarization (NAUS) approach for generating short summaries without the need for parallel data. The approach involves edit-based search and training an encoder-only non-autoregressive Transformer based on the search result. The paper also introduces a dynamic programming approach for length-control decoding, which is important for the summarization task. Experiments on two datasets show that NAUS achieves state-of-the-art performance for unsupervised summarization and improves inference efficiency. Additionally, the algorithm is able to perform explicit length-transfer summary generation.
Min Yang, Qiang Qu, Jia Zhu, Ying Shen, Zhou Zhao
System: The paper proposes a model called CASAS for aspect/sentiment-aware abstractive review summarization in a domain adaptation scenario. The model leverages a domain classification task to recognize the domain information of texts and transfer knowledge from source domains to target domains. The experiments conducted on Amazon reviews show that CASAS outperforms other methods in both out-of-domain and in-domain setups.
Haoran Li, Junnan Zhu, Jiajun Zhang, Chengqing Zong
The paper discusses the importance of correctness in sentence summarization and proposes a new approach that incorporates entailment knowledge into abstractive summarization models. The authors argue that a correct summary should not contain error messages with respect to the source sentence. They propose an entailment-aware encoder and decoder and use entailment Reward Augmented Maximum Likelihood (RAML) training. Experimental results show that their models outperform baselines in terms of informativeness and correctness.
Amir Soleimani, Vassilina Nikoulina, Benoit Favre, Salah Ait-Mokhtar
The paper explores the zero-shot setting for aspect-based scientific document summarization, which can improve document assistance systems and reader experience. However, current datasets have limited aspects, causing models to over-fit to specific domains. The authors establish baseline results for zero-shot performance and propose a self-supervised pre-training approach to enhance it. They create a biomedical aspect-based summarization dataset using PubMed structured abstracts and show promising results when pre-trained with unlabelled in-domain data.
Xinnian Liang, Jing Li, Shuangzhi Wu, Jiali Zeng, Yufan Jiang, Mu Li, Zhoujun Li
The paper proposes an efficient Coarse-to-Fine Facet-Aware Ranking (C2F-FAR) framework for unsupervised long document summarization, which is based on the semantic block. The framework addresses the problem of existing methods failing to consider efficiency and effectiveness at the same time when the input document is extremely long. The proposed method converts the one-step ranking method into the hierarchical multi-granularity two-stage ranking, where the coarse-level stage splits the document into facet-aware semantic blocks and filters insignificant blocks, and the fine-level stage selects salient sentences in each block and extracts the final summary from selected sentences. The framework achieves new state-of-the-art unsupervised summarization results on Gov-Report and BillSum and speeds up 4-28 times more than previous methods.
Alexander M. Rush, Sumit Chopra
System: The paper proposes a new approach to abstractive sentence summarization using a fully data-driven method. The method utilizes a local attention-based model that generates each word of the summary based on the input sentence. The model is simple in structure, but can be trained end-to-end and scaled to a large amount of training data. The model shows significant performance gains on the DUC-2004 shared task compared to other strong baselines.
Preksha Nema, Mitesh M. Khapra, Anirban Laha, Balaraman Ravindran
The paper proposes a model for query-based summarization that addresses the problem of repeated phrases in the summary. The model is based on the encode-attend-decode paradigm and includes a query attention model and a diversity-based attention model. The authors introduce a new query-based summarization dataset and show that their model outperforms vanilla encode-attend-decode models with a gain of 28% in ROUGE-L scores.
Sumit Chopra, Michael Auli, Alexander M. Rush
The paper discusses a new method for Abstractive Sentence Summarization, which generates a shorter version of a given sentence while preserving its meaning. The method uses a conditional recurrent neural network (RNN) with a novel convolutional attention-based encoder to ensure that the decoder focuses on the appropriate input words. The model relies on learned features and is easy to train on large data sets. The experiments show that the model outperforms the state-of-the-art method on the Gigaword corpus and performs competitively on the DUC-2004 shared task.
Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Bing Xiang
The paper discusses the use of sequence-to-sequence recurrent neural networks (RNNs) for text summarization. It also explores various techniques for improving the performance of these models, such as attention mechanisms and pointer networks. The authors present experimental results on several benchmark datasets, demonstrating the effectiveness of their approach. They also discuss potential future directions for research in this area.
Qingyu Zhou, Nan Yang, Furu Wei, Ming Zhou
The paper proposes a selective encoding model for abstractive sentence summarization, which includes a sentence encoder, a selective gate network, and an attention equipped decoder. The model uses recurrent neural networks and constructs a second level sentence representation for better performance. The model was evaluated on multiple datasets and outperformed state-of-the-art baseline models.
Piji Li, Wai Lam, Lidong Bing, Zihao Wang
The paper proposes a new framework for abstractive text summarization using a sequence-to-sequence oriented encoder-decoder model with a deep recurrent generative decoder. The model learns latent structure information from target summaries using a recurrent latent random model and neural variational inference. Abstractive summaries are generated using both generative latent variables and discriminative deterministic states. The model outperforms state-of-the-art methods on benchmark datasets in different languages.
Xinyu Hua, Lu Wang
System: The paper explores domain adaptation for neural abstractive summarization and investigates what information can be transferred to a new domain. The study finds that pre-training based on extractive summaries benefits the neural summarization model and that a combination of in-domain and out-of-domain setup yields better summaries when in-domain data is insufficient. The model is capable of selecting salient content even when trained on out-of-domain data, but requires in-domain data to capture the style for a target domain.
Jiwei Tan, Xiaojun Wan, Jianguo Xiao
The paper discusses the challenges of abstractive document summarization and proposes a novel graph-based attention mechanism in the sequence-to-sequence framework to address the saliency factor of summarization. The experimental results show that the proposed model achieves considerable improvement over previous neural abstractive models and is competitive with state-of-the-art extractive methods.
Yizhu Liu, Zhiyi Luo, Kenny Q. Zhu
The paper discusses the limitations of convolutional neural networks (CNNs) in generating summaries of desired lengths for different scenarios with space or length constraints. To address this problem, the authors propose an approach to constrain the summary length by extending a convolutional sequence to sequence model. The results show that this approach generates high-quality summaries with user-defined length and outperforms baselines in terms of ROUGE score, length variations, and semantic similarity.
Romain Paulus, Caiming Xiong, Richard Socher
The paper discusses the limitations of current attentional, RNN-based encoder-decoder models for abstractive summarization on longer documents and introduces a new neural network model with a novel intraattention and a training method that combines supervised word prediction and reinforcement learning. The resulting summaries are more readable and the model achieves an improved ROUGE-1 score on the CNN/Daily Mail dataset compared to previous state-of-the-art models. Human evaluation also shows that the model produces higher quality summaries.
Ziqiang Cao, Furu Wei, Wenjie Li, Sujian Li
The paper discusses the problem of fake facts in abstractive summarization, where different parts of the source text are fused together. The authors propose a solution that leverages open information extraction and dependency parse technologies to extract actual fact descriptions from the source text, and a dual-attention sequence-to-sequence framework to generate summaries conditioned on both the source text and the extracted fact descriptions. Experiments show that their model can reduce fake summaries by 80%, while also improving informativeness.
Linqing Liu, Yao Lu, Min Yang, Qiang Qu, Jia Zhu, Hongyan Li
The paper proposes an adversarial process for abstractive text summarization, where a generative model and a discriminative model are simultaneously trained. The generator is built as an agent of reinforcement learning, while the discriminator attempts to distinguish the generated summary from the ground truth summary. The model achieves competitive ROUGE scores with state-of-the-art methods on the CNN/Daily Mail dataset and is able to generate more abstractive, readable, and diverse summaries.
Qiwei Bi, Haoyuan Li, Hanfang Yang
The paper discusses the challenge of summarization in niche domains and proposes a solution to the few-shot problem by designing auxiliary tasks to assist abstractive summarization. The authors use BART as the base sequence-to-sequence model and incorporate the main and auxiliary tasks under a multi-task framework. They also use a task-specific adapter and adaptive weight mechanism to adjust the contribution of auxiliary tasks to the main task. The experiments show the effectiveness of their method for few-shot datasets, and they propose pre-training the model on unlabeled datasets to further improve performance.
Alexander R. Fabbri, Simeng Han, Haoyuan Li, Haoran Li, Marjan Ghazvininejad, Shafiq Joty, Dragomir Radev, Yashar Mehdad
The paper discusses how models pretrained on large text corpora achieve state-of-the-art performance on English text summarization tasks, but fine-tuning them on new, niche domains is infeasible due to the requirement of hundreds of thousands of data points. The authors introduce a novel and generalizable method called WikiTransfer, which fine-tunes pretrained models for summarization in an unsupervised, dataset-specific manner using pseudo-summaries produced from generic Wikipedia data. WikiTransfer models achieve state-of-the-art, zero-shot abstractive summarization performance on the CNN-DailyMail dataset and demonstrate effectiveness on three additional diverse datasets. The authors also employ data augmentation and introduce a regularization term to improve few-shot transfer performance. The paper further studies the effect of dataset aspects on transfer performance and evaluates the quality of output summaries using both automatic and human evaluation.
Wei Li, Xinyan Xiao, Yajuan Lyu, Yuanzhuo Wang
The paper discusses the limitations of current neural sequence-to-sequence models in document summarization and proposes a solution that leverages the structural information of both documents and multi-sentence summaries to improve performance. The proposed method involves incorporating structural-compression and structural-coverage regularization to capture the information compression and coverage properties of document summarization. Experimental results show that the proposed method significantly improves the performance of document summarization and outperforms current state-of-the-art neural abstractive methods.
Angela Fan, David Grangier, Michael Auli
The paper discusses how current document summarization models do not take into account user preferences such as desired length, style, entities of interest, and how much of the document has been read. The authors propose a neural summarization model that allows users to specify these preferences, resulting in high quality summaries tailored to their needs. The system can also automatically set control variables and outperforms state of the art abstractive systems on the CNN-Dailymail dataset.
Yen-Chun Chen, Mohit Bansal
The paper proposes a summarization model that selects important sentences and rewrites them to create a concise summary. They use a new sentence-level policy gradient method to bridge the gap between two neural networks and achieve higher scores on all metrics, including human evaluation, on the CNN/Daily Mail dataset. The model also enables faster inference and training convergence than previous models. The model is also demonstrated to perform well on the DUC2002 dataset.
Junyang Lin, Xu Sun, Shuming Ma, Qi Su
The paper proposes a new global encoding framework to improve the conventional sequence-to-sequence model in neural abstractive summarization, which often suffers from repetition and semantic irrelevance. The framework controls the information flow from the encoder to the decoder based on the global information of the source context, using a convolutional gated unit to perform global encoding and improve the representations of the source-side information. Evaluations on two datasets show that the proposed model outperforms baseline models and is capable of generating higher quality summaries with reduced repetition.
Kaiqiang Song, Lin Zhao, Fei Liu
The paper discusses the limitations of current summarization systems and proposes a new approach that incorporates source-side syntactic information to improve the quality of summaries. The approach uses structure-infused copy mechanisms to copy important words and relations from the source sentence to the summary sentence. Experimental results show that this approach is effective and outperforms state-of-the-art methods.
Wan-Ting Hsu, Chieh-Kai Lin, Ming-Ying Lee, Kerui Min, Jing Tang, Min Sun
The paper proposes a unified model that combines the strengths of extractive and abstractive summarization. The model uses sentence-level attention to modulate word-level attention, resulting in a more readable paragraph. The model also introduces a novel inconsistency loss function to penalize the inconsistency between two levels of attentions. By end-to-end training, the model achieves state-of-the-art ROUGE scores and is the most informative and readable summarization on the CNN/Daily Mail dataset according to a human evaluation.
Min Yang, Qiang Qu, Ying Shen, Qiao Liu, Wei Zhao, Jia Zhu
The paper discusses the lack of research on end-to-end abstractive review summarization, which is important for businesses and consumers to make informed decisions. The authors propose a mutual attention mechanism that learns the representations of context, sentiment, and aspect words within reviews, acting as an encoder. The learned representations are incorporated into the decoder to generate aspect/sentiment-aware review summaries via an attention fusion network. The abstractive summarizer is jointly trained with the text categorization task, which helps learn a category-specific text encoder. The experimental results on a real-life dataset show that their model outperforms other strong competitors.
Reinald Kim Amplayo, Seung-won Hwang
The paper explores the use of linked entities to improve the performance of a neural text summarizer. The authors propose a module called Entity2Topic (E2T) that transforms a list of entities into a vector representation of the summary's topic. They use an off-the-shelf entity linking system (ELS) to extract linked entities, but resolve imperfections in the ELS by encoding entities with selective disambiguation and pooling entity vectors using firm attention. Applying E2T to a simple sequence-to-sequence model with attention mechanism results in significant improvements in the performance of the summarizer in the Gigaword and CNN datasets.
Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, Nazli Goharian
System: The paper proposes a new model for abstractive summarization of longer-form documents, such as research papers. The model uses a hierarchical encoder to model the discourse structure of the document and an attentive discourse-aware decoder to generate the summary. Empirical results on two large-scale datasets of scientific papers show that the proposed model outperforms state-of-the-art models.
Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, Yejin Choi
System: The paper proposes a new approach to abstractive summarization using deep communicating agents in an encoder-decoder architecture. The task of encoding a long text is divided across multiple collaborating agents, each responsible for a subsection of the input text. These encoders are connected to a single decoder, trained using reinforcement learning to generate a focused and coherent summary. Empirical results show that this approach leads to higher quality summaries compared to several strong baselines.
Chenliang Li, Weiran Xu, Sheng Gao
The paper proposes a guiding generation model that combines extractive and abstractive methods for text summarization. The model uses a Key Information Guide Network (KIGN) to encode keywords and guide the generation process, and a prediction-guide mechanism to obtain long-term value for future decoding. The model is evaluated on the CNN/Daily Mail dataset and shows significant improvements compared to previous models.
System: The paper discusses the effectiveness of ensemble methods for text-generation tasks, but notes that they often come with increased computational costs. The authors propose an alternative unsupervised ensemble method called post-ensemble, which selects a majority-like output in post-processing. The method is theoretically related to kernel density estimation based on the von MisesFisher kernel. Experimental results on a news headline-generation task show that the proposed method outperforms current ensemble methods.
Wei Li, Xinyan Xiao, Yajuan Lyu, Yuanzhuo Wang
The paper proposes a new approach to document summarization that explicitly models and optimizes the information selection process. This is achieved through an information selection layer that includes global information filtering and local sentence selection. The approach is trained using distantly-supervised training guided by a golden summary. Experimental results show that this approach significantly improves document summarization performance and outperforms state-of-the-art neural abstractive methods.
Sebastian Gehrmann, Yuntian Deng, Alexander M. Rush
The paper proposes a technique to improve the content selection of neural network-based methods for abstractive summarization. The technique involves using a data-efficient content selector to identify phrases in the source document that should be included in the summary. This selector is used as a bottom-up attention step to constrain the model to likely phrases, resulting in improved text compression and fluent summaries. The approach is simpler and higher performing than other end-to-end content selection models, and can be trained with as little as 1,000 sentences, making it easy to transfer to a new domain. The technique was shown to significantly improve ROUGE scores for both the CNN-DM and NYT corpus.
Li Wang, Junlin Yao, Yunzhe Tao, Li Zhong, Wei Liu, Qiang Du
The paper proposes a deep learning approach to automatic summarization that incorporates topic information into the ConvS2S model and uses SCST for optimization. The approach improves coherence, diversity, and informativeness of generated summaries through a biased probability generation mechanism. Reinforcement training optimizes the model with respect to the non-differentiable metric ROUGE and avoids exposure bias during inference. The method is evaluated on three datasets and shows superior performance in abstractive summarization.
Min Yang, Qiang Qu, Wenting Tu, Ying Shen, Zhou Zhao, Xiaojun Chen
The paper discusses the challenges of generating high-quality abstractive summaries using deep neural network based methods and proposes a novel Hybrid learning model for Abstractive Text Summarization (HATS) that follows a hierarchical routine similar to human-like reading strategy. HATS consists of three major components, a knowledge-based attention network, a multitask encoder-decoder network, and a generative adversarial network, which are consistent with the different stages of the human-like reading strategy. The experimental results on two real-life datasets, CNN/Daily Mail and Gigaword, demonstrate that HATS achieves impressive results.
Eva Sharma, Luyang Huang, Zhe Hu, Lu Wang
The paper introduces SENECA, a new system for entity-driven coherent abstractive summarization that uses entity information to generate informative and coherent abstracts. The framework takes a two-step approach, with an entity-aware content selection module identifying salient sentences and an abstract generation module conducting cross-sentence information compression and abstraction. The model is trained with rewards to promote coherence, conciseness, and clarity, and is further connected using reinforcement learning. Automatic evaluation shows that SENECA outperforms previous state-of-the-art on ROUGE and coherence measures on New York Times and CNN/Daily Mail datasets, and human judges rate its summaries as more informative and coherent than those by popular summarization models.
Min Gui, Junfeng Tian, Rui Wang, Zhenglu Yang
System: The paper discusses the importance of attention in improving document summarization models. The authors propose an attention refinement unit that uses both local and global variance loss to supervise the attention model at each decoding step and optimize the attention distributions from a global perspective. The effectiveness of the proposed methods is verified through experiments on the CNN/Daily Mail dataset.
Xiangyu Duan, Hongfei Yu, Mingming Yin, Min Zhang, Weihua Luo, Yue Zhang
The paper proposes a contrastive attention mechanism for abstractive sentence summarization, which includes both conventional attention that focuses on relevant parts of the source sentence and opponent attention that focuses on irrelevant or less relevant parts. The mechanism is trained in an opposite way to encourage the contribution from conventional attention and discourage the contribution from opponent attention. Experiments show that the proposed mechanism is more focused on relevant parts and greatly improves the state-of-the-art performance on the task. The code is available on GitHub.
Yufei Tian, Jianfei Yu, Jing Jiang
System: The paper discusses a two-stage reinforcement learning approach for abstractive review summarization. The approach predicts the output word type and then generates the final word distribution based on the predicted word type. The method outperforms several strong baseline approaches based on ROUGE scores in experimental results on two Amazon product review datasets.
Wang Wenbo, Gao Yang, Zhou Yuxiang
The paper proposes a concept pointer network for improving abstractive summarization by generating new conceptual words to express concrete details. The network uses knowledge-based, context-aware conceptualizations to derive an extended set of candidate concepts and points to the most appropriate choice using both the concept set and original source text. The training model is optimized using a novel method of distantly-supervised learning guided by reference summaries and testing set. The proposed approach provides statistically significant improvements over several state-of-the-art models on both the DUC2004 and Gigaword datasets, and a human evaluation supports the quality of the summaries produced within this framework.
Shen Gao, Xiuying Chen, Piji Li, Zhangming Chan, Dongyan Zhao, Rui Yan
The paper introduces a model called Prototype Editing based Summary Generator (PESG) that utilizes prototype document-summary pairs to generate better summaries that conform to a particular style with patterns. The model addresses two challenges: incorporating learned patterns from the prototype while avoiding copying irrelevant facts, and generating new summaries based on the summary pattern or extracted facts. A fact checker is used to estimate mutual information between the input document and generated summary, resulting in state-of-the-art performance in both automatic metrics and human evaluations.
Sanghwan Bae, Taeuk Kim, Jihoon Kim, Sang-goo Lee
The paper proposes a new approach to combining extractive and abstractive summarization using Sentence Rewriting models. The existing models in this framework rely on suboptimal labels, causing a mismatch between the training objective and evaluation metric. The authors present a novel training signal that directly maximizes summary-level ROUGE scores through reinforcement learning and incorporate BERT into their model. They show that their proposed model and training procedure obtain new state-of-the-art performance on both CNN/Daily Mail and New York Times datasets and generalize better on DUC-2002 test set.
Siyao Li, Deren Lei, Pengda Qin, William Yang Wang
The paper discusses the limitations of using conventional reward measures for deep reinforcement learning in abstractive summarization tasks, which can result in repetitive and incoherent sentences. Instead, the authors propose using distributional semantics to measure the matching degrees, allowing for sentence-level evaluation and the generation of semantically-correct phrases. The proposed distributional semantics reward (DSR) is shown to have superior performance in capturing the lexical and compositional diversity of natural language, based on human judgments on Gigaword and CNN/Daily Mail datasets.
Yongjian You, Weijia Jia, Tianyi Liu, Wenmian Yang
The paper proposes a Transformer-based encoder-decoder framework with two novel extensions for abstractive document summarization. The first extension is a focus-attention mechanism that models a Gaussian focal bias on attention scores to enhance the perception of local context, contributing to producing salient and informative summaries. The second extension is an independent saliency-selection network that manages the information flow from encoder to decoder, effectively reducing the influences of secondary information on the generated summaries. Experimental results on the CNN/Daily Mail benchmark show that the proposed model outperforms other state-of-the-art baselines on the ROUGE metrics.
Lei Li, Wei Liu, Marina Litvak, Natalia Vanetik, Zuying Huang
The paper discusses the limitations of existing Seq2Seq models for abstractive summarization and introduces a new model called DivCNN Seq2Seq that uses Determinantal Point Processes methods to produce attention distribution that considers both quality and diversity. The new model achieves a higher level of comprehensiveness compared to existing models and strong baselines without breaking the end-to-end architecture. The reproducible codes and datasets are available online.
Byeongchang Kim, Hyunwoo Kim, Gunhee Kim
System: The paper discusses a method for summarizing Reddit posts using multi-level memory networks. The authors propose a model that can capture the important information in a post and generate a summary that accurately reflects the content. The model uses both word-level and sentence-level representations to capture the meaning of the post and the relationships between different parts of the text. The authors evaluate their model on a dataset of TIFU (Today I Fucked Up) posts from Reddit and show that it outperforms several baseline methods in terms of ROUGE scores.
Panagiotis Kouris, Georgios Alexandridis, Andreas Stafylopatis
The paper presents a new method for improving abstractive text summarization using deep learning and semantic data transformations. The method involves using a theoretical model for semantic-based text generalization along with a deep encoder-decoder architecture to produce a summary in generalized form. The summary is then transformed into a human-readable form while retaining important information and addressing the problem of out-of-vocabulary or rare words. The approach is evaluated on two datasets with positive results.
Logan Lebanoff, Kaiqiang Song, Franck Dernoncourt, Doo Soon Kim, Seokhwan Kim, Walter Chang, Fei Liu
The paper discusses the challenge of summarizing text by both compressing single sentences and fusing pairs, as sentence selection methods only work with single sentences and not combinations of them. The authors propose a framework that ranks sentence singletons and pairs together in a unified space, modeling human methodology by selecting either a single sentence or a pair of sentences and compressing or fusing them to produce a summary sentence. The framework was tested on both single and multidocument summarization datasets, with findings reported on sentence selection and abstraction.
Kai Wang, Xiaojun Quan, Rui Wang
The paper proposes a new model called Bi-directional Selective Encoding with Template (BiSET) for summarizing articles. The model uses templates discovered from training data to select key information from source articles and guide the summarization process. The experiments conducted on a standard summarization dataset show that the BiSET model significantly improves the summarization performance and achieves a new state of the art.
Tatsuya Ishigaki1(B, Hen-Hsen Huang, Hiroya Takamura, Hsin-Hsi Chen, Manabu Okumura
System: The paper discusses the query-biased summarization task and how conventional approaches have achieved better performance by including overlapping words between the source and the query in the summary. However, RNN-based approaches do not explicitly model this phenomenon. The paper proposes an RNN-based query-biased summarizer that primarily includes overlapping words in the summary using a copying mechanism. Experimental results show that this strategy works well for neural query-biased summarizers.
Philippe Laban, Andrew Hsi
The paper presents a new approach to unsupervised abstractive summarization that maximizes coverage and fluency while adhering to a length constraint. The method includes key terms from the original document and uses a coverage model to fill them in the generated summary. The unsupervised training procedure uses both coverage and fluency models to generate and score summaries. The method outperforms previous unsupervised methods by more than 2 R-1 points and approaches results of competitive supervised methods. The model attains higher levels of abstraction with shorter copied passages and learns to compress and merge sentences without supervision.
Sajad Sotudeh, Nazli Goharian, Ross W. Filice
The paper discusses the limitations of the seq2seq network in identifying key regions of the source for text summarization. The authors propose a solution by augmenting salient ontological terms into the summarizer for clinical abstractive summarization. Their experiments on two clinical data sets show that their model significantly improves state-of-the-art results in terms of ROUGE metrics, which is important in the healthcare domain where any improvement can impact patients’ welfare.
Luyang Huang, Lingfei Wu, Lu Wang, John M. Fabrizi, Joseph P. Ganim
The paper discusses the limitations of current sequence-to-sequence models for abstractive summarization and proposes a new framework called ASGARD, which uses dual encoders and a reward system based on a multiple choice cloze test to better capture entity interactions and generate more informative summaries. The authors show that their models produce significantly higher ROUGE scores and are rated as more informative and containing fewer errors by human judges compared to other systems.
Zhenwen Li, Wenhao Wu, Sujian Li
The paper proposes a new method for abstractive summarization using elementary discourse units (EDUs) instead of sentences. The method includes an EDU selection model to group informative EDUs and an EDU fusion model to combine them into sentences. The reinforcement learning mechanism is used to improve the summarization performance. The model was tested on CNN/Daily Mail and showed promising results.
Kaiqiang Song, Bingqing Wang, Zhe Feng, Liu Ren, Fei Liu
The paper discusses the challenge of creating abstracts that accurately summarize the original text without changing its meaning. It explores the use of neural summarization models to generate summaries with varying degrees of copying, from purely extractive to highly generative. The authors present a method that allows for control over copying during both training and decoding stages, and demonstrate its effectiveness through extensive experiments. The paper also reveals interesting and unobvious findings about the process of summarization.
Haoran Li, Junnan Zhu, Jiajun Zhang, Chengqing Zong, Xiaodong He
This paper proposes an abstractive sentence summarization method that applies guidance signals of keywords to both the encoder and the decoder in the sequence-to-sequence model. A multi-task learning framework is adopted to jointly learn to extract keywords and generate a summary for the input sentence. The authors apply keywords-guided selective encoding strategies to filter source information by investigating the interactions between the input sentence and the keywords. They extend the pointer-generator network by a dual-attention and a dual-copy mechanism, which can integrate the semantics of the input sentence and the keywords, and copy words from both the input sentence and the keywords. The authors demonstrate that multi-task learning and keywords-oriented guidance facilitate sentence summarization task, achieving better performance than the competitive models on the English Gigaword sentence summarization dataset.
Logan Lebanoff, Franck Dernoncourt, Doo Soon Kim, Walter Chang, Fei Liu
The paper presents an empirical study supporting the use of a cascade architecture for neural text summarization. The study shows that a pipeline architecture, which separately identifies important content pieces and stitches them together, performs comparably or better than end-to-end systems that perform content selection and surface realization jointly. The paper also discusses the challenges of evaluating summarization systems and suggests future research directions.
Hanqi Jin, Tianming Wang, Xiaojun Wan
The paper proposes a new approach to neural abstractive summarization that incorporates semantic dependency graphs to improve semantic relevance and reduce content deviation in generated summaries. The proposed model, SemSUM, leverages the information of original input texts and corresponding semantic dependency graphs to guide the summarization process. The model was evaluated on three datasets and showed significant improvements in automatic evaluation ROUGE metrics.
Kaiqiang Song, Logan Lebanoff, Qipeng Guo, Xipeng Qiu, Xiangyang Xue, Chen Li, Dong Yu, Fei Liu
The paper proposes a solution to the problem of ungrammatical and inaccurate sentences produced by abstractive summarization systems. The proposed method involves generating a sentence and its syntactic dependency parse simultaneously to encourage grammatical sentences and maintain the original meaning. The paper presents a novel neural architecture for abstractive summarization that combines a sequential decoder with a tree-based decoder and a human evaluation protocol to assess the accuracy of the summary. The method is evaluated on various datasets and shows competitive results against strong baselines.
Song Xu, Haoran Li, Peng Yuan, Youzheng Wu, Xiaodong He, Bowen Zhou
The paper proposes a Transformer-based model to improve the copy mechanism in abstractive summarization. The model identifies the importance of each source word using degree centrality with a directed graph built by the self-attention layer. The centrality of each source word is used to guide the copy process explicitly, resulting in better performance than baseline methods on the CNN/Daily Mail and Gigaword datasets.
Chenguang Zhu, Ruochen Xu, Michael Zeng, Xuedong Huang
The paper discusses the challenge of summarizing meeting transcripts and proposes a novel abstractive summary network that adapts to the meeting scenario. The network includes a hierarchical structure to accommodate long transcripts and a role vector to depict the difference among speakers. The model is pre-trained on largescale news summary data due to the inadequacy of meeting summary data. The empirical results show that the proposed model outperforms previous approaches in both automatic metrics and human evaluation, with an increase in ROUGE-1 score from 34.66% to 46.28% on the ICSI dataset.
Zhengjue Wang, Zhibin Duan, Hao Zhang, Chaojie Wang, Long Tian, Bo Chen, Mingyuan Zhou
The paper discusses the use of topic models to improve the performance of Transformer-based models in abstractive document summarization. The proposed model, called topic assistant (TA), includes three modules and is compatible with various Transformer-based models. TA is user-friendly and only introduces a small number of extra parameters. Experimental results on three datasets demonstrate that TA is able to improve the performance of several Transformer-based models.
Zheng Zhao, Shay B. Cohen, Bonnie Webber
The paper discusses the issue of hallucination in abstractive summaries and proposes a solution using the HERMAN system. HERMAN verifies specific entities in summaries and up-ranks those whose quantity terms are supported by the original text. Experimental results show higher precision and F1 scores for up-ranked summaries without a loss in recall, and human evaluation shows a preference for up-ranked summaries.
Khalil Mrini, Can Liu, Markus Dreyer
System: This paper discusses the problem of generating abstractive summaries focused on a particular topic. The authors propose a deep reinforcement learning approach that uses a negative example baseline to improve the model's ability to identify what it should not focus on. They adapt existing datasets for this task and show that their approach outperforms a self-critical baseline in various evaluation metrics.
Yichen Jiang, Asli Celikyilmaz, Paul Smolensky, Paul Soulos, Sudha Rao, Hamid Palangi, Roland Fernandez, Caitlin Smith, Mohit Bansal, Jianfeng Gao
The paper discusses the task of abstractive summarization, which involves generating a concise summary of input documents. The authors adapt the TP-TRANSFORMER architecture, which enriches the original Transformer with the Tensor Product Representation (TPR), for this task. The model encodes two separate representations for each token to represent the syntactic structure and semantic content separately, and then binds them into the TPR as the layer output. The authors argue that this structured intermediate representation enables the model to better control the contents and structures when generating the summary. The TP-TRANSFORMER outperforms the Transformer and the original TP-TRANSFORMER significantly on several abstractive summarization datasets based on both automatic and human evaluations. The authors also demonstrate the emergent structural information in the role vectors and improved syntactic interpretability in the TPR layer outputs.
Yue Dong, Shuohang Wang, Zhe Gan, Yu Cheng, Jackie Chi, Kit Cheung, Jingjing Liu
The paper discusses the challenges faced by system-generated abstractive summaries, which often contain factual inconsistencies. To address this issue, the authors propose SpanFact, a suite of two factual correction models that use knowledge from question answering models to correct errors in system-generated summaries. The models use single or multimasking strategies to replace entities and ensure semantic consistency with the source text while retaining the syntactic structure of the summaries. Experiments show that SpanFact significantly improves the factual consistency of system-generated summaries without sacrificing summary quality.
Meng Cao, Yue Dong, Jiapeng Wu, Jackie Chi, Kit Cheung
The paper discusses the challenge of ensuring factual consistency in abstractive summarization systems and proposes a post-editing corrector module to address this issue. The module is pre-trained on artificial examples created by applying heuristic transformations on reference summaries. Experimental results show that the model is able to correct factual errors in summaries generated by other neural summarization models and outperforms previous models on factual consistency evaluation on the CNN/DailyMail dataset. However, the paper also notes that transferring from artificial error correction to downstream settings is still challenging.
Yanyan Zou, Xingxing Zhang, Wei Lu, Furu Wei, Ming Zhou
The paper discusses the challenge of training large SEQ2SEQ based summarization models on limited supervised summarization data and presents three sequence-to-sequence pre-training objectives that allow for pre-training a SEQ2SEQ based abstractive summarization model on unlabeled text. These objectives include sentence reordering, next sentence generation, and masked document generation, which have close relations with the abstractive document summarization task. Experiments on two benchmark summarization datasets show that all three objectives can improve performance upon baselines. The method achieves comparable results to models pre-trained on large-scale data with only 19GB text for pre-training, demonstrating its effectiveness. Code and models are publicly available.
Changmeng Zheng, Yi Cai, Guanjie Zhang, Qing Li
The paper proposes a controllable abstractive sentence summarization model that generates summaries with guiding entities. The model ensures that entities appear in final output summaries and can generate more novel entities. The proposed model is evaluated using fine-grained informativeness metrics in the relevance, extraness, and omission perspectives. Experimental results show that the model outperforms the state-of-the-art methods in both automatic evaluation scores and informativeness metrics.
Wenhao Wu, Wei Li, Xinyan Xiao, Jiachen Liu, Ziqiang, Sujian Li, Hua Wu, Haifeng Wang
The paper proposes a new framework called BASS for abstractive summarization of long or multi-document text, which is challenging for the Seq2Seq architecture due to its inability to analyze long-distance relations in text. BASS utilizes a unified Semantic graph to aggregate co-referent phrases and convey rich relations between them. A graph-based encoder-decoder model is also proposed to improve document representation and summary generation by leveraging the graph structure. Several graph augmentation methods are designed to encode both explicit and implicit relations in the text, while the graph propagation attention mechanism is developed in the decoder to select salient content for the summary. Empirical results show that BASS brings substantial improvements for both long-document and multi-document summarization tasks.
Chenguang Zhu, Ziyi Yang, Robert Gmyr, Michael Zeng, Xuedong Huang
The paper proposes leveraging the lead bias in news articles to pre-train abstractive news summarization models on large-scale unlabeled news corpora. The authors collect a massive news corpus and conduct data cleaning and filtering via statistical analysis. They apply self-supervised pre-training on this dataset to existing generation models BART and T5 for domain adaptation. The approach dramatically improves the summarization quality and achieves state-of-the-art results for zero-shot news summarization without any fine-tuning. The model is deployed in Microsoft News and provides public APIs as well as a demo website for multi-lingual news summarization.
Lihan Wang, Min Yang, Chengming Li, Ying Shen, Ruifeng Xu
System: The paper proposes a new approach to text summarization using hierarchical multi-scale abstraction modeling and dynamic memory. The system is designed to extract important information from large amounts of text and generate a concise summary. The approach is evaluated on several datasets and shows promising results compared to other state-of-the-art methods.
Dan Su, Tiezheng Yu, Pascale Fung
The paper proposes a new model called QFS-BART for generating summaries that are both coherent and answer-related to a given query. Unlike previous QFS models, QFS-BART considers the explicit answer relevance of the source documents given the query via a question answering model. The model also takes advantage of large pre-trained models for improved summarization performance. Empirical results on the Debatepedia dataset show that QFS-BART achieves state-of-the-art performance.
Feng Nan, Cicero Nogueira dos Santos, Henghui Zhu, Patrick Ng, Kathleen McKeown, Ramesh Nallapati, Dejiao Zhang, Zhiguo Wang, Andrew O. Arnold, Bing Xiang
The paper addresses the problem of factual inconsistency in abstractive summarization models. The authors propose an efficient automatic evaluation metric to measure factual consistency and a novel learning algorithm that maximizes the proposed metric during model training. Through extensive experiments, the authors confirm that their method is effective in improving factual consistency and overall quality of the summaries, as judged by both automatic metrics and human evaluation.
Shweta Yadav, Deepak Gupta, Asma Ben Abacha, Dina Demner-Fushman
The paper discusses the need for reliable and accurate question answering systems for online consumer health questions. It introduces a reinforcement learning-based framework for abstractive question summarization, which proposes two novel rewards obtained from downstream tasks to regularize the question generation model. The proposed method achieves higher performance over state-of-the-art models and generates more diverse and semantically valid questions with fewer factual inconsistencies. The source code is available on GitHub.
Yixin Liu, Pengfei Liu
The paper introduces a new framework called SIMCLS for abstractive summarization, which improves the performance of existing top-performing models by a large margin. The framework formulates text generation as a reference-free evaluation problem assisted by contrastive learning. The experimental results show that SIMCLS can achieve 2.51 absolute improvement against BART and 2.50 over PEGASUS w.r.t ROUGE-1 on the CNN/DailyMail dataset, driving the state-of-the-art performance to a new level. The codes and results have been open-sourced, and the proposed models have been deployed into the EXPLAINABOARD platform for researchers to understand the systems in a more fine-grained way.
Andreas Marfurt, James Henderson
System: The paper proposes a new model called the sentence planner model to generate more abstractive summaries. The model includes a hierarchical decoder that generates a representation for the next summary sentence and conditions the word generator on this representation. The generated summaries are more abstractive and achieve high ROUGE scores when compared to human reference summaries. The effectiveness of the design decisions is verified through extensive evaluations.
Ahmed Magooda, Mohamed Elaraby, Diane Litman
The paper investigates the use of multitask learning for abstractive summarization with limited training data. Four different tasks, including extractive summarization, language modeling, concept detection, and paraphrase detection, are incorporated individually and in combination to improve abstractive summarization. The results show that multitask learning can enhance the performance of abstractive summarization, and certain tasks, such as paraphrase detection, consistently benefit the task.
Shashi Narayan, Yao Zhao, Joshua Maynez, Vitaly Nikolaev, Ryan McDonald
The paper introduces a mechanism to improve the generation of abstractive summaries by learning an intermediate plan that grounds the summary generation. This is achieved by prepending target summaries with entity chains, which are ordered sequences of entities mentioned in the summary. Transformer-based sequence-to-sequence models are then trained to generate the entity chain and continue generating the summary based on the entity chain and input. The approach was evaluated on multiple datasets and demonstrated improved entity specificity and planning in summaries, achieving state-of-the-art performance in terms of ROUGE on some datasets. The mechanism also provides a way to control hallucinations in abstractive summaries, outperforming state-of-the-art approaches for faithfulness when evaluated automatically and by humans.
Saadia Gabriel, Antoine Bosselut, Jeff Da, Ari Holtzman, Jan Buys, Kyle Lo, Asli Celikyilmaz, Yejin Choi
The paper introduces a framework called Co-opNet for generating abstractive summaries with factual consistency and narrative flow. Co-opNet is a transformer-based framework where a generator works with a discriminator architecture to compose coherent long-form summaries. The paper explores four different discriminator objectives to capture different aspects of coherence. The ability of Co-opNet to learn these objectives is measured using arXiv scientific papers, with empirical results showing improved global coherence compared to competitive baselines.
Haoran Li, Arash Einolghozati, Srinivasan Iyer, Bhargavi Paranjape, Yashar Mehdad, Sonal Gupta, Marjan Ghazvininejad
The paper proposes a new framework called EASE that combines the strengths of extractive and abstractive summarization systems to generate concise and interpretable summaries. The framework uses the Information Bottleneck principle to jointly train extraction and abstraction in an end-to-end fashion. Inspired by human summarization methods, the framework first extracts a pre-defined amount of evidence spans and then generates a summary using only the evidence. The authors show through automatic and human evaluations that the generated summaries are better than strong extractive and extractive-abstractive baselines.
Haoran Li, Song Xu, Peng Yuan, Yujia Wang, Youzheng Wu, Xiaodong He, Bowen Zhou
The paper proposes a new copying scheme called Correlational Copying Network (CoCoNet) for abstractive summarization that enhances the standard copying mechanism by keeping track of the copying history. CoCoNet takes advantage of prior copying distributions and encourages the model to copy the input word that is relevant to the previously copied one. The model is strengthened through pretraining with suitable corpora that simulate the copying behaviors. Experimental results show that CoCoNet can copy more accurately and achieves new state-of-the-art performances on summarization benchmarks, including CNN/DailyMail for news summarization and SAMSum for dialogue summarization. The code is available at https://github.com/hrlinlp/coconet.
Shuo Guan, Ping Zhu, Zhihua Wei
Abstractive Sentence summarization method that addresses the issue of sparse knowledge structure. The proposed method utilizes topic keywords and knowledge structure to generate high-quality summaries. The results show that KAS outperforms existing methods in terms of ROUGE scores and human evaluation.
Zi-Yi Dou, Pengfei Liu, Hiroaki Hayashi, Zhengbao Jiang, Graham Neubig
The paper discusses the challenges of neural abstractive summarization models, which can produce coherent summaries but may be unfaithful and difficult to control. The authors propose a guided summarization framework (GSum) that can effectively take different types of external guidance as input and demonstrate its effectiveness in achieving state-of-the-art performance on popular summarization datasets. The authors also show how different types of guidance can generate qualitatively different summaries, providing a degree of controllability to the learned models.
Feng Nan, Ramesh Nallapati, Zhiguo Wang, Cicero Nogueira dos Santos, Henghui Zhu, Dejiao Zhang, Kathleen McKeown, Bing Xiang
System: The paper discusses the challenge of ensuring factual consistency in abstractive summarization, particularly in relation to entity hallucination. The authors propose new metrics to measure entity-level factual consistency and suggest filtering training data as a solution to the problem. They also propose a summary-worthy entity classification task and a joint entity and summary generation approach to further improve entity level metrics.
Chenguang Zhu, William Hinthorn, Ruochen Xu, Qingkai Zeng, Michael Zeng, Xuedong Huang, Meng Jiang
The paper discusses the issue of inconsistency between automatic abstractive summaries and the original text, which can distort or fabricate facts. To address this problem, the authors propose a fact-aware summarization model called FASUM, which integrates factual relations into the summary generation process using graph attention. They also introduce a factual corrector model called FC to automatically correct factual errors in existing summaries. Empirical results show that FASUM produces more factually consistent summaries compared to existing systems, and FC can improve the factual consistency of given summaries by modifying only a few keywords.
Tiezheng Yu, Zihan Liu, Pascale Fung
The paper discusses the challenges faced by state-of-the-art abstractive summarization models due to their reliance on extensive labeled data, which limits their generalization ability on domains where such data are not available. The authors present a study of domain adaptation for the abstractive summarization task in a low-resource setting, focusing on the second phase of pre-training on large-scale generative models under three different settings. The experiments show that the effectiveness of pre-training is correlated with the similarity between the pre-training data and the target domain task. The authors also find that continuing pre-training could lead to catastrophic forgetting, and a learning method with less forgetting can alleviate this issue. The results highlight the need for more advanced domain adaptation methods for the abstractive summarization task, as a huge gap still exists between the low-resource and high-resource settings.
Ye Ma, Zixun Lan Lu Zong, Kaizhu Huang
The paper presents a new algorithm for neural abstractive summarization that improves upon the local optimality problem of the original beam search. The algorithm uses a novel global protocol based on the attention distribution to generate summaries in a near-global optimal fashion. The global attention distribution can be predicted before inference, allowing for step-wise improvements on the beam search through the global scoring mechanism. The algorithm is shown to significantly improve state-of-the-art summarization models on nine datasets and remains robust even with corrupted attention distributions. The codes and examples are available.
Sangwoo Cho, Kaiqiang Song, Chen Li, Dong Yu, Hassan Foroosh, Fei Liu
System: The paper proposes a method to generate summary highlights that can be overlaid on original documents to help readers sift through large amounts of text. The method aims to prevent distortion of the original meaning by providing summaries in context. The method combines determinantal point processes and deep contextualized representations to identify important and non-redundant sub-sentence segments to form self-contained highlights. The paper presents extensive experiments on summarization datasets to demonstrate the flexibility and modeling power of the method. The authors conclude that highlighting is a promising avenue for future summarization research.
Kaiqiang Song, Bingqing Wang, Zhe Feng, Fei Liu
The paper proposes a new approach to generate multiple summaries with diverse content and varying lengths, and then select the best ones based on user needs. The approach involves a two-staged strategy to generate a diverse set of candidate summaries from the source text and then score and select admissible ones. The generator gives precise control over the length of the summary, and the selectors are designed to predict the optimal summary length and emphasize faithfulness to the original text. The approach achieves state-of-the-art performance in benchmark summarization datasets.
Sihao Chen, Fan Zhang, Kazoo Sone, Dan Roth
The paper discusses how current models for neural abstractive summarization often generate summaries that are not faithful to the original context. To address this issue, the authors propose a post-processing technique called contrast candidate generation and selection. They generate alternative candidate summaries where named entities and quantities are replaced with compatible semantic types from the source document, and then use a discriminative correction model to select the best candidate as the final output summary. The authors' experiments show that this method is effective in identifying and correcting extrinsic hallucinations. They also analyze the typical hallucination phenomenon by different types of neural summarization systems, in hope to provide insights for future work on the direction.
Hanlu Wu, Tengfei Ma, Lingfei Wu, Tariro Manyumwa, Shouling Ji
The paper proposes a new method for evaluating the quality of document summarization systems without requiring human-generated reference summaries. The method uses unsupervised contrastive learning and a new metric based on BERT that covers both linguistic qualities and semantic informativeness. The model is trained with a ranking loss using different types of negative samples for each summary. The experiments on Newsroom and CNN/Daily Mail datasets show that the proposed method outperforms other metrics and is generalizable across datasets.
Matt Wilber, William Timkey, Marten van Schijndel
The paper discusses the limitations of abstractive neural summarization models despite their improved ROUGE scores. The authors conducted experiments on the pointer-generator model to understand how it controls its level of abstraction and extraction. The model utilizes syntactic boundaries to truncate sentences on an extractive-biased dataset, but when forced to generate, it only shows simple paraphrasing abilities with factual inaccuracies and hallucinations. On an abstractive-biased dataset, the model copies infrequently and shows limited abstractive abilities. The results suggest that abstractive summarization models lack the semantic understanding necessary to generate faithful and abstractive paraphrases.
Shuyang Cao, Lu Wang
The paper presents a technique called attention head masking to effectively inform content selection in Transformer-based abstractive summarization models. This technique is applied on encoder-decoder attentions to identify important content during inference. The authors demonstrate the effectiveness of this technique on three document summarization datasets, including in-domain and cross-domain settings. Their models outperform prior state-of-the-art models on CNN/Daily Mail and New York Times datasets. Additionally, the inferencetime masking technique is data-efficient, requiring less than 20% of the training samples to outperform BART fine-tuned on the full CNN/DailyMail dataset.
Chulaka Gunasekara, Guy Feigenblat, Benjamin Sznajder, Ranit Aharonov, Sachindra Joshi
The paper discusses the issues with current neural abstractive summarization models and presents a framework to train these models to improve their summaries. The framework involves training a sequence-to-sequence model and then further training it in a Reinforcement Learning setting with question-answering based rewards. The experimental results show that this approach can improve the quality of the summaries generated by these models, with human evaluations showing a preference for the approach over general abstractive summarization models 30% of the time.
Mathieu Ravaut, Shafiq Joty, Nancy F. Chen
The paper discusses the limitations of using beam search to generate summaries with sequence-to-sequence neural networks, due to the large search space and exposure bias. The authors propose a solution of directly training a second-stage model to perform re-ranking on a set of summary candidates, resulting in improved performance of the base model. Their SummaReranker model achieves state-of-the-art results on several datasets, with code and checkpoints available online.
Shuyang Cao, Lu Wang
System: The paper discusses a new approach to generating abstractive summaries that are both faithful and factually consistent with the given articles. The approach uses a contrastive learning formulation that leverages both reference summaries and automatically generated erroneous summaries to train summarization systems that are better at distinguishing between them. The paper also describes four strategies for creating negative samples that resemble errors made commonly by two state-of-the-art models, BART and PEGASUS. Experiments on XSum and CNN/Daily Mail show that the contrastive learning framework consistently produces more factual summaries than other approaches, according to QA-based factuality evaluation. Human judges also find that the model summaries correct more errors.
Haonan Wang, Yang Gao, Yu Bai, Mirella Lapata, Heyan Huang
The paper discusses the limitations of current neural models for document summarization, which lack transparency and control. To address this issue, the authors propose a novel select-and-generate framework called ESCA that focuses on explainability. The framework reveals the latent centrality and interactions between sentences, along with scores for sentence novelty and relevance, to give users a window into the choices the model is making and an opportunity to guide those choices. A novel pair-wise matrix captures the sentence interactions, centrality, and attribute scores, and a mask with tunable attribute thresholds allows the user to control which sentences are likely to be included in the extraction. A sentence-deployed attention mechanism in the abstractor ensures the final summary emphasizes the desired content. ESCA outperformed eight state-of-the-art models on the CNN/DailyMail and NYT50 benchmark datasets in a series of experiments assessed with ROUGE metrics and two human evaluations.
Haopeng Zhang, Semih Yavuz, Wojciech Kryscinsk, Kazuma Hashimoto, Yingbo Zhou
The paper discusses the limitations of abstractive summarization systems that use pre-training language models, which are prone to hallucinating facts that are not faithful to the input context. To address this issue, the authors propose a method called Entity Coverage Control (ECC) that computes entity coverage precision and adds a control code to each training example to guide the model to recognize faithful contents. They also extend their method through intermediate fine-tuning on noisy data extracted from Wikipedia to enable zero-shot summarization. The proposed method leads to more faithful and salient abstractive summarization in supervised fine-tuning and zero-shot settings, as demonstrated by experimental results on three benchmark datasets of different domains and styles.
Xiaochen Liu, Yang Gao, Yu Bai, Jiawei Li, Yinan Hu, Heyan Huang, Boxing Chen
The paper presents a new approach to few-shot abstractive summarization using a soft prompts architecture coupled with prompt pre-training and fine-tuning. The soft prompts consist of continuous input embeddings across an encoder and decoder, with a new inner-prompt introduced to capture document-level information. The approach uses prompt pre-training with self-supervised pseudo-data to teach the model basic summarizing capability, followed by fine-tuning with few-shot examples using lightweight soft prompts. Experimental results on the CNN/DailyMail and XSum datasets show that the method outperforms full-model tuning and Prompt Tuning, and delivers competitive results against PrefixTuning with significantly fewer parameters.
David Wan, Mohit Bansal
The paper presents FACTPEGASUS, an abstractive summarization model that focuses on factuality during pre-training and finetuning. The model uses a sentence selection strategy to create pseudosummaries that are both important and factual, and introduces three complementary components for fine-tuning: a corrector to remove hallucinations, a contrastor to differentiate factual from nonfactual summaries, and a connector to improve knowledge transfer. Experiments show that FACTPEGASUS substantially improves factuality and is more factual than using the original pre-training objective in zero-shot and few-shot settings, while also retaining factual behavior more robustly than strong baselines.
Yuanjie Lyu, Chen Zhu, Tong Xu, Zikai Yin, Enhong Chen
The paper proposes a new model for abstractive summarization called Entity-Relation Pointer Generator Network (ERPGN) that formalizes the facts in the original document as a factual knowledge graph and generates a high-quality summary by directly modeling consistency between the summary and the knowledge graph. The model uses two pointer network structures to capture the facts in the original document and two semantic-level losses to measure the disagreement between the summary and the facts. The experiments show that ERPGN outperforms classic abstractive summarization models and state-of-the-art fact-aware baseline methods in terms of faithfulness.
Shengqiang Zhang, Xingxing Zhang, Hangbo Bao, Furu Wei
The paper discusses how abstractive text summarization relies on large, computationally expensive pre-trained sequence-to-sequence Transformer models, and proposes a method to distill these models into smaller ones with minimal performance loss. The method involves manipulating attention temperatures in Transformers to make pseudo labels easier to learn for student models. Experiments on three summarization datasets show that this method consistently improves vanilla pseudo-labeling based methods, and both pseudo labels and summaries produced by the student models are shorter and more abstractive. The code for the proposed method is available on GitHub.
Kaiqiang Song, Chen Li, Xiaoyang Wang, Dong Yu, Fei Liu
The paper discusses the challenges of summarizing podcasts, including factual inconsistencies and speech disfluencies in transcripts. The authors propose a novel abstractive summarization method that grounds summary segments in specific regions of the transcript to improve summarization quality. They conducted a series of analyses on a large podcast dataset and found that their approach achieved promising results, improving both automatic and human evaluation of summarization quality.
Ye Xiong, Teeradaj Racharak, Minh Le Nguyen
The paper discusses the use of elementary discourse units (EDUs) as the textual unit of content selection for abstractive summarization. The authors propose a novel summarization model that first designs an EDU selector to choose salient content, and then the generator model rewrites the selected EDUs as the final summary. To determine the relevancy of each EDU on the entire document, the authors apply group tag embedding. Extensive experiments on the CNN/Daily Mail dataset have demonstrated the effectiveness of their model.
Yixin Liu, Pengfei Liu, Dragomir Radev, Graham Neubig
The paper proposes a new training paradigm for abstractive summarization models that assumes a non-deterministic distribution, which assigns probability mass to different candidate summaries based on their quality. This approach addresses the performance degradation issue during inference, where the model needs to compare system-generated summaries that deviate from the reference summary. The proposed method achieves a new state-of-the-art result on the CNN/DailyMail and XSum datasets, and can estimate probabilities of candidate summaries that are more correlated with their level of quality.
José Ángel González, Annie Louis, Jackie C. K. Cheung
The paper discusses the phenomenon of referring to entities in later discourse by a more general description, and how this applies to summarization. The authors categorize these instances as source-summary entity aggregations and analyze them in the CNN/DAILYMAIL corpus. They examine how well three state-of-the-art summarization systems can generate such aggregations and develop techniques to encourage them to generate more. The results show that there is significant room for improvement in producing semantically correct aggregations.
Han Guo, Ramakanth Pasunuru, Mohit Bansal
The paper proposes a method to improve abstractive summarization by using multi-task learning with the auxiliary tasks of question generation and entailment generation. The former helps the summarization model identify salient questioning-worthy details, while the latter teaches the model how to rewrite a summary that is a directed-logical subset of the input document. The paper also proposes novel multitask architectures with high-level layer-specific sharing and soft-sharing mechanisms, which result in statistically significant improvements over the state-of-the-art on various datasets. The paper presents quantitative and qualitative analysis studies of the model's learned saliency and entailment skills.
Wojciech Kryściński, Romain Paulus, Caiming Xiong, Richard Socher
The paper proposes two techniques to improve the level of abstraction in abstractive text summarization. The first technique involves decomposing the decoder into a contextual network and a pretrained language model. The second technique involves a novelty metric that encourages the generation of novel phrases. The proposed model achieves results comparable to state-of-the-art models, while achieving a significantly higher level of abstraction as measured by n-gram overlap with the source document.
Alexios Gidiotis, Grigorios Tsoumakas
The paper explores uncertainty in modern abstractive summarization models using Bayesian Deep Learning. They use Monte Carlo dropout to approximate Bayesian inference and perform multiple stochastic forward passes to quantify uncertainty at prediction time. This allows for filtering out generated summaries of high uncertainty and can be used for selecting samples for annotation. Bayesian inference also enables finding a summary that performs better than a deterministic one and is more robust to uncertainty. Their Variational Bayesian equivalents of BART and PEGASUS outperform their deterministic counterparts on multiple benchmark datasets.
Yizhu Liu, Qi Jia, Kenny Q. Zhu
The paper proposes a new approach for length-controllable summarization models that adapts the encoding of the source based on the desired length. This is achieved through a length-aware attention mechanism (LAAM) that is trained on a summary length balanced dataset built from the original training data. The results show that this approach is effective in generating high-quality summaries with desired lengths, including those that were not seen in the original training set. Previous models tended to generate summaries as long as those in the training data, but LAAM can generate shorter summaries as well.
System: The paper proposes a neural network model that generates informative and concise summaries for opinionated text. The model uses an attention-based mechanism to absorb information from multiple text units and an importance-based sampling method to integrate important input. The system outperforms state-of-the-art summarization systems on newly collected datasets of movie reviews and arguments and is rated higher in human evaluation for informativeness and grammaticality.
Abigail See, Peter J. Liu, Christopher D. Manning
The paper discusses the limitations of neural sequence-to-sequence models for abstractive text summarization, which can inaccurately reproduce factual details and repeat themselves. The authors propose a new architecture that uses a hybrid pointer-generator network to accurately reproduce information while retaining the ability to generate novel words, and coverage to discourage repetition. The model is applied to the CNN/Daily Mail summarization task and outperforms the current abstractive state-of-the-art by at least 2 ROUGE points.
Jiwei Tan, Jianguo Xiao
The paper discusses the challenge of extending sentence summarization models to the task of document headline generation. The proposed solution is a coarse-to-fine approach that first identifies important sentences using document summarization techniques and then uses a multi-sentence summarization model with hierarchical attention to generate headlines. The approach significantly improves the performance of neural sentence summarization models on the headline generation task, as demonstrated by experimental results on a large real dataset.
Shashi Narayan, Shay B. Cohen, Mirella Lapata
The paper introduces a new summarization task called extreme summarization, which requires an abstractive modeling approach to create a one-sentence news summary that answers the question "What is the article about?" A large dataset was collected from the BBC, and a novel abstractive model based on convolutional neural networks was proposed. The model was shown to outperform both extractive and abstractive approaches when evaluated by humans and automatically. The architecture captures long-range dependencies in a document and recognizes pertinent content.
Ramakanth Pasunuru, Mohit Bansal
The paper discusses the task of abstractive text summarization, which involves compressing a long document into a short summary while maintaining important aspects such as saliency, logical entailment, and non-redundancy. The authors propose a reinforcement learning approach with two novel reward functions, ROUGESal and Entail, in addition to a coverage-based baseline. The ROUGESal reward up-weights salient phrases/words detected via a keyphrase classifier, while the Entail reward gives high scores to logically-entailed summaries using an entailment classifier. The authors show that combining these rewards with traditional metric-based rewards leads to superior performance improvement, achieving state-of-the-art results on the CNN/Daily Mail dataset and strong improvements on the DUC-2002 dataset.
Yau-Shian Wang, Hung-Yi Lee
The paper proposes a method for achieving unpaired abstractive summarization using an auto-encoder that encodes input text into human-readable sentences. The auto-encoder consists of a generator and a reconstructor, with a discriminator used to ensure the generator output resembles human-written sentences. The generator encodes the input text into a shorter word sequence, and the reconstructor recovers the generator input from the generator output. This approach achieves abstractive summarization without the need for document-summary pairs as training data, and promising results are shown on both English and Chinese corpora.
Shuming Ma, Xu Sun, Junyang Lin, Xuancheng Ren
The paper proposes a hierarchical end-to-end model for joint learning of text summarization and sentiment classification, where the sentiment classification label is treated as a further "summarization" of the text summarization output. The model achieves better performance than strong baseline systems on both abstractive summarization and sentiment classification, as shown by experimental results on Amazon online reviews datasets. Text summarization and sentiment classification aim to capture the main ideas of the text at different levels, with text summarization describing the text within a few sentences and sentiment classification summarizing the text into an even more abstract fashion, i.e., a sentiment class.
Thomas Scialom, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano
The paper discusses how abstractive summarization approaches based on Reinforcement Learning (RL) can overcome classical likelihood maximization. The most commonly used summarization metric, ROUGE, has limitations such as bias towards lexical similarity and suboptimal accounting for fluency and readability. The paper proposes alternative evaluation measures based on Question Answering, which were found to be favorable compared to ROUGE and do not require reference summaries. Training a RL-based model on these metrics leads to improvements in both human and automated metrics.
The paper proposes a new method for studying content selection in topic-focused summarization called the summary cloze task. The task involves generating the next sentence of a summary based on a topic, a partial summary, and a reference document. The challenge is deciding what information in the references is relevant to the topic and partial summary and should be included in the summary. The paper reports experimental results on a dataset of nearly 500k summary cloze instances from Wikipedia using various extractive and abstractive models. The results show that the task remains a significant challenge, but the topic and partial summary help the models identify relevant content.
Yang Liu, Mirella Lapata
The paper discusses the use of Bidirectional Encoder Representations from Transformers (BERT) in text summarization and proposes a framework for both extractive and abstractive models. They introduce a document-level encoder based on BERT that can express the semantics of a document and obtain representations for its sentences. They also propose a new fine-tuning schedule for abstractive summarization that adopts different optimizers for the encoder and decoder to alleviate the mismatch between the two. The experiments on three datasets show that their model achieves state-of-the-art results in both extractive and abstractive settings.
Kushal Chawla, Balaji Vasan Srinivasan, Niyati Chhaya
The paper discusses a reinforcement learning based approach to generate formality-tailored summaries for an input article. The model can generate both formal and informal summary variants, accommodating the psycho-linguistic preferences of the intended audience. The proposed framework includes a novel input-dependent reward function that aids in training the model with stylistic feedback on sampled and ground-truth summaries. Automated and qualitative evaluations show the viability of the approach.
Yoshihiko Suhara, Xiaolan Wang, Stefanos Angelidis, Wang-Chiew Tan
The paper presents OPINIONDIGEST, an opinion summarization framework that uses an Aspect-based Sentiment Analysis model to extract opinion phrases from reviews and trains a Transformer model to reconstruct the original reviews. The framework selects the most popular opinions and uses them to generate an opinion summary. OPINIONDIGEST can also generate customized summaries by filtering opinions according to aspect and sentiment. The framework outperforms competitive baselines in automatic evaluation and produces informative summaries with promising customization capabilities, as verified by human studies.
Bowen Tan, Lianhui Qin, Eric P. Xing, Zhiting Hu
The paper discusses aspect-based abstractive summarization, which generates a summary of a document based on a specific topic of interest. Previous studies have only focused on a small set of pre-defined topics, limiting the application of the task. The authors propose a new method that allows summarization on arbitrary topics relevant to the document, using external knowledge sources such as ConceptNet and Wikipedia. Experiments show that their approach improves performance on both real and synthetic documents.
Yang Deng, Wenxuan Zhang, Wai Lam
The paper proposes a new method called Multi-hop Selective Generator (MSG) for question-driven abstractive summarization. This method incorporates multi-hop reasoning to provide justifications for the generated summaries. The proposed method outperforms state-of-the-art methods on two non-factoid QA datasets, namely WikiHow and PubMedQA. The method jointly models the relevance to the question and the interrelation among different sentences via a human-like multi-hop inference module and a gated selective pointer generator network with a multi-view coverage mechanism.
Kazuki Matsumaru, Sho Takase, Naoaki Okazaki
The paper discusses the concern about the truthfulness of generated summaries in abstractive summarization and explores improving the truthfulness in headline generation on two popular datasets. The study analyzes headlines generated by the state-of-the-art encoder-decoder model and shows that the model sometimes generates untruthful headlines due to untruthful supervision data used for training the model. To remedy this problem, the study hypothesizes that removing untruthful instances from the supervision data may help and builds a binary classifier that predicts an entailment relation between an article and its headline to filter out untruthful instances. Experimental results demonstrate that the headline generation model trained on filtered supervision data shows remarkable improvements in automatic and manual evaluations of the generated headlines.
Potsawee Manakul, Mark J. F. Gales
The paper discusses the use of transformer-based models in natural language processing tasks, specifically document summarization. While these models have achieved impressive results, they struggle with scaling as input length grows, making it difficult to train or fine-tune them for long document summarization. The paper proposes two methods, local self-attention and explicit content selection, to address long-span dependencies in abstractive summarization. The approaches are compared on various network configurations and tested on standard long-span summarization tasks, achieving state-of-the-art results on all three tasks in the ROUGE scores. The paper also notes that their approach can achieve comparable or better results than existing approaches without requiring a large-scale GPU card.
Shusheng Xu, Xingxing Zhang, Yi Wu, Furu Wei
The paper proposes a contrastive learning model for supervised abstractive text summarization, which maximizes the similarities between different views of the same mean representation during training. The model outperforms a strong sequence-to-sequence text generation model on three different summarization datasets and achieves better faithfulness ratings in human evaluation. The code is available at https://github.com/xssstory/SeqCo.
Travis R. Goodwin, Max E. Savery, Dina Demner-Fushman
The paper discusses the problem of conditional summarization, where content selection and surface realization are based on a natural language question or topic description. The authors explore the use of multi-task fine-tuning (MTFT) on twenty-one natural language tasks to enable zero-shot conditional summarization on five tasks. They present four new summarization datasets and report zero-shot performance using T5 and BART, demonstrating that MTFT can improve zero-shot summarization quality. The paper highlights the importance of specific summaries for applications such as question answering and literature discovery.
Jacob Parnell, Inigo Jauregi Unanue, Massimo Piccardi
The paper proposes two reward functions for abstractive summarization, RwBHinge and RISK, to improve upon the negative loglikelihood (NLL) baselines commonly used in training models. The experiments show that the proposed approach consistently improves performance over the NLL baselines when fine-tuning an NLL pre-trained model on nine diverse summarization datasets. The reward function used in reinforcement learning plays a key role in performance and is still partially unexplored.
Potsawee Manakul, Mark J. F. Gales
The paper discusses the challenges of using transformer models for NLP tasks, particularly in summarization, due to the computational expense of the encoder-decoder attention mechanism. The authors propose a modified architecture that selects a subset of input sentences to constrain the attention mechanism, based on the empirical observation of a sparse sentence structure in document summarization. Experiments on various summarization tasks show that the proposed approach maintains system performance while reducing computational cost.
Hou Pong Chan, Lu Wang, Irwin King
The paper discusses controllable text summarization, which allows users to control specific attributes of generated summaries. The authors propose a new training framework based on Constrained Markov Decision Process (CMDP) that includes a reward function and constraints to improve summarization control. The reward function encourages summaries to resemble human-written references, while the constraints prevent generated summaries from violating user-imposed requirements. The framework can be used to control important attributes of summarization, such as length, covered entities, and abstractiveness. Experiments show that the CMDP framework helps generate informative summaries while complying with specific attribute requirements.
Vidhisha Balachandran, Artidoro Pagnoni, Jay Yoon Lee, Dheeraj Rajagopal, Jaime Carbonell, Yulia Tsvetkov
The paper discusses the challenges faced by abstractive text summarization models, including layout bias, limited abstractiveness, and lack of transparency. The authors propose a framework based on document-level structure induction for summarization that incorporates latent and explicit dependencies across sentences in the source document into end-to-end single-document summarization models. The framework improves the coverage of content in the source documents, generates more abstractive summaries by generating more novel n-grams, and incorporates interpretable sentence-level structures, while performing on par with standard baselines. The framework was trained on the CNN/DM dataset.
Reinald Kim Amplayo, Mirella Lapata, Samuel L. Jackson
The paper proposes a new approach to opinion summarization that eliminates the need for pre-selected content and allows for the use of all input reviews. The approach involves condensing the reviews into multiple dense vectors which are then used as input to an abstractive model. The framework also includes a zero-shot customization technique that takes user preferences into account. Experimental results show that the proposed model outperforms existing methods on the Rotten Tomatoes dataset and generates more informative and customized summaries.
Tanya Goyal, Greg Durrett
The paper discusses the issue of factual errors in abstractive summarization systems and explores different data sources for training models to identify these errors. The authors found that factual errors differ significantly across datasets and that human-labeled data with fine-grained annotations is more effective for training models than synthetic data or sentence-level annotations. They also show that their best factuality detection model enables training of more factual summarization models by identifying non-factual tokens in the training data.
Wang Xu, Tiejun Zhao
The paper discusses the challenges of generating factual consistency summaries through abstractive summarization and proposes a novel framework based on conditional variational autoencoders to induce guidance information and generate summaries equipped with guidance synchronously. The approach is shown to generate relevant and fluent summaries that are more faithful than existing state-of-the-art approaches according to multiple factual consistency metrics, as demonstrated through experiments on XSUM and CNNDM datasets.
Arthur Bražinskas, Ramesh Nallapati, Mohit Bansal, Markus Dreyer
The paper discusses the challenges of abstractive summarization in opinion summarization due to the lack of large annotated datasets of reviews paired with reference summaries. To address this, the authors propose a few-shot method based on adapters that can easily store in-domain knowledge. Instead of fine-tuning the entire model, adapters are added and pre-trained in a task-specific way on a large corpus of unannotated customer reviews, using held-out reviews as pseudo summaries. The adapters are then fine-tuned on the small available human-annotated dataset. The authors show that this self-supervised adapter pre-training improves summary quality over standard fine-tuning. Additionally, for summary personalization, the authors condition on aspect keyword queries, automatically created from generic datasets. This results in better-organized summary content reflected in improved coherence and fewer redundancies.
Shweta Yadav, Cornelia Caragea
The paper discusses the challenges of creating large-scale datasets for abstractive document summarization in closed domains like healthcare, where human annotation requires domain expertise. The authors propose a data selection strategy that uses guided semantic-overlap and diversity-based objective functions to generate diverse and semantic questions in a low-resource setting. Their experiments on benchmark healthcare question summarization datasets show that their method achieves new state-of-the-art results and generates diverse, fluent, and informative summarized questions.
Liqiang Xiao, Lu Wang, Hao He, Yaohui Jin
The paper discusses the challenge of modeling content importance for summarization, which previous methods have struggled with due to their focus on word-level salience and lack of consideration for semantics and context. The authors propose a new approach that applies information theory to pretrained language models, allowing for a more comprehensive evaluation of importance that can be applied to different types of semantic units. Experiments on two datasets show that their method outperforms prior work in terms of F1 and ROUGE scores.
Yang Gao, Christian M. Meyer, Iryna Gurevych
The paper proposes a method for automatic document summarization that learns from users' preferences instead of using reference summaries. The method reduces sample complexity by leveraging active learning, preference learning, and reinforcement learning techniques through a new objective function. The authors conducted both simulation and real-user experiments, which showed that their method significantly advances the state of the art. The source code is available for free on GitHub.
Yichen Jiang, Mohit Bansal
The paper discusses the importance of a strong encoder in neural sequence-to-sequence summarization models and proposes a method to improve the encoder's memorization capabilities by adding an additional 'closed-book' decoder without attention and pointer mechanisms. This forces the encoder to be more selective in the information it encodes in its memory state, leading to improved performance on the CNN/Daily Mail dataset in terms of ROUGE and METEOR metrics, as well as human evaluation. The paper also presents several tests and ablations to demonstrate the effectiveness of the proposed method.
Kundan Krishna, Balaji Vasan Srinivasan
System: The paper proposes an attention-based RNN framework to generate multiple summaries of a single document that are tuned to different topics of interest. Existing summarization algorithms generate a single summary and cannot generate multiple summaries that are tailored to the interests of different readers. The proposed method outperforms existing baselines and suggests that generative networks can be successfully biased to look at sentences relevant to a topic and generate topic-tuned summaries.
Ziqiang Cao, Wenjie Li, Furu Wei, Sujian Li
System: The paper proposes a new approach to seq2seq summarization that uses existing summaries as soft templates to guide the model. The authors retrieve proper summaries as candidate templates using an IR platform and extend the seq2seq framework to conduct template reranking and template-aware summary generation. Experiments show that this approach significantly outperforms state-of-the-art methods and even soft templates themselves demonstrate high competitiveness. Importing high-quality external summaries also improves the stability and readability of generated summaries.
Junjie Li, Xuepeng Wang, Dawei Yin, Chengqing Zong
The paper proposes an Attribute-aware Sequence Network (ASN) for review summarization that takes into account users' characteristics such as gender, age, and occupation. The ASN includes three modules: an attribute encoder, an attribute-aware review encoder, and an attribute-aware summary decoder. The authors validate their model using a new dataset called TripAtt, which includes 495,440 attribute-review-summary triplets. The experiments show that ASN achieves state-of-the-art performance on review summarization in both auto-metric ROUGE and human evaluation.
Haoyu Zhang, Jingjing Cai, Jianjun Xu, Ji Wang
System: The paper proposes a new pretraining-based encoder-decoder framework for generating output sequences from input sequences in two stages. The encoder uses BERT to encode the input sequence into context representations, while the decoder uses a Transformer-based decoder to generate a draft output sequence in the first stage. In the second stage, each word of the draft sequence is masked and fed to BERT, and the input sequence and draft representation generated by BERT are combined to predict the refined word for each masked position using a Transformer-based decoder. This approach is the first to apply BERT to text generation tasks, and the proposed method is evaluated on the text summarization task, achieving new state-of-the-art results on both CNN/Daily Mail and New York Times datasets.
Hayate Iso, Xiaolan Wang, Yoshihiko Suhara, Stefanos Angelidis, Wang-Chiew Tan
The paper discusses recent advances in text autoencoders and their ability to generate grammatically correct and consistent text from aggregated latent vectors. However, the commonly used simple average approach for vector aggregation can lead to overly generic summaries due to unexpected L2-norm shrinkage in the aggregated latent vectors, which the paper refers to as summary vector degeneration. To address this issue, the authors develop a framework called COOP, which searches input combinations for the latent vector aggregation using input-output word overlap. Experimental results show that COOP successfully alleviates the summary vector degeneration issue and establishes new state-of-the-art performance on two opinion summarization benchmarks. The code for COOP is available at https://github.com/megagonlabs/coop.
Arthur Bražinskas, Mirella Lapata, Ivan Titov
The paper discusses the challenges of opinion summarization and proposes a new approach that involves jointly learning to select informative subsets of reviews and summarizing the opinions expressed in these subsets. The authors collected a large dataset of summaries paired with user reviews for over 31,000 products, but the large number of reviews per product made summarization impractical. The authors use amortized variational inference and policy gradient methods for joint training and demonstrate the importance of selecting informative reviews resulting in improved quality of summaries and reduced hallucinations.
Hou Pong Chan, Hong Kong, Wang Chen, Irwin King
The paper proposes a dual-view model that jointly improves review summarization and sentiment classification tasks. The model uses an encoder to learn a context representation for the review and a summary decoder to generate a review summary. Two sentiment classifiers are used to predict sentiment labels for the review and generated summary. An inconsistency loss is introduced during training to penalize disagreement between the two classifiers and help the decoder generate a summary with a consistent sentiment tendency. Experiment results on four real-world datasets demonstrate the effectiveness of the proposed model.
Xiyan Fu, Jun Wang, Jinghan Zhang, Jinmao Wei, Zhenglu Yang
improved summaries. The paper introduces a new approach, VHTM, that combines summarization with topic inference and merges topics into multiple granularity levels. This is in contrast to previous work that relied on pre-trained single-grained topic models. The approach is validated through comprehensive experiments, which demonstrate its superior performance compared to baselines.
Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano
The paper discusses how language models are limited by the data and metrics used for a particular task, such as summarization models being trained to predict human reference summaries and evaluated using ROUGE. The authors propose training a model to optimize for human preferences, using a large dataset of human comparisons between summaries and reinforcement learning. They apply their method to a version of the TL;DR dataset of Reddit posts and find that their models significantly outperform both human reference summaries and larger models fine-tuned with supervised learning alone. The authors also conduct extensive analyses to understand their human feedback dataset and fine-tuned models and establish that their reward model generalizes to new datasets and results in better summaries than optimizing ROUGE according to humans. The paper aims to motivate machine learning researchers to pay closer attention to how their training loss affects the model behavior they actually want.
Isabel Cachola, Kyle Lo, Arman Cohan, Daniel S. Weld
System: The paper introduces TLDR generation, a new extreme summarization technique for scientific papers that involves compressing the source material and requires expert knowledge of the domain-specific language. To facilitate research on this task, the authors introduce SCITLDR, a dataset of 5.4K TLDRs over 3.2K papers that includes both author-written and expert-derived summaries. The authors propose CATTS, a learning strategy that uses titles as an auxiliary training signal to generate TLDRs. CATTS outperforms strong baselines under both automated metrics and human evaluations. The data and code for this research are publicly available at https://github.com/allenai/scitldr.
Sascha Rothe, Shashi Narayan
The paper discusses the effectiveness of using pre-trained checkpoints for Sequence Generation. The authors developed a Transformer-based sequence-to-sequence model that is compatible with pre-trained BERT, GPT-2, and RoBERTa checkpoints. They conducted an empirical study and found that initializing their model with these checkpoints resulted in new state-of-the-art results on Machine Translation, Text Summarization, Sentence Splitting, and Sentence Fusion. This demonstrates the potential of pre-training for Sequence Generation tasks.
Logan Lebanoff, Franck Dernoncourt, Doo Soon Kim, Lidan Wang, Walter Chang, Fei Liu
System: This paper explores the ability of Transformers to fuse sentences and proposes algorithms to enhance their ability to perform sentence fusion by leveraging the knowledge of points of correspondence between sentences. The authors conducted extensive experiments to investigate the effects of different design choices on Transformer's performance and found that modeling points of correspondence between sentences is crucial for effective sentence fusion. The ability to fuse sentences is important for summarization systems to produce succinct abstracts, but current summarizers can fail on fusing sentences, leading to few summary sentences or incorrect fusions that fail to retain the original meaning.
Rahul Aralikatte, Shashi Narayan, Joshua Maynez, Sascha Rothe, Ryan McDonald
The paper introduces a new method called Focus Attention Mechanism to help seq2seq decoders generate summaries that are similar or topical to the input document. They also propose a Focus Sampling method to enable the generation of diverse summaries. The evaluation on the BBC extreme summarization task shows that models augmented with Focus Attention generate summaries that are closer to the target and more faithful to their input documents, outperforming their vanilla counterparts on ROUGE and multiple faithfulness measures. The paper also demonstrates that Focus Sampling is more effective in generating diverse and faithful summaries than other decoding methods.
Shuyang Cao, Lu Wang
The paper discusses the importance of document structure for efficient information consumption, but notes that it is difficult to encode this structure into modern Transformer architecture. The authors present HIBRIDS, a model that incorporates hierarchical biases to better incorporate document structure into attention scores. They also introduce a new task, hierarchical questionsummary generation, which involves summarizing content into a hierarchy of questions and summaries. The authors annotate a new dataset with over 6,000 questionsummary hierarchies labeled on long government reports and show that their model produces better hierarchies than comparisons on both hierarchy quality and content coverage. The model also improves the generation of longform summaries from government reports and Wikipedia articles, as measured by ROUGE scores.
Ziqiang Cao, Furu Wei, Sujian Li, Wenjie Li, Ming Zhou, Houfeng Wang
The paper introduces the concept of summary prior, which determines how much of a sentence should be included in a summary without considering its context. The authors propose a new summary system called PriorSum, which uses convolutional neural networks to capture summary prior features from length-variable phrases. The learned prior features are combined with document-dependent features for sentence ranking. Experiments on the DUC generic summarization benchmarks show that PriorSum outperforms existing methods and can identify different aspects supporting the summary prior.
Hongyan Xu, Hongtao Liu, Pengfei Jiao, Wenjun Wang
The paper proposes a novel transformer-based reasoning framework for personalized review summarization in E-commerce platforms. The quality of generated summaries is highly related to the characteristics of users and products, including their historical summaries. However, most previous works ignore the interaction between the input review and corresponding historical summaries. The proposed approach involves inter- and intra-attention in the encoder to learn the personalized representation of the input review and a memory-decoder attention module in the decoder to retrieve more useful information for the final summary generation. The approach outperforms many competitive baseline methods in generating more reasonable summaries for recommendation.
Yixin Liu, Zi-Yi Dou, Pengfei Liu
The paper presents a new framework called Refactor for text summarization and summaries combination. The authors highlight the limitations of previous methods and perform a comprehensive evaluation involving twenty-two base systems, four datasets, and three different application scenarios. The Refactor model achieves new state-of-the-art results on the CNN/DailyMail dataset and addresses the limitations of traditional methods. The authors open-source all the code and provide a convenient interface for other researchers to use as an off-the-shelf tool to achieve further performance improvements.
Yang Liu, Sheng Shen, Mirella Lapata
System: The paper proposes a new method called self-knowledge distillation for text summarization that can improve the training process by using guidance from a teacher model and multiple noise signals to better model uncertainty. The proposed method achieves state-of-the-art results on three benchmarks for both pretrained and nonpretrained summarizers.
Hong Wang, Xin Wang, Wenhan Xiong, Mo Yu, Xiaoxiao Guo, Shiyu Chang, William Yang Wang
The paper proposes a new approach to improve extractive summarization by introducing three pre-training tasks that capture document-level context in a self-supervised manner. The proposed method is validated through experiments on the CNN/DM dataset, and the results show that a simple model with pre-training outperforms previous state-of-the-art models.
Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, Lu Wang
HEPOS is a new efficient encoder-decoder attention model that effectively identifies important information from a source document for summarization. The authors conducted a study of existing efficient self-attentions and combined them with HEPOS to process ten times more tokens than existing models that use full attentions. They also presented a new dataset, GOVREPORT, with longer documents and summaries, and showed that their models produced significantly higher ROUGE scores than competitive comparisons, including new state-of-the-art results on PubMed. Human evaluation also showed that their models generated more informative summaries with fewer unfaithful errors.
Reinald Kim Amplayo, Stefanos Angelidis, Mirella Lapata
System: This paper proposes a new approach for generating customized summaries based on aspect queries, such as describing the location and room of a hotel. The authors create a synthetic training dataset enriched with aspect controllers and fine-tune a pretrained model to generate aspect-specific summaries. Experiments show that their model outperforms previous state-of-the-art methods and can generate personalized summaries by controlling the number of aspects discussed.
Duy-Hung Nguyen, Nguyen Viet Dung Nghiem, Bao-Sinh Nguyen, Dung Tien Le, Shahab Sabahi, Minh-Tien Nguyen, Hung Le, Hoang Cau, Dong Da
The paper discusses the importance of incorporating human preferences in summarization models to align with human interests. It proposes a new framework for training summarization models with preference feedback in an interactive manner, leveraging offline data and a novel reward model to improve performance and sample efficiency. The experiments conducted on three datasets confirm the benefits of the proposed framework in active, few-shot, and online settings of preference learning.
Yanjun Gao, Timothy Miller, Dongfang Xu, Matthew M. Churpek, Majid Afshar
The paper proposes a new NLP task of generating a list of problems in a patient's daily care plan using input from provider's progress notes during hospitalization. The study investigates the performance of T5 and BART, two state-of-the-art seq2seq transformer architectures, in solving this problem. The evaluation methods include ROUGE, BERTScore, cosine similarity on sentence embedding, and F-score on medical concepts. The results show that T5 with domain adaptive pre-training achieves significant performance gains compared to a rule-based system and general domain pre-trained language models, indicating a promising direction for tackling the problem summarization task. The study provides a corpus built on top of progress notes from publicly available electronic health record progress notes in the Medical Information Mart for Intensive Care (MIMIC)-III.
Yumo Xu, Mirella Lapata
The paper discusses the development of neural models for creating generic summaries for single or multiple documents, driven by the availability of large-scale datasets. However, for query-focused summarization (QFS), labeled training data is not easily accessible. The authors propose a unified modeling framework for any type of summarization, assuming that all summaries are a response to a query, which is observed in QFS and latent in generic summarization. They model queries as discrete latent variables over document tokens and learn representations compatible with observed and unobserved query verbalizations. The framework formulates summarization as a generative process and optimizes a latent query model and a conditional language model. Despite learning from generic summarization data only, their approach outperforms strong comparison systems across benchmarks, query types, document settings, and target domains.
Mittul Singh, Arunav Mishra, Youssef Oualil, Klaus Berberich, Dietrich Klakow
The paper discusses the use of long-span language models (LMs) in unsupervised query-focused extractive summarization systems. The authors propose the use of Across Sentence Boundary LSTM-based LMs (ASBLSTM and biASBLSTM) that are specifically designed for this task. They conducted experiments on a real-world corpus with 100 Wikipedia event descriptions as queries and found that using the long-span models in an integer linear programming (ILP) formulation of MMR criterion was the most effective approach compared to several state-of-the-art baseline methods from the literature.
Ryuji Kano, Yasuhide Miura, Tomoki Taniguchi, Tomoko Ohkuma
The paper proposes an unsupervised extractive neural summarization model called Implicit Quote Extractor for conversational texts. The model aims to extract quoted sentences as summaries, even if they are not explicitly shown in replies. The training task of the model is to predict whether a reply candidate is a true reply to a post, and to do so, the model learns to extract sentences that replies frequently refer to. The model is evaluated on two email datasets and one social media dataset, and the results confirm that it is useful for extractive summarization. The paper also discusses whether quote extraction is an important factor for summarization and whether the model can capture salient sentences that conventional methods cannot.
Shusheng Xu, Xingxing Zhang, Yi Wu, Furu Wei, Ming Zhou
The paper discusses a new method for unsupervised extractive document summarization, which involves selecting important sentences from a document without using labeled summaries during training. The authors propose using transformer attentions to rank sentences, and pre-train a hierarchical transformer model using unlabeled documents only. They then use sentence-level self-attentions and pre-training objectives to rank sentences. Experiments on CNN/DailyMail and New York Times datasets show that their model achieves state-of-the-art performance on unsupervised summarization, and is less dependent on sentence positions. When combined with a recent unsupervised model explicitly modeling sentence positions, the results are even better.
Xinnian Liang, Shuangzhi Wu, Mu Li, Zhoujun Li
The paper discusses the problem of facet bias in unsupervised extractive summarization, where existing graph-based methods tend to select sentences within the same facet. To address this, the authors propose a facet-aware centrality-based ranking model that introduces a sentence-document weight to pay more attention to different facets. The method is evaluated on 8 benchmark datasets and consistently outperforms strong baselines, especially in long and multi-document scenarios. The performance gains are attributed to alleviating the facet bias problem.
System: The paper proposes a new approach to unsupervised extractive summarization using pointwise mutual information (PMI) between sentences to measure relevance and redundancy. The method involves a greedy sentence selection algorithm to maximize relevance and minimize redundancy of extracted sentences. The authors show that their method outperforms similarity-based methods on datasets in various domains, including news, medical journal articles, and personal anecdotes.
Somnath Basu, Roy Chowdhury, Chao Zhao, Snigdha Chaturvedi
The paper presents a new method called Semantic Autoencoder (SemAE) for extractive opinion summarization in an unsupervised manner. SemAE uses dictionary learning to capture semantic information from reviews and learns a latent representation of each sentence over semantic units. The extractive summarization algorithm leverages these representations to identify representative opinions among hundreds of reviews. SemAE can also perform controllable summarization to generate aspect-specific summaries. The authors report strong performance on SPACE and AMAZON datasets and provide their code publicly.
Stefanos Angelidis, Mirella Lapata
The paper presents a neural framework for summarizing opinions from online product reviews. The framework is knowledge-lean and only requires light supervision in the form of product domain labels and user-provided ratings. The method combines two weakly supervised components to identify salient opinions and form extractive summaries from multiple reviews. The authors introduce an opinion summarization dataset that includes a training set of product reviews from six diverse domains and human-annotated development and test sets with gold standard aspect annotations, salience labels, and opinion summaries. Automatic evaluation shows significant improvements over baselines, and a largescale study indicates that the opinion summaries generated by the framework are preferred by human judges according to multiple criteria.
Yue Dong, Andrei Mircea, Jackie C. K. Cheung
The paper proposes an unsupervised graph-based ranking model for summarizing long scientific documents. The method uses a two-level hierarchical graph representation of the document and asymmetrical positional cues to determine sentence importance. The approach outperforms strong unsupervised baselines in automatic metrics and human evaluation on the PubMed and arXiv datasets. It also achieves performance comparable to many state-of-the-art supervised approaches. The results suggest that patterns in the discourse structure are a strong signal for determining importance in scientific articles.
Gabriel Shenouda, Christophe Rodrigues, Aurélien Bossard
The paper introduces a new method called SummVD for automatic unsupervised extractive summarization. It uses singular value decomposition and word clustering to reduce the dimensionality of word embeddings and propose a representation of words on a small number of dimensions, each representing a hidden topic. This makes SummVD an efficient method for text summarization, outperforming recent extractive approaches. It requires low resources in terms of data and computing power, making it suitable for use in live summarization systems.
Tuba Gokhan, Phillip Smith, Mark Lee
The paper presents a new method for unsupervised extractive document summarization called Graph-Based Unsupervised Summarization (GUSUM). The method uses sentence embeddings and features to modify traditional graph ranking algorithms and compute sentence centrality. The approach aims to include the most important sentences while excluding those with similar meanings in the summary. The method is evaluated on several datasets and achieves high performance when evaluated both automatically and by humans.
Hao Zheng, Mirella Lapata
The paper discusses the development of an unsupervised approach for single document summarization, which utilizes a modified graph-based ranking algorithm. The algorithm incorporates BERT, a neural representation learning model, to capture sentential meaning, and builds graphs with directed edges to consider the relative position of nodes in a document. The approach was tested on three news summarization datasets and outperformed strong baselines by a significant margin. The authors argue that this approach is more realistic than relying on large-scale and high-quality training data for different types of summaries, domains, or languages.
Raphael Schumann, Lili Mou, Yao Lu, Olga Vechtomova, Katja Markert
The paper discusses the process of automatic sentence summarization, which involves creating a shorter version of a sentence while retaining its most important information. The authors propose an unsupervised objective function that considers language fluency and semantic similarity metrics to find a high-scoring summary through discrete optimization. Their method achieves a new state-of-the-art for unsupervised sentence summarization according to ROUGE scores. The authors also highlight the sensitivity of the commonly reported ROUGE F1 metric to summary length and suggest that future evaluation should group summarization systems by output length brackets.
Han Xu, Eric Martin, Ashesh Mahidadia
System: The paper presents a statistical framework for summarizing scientific papers by extracting information-rich citation sentences that capture the main contributions of the paper. The framework involves two stages, where salient keywords are automatically discovered in the first stage and citation sentences that best capture the paper's main contributions are identified in the second stage. The approach outperforms current state-of-the-art systems in scientific paper summarization using methods rooted in quantitative statistics and information theory.
Daraksha Parveen, Michael Strube
The paper proposes a graph-based method for extractive single-document summarization that considers importance, non-redundancy, and local coherence simultaneously. The method uses a bipartite graph consisting of sentence and entity nodes to rank sentences based on importance and ensure non-redundancy and local coherence of the summary. The method is applied to scientific articles from the journal PLOS Medicine and achieves better results than other systems on this data. The method also achieves state-of-the-art results on DUC 2002 data, and incorporating the local coherence measure always achieves the best results. Human judgments are used to evaluate the coherence of the summaries.
Daraksha Parveen, Hans-Martin Ramsl, Michael Strube
System: The paper presents an approach for extractive single-document summarization using a weighted graphical representation of documents obtained by topic modeling. The approach optimizes importance, coherence, and non-redundancy simultaneously using ILP. The system's performance is compared with state-of-the-art results on scientific articles from PLOS Medicine and on DUC 2002 data using ROUGE scores. Human judges evaluate the coherence of summaries generated by the system in comparison to two baselines, and the approach obtains competitive performance.
Zhongyu Wei, Wei Gao
The paper explores using tweets linking to news for generating extractive summaries of documents. By regarding every tweet as a vote for candidate sentences, they use unsupervised summarization models to rank candidate extracts via random walk on a heterogeneous graph. They can use the linking tweets to opportunistically "supervise" the summarization with no need for reference summaries. The influence of the volume and latency of tweets on the quality of output summaries is analyzed. Compared to truly supervised summarizers unaware of tweets, their method achieves significantly better results with a reasonably small tradeoff on latency. Compared to the same using tweets as auxiliary features, their method is comparable while needing fewer tweets and much shorter time to achieve significant outperformance.
Pengjie Ren, Furu Wei, Zhumin Chen, Jun Ma, Ming Zhou
The paper proposes a new approach to extractive summarization that models sentence importance and redundancy simultaneously by evaluating the relative importance of a sentence given a set of selected sentences. The proposed method uses a new framework to conduct regression with respect to the relative gain of a sentence calculated by the ROUGE metric and incorporates additional features derived from sentence relations. Experiments on multi-document summarization datasets show that the proposed method outperforms state-of-the-art extractive summarization approaches.
Antoine J.-P. Tixier, Polykarpos Meladianos, Michalis Vazirgiannis
The paper presents an unsupervised text summarization system that uses a submodularity framework to generate summaries in a greedy way while maintaining high performance. The system includes a novel coverage reward term that assigns scores to words based on the graph-of-words representation of text and the k-core decomposition algorithm. The system was evaluated on three datasets and achieved state-of-the-art performance, particularly in the meeting domain.
Tsutomu Hirao, Masaaki Nishino, Jun Suzuki, Masaaki Nagata
The paper proposes an Integer Linear Programming formulation to obtain extractive oracle summaries in terms of ROUGEn and an algorithm that enumerates all of the oracle summaries for a set of reference summaries to evaluate system summaries. The experimental results show that there is room for improvement in extractive summarization and that F-measures derived from the enumerated oracle summaries have stronger correlations with human judgment than those derived from single oracle summaries.
Sansiri Tarnpradab, Fei Liu, Kien A. Hua
System: This paper discusses the task of forum thread summarization, which has not been extensively studied. The authors propose a model that uses hierarchical attention networks and neural attention mechanisms to build sentence and thread representations for summarization. The results show that their approach outperforms other methods and that removing redundancies is important for achieving the best results.
Masaru Isonuma, Toru Fujino, Junichiro Mori, Yutaka Matsuo, Ichiro Sakata
The paper proposes a framework for automatic document summarization that extracts sentences using externally related information. The focus is on single document summarization using small amounts of reference summaries, and the framework uses multitask learning with curriculum learning for sentence extraction and document classification. The proposed method is evaluated on financial report and news corpus datasets, and the results show comparable performance to state-of-the-art systems.
Ed Collins, Isabelle Augenstein, Sebastian Riedel
The paper discusses the challenges of summarizing large, complex scientific publications using neural approaches, which require large datasets. The authors introduce a new dataset for summarization of computer science publications and develop models using both neural sentence encoding and traditional summarization features. They find that models that encode sentences and their local and global context perform best, outperforming established baseline methods.
Ramesh Nallapati, Feifei Zhai, Bowen Zhou
The paper presents SummaRuNNer, a Recurrent Neural Network (RNN) based model for extractive summarization of documents. The model achieves performance better than or comparable to state-of-the-art and is very interpretable, allowing visualization of its predictions broken up by abstract features such as information content, salience, and novelty. The paper also introduces abstractive training of the extractive model, which can train on human-generated reference summaries alone, eliminating the need for sentence-level extraction.
Abhishek Kumar Singh, Manish Gupta, Vasudeva Varma
The paper discusses the problem of extractive text summarization and the limitations of conventional approaches that rely on manually compiled features. The authors propose a data-driven system called Hybrid MemNet, which uses an end-to-end deep network to learn a continuous unified representation of a document and generate its summary. The system captures both local and global sentential information and identifies summary-worthy sentences. Experimental results on two corpora show significant performance gains compared to state-of-the-art baselines.
Aishwarya Jadhav, Vaibhav Rajan
SWAP-NET is a new neural sequence-to-sequence model for extractive summarization that identifies both salient sentences and key words in an input document, and then combines them to form the extractive summary. The model uses a new two-level pointer network based architecture that models the interaction of key words and salient sentences. Experiments on large scale benchmark corpora demonstrate that SWAP-NET outperforms state-of-the-art extractive summarizers.
Kristjan Arumae, Fei Liu
The paper proposes a new training method for extractive summarization using Cloze-style comprehension questions instead of human abstracts, which are often inaccurate due to difficulty aligning them with source documents. The method encourages system summaries to preserve important source content and share common words with the abstracts, and uses reinforcement learning with a question-focused reward function to promote concise, fluent, and informative summaries. Experiments show that the proposed method is effective and outperforms state-of-the-art systems on standard summarization datasets.
Chong Feng, Fei Cai, Honghui Chen, Maarten de Rijke
The paper proposes an attentive encoder-based summarization (AES) model for generating article summaries that considers both the global information of a document and the relationships of sentences in the document. The model uses both unidirectional and bidirectional recurrent neural networks (RNNs) to construct encoders, resulting in unidirectional attentive encoder-based summarization (Uni-AES) and bidirectional attentive encoder-based summarization (Bi-AES). The experimental results show that Bi-AES outperforms Uni-AES and achieves substantial improvements over a relevant baseline.
Shashi Narayan, Shay B. Cohen, Mirella Lapata
System: This paper proposes a new algorithm for single document summarization, which is the task of creating a shorter version of a document while retaining its main information. The algorithm is based on a sentence ranking task and uses a reinforcement learning objective to optimize the ROUGE evaluation metric. The authors trained a neural summarization model using this algorithm on the CNN and DailyMail datasets and found that it outperformed existing extractive and abstractive systems in both automatic and human evaluations.
Jonathan Pilault, Raymond Li, Sandeep Subramanian, Christopher Pal
The paper presents a method for producing abstractive summaries of long documents using neural abstractive summarization. The method involves performing a simple extractive step before generating a summary, which is then used to condition the transformer language model on relevant information. The approach produces more abstractive summaries compared to prior work that employs a copy mechanism, while still achieving higher ROUGE scores. The authors provide extensive comparisons with strong baseline methods and multiple variants of their approach, using four different summarization tasks and datasets. They find that transformer-based methods produce summaries with fewer n-gram copies, leading to n-gram copying statistics that are more similar to human-generated abstracts. A human evaluation shows that transformers are ranked highly for coherence and fluency, but purely extractive methods score higher for informativeness and relevance. The authors hope that their architectures and experiments may serve as strong points of comparison for future work.
Elaheh ShafieiBavani, Mohammad Ebrahimi, Raymond Wong, Fang Chen
The paper presents a new approach for evaluating the quality of summaries without the need for human model summaries. The approach uses word embeddings to develop features that reflect coverage, diversity, informativeness, and coherence of summaries. These features are then used to train a learning model for predicting summary content quality. The proposed metric was evaluated on data from query-focused and update summarization tasks in TAC 2008 and 2009, and the results show that the feature combination provides reliable estimates of summary content quality when model summaries are not available.
Yue Dong, Yikang Shen, Eric Crawford, Herke van Hoof, Jackie C.K. Cheung
The paper proposes a new method called BANDITSUM for training neural networks to perform single-document extractive summarization without heuristically-generated extractive labels. The approach treats extractive summarization as a contextual bandit problem, where the model chooses a sequence of sentences to include in the summary based on the document context. A policy gradient reinforcement learning algorithm is used to train the model to select sequences of sentences that maximize ROUGE score. The experiments show that BANDITSUM achieves better or comparable ROUGE scores than state-of-the-art approaches and converges using fewer update steps. Additionally, BANDITSUM performs significantly better than competing approaches when good summary sentences appear late in the source document.
Ryuji Kano, Yasuhide Miura, Motoki Taniguchi, Yan-Ying Chen, Francine Chen, Tomoko Ohkuma
The paper discusses using popularity measures in social media as a way to summarize online conversations. They propose a Disjunctive model that separates the contribution of content and context in determining popularity. They evaluate their model using a dataset where the informativeness of comments is annotated and show that their model outperforms baseline models that use popularity as a measure of informativeness.
Xingxing Zhang, Mirella Lapata, Furu Wei, Ming Zhou
System: The paper proposes a new approach to extractive summarization that uses a latent variable model where sentences are viewed as latent variables. This approach avoids the need for heuristically created sentence-level labels, which may be suboptimal. Instead, sentences with activated variables are used to infer gold summaries, and the loss during training comes directly from these summaries. The model was tested on the CNN/Dailymail dataset and was found to outperform a strong extractive baseline trained on heuristically approximated labels and perform competitively with several recent models.
Jiaxin Shi, Chen Liang, Lei Hou, Juanzi Li, Zhiyuan Liu, Hanwang Zhang
DeepChannel is a neural model for extractive document summarization that uses a salience score to represent the importance of sentences in a document. The salience score is estimated using an attention-based deep neural network, and the model uses a contrastive training strategy to learn the salience estimation network. The most salient sentences are iteratively extracted from the document to generate a summary. The model achieves state-of-the-art ROUGE scores on the CNN/Daily Mail dataset and shows strong robustness in out-of-domain tests. It also demonstrates tremendous data efficiency, achieving a high ROUGE-1 F-1 score with only 1/100 of the training set.
Yang Gao, Christian Meyer, Mohsen Mesgar, Iryna Gurevych
The paper proposes a new approach to document summarization using Reinforcement Learning (RL) algorithms. The approach, called RELIS, learns a reward function with Learning-to-Rank (L2R) algorithms at training time and uses this reward function to train an input-specific RL policy at test time. This approach reduces training time by two orders of magnitude compared to state-of-the-art models while performing on par with them. The authors prove that RELIS guarantees to generate near-optimal summaries with appropriate L2R and RL algorithms. The approach is evaluated on extractive multi-document summarization.
The paper discusses recent neural network approaches to summarization, which are either selection-based extraction or generation-based abstraction. The authors present a neural model for single-document summarization that combines extraction and syntactic compression. The model selects sentences from the document, identifies possible compressions based on constituency parses, and scores those compressions with a neural model to produce the final summary. The authors construct oracle extractive-compressive summaries for learning and achieve strong performance on the CNN/Daily Mail and New York Times datasets, outperforming an off-the-shelf compression module. Human and manual evaluation shows that the model's output generally remains grammatical.
Zhengyuan Liu, Nancy F. Chen
The paper proposes using discourse-level segmentation to improve extractive summarization, as it can more precisely identify the core content in a document compared to using sentences as the elementary unit. The authors investigate the effectiveness of this approach using two basic neural network architectures and a deep bi-directional transformer, and achieve state-of-the-art performance when combining discourse-level segmentation with their adapted contextual representation model on the CNN/Daily Mail dataset.
Wen Xiao, Giuseppe Carenini
The paper proposes a new neural summarization model for long documents that considers both global and local context. The model outperforms previous work on two scientific paper datasets and shows that its benefits increase with longer documents. Surprisingly, the study finds that the benefits of the model come mainly from modeling the local context, even for the longest documents.
Ruipeng Jia, Yanan Cao, Haichao Shi, Fang Fang, Yanbing Liu, Jianlong Tan
DistilSum is a new approach to extractive summarization that uses a teacher mechanism and student model to produce high entropy soft targets at a high temperature. The student model is trained to match these targets and then tested with a temperature of 1 to distill for ground-truth labels. Compared to the current best extractive classifier, BERTSUMEXT, DistilSum achieves a substantial improvement in both text similarity and performance of the classifier on the CNN/DM dataset. The source code for DistilSum will be available on Github.
Matt Grenander, Yue Dong, Jackie C. K. Cheung, Annie Louis
The paper discusses how sentence position is a strong feature for news summarization, but recent neural systems excessively exploit this trend, which can be detrimental when summarizing documents where important content is in later parts of the article. The authors propose two techniques to make systems sensitive to the importance of content in different parts of the article: pretraining the model with randomly shuffled sentences and using an auxiliary ROUGE-based loss. These techniques significantly improve the performance of a reinforcement learning-based extractive system, with the auxiliary loss being more powerful than pretraining.
Ling Luo, Xiang Ao, Yan Song, Feiyang Pan, Min Yang, Qing He
The paper proposes a new approach to extractive text summarization for long documents by simulating the two-stage process of human summarization. The approach uses a convolutional neural network to encode the gist of paragraphs for rough reading and a decision-making policy with an adapted termination mechanism for careful reading. The problem is formulated as a contextual bandit problem and solved with policy gradient. Experiments on the CNN and DailyMail datasets show that the proposed method provides high-quality summaries with varied length and outperforms state-of-the-art extractive methods in terms of ROUGE metrics.
Kristjan Arumae, Fei Liu
The paper discusses the challenge of developing a supervised summarization system due to the lack of ground-truth data. The authors propose a novel framework that uses question-answering rewards to guide the system in producing informative and fluent summaries that perform well on question-answering tasks. The system learns from human abstracts and aims to produce summaries that can answer important questions. The results show that the proposed framework outperforms strong summarization baselines as evaluated by automatic metrics and human assessors.
Léo Bouscarrat, Antoine Bonnefoy, Thomas Peel, Cécile Pereira, EURA NOVA
The paper introduces STRASS, an extractive text summarization method that selects sentences with the closest embeddings to the document embedding. The model learns a transformation of the document embedding to minimize the similarity between the extractive summary and the ground truth summary. The training is inexpensive and can be done on CPU, and the inference time is short and linear. The paper also introduces the French CASS dataset and shows that the method performs similarly to state-of-the-art extractive methods with effective training and inferring time.
Ye Liu, Jian-Guo Zhang, Yao Wan, Congying Xia, Lifang He, Philip S. Yu
The paper proposes a new approach for extractive summarization called HETFORMER, which is based on a Transformer-based pre-trained model with multi-granularity sparse attentions. The approach models different types of semantic nodes in raw text as a potential heterogeneous graph and directly learns heterogeneous relationships among nodes by Transformer. The experiments show that HETFORMER achieves state-of-the-art performance in Rouge F1 while using less memory and fewer parameters compared to existing methods that use GNNs with pre-trained models.
Ming Zhong, Pengfei Liu, Yiran Chen, Danqing Wang, Xipeng Qiu, Xuanjing Huang
The paper proposes a new approach to building neural extractive summarization systems by formulating the task as a semantic text matching problem. This paradigm shift is based on a comprehensive analysis of the gap between sentence-level and summary-level extractors. The authors demonstrate the effectiveness of the matching framework by achieving state-of-the-art results on the CNN/DailyMail dataset and five other datasets. They also release their codes, processed dataset, and generated summaries to encourage further research in this area.
Yang Deng, Wenxuan Zhang, Yaliang Li, Min Yang, Wai Lam, Ying Shen, Hong Kong, 2Alibaba
The paper discusses the challenges of answer summarization in non-factoid question answering and proposes a unified model that integrates hierarchical and sequential context modeling for question-driven extractive answer summarization. The model uses a hierarchical compare-aggregate method to integrate the interaction between QA pairs in both word-level and sentence-level into the final question and answer representations. The question-aware sequential extractor is then used to produce a summary for the lengthy answer. The experimental results show that the proposed method achieves superior performance on WikiHowQA and PubMedQA.
Fangfang Zhang, Jin-ge Yao, Rui Yan
System: The paper discusses modern neural document summarization systems that aim to produce abstractive summaries. The authors conducted a study to verify the degree of abstractiveness of these systems and found that many tend to be near-extractive in practice. They also implemented a pure copy system that achieved comparable results while being more computationally efficient. The authors suggest that future efforts should focus on developing more efficient systems that can better utilize the vocabulary in the original document.
The paper analyzes current evaluation methodologies for summarization metrics and identifies concerns such as the absence of methods for testing improvements over a baseline and the omission of important components of human assessment. The authors propose an evaluation methodology that overcomes these challenges and reveals which metric variants outperform others. They also find that the machine translation metric BLEU performs similarly to ROUGE for evaluating summarization systems. The authors replicate a recent evaluation that relied on suboptimal ROUGE variants and find different conclusions about the relative performance of state-of-the-art summarization systems.
Danqing Wang, Pengfei Liu, Yining Zheng, Xipeng Qiu, Xuanjing Huang
System: The paper presents a new approach called HETERSUMGRAPH for extractive document summarization. It uses a graph-based neural network that includes semantic nodes of different granularity levels, which act as intermediaries between sentences and enrich cross-sentence relations. The graph structure is flexible and can be extended from a single-document setting to multi-document by introducing document nodes. The authors claim to be the first to introduce different types of nodes into graph-based neural networks for extractive document summarization and have performed a comprehensive qualitative analysis to investigate their benefits. The code for HETERSUMGRAPH will be released on Github.
Jiacheng Xu, Zhe Gan, Yu Cheng, Jingjing Liu
The paper introduces a new neural summarization model called DISCOBERT1, which addresses issues with sentence-based extractive models and the limitations of BERT in capturing long-range dependencies in documents. DISCOBERT extracts sub-sentential discourse units and constructs structural discourse graphs to capture long-range dependencies, which are encoded with Graph Convolutional Networks. The proposed model outperforms state-of-the-art methods on popular summarization benchmarks compared to other BERT-base models.
Ruifeng Yuan, Zili Wang, Wenjie Li
The paper proposes a new approach to extractive summarization that focuses on fact-level semantic units rather than individual sentences. The model uses a hierarchical structure to incorporate multiple levels of textual information and is combined with BERT using a hierarchical graph mask to improve natural language understanding. The experiments on the CNN/DaliyMail dataset show that the proposed model achieves state-of-the-art results.
Ruipeng Jia, Yanan Cao, Hengzhu Tang, Fang Fang, Cong Cao, Shi Wang
The paper discusses the challenges of sentence-level extractive text summarization, particularly in modeling redundancy between extracted sentences. The authors propose a new approach called HAHSum, which uses a hierarchical attentive heterogeneous graph to model different levels of information and spotlight redundancy dependencies between sentences. The approach iteratively refines sentence representations with a redundancy-aware graph and delivers label dependencies by message passing. Experiments on large-scale benchmark corpora demonstrate that HAHSum outperforms previous extractive summarizers.
Zhengyuan Liu, Ke Shi, Nancy F. Chen
The paper discusses the challenges of text summarization in the news domain, where neural models easily overfit due to the inverted pyramid writing style and the need to generate a variety of summaries for different users. The authors propose a neural framework that can flexibly control summary generation by introducing subaspect functions (importance, diversity, position) regulated by control codes. They demonstrate that extracted summaries with minimal position bias are comparable to those generated by standard models that take advantage of position preference, and that news summaries generated with a focus on diversity can be more preferred by human raters. The authors suggest that a more flexible neural summarization framework providing more control options could be desirable in tailoring to different user preferences.
Shashi Narayan, Joshua Maynez, Jakub Adamek, Daniele Pighin, Blaz̆ Bratanic, Ryan McDonald
The paper proposes encoder-centric stepwise models for extractive summarization using structured transformers - HiBERT and Extended Transformers. The models enable stepwise summarization by injecting the previously generated summary into the structured transformer as an auxiliary sub-structure. The models are efficient in modeling the structure of long inputs and do not rely on task-specific redundancy-aware modeling, making them a general purpose extractive content planner for different tasks. The stepwise models achieve state-of-the-art performance in terms of Rouge without any redundancy aware modeling or sentence filtering in CNN/DailyMail extractive summarization and Rotowire table-to-text generation. Amongst the two structured transformers tested, stepwise Extended Transformers provides the best performance across both datasets and sets a new standard for these challenges.
Peng Cui, Le Hu, Yuanchao Liu
The paper proposes a new approach to extractive text summarization that addresses the limitations of existing models in capturing intersentence relationships and topical information. The proposed model uses a graph neural network to efficiently represent the document structure and a joint neural topic model to discover latent topics for sentence selection. The experimental results show that the proposed model outperforms existing approaches on both short and long document datasets, demonstrating its robustness in different document genres and lengths. The model's effectiveness in long document summarization is attributed to its ability to preselect salient contents using topical information.
Yash Agrawal, Vivek Anand, Manish Gupta, S Arunachalam, Vasudeva Varma
The paper discusses the importance of extractive summarization of financial reports filed by companies, which impact their stock prices. The lack of in-domain labeled summarization data is a major obstacle to train finance-specific summarization models. The paper proposes a goal-directed approach to modeling 10-K report summarization, leveraging summaries with labeled goal-related data for stock buy/sell classification. The paper also considers a multi-task learning method with an industry classification auxiliary task to provide improvements. The proposed method significantly outperforms strong baselines in intrinsic and extrinsic evaluations for stock buy/sell classification and portfolio construction tasks.
Ruipeng Jia, Yanan Cao, Fang Fang, Yuchen Zhou, Zheng Fang, Yanbing Liu, Shi Wang
The paper discusses the issue of imbalanced sentence classification in extractive summarization, which cannot be easily addressed by data sampling or augmentation algorithms. To solve this problem, the authors propose a deep differential amplifier framework that calculates and amplifies the semantic difference between each sentence and other sentences, and applies a residual unit to deepen the architecture. The model pays more attention to the pivotal information of one sentence, which is different from previous approaches that model all informative context in the source document. Experimental results show that the proposed summarizer performs competitively against state-of-the-art methods. The source code will be available on Github.
Jad Kabbara, Jackie Chi Kit Cheung
The paper discusses the limitations of extractive summarization and proposes a postediting step that focuses on the definiteness of noun phrases to improve the coherence and readability of extractive summaries. The proposed system was evaluated through human and automatic evaluation studies, which showed that the system generated improved summaries. The authors also noted that the system relied on local cues rather than pragmatic reasoning to make decisions.
Stefanos Angelidis, Reinald Kim Amplayo, Yoshihiko Suhara, Xiaolan Wang, Mirella Lapata
ones. The paper presents the Quantized Transformer, an unsupervised system for extractive opinion summarization that uses a clustering interpretation of the quantized space and a novel extraction algorithm to discover popular opinions among hundreds of reviews. The system also enables controllable summarization without further training by utilizing properties of the quantized space to extract aspect-specific summaries. The authors also introduce SPACE, a large-scale evaluation benchmark for opinion summarizers, and demonstrate the promise of their approach through experiments and human studies.
Marina Litvak, Sami Shamoon, Natalia Vanetik
The paper introduces a new approach for automated text summarization using the Minimum Description Length principle and the Krimp dataset compression algorithm. The approach represents a text as a transactional dataset and describes it using frequent sequences of words. The summary is compiled from sentences that compress the document, with the problem of summarization reduced to maximal coverage. The approach is evaluated using a greedy algorithm and the results are presented.
Xingxing Zhang, Furu Wei, Ming Zhou
The paper proposes a new model called HIBERT for document encoding in neural extractive summarization models. It pre-trains the model using unlabeled data and applies it to the summarization model, resulting in better performance compared to randomly initialized models. The proposed model achieves state-of-the-art performance on the CNN/Dailymail and New York Times datasets.
Jingun Kwon, Naoki Kobayashi, Hidetaka Kamigaito, Manabu Okumura
The paper proposes a new model called NeRoBERTa for sentence extractive summarization, which uses nested tree structures consisting of syntactic and discourse trees to improve coherence and informativeness of the summary. The model outperforms baseline models in ROUGE and achieves comparable scores to state-of-the-art models in human evaluation. The paper highlights the difficulty of using pre-trained BERT-based encoders for this task and suggests the use of nested tree structures for better performance.
Yash Gupta, Pawan Sasanka Ammanamanchi, Shikha Bordia, Arjun Manoharan, Deepak Mittal, Ramakanth Pasunuru, Manish Shrivastava, Maneesh Singh, Mohit Bansal, Preethi Jyothi
The paper explores the impact of pretraining on a BERT-based extractive summarization system for scientific documents. The authors found that an intermediate pretraining step using existing summarization datasets improved performance and achieved state-of-the-art results on a scientific summarization dataset. They also analyzed the effects of varying the size and domain of the pretraining corpus, changing the length of the input sequence, and varying target tasks. Additionally, they investigated how intermediate pretraining interacts with contextualized word embeddings trained on different domains.
Yin Jou Huang, Sadao Kurohashi
System: The paper proposes a model for extractive summarization that incorporates both discourse and coreference relations. The model uses a heterogeneous graph containing three types of nodes, each corresponding to text spans of different granularity. Experimental results on a benchmark summarization dataset show that the proposed method is effective.
The paper proposes a new approach to extractive summarization of long-form documents using a sliding selector network with dynamic memory. This approach addresses the issue of loss of summary-relevant contents due to the length limitation of text encoder in neural-based summarization models. The sliding window extracts summary sentences segment by segment and the memory mechanism preserves and updates history information dynamically, allowing semantic flow across different windows. Experimental results on two large-scale datasets of scientific papers show that this model outperforms previous state-of-the-art models. Qualitative and quantitative investigations are also performed to understand how the model works and where the performance gain comes from.
Ruipeng Jia, Yanan Cao, Haichao Shi, Fang Fang, Pengfei Yin, Shi Wang
The paper proposes a non-autoregressive method for extractive summarization called ThresSum, which extracts a non-fixed number of summary sentences without sorting them by predicted probabilities. Instead, ThresSum picks sentences individually from the source document when the predicted probabilities exceed a threshold. During training, the model enhances sentence representation through iterative refinement and weak supervision with soft labels generated progressively by adjusting the temperature with a knowledge distillation algorithm. ThresSum outperforms BERTSUMEXT with a substantial improvement of 0.74 ROUGE-1 score on CNN/DM dataset.
Baoyu Jing, Zeyu You, Tao Yang, Wei Fan, Hanghang Tong
The paper proposes a new approach to extractive text summarization, which involves extracting the most representative sentences from a given document. The authors note that sentence embedding is important for creating a good summary, and that recent studies have used graph neural networks to capture inter-sentential relationships. However, these approaches do not consider multiple types of inter-sentential relationships or intra-sentential relationships. To address these issues, the authors propose a Multiplex Graph Convolutional Network (MultiGCN) to model different types of relationships among sentences and words. They then use this approach to create a Multiplex Graph Summarization (Multi-GraS) model for extractive text summarization. The authors evaluate their approach on the CNN/DailyMail benchmark dataset and demonstrate its effectiveness.
Peggy Tang, Kun Hu, Rui Yan, Lei Zhang, Junbin Gao, Zhiyong Wang
The paper is written by a group of researchers from various institutions, including the University of Sydney and Renmin University of China. The abstract does not provide a clear indication of the topic or focus of the paper, but it does list the authors' affiliations and contact information. Further analysis of the full paper would be necessary to understand its content and purpose.
MemSum is a reinforcement-learning-based extractive summarizer that considers the text content of the sentence, the global text context of the rest of the document, and the extraction history consisting of the set of sentences that have already been extracted. It obtains state-of-the-art test-set performance in summarizing long documents taken from PubMed, arXiv, and GovReport. Ablation studies demonstrate the importance of local, global, and history information. A human evaluation confirms the high quality and low redundancy of the generated summaries, stemming from MemSum’s awareness of extraction history.
Faisal Ladhak, Esin Durmus, He He, Claire Cardie, Kathleen McKeown
The paper discusses the issue of faithfulness errors in abstractive summarization systems and proposes a framework for evaluating the effectiveness of such systems. The authors generate a faithfulness-abstractiveness trade-off curve to serve as a control and show that current methods for improving faithfulness fail to consistently improve over the control at the same level of abstractiveness. They then introduce a selector to identify the most faithful and abstractive summary for a given document and demonstrate that this system can attain higher faithfulness scores in human evaluations while being more abstractive than the baseline system on two datasets. Additionally, the authors show that their system achieves a better faithfulness-abstractiveness trade-off than the control at the same level of abstractiveness.
Qian Ruan, Malte Ostendorff, Georg Rehm
The paper proposes a new approach to improve extractive summarization models by explicitly incorporating hierarchical structure information into a pre-trained, encoder-only Transformer language model. The proposed HiStruct+ model outperforms a strong baseline on three datasets, including PubMed and arXiv, and the improvement is more significant for datasets with more conspicuous hierarchical structures. The ablation study shows that the hierarchical position information is the main contributor to the model's state-of-the-art performance.
Qianren Mao, Hongdong Zhu, Junnan Liu, Cheng Ji, Hao Peng, Jianxin Li, Lihong Wang, Zheng Wang
The paper discusses the limitations of using pre-trained BERT-based encoders for extractive text summarization and proposes a new approach called MuchSUM, which is a multi-channel graph convolutional network that incorporates multiple summary-worthy features. The approach introduces three specific graph channels to encode node textual features, node centrality features, and node position features, respectively, under bipartite word-sentence heterogeneous graphs. A cross-channel convolution operation is designed to distill the common graph representations shared by different channels, and the sentence representations of each channel are fused for extractive summarization. The approach also investigates three weighted graphs in each channel to infuse edge features for graph-based summarization modeling. Experimental results demonstrate that the MuchSUM model can achieve considerable performance compared with some BERT-initialized graph-based extractive summarization systems.
Siya Qi, Lei Li, Yiyang Li, Jin Jiang, Dingxin Hu, Yuze Li, Yingqi Zhu, Yanquan Zhou, Marina Litvak, Natalia Vanetik
The paper discusses the challenges of scientific paper summarization in NLP and presents a solution called SAPGraph1. The framework utilizes paper structure to generate more comprehensive and valuable summaries compared to previous works that tend to extract summaries from the head of the paper. SAPGraph is based on a structure-aware heterogeneous graph that models the document into a graph with three kinds of nodes and edges based on structure information of facets and knowledge. The paper also provides a large-scale dataset of COVID-19-related papers, CORD-SUM, for experiments.
The paper discusses the limitations of the ROUGE metric for evaluating extractive summarization tasks and proposes a new evaluation metric called Sem-nCG, which is both rank-aware and semantic-aware. The paper also demonstrates how to generate more reliable semantic-aware ground truths for evaluating extractive summarization tasks without additional human intervention. Preliminary experimental results show that the Sem-nCG metric is semantic-aware and has a higher correlation with human judgement for single document summarization when a single reference is considered.
Kundan Krishna, Jeffrey Bigham, Zachary C. Lipton
The paper discusses pretraining techniques in text summarization and challenges the idea that knowledge transfer is the reason for its success. The authors show that pretraining on randomly selected character n-grams can achieve similar performance to models pretrained on real corpora, which could eliminate concerns over offensive language, bias, and copyright issues. The authors also design several tasks to test the structure of pretraining tasks, but find no significant benefit, leaving the possibility of a small role for knowledge transfer.
Ella Hofmann-Coyle, Mayank Kulkarni, Lingjue Xie, Mounica Maddela, Daniel Preoţiuc-Pietro
The paper discusses entity-centric summarization, which produces a summary of a document specific to a given target entity. Extractive summaries are preferred over abstractive ones as they preserve factuality and can be used in downstream tasks. The authors explore methods to solve this task by recasting it as a sentence selection task, using methods inspired by information retrieval. They test different architecture variants and loss functions and achieve up to a 5.8 F1 improvement over past state-of-the-art and outperform the entity-centric Lead 3 heuristic by 1.1 F1. The authors also show strong results on the related task of salient sentence selection for an entity.
Qianqian Xie, Jimin Huang
The paper proposes a new model called Graph contRastivE Topic Enhanced Language model (GRETEL) that combines the graph contrastive topic model with pre-trained language models (PLMs) to improve text summarization. The graph contrastive topic model integrates the hierarchical transformer encoder and graph contrastive learning to capture and integrate global semantic information from the document context and the gold summary. GRETEL aims to extract salient sentences that are topically related to the gold summary, rather than redundant sentences that cover sub-optimal topics. Experimental results on general domain and biomedical datasets show that GRETEL outperforms state-of-the-art methods.
Tuan-Anh Phan, Nam Bui
The paper discusses the effectiveness of Graph Neural Network (GNN)-based models in Natural Language Processing (NLP) tasks, particularly in Extractive Document Summarization (EDS). However, long-form document summarization using graph-based approaches is still a challenge. The paper proposes a new model called HeterGraphLongSum, which includes three types of semantic units (word, sentence, and passage) to represent long documents in a graph structure. The model achieves promising results for the extractive long document summarization problem without relying on pre-trained language models like BERT. The source code is available on Github for further exploration.
Xuan-Dung Doan, Le-Minh Nguyen, Nam Bui
System: The paper discusses the use of Heterogeneous Graph Neural Networks (HeterGNN) for document summarization, specifically for long documents. The authors address the issue of lacking inter-sentence connections and propose a solution by building a graph on sentence-level nodes and combining it with HeterGNN to capture semantic information. The experiments conducted on two benchmark datasets show that this method achieves state-of-the-art results in the field of document summarization.
Jin-ge Yao, Xiaojun Wan, Jianguo Xiao
System: The paper presents a sparse optimization framework for extractive document summarization with a decomposable convex objective function. An efficient ADMM algorithm is derived to solve it, and an additional sentence dissimilarity term is introduced to encourage diversity in the summaries. The framework achieves significant improvement over previous related work and is generalized to compressive summarization with a block coordinate descent algorithm. The compressive summarization results are competitive against state-of-the-art results while maintaining reasonable readability, as demonstrated on DUC 2006 and DUC 2007 datasets.
Wenpeng Yin, Yulong Pei
The paper proposes a new approach to extractive document summarization, which involves selecting salient sentences from a given document. The approach, called DivSelect+CNNLM, addresses two challenges: modeling information redundancy among candidate sentences and selecting the most appropriate sentences. It introduces a novel neural network language model based on convolutional neural network (CNN) to project sentences into dense distributed representations and models sentence redundancy using cosine similarity. The selection process is formulated as an optimization problem, constructing a diversified selection process (DivSelect) to select sentences with high prestige and dissimilarity. The approach is evaluated on benchmark datasets and shows effectiveness in summarization.
Diogo PernesÁç, Afonso MendesÁ, André F. T. MartinsÈÉÆ, ÁPriberam çUniversidade
The paper discusses the weaknesses of current abstractive summarization systems, such as the omission of relevant information and the generation of factual inconsistencies. It proposes an energy-based model that learns to re-rank summaries according to recent advances in summarization metrics, which consistently improves the scores achieved by the predicted summaries. However, the paper also notes that the re-ranking approach should be used with care for highly abstractive summaries, as the available metrics are not yet sufficiently reliable for this purpose.
Akim Tsvigun, Ivan Lysenko, Danila Sedashov, Ivan Lazichny, Eldar Damirov, Vladimir Karlov, Artemy Belousov, Leonid Sanochkin, Maxim Panov, Alexander Panchenko, Mikhail Burtsev, Artem Shelmanov
The paper discusses the challenges of creating human-curated annotated datasets for abstractive text summarization (ATS) and the potential of Active Learning (AL) to reduce the amount of annotation required. However, there were no effective AL query strategies for ATS due to the fact that uncertain instances are usually noisy and selecting them can degrade the model performance. The paper proposes the first effective query strategy for AL in ATS based on diversity principles, which improves the model performance in terms of ROUGE and consistency scores. The paper also analyzes the effect of self-learning and shows that it can further increase the performance of the model.
Ziqiang Cao, Wenjie Li, Sujian Li, Furu Wei, Yanran Li
The paper discusses the challenges of extractive query-focused summarization, specifically the tasks of query relevance ranking and sentence saliency ranking. Previous systems have struggled to perform both tasks effectively, but the proposed system, AttSum, tackles them jointly using distributed representations and an attention mechanism. The system is evaluated on benchmark datasets and achieves competitive performance without the use of hand-crafted features. The authors also observe that the sentences identified as relevant to the query do indeed meet the query's needs.
Charlie Egan, Advaith Siddharthan
The paper proposes an abstractive approach to summarize argumentative discussions in online communities. The approach extracts key content through 'point' extraction, where a point is a verb and its syntactic arguments. The approach uses dependency parse information and verb case frames to identify and extract valid points and generates an abstractive summary that discusses the key points being made in the debate. The approach was evaluated using a corpus of online political debates and showed significant improvements over a high-performing extractive summarizer.
Jianpeng Cheng, Mirella Lapata
System: The paper proposes a new approach to extractive summarization using neural networks and continuous sentence features. The approach includes a hierarchical document encoder and an attention-based extractor, allowing for different classes of summarization models. The models were trained on large scale corpora and achieved results comparable to the state of the art without any linguistic annotation.
Tatsuya Ishigaki, Hiroya Takamura, Manabu Okumura
System: The paper proposes the task of question summarization and analyzes question-summary pairs from a Community Question Answering site. It finds that some questions require abstractive approaches instead of extractive approaches. The authors created a dataset and trained extractive and abstractive summarization models, comparing them based on ROUGE scores and manual evaluations. The results show that an abstractive method using an encoder-decoder model with a copying mechanism performs better according to both ROUGE-2 F-measure and human judges' evaluations.
Yuxiang Wu, Baotian Hu
The paper proposes a neural coherence model to capture cross-sentence semantic and syntactic coherence patterns in order to extract more coherent summaries. The proposed model can be trained in an end-to-end fashion using unlabeled data and is used in combination with the ROUGE package to design a reinforcement learning method to train a neural extractive summarizer called the Reinforced Neural Extractive Summarization (RNES) model. The RNES model learns to optimize coherence and informative importance of the summary simultaneously and outperforms existing baselines in terms of ROUGE on the CNN/Daily Mail dataset. The qualitative evaluation shows that summaries produced by RNES are more coherent and readable.
Qingyu Zhou, Nan Yang, Furu Wei, Shaohan Huang, Ming Zhou, Tiejun Zhao
The paper presents a new approach to extractive document summarization that combines sentence scoring and selection into a single neural network framework. The approach uses a hierarchical encoder to represent the document sentences and integrates the selection strategy into the scoring model. Experiments on the CNN/Daily Mail dataset show that the proposed framework outperforms existing extractive summarization models.
Xiuying Chen, Shen Gao, Chongyang Tao, Yan Song, Dongyan Zhao, Rui Yan
ITS is a new model for extractive text summarization that iteratively polishes the document representation on multiple passes through the document, inspired by the observation that humans often need to read an article multiple times to fully understand and summarize its contents. The model also includes a selective reading mechanism that accurately determines the extent to which each sentence should be updated. Experimental results on two datasets show that ITS outperforms state-of-the-art extractive systems when evaluated by both machines and humans.
Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Łukasz Kaiser, Noam Shazeer
The paper discusses a method for generating English Wikipedia articles by summarizing source documents using extractive summarization and a neural abstractive model. The abstractive model uses a decoder-only architecture that can attend to very long sequences, allowing it to generate fluent and coherent multi-sentence paragraphs and even whole articles. The model is able to extract relevant factual information when given reference documents, as reflected in perplexity, ROUGE scores, and human evaluations.
Xiaoyu Shen, Yang Zhao, Hui Su, Dietrich Klakow
The paper discusses the limitations of Pointer Generators in modern summarization systems, which are restricted to exact word matches and result in a bias towards extractive generations. The authors propose a solution by allowing the model to "edit" pointed tokens, transforming them into a target space with a learned relation embedding. The model is shown to capture more latent alignment relations, improve word alignment accuracy, generate higher quality summaries, and bring more abstraction to the generated summaries. The proposed approach is validated on three large-scale summarization datasets.
Edward Moroshko, Guy Feigenblat, Haggai Roitman, David Konopnicki
System: The paper proposes a new approach to summarization called the Editorial Network, which combines extractive and abstractive methods. This approach is applied as a postprocessing step to a sequence of extracted sentences. The paper also suggests a novel soft-labeling approach for training the "editor." The effectiveness of this approach is demonstrated using the CNN/DailyMail dataset, and it is shown to outperform state-of-the-art extractive-only or abstractive-only baselines.
Elozino Egonmwan, Yllias Chali
The paper proposes a system that enhances performance on single document summarization tasks using the CNN/DailyMail and Newsroom datasets. The system follows the encoder-decoder paradigm but with a focus on the encoder. The authors introduce a framework that encodes the source text with a transformer and then a sequence-to-sequence model. They find that the transformer and seq2seq model complement each other, resulting in a richer encoded vector representation. Additionally, paying more attention to the vocabulary of target words during abstraction improves performance. The authors experiment with their hypothesis and framework on extractive and abstractive single document summarization tasks and evaluate using the CNN/DailyMail and Newsroom datasets.
Afonso Mendes, Shashi Narayan, Sebastião Miranda, Zita Marinho, André F. T. Martins, Shay B. Cohen
The paper presents a new neural model for text summarization that extracts sentences from a document and compresses them to generate concise and informative summaries. The model dynamically determines the length of the output summary based on gold summaries observed during training, and does not require length constraints typical to extractive summarization. The model achieves state-of-the-art results on the CNN/DailyMail and Newsroom datasets, improving over current extractive and abstractive methods. A new dataset of oracle compressive summaries derived automatically from the CNN/DailyMail reference summaries is also made available.
Yang Liu, Ivan Titov, Mirella Lapata
The paper proposes a new approach to single-document extractive summarization, using a multi-root dependency tree to generate summaries. The model is designed to refine its structures through an iterative algorithm, and is shown to perform competitively against existing methods on two benchmark datasets. This approach differs from previous methods that rely on linguistically motivated document representations.
Lea Frermann, Alexandre Klementiev
The paper discusses aspect-based summarization, which generates a summary centered around a specific aspect of a document. The authors induce latent document structure and train their models in a scalable synthetic setup, resulting in improvements in summarization over topic-agnostic baselines. The models accurately segment documents by aspect and can produce both abstractive and extractive aspect-based summaries. The learned document structure is particularly advantageous for summarizing long documents, and the results transfer from synthetic training documents to natural news articles from CNN/Daily Mail and RCV1.
Elozino Egonmwan, Vittorio Castelli, Md Arafat Sultan
System: The paper explores the possibility of transferring knowledge between machine reading comprehension (MRC) and query-based text summarization. The authors use an MRC model trained on the SQuAD1.1 dataset to build an extractive query-based summarizer, which compresses the output of the MRC model using a new sentence compression technique. They also use pre-trained machine translation systems to abstract the extracted summaries. The models achieve state-of-the-art results on the CNN/Daily Mail and Debatepedia datasets, and can serve as powerful baselines for future systems. The authors hope that their results will encourage further research on transfer learning from large MRC corpora to query-based summarization.
Rajdeep Mukherjee, Hari Chandana Peruri, Uppada Vishnu, Pawan Goyal, Sourangshu Bhattacharya, Niloy Ganguly
The paper discusses the time-consuming process of manually extracting relevant aspects and opinions from large volumes of user-generated text. It proposes a solution for generating personalized aspect-based opinion summaries from online tourist reviews, allowing readers to control various attributes of the summary. The approach involves an unsupervised method to extract coherent aspects and an Integer Linear Programming (ILP) based extractive technique to select informative opinions around those aspects while respecting user-specified values. The authors evaluate and compare their summaries using crowdsourcing and ROUGE-based metrics and obtain competitive results.
Liqiang Xiao, Lu Wang, Hao He, Yaohui Jin
The paper proposes a hybrid framework for summarization called HYSUM that combines extractive and abstractive methods to generate informative and concise summaries. Existing extract-then-abstract methods suffer from information loss in the abstraction step, but HYSUM can switch between copying and rewriting sentences based on redundancy to effectively combine the advantages of both methods. The paper also proposes an end-to-end reinforcing method based on Hierarchical Reinforcement Learning to enhance cooperation between the extraction and rewriting modules. Automatic and human evaluations show that HYSUM outperforms existing models on the CNN/DailyMail corpus.
Arthur Bražinskas, Mirella Lapata, Ivan Titov
The paper discusses the task of opinion summarization, which involves creating text that reflects subjective information expressed in multiple documents, such as user reviews of a product. The lack of large datasets for training supervised models has led to the use of extractive methods that select text fragments in an unsupervised or weakly-supervised way. However, recent research has shown that abstractive summaries can also be produced in an unsupervised fashion. The paper presents a method that uses a handful of summaries to bootstrap the generation of summary text with expected properties such as writing style, informativeness, fluency, and sentiment preservation. The approach involves training a conditional Transformer language model to generate a new product review given other available reviews of the product, and fine-tuning a plug-in module that predicts property values on a handful of summaries. The approach outperforms previous extractive and abstractive methods in automatic and human evaluation on Amazon and Yelp datasets.
Edwin Simpson, Yang Gao, Iryna Gurevych
text ranking. The paper proposes an interactive text ranking approach that uses Bayesian optimization to focus on high-quality candidates and integrate prior knowledge to address the lack of user or task-specific training data. The approach significantly outperforms existing interactive approaches in community question answering and extractive multidocument summarization. The ranking function learned by the method is also an effective reward function for reinforcement learning, improving the state of the art for interactive text ranking.
Shrey Desai, Jiacheng Xu, Greg Durrett
The paper proposes a new approach to compressive summarization that uses data-driven criteria of plausibility and salience to determine which spans of sentences can be deleted. A pre-trained Transformer model judges each criterion, and only deletions that are both plausible and not salient are applied. The approach achieves strong in-domain results on benchmark summarization datasets and can generalize cross-domain with fine-tuning on only 500 samples. Human evaluation shows that the plausibility model generally selects for grammatical and factual deletions.
Shahbaz Syed, Roxanne El Baff, Khalid Al-Khatib, Johannes Kiesel, Benno Stein, Martin Potthast
The paper discusses the lack of exploration in automatic summarization of argumentative texts and presents a new corpus of 1330 summaries for 266 news editorials. The summaries are evaluated based on a specific annotation scheme and aim to be thesis-indicative, persuasive, reasonable, concise, and self-contained. The corpus contains at least three high-quality summaries for about 90% of the editorials, making it useful for the development and evaluation of summarization technology for long argumentative texts. The paper also reports on an in-depth corpus analysis and the evaluation of two extractive summarization models.
Quentin Grail, Julien Perez, Eric Gaussier
The paper discusses the limitations of using current transformer-based architectures for fine-tuning large language models on downstream tasks that require reasoning with long documents. To address this issue, the authors introduce a novel hierarchical propagation layer that spreads information between multiple transformer windows. They validate the effectiveness of their approach on three extractive summarization corpora of long scientific papers and news articles and report state-of-the-art results for long document summarization and comparable results for smaller document summarization.
Tianyu Zhu, Wen Hua, Jianfeng Qu, Xiaofang Zhou
The paper proposes a new extractive summarization model called HEROES to address the deficiencies of existing models for summarizing long-form documents. The two main deficiencies are the increase in computation due to the size of the input document and the lack of exploitation of discourse structural information. HEROES consists of two modules: a content ranking module that selects important sections and sentences to create a short digest, and an extractive summarization module based on a heterogeneous graph with nodes from different discourse levels and designed edge connections to reflect the discourse hierarchy of the document. Experimental results show that HEROES outperforms various strong baselines.
Linzi Xing, Wen Xiao, Giuseppe Carenini
System: This paper introduces a new technique to reduce lead bias in news articles and improve the performance of neural extractive summarizers on data with different or no bias. The experiments conducted on two news corpora show that this technique effectively reduces the model's learned lead bias and improves its generality on out-of-distribution data, without any significant loss in performance on in-distribution data.
Guangsheng Bao, Yue Zhang
The paper discusses the limitations of extractive summarization and the potential benefits of abstractive rewriting. However, abstractive rewriting systems only consider extracted summaries as input, which can result in the loss of important background knowledge. To address this issue, the authors propose a contextualized rewriting approach that takes in the entire original document. They formalize this approach as a seq2seq problem with group alignments and introduce group tags to model the alignments. The system identifies extracted summaries through content-based addressing and achieves significant improvements on ROUGE scores compared to non-contextualized rewriting systems without requiring reinforcement learning.
Sharmila Reddy Nangi, Atharv Tyagi, Jay Mundra, Sagnik Mukherjee, Snehal Raj, Aparna Garimella, Niyati Chhaya
The paper proposes methods to automatically create deep learning models for extractive and abstractive text summarization tasks, which have shown state-of-the-art performances on various datasets. The methods use a combination of Neural Architecture Search and Knowledge Distillation techniques, leveraging the knowledge provided by large language models such as BERT and GPT-2 to develop smaller, customized models for any given dataset. The proposed methods achieve near state-of-the-art performances in terms of accuracy while reducing inference time and model size.
Jiaxin Ju, Ming Liu, Huan Yee Koh, Yuan Jin, Lan Du, Shirui Pan
The paper presents an unsupervised extractive approach to summarize scientific long documents using the Information Bottleneck principle. The approach involves using signals as queries to retrieve key content from the source document, followed by a pre-trained language model to conduct further sentence search and editing to return the final extracted summaries. The framework can be extended to a multi-view framework by different signals. The proposed framework was evaluated on three scientific document datasets and was found to be effective. Human evaluation suggests that the extracted summaries cover more content aspects than previous systems.
Zixing Song, Irwin King
The paper proposes a new approach to summarization that incorporates the constituent structure of the text using Graph Neural Networks. They use a hierarchical heterogeneous graph attention network over constituency-based parse trees for syntax-aware summarization, which reflects how humans construct summaries hierarchically. The model is effective for both abstractive and extractive summarization tasks on five benchmark datasets from various domains, and further performance improvement can be obtained using state-of-the-art pre-trained models.
Sajad Sotudeh, Nazli Goharian
The paper discusses the challenges of generating long/extended summaries for scientific papers, which provide more detailed information than traditional abstracts. The authors propose an extractive summarizer called TSTR that uses introductory information as pointers to salient information. The evaluations on two large-scale datasets show significant improvement in ROUGE and average ROUGE scores compared to strong baselines and state-of-the-art methods. Human evaluations also favor TSTR-generated extended summaries in terms of cohesion and completeness.
Zhihua Jiang, Junzhan Yang, Dongning Rao
The paper discusses the challenges of long-document summarization and proposes a Simple yet Effective HYbrid approach (SEHY) that selects salient sections instead of sentences for summary generation. The approach exploits discourse information and avoids fulltext understanding while retaining salient information within the length limit. The paper also presents two strategies for training the extractor and evaluates the approach on a large-scale scientific paper dataset. The authors also discuss how the disciplinary class of a scientific paper affects the performance of SEHY. Experimental results show the effectiveness of the approach and interesting findings on arXiv and its subsets.
Thomas Scialom, Paul-Alexis Dray, Patrick Gallinari, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano, Alex Wang
The paper discusses the limitations of current metrics for evaluating summarization, such as ROUGE, and proposes a new framework called QUESTEVAL. Unlike other metrics, QUESTEVAL does not require a groundtruth reference and relies on question answering models to assess whether a summary contains all the relevant information from its source document. The paper shows that QUESTEVAL significantly improves the correlation with human judgments over four evaluation dimensions: consistency, coherence, fluency, and relevance. The authors also provide code and models for the framework.
Marcello Gecchele, Hiroaki Yamada, Takenobu Tokunaga, Yasuyo Sawaki
The paper proposes a method for computer-assisted content evaluation of summaries by establishing a correspondence between segments of the source text and its summary using "Idea Units (IUs)." The IU correspondence is based on the similarity between vector representations of IU. The proposed method is more robust against rephrased expressions than conventional ROUGE-based baselines and outperformed the baselines in recall. The proposed method has been implemented in a GUI tool called "Segment Matcher" to help teachers establish a link between corresponding IUs across the summary and source text.
Chenxin An, Ming Zhong, Zhiyong Wu, Qin Zhu, Xuanjing Huang, Xipeng Qiu
The paper proposes a new framework called COLO for one-stage summarization that uses contrastive learning to generate summaries directly based on summary-level scores, without additional modules or parameters. The framework improves extractive and abstractive results on the CNN/DailyMail benchmark while maintaining parameter and inference efficiency. Compared to state-of-the-art multi-stage systems, COLO saves more than 100 GPU training hours and has a 3-8x speed-up ratio during inference while achieving comparable results.
Arman Cohan, Nazli Goharian
The paper proposes a new approach to summarizing scientific articles that takes into account citation-context and the document discourse model. The method overcomes the problem of inconsistency between citation summaries and the article's content by providing context for each citation. The approach leverages the inherent scientific article's discourse for producing better summaries and shows a significant improvement over existing summarization approaches in terms of ROUGE scores on a scientific summarization dataset. The method is adaptable to other domains beyond the biomedical domain used for evaluation.
Wencan Luo, Diane Litman
The paper proposes a new algorithm for summarizing student responses to reflection prompts. Unlike traditional methods, the algorithm creates summaries from extracted phrases rather than sentences, and ranks the phrases by the number of students who mention them. Experimental results show that this approach outperforms other summarization methods in terms of ROUGE scores.
Philip John Gorinski, Mirella Lapata
System: The paper discusses the task of movie script summarization and how it can improve script browsing, provide a general idea of the plotline, and reduce reading time. The authors propose a graph-based model that selects an optimal chain of scenes by considering logical progression, diversity, and importance. Human evaluation shows that their model produces more informative summaries compared to other methods.
Arman Cohan, Nazli Goharian
The paper proposes an unsupervised model that uses distributed representation of words and domain knowledge to extract context from referenced papers to reflect their exact contributions. The model significantly outperforms the state-of-the-art and improves citation-based summarization of scientific articles. The paper highlights the importance of appropriate context for citation texts and presents a solution to address this problem.
Jeffrey Ling, Alexander M. Rush
The paper proposes a new approach to document summarization using a coarse-to-fine attention model that hierarchically reads a document. This approach selects top-level chunks of text using coarse attention and then reads the words of the chosen chunks using fine attention. Unlike standard attention models, this method scales with the number of top-level chunks and can handle longer sequences. While it may lag behind state-of-the-art baselines, the proposed method achieves the desired behavior of sparsely attending to subsets of the document for generation.
Florian Boudin, Hugo Mougard, Benoit Favre
System: The paper discusses the challenges of sentence selection in concept-based summarization, which is modelled as a budgeted maximum coverage problem. To find optimal solutions efficiently, low-weight concepts need to be pruned. However, reducing the number of concepts leads to lower ROUGE scores and multiple optimal solutions. The authors propose an extension to the model that provides a single optimal solution and eliminates the need for concept pruning using an approximation algorithm that achieves comparable performance to exact inference.
Sun Kim, Lana Yeganova, John Wilbur
System: The paper proposes a method for improving the search and browsing experience in PubMed by finding sub-topics or themes from a set of documents and computing representative titles for each theme. The method combines a thematic clustering algorithm and the Pool Adjacent Violators algorithm to induce significant themes. The system was tested on five disease sets from OMIM and outperformed LDA in terms of performance measures. The quality of theme titles was also evaluated by comparing them with manually created titles.
Chen Li, Zhongyu Wei, Yang Liu, Yang Jin, Fei Huang
The paper explores using public posts on social media to improve automatic summary generation for news articles. Different approaches are proposed, including using frequency information from posts to re-estimate bigram weights and re-weighting a dependency tree edge's importance for sentence compression. The experiments conducted on Facebook data show that relevant public posts can be effectively leveraged to improve news article summarization results.
Greg Durrett, Taylor Berg-Kirkpatrick, Dan Klein
The paper presents a model for single-document summarization that combines compression and anaphoricity constraints. The model selects textual units for the summary based on learned weights from a large corpus. Compression rules allow for content deletion within a sentence, and anaphoricity constraints ensure cross-sentence coherence by including pronoun antecedents or rewriting pronouns as full mentions. The final system outperforms prior work on both ROUGE and human judgments of linguistic quality.
Wencan Luo, Fei Liu, Zitao Liu, Diane Litman
System: The paper proposes a new approach to summarizing student course feedback using the integer linear programming (ILP) framework. This approach allows different student responses to share co-occurrence statistics and alleviates sparsity issues. The experimental results on a student feedback corpus show that this approach outperforms a range of baselines in terms of both ROUGE scores and human evaluation.
Daraksha Parveen, Mohsen Mesgar, Michael Strube
System: The paper introduces a new approach to automatic summarization of scientific articles that takes into account coherence. The approach uses a graph-based model and coherence patterns mined from a corpus of abstracts to generate summaries that are coherent, important, and non-redundant. The approach is optimized using Mixed Integer Programming and outperforms baseline and state-of-the-art systems in terms of coherence and relevance.
Ottokar Tilk, Tanel Alumäe
System: This paper discusses the challenges of improving headline quality on smaller datasets using neural headline generation models. The authors propose a new method that allows for pre-training all parameters of the model and utilizing all available text. This approach resulted in significant improvements in perplexity and ROUGE scores, with up to a 32.4% relative improvement in perplexity and 2.84 points in ROUGE.
Gyoung Ho Lee, Kong Joo Lee
System: The paper discusses the use of simple embedding features in a Reinforcement learning approach to automatic text summarization. The authors propose a new deep learning network for estimating Qvalues used in Reinforcement learning and evaluate their model using ROUGE scores with various datasets. The results show that their model is competitive with previous models.
Kundan Krishna, Aniket Murhekar, Saumitra Sharma, Balaji Vasan Srinivasan
The paper proposes a neural framework for generating summaries that are tailored to the linguistic preferences of a specific audience. Existing frameworks do not take into account such preferences, but the proposed method tunes the summary words at the time of generation to match the target vocabulary. The evaluations show that the proposed approach maintains a superior summary quality compared to a word embedding based lexical substitution algorithm. The paper demonstrates two applications of the proposed approach to generate summaries with simpler or shorter words for better readability.
Michihiro Yasunaga, Jungo Kasai, Rui Zhang, Alexander R. Fabbri, Irene Li, Dan Friedman, Dragomir R. Radev
The paper discusses the challenges of scientific article summarization and proposes solutions to these challenges. The authors develop and release a large-scale manually-annotated corpus for scientific papers on computational linguistics and propose summarization methods that integrate the authors' original highlights and the article's actual impacts on the community to create comprehensive, hybrid summaries. The authors conduct experiments to demonstrate the efficacy of their corpus in training data-driven models for scientific paper summarization and the advantage of their hybrid summaries over abstracts and traditional citation-based summaries. The large annotated corpus and hybrid methods provide a new framework for scientific paper summarization research.
Florian Böhm, Yang Gao, Christian M. Meyer, Ori Shapira, Ido Dagan, Iryna Gurevych
The paper discusses the limitations of using ROUGE scores as rewards in Reinforcement Learning (RL) based document summarisation systems, as high ROUGE scores do not necessarily correspond to high human judgement. To address this, the authors learn a reward function from human ratings on 2,500 summaries, which only takes the document and system summary as input. The learned rewards are shown to have significantly higher correlation with human ratings than previous approaches. The authors conduct human evaluation experiments and find that RL systems using their learned rewards generate summaries with higher human ratings compared to state-of-the-art supervised-learning systems and ROUGE-as-rewards RL summarisation systems. The learned reward function and source code are available at https://github.com/yg211/summary-reward-no-reference.
Kristjan Arumae, Parminder Bhatia, Fei Liu
The paper discusses the benefits of creating summary highlights at the sub-sentence level and proposes a method for generating them by annotating summary-worthy sub-sentences and teaching classifiers to do the same. The task is framed as jointly selecting important sentences and identifying a single most informative textual unit from each sentence, which reduces the complexity involved in sentence compression. The study provides new benchmarks and baselines for generating highlights at the sub-sentence level.
Hui Liu, Xiaojun Wan
The paper discusses product review summarization, which is a personalized and targeted form of text summarization that provides a brief summary of an online product review. The authors explore different ways to use user and product information to improve review summarization and demonstrate that their approaches are highly effective and outperform existing summarization methods. This technique is useful for both sellers and consumers in making purchase decisions.
Takuya Makino, Tomoya Iwakura, Hiroya Takamura, Manabu Okumura
The paper proposes a global optimization method called GOLC for neural text summarization models that increases the probabilities of generating summaries with high evaluation scores within a desired length. The method is compared to two other optimization methods on two datasets and the results show that GOLC generates fewer overlength summaries while maintaining the fastest processing speed. The importance of generating in-length summaries for post-editing is also demonstrated, with approximately 30% to 40% improved post-editing time by use of in-length summaries.
Asma Ben Abacha, Dina Demner-Fushman
The paper discusses the challenge of question understanding in question answering, particularly in the context of natural language questions that are longer than necessary and contain peripheral information. The authors study neural abstractive models for medical question summarization and introduce the MeQSum corpus of 1,000 summarized consumer health questions. They explore data augmentation methods and evaluate state-of-the-art neural abstractive models on this task. The authors show that semantic augmentation from question datasets improves performance and that pointer-generator networks outperform sequence-to-sequence attentional models, achieving a ROUGE-1 score of 44.16%. The paper also includes a detailed error analysis and suggestions for improving question summarization.
Roy Bar-Haim, Lilach Eden, Roni Friedman, Yoav Kantor, Dan Lahav, Noam Slonim
The paper proposes a method for generating concise summaries from a large collection of arguments on a given topic by representing them as a small set of key points, each scored according to its salience. The authors analyze a large dataset of crowd-contributed arguments and find that a small number of key points per topic is typically sufficient for covering the vast majority of the arguments. They also show that a domain expert can often predict these key points in advance. The paper introduces a novel large-scale dataset for the task of argument-to-key point mapping and reports promising empirical results for an extensive set of experiments with this dataset.
Wen Xiao, Giuseppe Carenini
The paper explores the problem of redundancy in neural summarization and proposes three new methods to balance non-redundancy and importance when summarizing long documents. The authors organize existing methods into categories based on when and how redundancy is considered and show that their proposed methods achieve state-of-the-art ROUGE scores while significantly reducing redundancy on two scientific paper datasets.
Chao Zhao, Snigdha Chaturvedi
The paper proposes a generative method called ASPMEM for opinion summarization from online product reviews. ASPMEM contains an array of memory cells to store aspect-related knowledge, which helps obtain a better opinion representation and infer aspect information more precisely. The method is evaluated on both aspect identification and opinion summarization tasks and outperforms state-of-the-art methods without relying on human supervision. The proposed method uses domain knowledge from external sources to automatically identify relevant aspects, eliminating the need for additional human effort.
Roy Bar-Haim, Yoav Kantor, Lilach Eden, Roni Friedman, Dan Lahav, Noam Slonim
The paper discusses the importance of not only extracting salient points when summarizing a collection of views, arguments or opinions, but also quantifying their prevalence. The traditional approach of creating textual summaries lacks this quantitative aspect. The paper proposes a method for automatic extraction of key points, which enables fully automatic analysis and achieves performance comparable to a human expert. The applicability of key point analysis goes beyond argumentation data, as demonstrated by promising results in municipal surveys and user reviews. The paper also presents an in-depth evaluation of argument-to-key point matching models, where previous results are substantially outperformed.
Hiroaki Hayashi, Prashant Budania, Peng Wang, Chris Ackerson, Raj Neervannan, Graham Neubig
information. The paper proposes a new dataset, WikiAsp, for multi-domain aspect-based summarization, which aims to encourage research in open-domain aspect-based summarization. The dataset is built using Wikipedia articles from 20 different domains, and several baseline models are proposed and tested. The results highlight challenges that existing summarization models face in this setting, such as handling pronouns and time-sensitive information.
Zeyu Dai, Ruihong Huang
The paper proposes a joint model for structure-based news genre classification that identifies one of four commonly used news structures and recognizes a sequence of news elements within the article that define the corresponding news structure. The joint model consistently outperforms its variants that perform two tasks independently, which supports the idea that preserving the two-way dependencies and constraints between a type of news structure and its sequence of news elements enables the model to better predict both of them. The system's predicted news structure type and news elements have improved the performance of text summarization when incorporated into a recent neural network system.
Rui Meng, Khushboo Thaker, Lei Zhang, Yue Dong, Xingdi Yuan, Tong Wang, Daqing He
System: The paper discusses faceted summarization, which provides multiple summaries of a long document from different perspectives, each targeting specific sections such as purpose, method, findings, and value. The lack of large-scale faceted summarization datasets has hindered research in this area, but the authors present FacetSum, a benchmark built on Emerald journal articles covering diverse domains. The study's analyses and empirical results highlight the importance of structured summaries, and the authors believe FacetSum will drive further advances in summarization research and NLP systems that can leverage structured information in both long texts and summaries.
Sheikh Muhammad Sarwar, Felipe Moraes, Jiepu Jiang, James Allan
The paper discusses a new approach to query-biased summarization (QBS) that aims to reduce user effort in finding relevant documents. The approach identifies missing information in a retrieved document and presents it in a search interface for crowd workers to judge document relevance based on snippets and missing information. The method, called DSPApprox, uses classical approaches to find terms or phrases relevant to a query. The experimental results show both benefits and limitations of the method compared with traditional ones that only show relevant snippets.
Zhongyi Yu, Zhenghao Wu, Hao Zheng, Zhe XuanYuan, Jefferson Fong, Weifeng Su
The paper discusses fixed length summarization and the trade-off between length controllability and summary quality. The authors introduce a new length controlling unit called LenAtten, which improves length controllability and ROGUE scores while maintaining great generalization ability. The experimental results show that their model is significantly better than the best-performing length controllable summarizer on the CNN/Daily Mail dataset.
Jinpeng Hu, Jianling Li, Zhihong Chen, Yaling Shen, Yan Song, Xiang Wan, Tsung-Hui Chang
The paper discusses the challenges faced by radiologists in writing impression sections of radiology reports, which summarize essential findings and are critical for communicating medical information to physicians. Automatic impression generation has emerged as an attractive research direction to facilitate this clinical practice. The paper proposes a novel method for automatic impression generation, where a word graph is constructed from the findings to record critical words and their relations, and a Word Graph guided Summarization model (WGSUM) is designed to generate impressions with the help of the word graph. Experimental results on two datasets confirm the validity and effectiveness of the proposed approach, achieving state-of-the-art results. Further experiments are conducted to analyze the impact of different graph designs on the performance of the method.
Nadav Oved, Ran Levy
The paper discusses the challenges of automatically producing concise and informative summaries for product reviews, including the tendency for summarizers to favor generic content and the potential for self-contradicting summaries due to reviewer disagreements. The authors propose the PASS system, which uses a pre-trained Transformer-based model and applies systematic perturbations to generate multiple summaries per product. The system also includes a method for ranking the summaries based on coherence. The authors compare their system to other methods and show that it produces more informative, diverse, and coherent summaries.
Chao-Chun Hsu, Chenhao Tan
The paper proposes a new approach to summarization called decision-focused summarization, which aims to summarize relevant information for a particular decision. They use a predictive model to make the decision based on the full text and then select representative sentences that lead to similar model decisions while accounting for non-redundancy. The method, called DecSum, is evaluated on a testbed where the task is to summarize restaurant reviews to predict future ratings on Yelp. DecSum outperforms other summarization methods in decision faithfulness and representativeness and enables humans to outperform random chance in predicting which restaurant will be better rated in the future.
Chenxin An, Ming Zhong, Yiran Chen, Danqing Wang, Xipeng Qiu, Xuanjing Huang
The paper proposes a new approach to scientific paper summarization that utilizes the citation network of the papers. The authors argue that previous approaches have focused too much on the content of the papers and have not taken into account the importance of the citation network. They introduce a new model called CGSUM that incorporates both the source paper and its references. They also construct a new dataset called Semantic Scholar Network (SSN) that contains 141K research papers and 661K citation relationships. The experiments show that the proposed model outperforms pretrained models even with a simple architecture and that the citation graph is crucial for generating high-quality summaries.
Xinnuo Xu, Ondřej Dušek, Shashi Narayan, Verena Rieser, Ioannis Konstas
The paper discusses the problem of 'extrinsic hallucinations' in single-document news summarization, where the summary contains facts not present in the source document. The authors propose using multiple supplementary resource documents to mitigate this problem and present a new dataset called MIRANEWS to benchmark existing summarization models. They show that more than 27% of facts mentioned in the gold summaries of MIRANEWS are better grounded on assisting documents than in the main source articles. The authors also conduct an error analysis of generated summaries from pretrained models fine-tuned on MIRANEWS, revealing that assisted summarization reduces 55% of hallucinations when compared to single-document summarization models trained on the main article only.
Jinpeng Hu, Zhuo Li, Zhihong Chen, Zhen Li, Xiang Wan, Tsung-Hui Chang
The paper proposes a unified framework for automatic impression generation in radiology reports that leverages both extra knowledge and the original findings in an integrated way. The proposed method encodes each input finding using a text encoder and constructs a graph through its entities and dependency tree. A graph encoder is then adopted to model relation information in the constructed graph. Finally, contrastive learning is introduced to emphasize key words in the findings. The experimental results on OpenI and MIMIC-CXR confirm the effectiveness of the proposed method.
Hayate Isoä, Xiaolan Wangä, Stefanos AngelidisÅ, Yoshihiko Suharaä
The paper proposes a new task called comparative opinion summarization, which generates two contrastive summaries and one common summary from two different sets of reviews to help users compare multiple choices. The authors develop a framework called COCOSUM, which consists of two base summarization models that jointly generate the summaries. Experimental results show that COCOSUM produces higher-quality summaries than existing opinion summarization models. The dataset and code are available for use.
Kexun Zhang, Jiaao Chen, Diyi Yang
The paper discusses the task of automatic email to-do item generation, which involves generating action mentions from emails to help people schedule their daily work. The authors propose a learning to highlight and summarize framework (LHS) to identify the most salient text and actions and generate more faithful to-do items. The LHS model outperforms baseline models and achieves state-of-the-art performance in both quantitative evaluation and human judgement. The paper also highlights specific challenges that current models face with email to-do summarization.
Naman Bansal, Mousumi Akter, Shubhra Kanti Karmaker
The paper introduces a new NLP task called Semantic Overlap Summarization (SOS) which involves generating a summary from multiple alternative narratives. The authors created a benchmark dataset by collecting alternative narrative pairs and manually creating reference summaries. They found that the popular ROUGE metric is not suitable for evaluating this task and instead used a sentencewise annotation technique with three overlap labels. Their experiments showed that this technique yielded higher correlation with human judgment and higher inter-rater agreement compared to the ROUGE metric.
Pengshan Cai, Fei Liu, Adarsha Bajracharya, Joe Sills, Alok Kapoor, Weisong Liu, Dan Berlowitz, David Levy, Richeek Pradhan, Hong Yu
The paper discusses the problem of physicians not having enough time to write clear and informative after-visit summaries for patients, and explores the possibility of using automatic generation of summaries. The study uses a clinical dataset to examine whether automatic summaries can effectively convey the important details of clinical visits. The results suggest that generating lay language after-visit summaries is still a challenging task, but a feedback mechanism is introduced to alert physicians when automatic summaries fail to capture important details or contain potentially detrimental information. Automatic and human evaluation shows the effectiveness of this approach in providing writing feedback and supporting physicians.
Mounica Maddela, Mayank Kulkarni
The paper discusses controllable summarization, which aims to provide summaries that take into account user-specified aspects and preferences. The authors introduce a human-annotated data set (ENTSUM) for controllable summarization with a focus on named entities as the aspects to control. They conduct an extensive analysis and show that existing methods for controllable summarization fail to generate entity-centric summaries. The authors propose extensions to state-of-the-art summarization approaches that achieve substantially better results on their data set. The paper highlights the challenging nature of this task and the proposed data set.
Choongwon Park, Youngjoong Ko
The paper discusses the task of Query-Focused Summarization (QFS) and the limitations of Transformer-based summarization models in utilizing relationships between distant words and query information. To address these issues, the authors propose the QSG Transformer, a novel QFS model that leverages structure information on Query-attentive Semantic Graph (QSG). The QSG node representation is improved by a query-attentive graph attention network, which spreads the information of the query node into QSG using Personalized PageRank. The proposed method achieves superior performance over state-of-the-art models on two QFS datasets.
Yue Dong, John Wieting, Pat Verga
The paper discusses how existing abstractive summarization systems generate text that is not directly inferable from the source alone, resulting in content hallucinations. These hallucinations are sometimes factual but unfaithful to the source. The paper suggests that these factual hallucinations occur due to the prevalence of factual yet unfaithful entities in summarization datasets. The authors find that these entities are examples of additional world knowledge being used to connect entities and concepts. They demonstrate that connecting entities to an external knowledge base can improve the factuality of summaries without making them more extractive.
Ryuji Kano, Takumi Takahashi, Toru Nishino, Motoki Taniguchi, Tomoki Taniguchi, Tomoko Ohkuma
The paper proposes a method of curriculum learning to train summarization models from noisy data. They use sequence-to-sequence models and propose a model that can quantify noise from a single noisy corpus. They conduct experiments on three summarization models and show that their method improves performance. They also analyze how different curricula affect the performance of pretrained and nonpretrained summarization models. Human evaluation results also show that their method improves the performance of summarization models.
Max Grusky, Mor Naaman, Yoav Artzi
The paper introduces NEWSROOM, a dataset of 1.3 million articles and summaries from 38 major news publications, extracted from search and social media metadata between 1998 and 2017. The summaries demonstrate a high diversity of summarization styles, combining abstractive and extractive strategies. The authors analyze the extraction strategies used in NEWSROOM summaries and compare them to other datasets to evaluate its diversity and difficulty. They also train existing methods on the data to evaluate its utility and challenges. The dataset is available online at summari.es.
Ting-Yao Hsu, Yoshi Suhara, Xiaolan Wang
The paper proposes a new task of summarizing Community-based Question Answering (CQA) pairs to help users quickly digest key information. The authors design a multi-stage data annotation process and create a benchmark dataset, COQASUM, based on the Amazon QA corpus. They compare extractive and abstractive summarization methods and establish a strong baseline approach called DedupLED. The experiment confirms two key challenges, sentencetype transfer and deduplication removal, towards the CQA summarization task. The data and code are publicly available.
Rishi Bommasani, Claire Cardie
The paper discusses the importance of high quality data for building statistical models in natural language processing (NLP), and the need to evaluate data quality during dataset construction or post hoc. It highlights that popular summarization datasets are often drawn from natural sources without quality assurance guarantees, and that data quality has gone largely unquestioned in recent summarization research. The authors introduce 5 intrinsic metrics and apply them to 10 popular datasets, finding that data usage in recent summarization research is sometimes inconsistent with the underlying properties of the datasets employed. They also discover that their metrics can serve as inexpensive heuristics for detecting low quality examples.
Mehwish Fatima, Michael Strube
The paper discusses the challenge of cross-lingual summarization and the lack of available resources for this task. To address this issue, the authors present a new dataset for monolingual and cross-lingual summarization in the English-German pair. They collected high-quality cross-lingual data from Spektrum der Wissenschaft and complemented it with a similar dataset from the Wikipedia Science Portal. The authors also conducted experiments with various summarization models and found that the proposed dataset is useful for monolingual and cross-lingual summarization.
Jia Jin Koay, Alexander Roustai, Xiaojin Dai, Dillon Burns, Alec Kerrigan, Fei Liu
The paper discusses the importance of meetings in organizations and the need for a meeting summarization system to help users quickly search and sift through large meeting collections. The authors analyze the impact of domain terminology, or jargon terms, on the performance of meeting summarization and find that it can have a substantial impact. They create gold-standard annotations for jargon terms on a sizable meeting corpus and publicly release all domain terminology to advance research in meeting summarization.
Amr Keleg, Matthias Lindemann, Danyang Liu, Wanqiu Long, Bonnie L. Webber
The paper discusses the impact of non-summary texts, specifically straplines, on the quality of news article summarization. The authors identify straplines as a common form of non-summary text that is often included in scraped corpora used for news summarization. They present a rule-based strapline detection method that achieves good performance and show that removing straplines and noise from the training data of a news summarizer results in higher quality summaries, with improvements as high as 7 points ROUGE score.
Sajad Sotudeh, Hanieh Deilamsalehy, Franck Dernoncourt, Nazli Goharian
The paper discusses the importance of training data in developing summarization systems and introduces a new large-scale summarization dataset called TLDR9+ containing over 9 million training instances extracted from Reddit discussion forum. The dataset is specifically gathered for extreme summarization and is more than twice larger than the previously proposed dataset. The authors also distill a more fine-grained dataset called TLDRHQ by sampling high-quality instances from TLDR9+ with the help of human annotations. The paper further evaluates different state-of-the-art summarization models on the proposed datasets.
Vivian Lai, Alison Smith-Renner, Ke Zhang, Ruijia Cheng, Wenjuan Zhang, Joel Tetreault, Alejandro Jaimes
The paper discusses the potential benefits of human-AI collaboration in text summarization through post-editing. The study conducted with 72 participants compared post-editing provided summaries with manual summarization for summary quality, human efficiency, and user experience on formal and informal text. The results suggest that post-editing can be useful in some cases, but not in others, and participants' different editing strategies and needs for assistance offer implications for future human-AI summarization systems.
Umanga Bista, Alexander Mathews, Minjeong Shin, Aditya Krishna Menon, Lexing Xie
The paper discusses extractive summarization in a comparative setting, where the objective is to select a small number of documents that represent each group and distinguish them from other groups. The authors propose a new set of objective functions that connect recent literature on document summarization, interpretable machine learning, and data subset selection. They cast the problem as a binary classification among different groups and derive objectives based on the maximum mean discrepancy and a gradient-based optimization strategy. The authors evaluate comparative summarization methods on a newly curated collection of controversial news topics over 13 months and find that gradient-based optimization outperforms discrete and baseline approaches in 15 out of 24 different automatic evaluation settings. In crowd-sourced evaluations, summaries from gradient optimization elicit 7% more accurate classification from human workers than discrete optimization. The authors suggest that their formulation of comparative summarization will be useful in comparing content sources, authors, related topics, or distinct viewpoints.
Fajri Koto, Timothy Baldwin, Jey Han Lau
System: The paper discusses the importance of summaries, keyphrases, and titles in capturing the content of a document. The authors introduce LipKey, a news corpus with human-written abstractive summaries, absent keyphrases, and titles. They use multi-task training and joint structured inputs to improve transformer-based summarization models by including absent keyphrases and titles as additional context to the source document.
Sangwoo Cho, Kaiqiang Song, Xiaoyang Wang, Fei Liu, Dong Yu
The paper discusses the importance of text segmentation in understanding and summarizing long documents, particularly in transcripts of audio/video recordings. The authors propose an approach that simultaneously performs summarization and segmentation to learn robust sentence representations, which is further enhanced by an optimization-based regularizer to promote selection of diverse summary sentences. The approach was evaluated on multiple datasets and found to achieve state-of-the-art performance on publicly available benchmarks, with better crossgenre transferability when equipped with text segmentation. The paper also includes analyses to quantify the impact of section segmentation on summarizing written and spoken documents of substantial length and complexity.
Frederic Kirstein, Jan Philip Wahle, Terry Ruas, Bela Gipp
The paper explores the effects of task families on abstractive text summarization, specifically analyzing the influence of multi-task learning strategies using task families for the English language. The authors group tasks into three strategies and evaluate trained models through two downstream tasks, finding that certain combinations of task families positively impact downstream performance. They also find that choice and combinations of task families influence downstream performance more than the training scheme, supporting the use of task families for abstractive text summarization. The code is publicly available.
Nachshon Cohen, Oren Kalinsky, Yftah Ziser, Alessandro Moschitti
The paper discusses the challenges of evaluating summarization output from existing datasets, which are often curated from academic documents written for experts. To address this issue, the authors present a new dataset based on article summaries from the WikiHow website, which are written in plain language and focused on how-to articles. The authors compare their dataset to existing ones and show that it makes human evaluation more manageable and effective. A human evaluation conducted on PubMed and the proposed dataset supports their findings.
Minh-Tien Nguyen, Minh-Le Nguyen
The paper introduces a new dataset called SoLSCSum for social context summarization, consisting of 157 open-domain articles and their comments from Yahoo News that were manually annotated by two annotators to extract standard summaries. The dataset has a high inter-annotator agreement and can be used to train summary methods such as SVM. The paper also demonstrates the potential use of the dataset by training a learning to rank model with local and cross features, which achieved significant improvements in document summarization over state-of-the-art baselines.
Seonil Son, Junsoo Park, Jeong-in Hwang, Junghwa Lee, Hyungjong Noh, Yeonsoo Lee
The paper discusses the challenge of measuring the factual consistency of generated text in summarization models. The authors propose a reference-free metric called HaRiM, which measures hallucination risk based on token likelihoods and correlates well with human judgment on three summary-quality annotation sets. They reinterpret a previously suggested objective as a hallucination risk measurement to better estimate summary quality without requiring additional training or alignment to human judgments. The authors hope their work will facilitate progress in both automated evaluation and generation of summaries.
Esin Durmus, Mona Diab
The paper discusses the issue of neural abstractive summarization models generating content inconsistent with the source document, and the inadequacy of existing automatic metrics to capture such mistakes. The authors propose an automatic question answering (QA) based metric for evaluating the faithfulness of generated summaries, which has a higher correlation with human faithfulness scores, especially on highly abstractive summaries. The authors also find that current models exhibit a trade-off between abstractiveness and faithfulness, with outputs having less word overlap with the source document being more likely to be unfaithful.
Seyed Ali Bahrainian, Sheridan Feucht, Carsten Eickhoff
The paper discusses how text summarization models are improving and how existing benchmarking corpora may not reflect the full range of summarization needs. The paper introduces a new topical summarization corpus called NEWTS, which is based on the CNN/Dailymail dataset and annotated via online crowd-sourcing. Each source article is paired with two reference summaries, each focusing on a different theme of the source document. The paper evaluates existing techniques and analyzes the effectiveness of different prompting methods.
Shiyue Zhang, Asli Celikyilmaz, Jianfeng Gao, Mohit Bansal
The paper discusses the importance of summarizing conversation threads to improve work and communication efficiency. To aid in research on thread summarization, the authors developed an abstractive Email Thread Summarization dataset and conducted a study on different summarization techniques. The study revealed challenges in current abstractive summarization models, such as understanding the sender's intent and identifying the roles of sender and receiver. The authors also found that widely used automatic evaluation metrics are weakly correlated with human judgments, emphasizing the importance of human evaluation and the development of better metrics.
Piji Li, Haisong Zhang, Xiaojiang Liu, Shuming Shi
The paper discusses the challenges of generating text in rigid formats such as lyrics, sonnets, and classical Chinese poetry, which require adherence to strict formatting and rhyming schemes while maintaining sentence integrity. The authors propose a framework called SongNet, which is a Transformer-based auto-regressive language model that uses tailor-designed symbols to improve modeling performance. The attention mechanism is also improved to capture future information on the format. The framework is pre-trained and fine-tuned, and experiments show that it generates better results than existing methods in terms of both automatic metrics and human evaluation.
Mingda Chen, Zewei Chu, Sam Wiseman, Kevin Gimpel
The paper introduces a summarization dataset called SUMMSCREEN, which consists of pairs of TV series transcripts and human-written recaps. The dataset poses a challenge for abstractive summarization due to plot details being scattered throughout the transcript and the presence of content that does not directly relate to the central plot. The paper proposes two entity-centric evaluation metrics and evaluates several methods, including neural models and those based on nearest neighbors. An oracle extractive approach outperforms all benchmarked models, indicating that neural models are unable to fully exploit the input transcripts. Human evaluation and qualitative analysis show that non-oracle models are competitive with their oracle counterparts but generate unfaithful facts, suggesting future research directions.
Junjie Li, Haoran Li, Chengqing Zong
The paper discusses personalized review summarization, which generates a condensed summary for a user's review, accounting for their preference on different aspects or writing style. The proposed model, User-aware Sequence Network (USN), considers the user's characteristics when generating summaries, containing a user-aware encoder and decoder. The user-aware encoder selects important information of a review, and the user-aware decoder incorporates user characteristics and word-using habits to generate personalized summaries. The model was validated using a new dataset, and achieved state-of-the-art performance on personalized review summarization. The paper focuses on single-review summarization and leaves adapting the model to multi-review summarization scenarios for future work. The review provided is about a hotel near the airport, with a clean and comfortable room and a slightly high price. The summary generated is "very quite room in a great location."
Daniel Deutsch, Rotem Dror, Dan Roth
The paper discusses the challenges in evaluating the quality of summarization evaluation metrics and proposes methods for calculating confidence intervals and running hypothesis tests for correlations using two resampling methods, bootstrapping and permutation. The authors evaluate the proposed methods through simulation experiments and apply them to several automatic evaluation metrics across three sets of human annotations. They find that confidence intervals are wide, indicating high uncertainty in the reliability of automatic metrics. However, two recent works, QAEval and BERTScore, show statistical improvements over ROUGE in some evaluation settings.
Pavlos Vougiouklis, Elena Simperl
System: The paper discusses a method for generating natural language summaries from knowledge base triples using a pointer-generator network. The network can generate regular words and verbalize triples in multiple ways. The approach was evaluated through automatic and human evaluations on single and open-domain summaries generation tasks, and it outperformed other data-driven baselines significantly.
Abram Handler, Prem Ganeshkumar, Brendan O’Connor, Mohamed AlTantawy, Slobodan Milosevic
The paper discusses concept maps, which are visual summaries of important concepts from a dataset displayed as vertexes with edges showing natural language descriptions of relationships between concepts. While previous attempts at creating concept maps have been static, the paper presents a model that responds to queries by returning short, importance-ranked, natural language descriptions of the relationship between two requested concepts for display in a visual interface. The model is trained on a new public dataset and code and data are available at a specific GitHub link.
Anastassia Kornilova, Vlad Eidelman
The paper introduces BillSum, the first dataset for summarization of US Congressional and California state bills. The authors explain the challenges in processing this type of data and benchmark extractive methods that consider neural sentence representations and traditional contextual features. They also demonstrate that models built on Congressional bills can be used to summarize California bills, showing that methods developed on this dataset can transfer to states without human-written summaries.
Priyam Tejaswin, Dhruv Naik, Pengfei Liu
The paper discusses the lack of understanding of the characteristics of datasets used to train and evaluate summarization systems, and how they affect system performance and reliability of metrics. The authors manually analyze 600 samples from three popular summarization datasets and classify them into six categories based on noise types and summarization difficulty. They then analyze 27 state-of-the-art summarization models and 5 popular metrics, and report their findings on the distinct data quality and complexity distributions of datasets, the dependence of model performance and metric reliability on sample complexity, and the low scores received by faithful summaries due to poor diversity of references. The authors also release the code, annotated data, and model outputs.
Tal Baumel, Raphael Cohen, Michael Elhadad
The paper discusses Query-Focused Summarization (QFS), which summarizes a document cluster in response to a specific input query. The authors note that current state-of-the-art algorithms for QFS do not significantly improve upon generic summarization methods, which ignore query relevance, when evaluated on traditional QFS datasets. They hypothesize that this is due to the high topic concentration in these datasets. To address this, they introduce a new QFS dataset with controlled levels of topic concentration and compare algorithms on this dataset. They report strong improvement in performance for algorithms that properly model query relevance and present three new QFS algorithms that outperform state-of-the-art methods on the new dataset.
Meng Cao, Yue Dong, Jackie Chi, Kit Cheung
The paper discusses how abstractive summarization systems often generate content that is not directly inferable from the source text, known as "hallucinations." However, the authors found that much of this hallucinated content is factual and can provide useful background information in a summary. They propose a novel detection approach to separate factual from non-factual hallucinations of entities, using pre-trained and finetuned masked language models. Their approach outperforms five baselines and strongly correlates with human judgments. The authors also show that their detector, when used as a reward signal in an off-line reinforcement learning algorithm, significantly improves the factuality of summaries while maintaining the level of abstractiveness.
Shuaiqi LIU, Jiannong Cao, Zhiyuan Wen
The paper discusses the challenges of summarizing numerous academic papers into a structured summary and proposes a solution called BigSurvey, which is a large-scale dataset for generating comprehensive summaries of academic papers on each topic. The authors utilize target summaries from over 7,000 survey papers and their 430,000 reference papers' abstracts as input documents. They also propose a summarization method called category-based alignment and sparse transformer (CAST), which outperforms various advanced summarization methods.
Yifan Chen, Tamara Polajnar, Colin Batchelor, Simone Teufel
System: The paper introduces a new task of summarizing scientific articles in the chemistry domain into one or two-sentence table of contents entries. The authors use an open access publication corpus and evaluate their approach using state-of-the-art summarization methods.
Ming Zhong, Danqing Wang, Pengfei Liu, Xipeng Qiu, Xuanjing Huang
System: The paper discusses the current state of summarization datasets and how different factors of datasets affect the generalization behavior of neural extractive summarization models. The authors propose several properties of datasets that matter for the generalization of summarization models and analyze how different properties of datasets influence the choices of model structure design and training methods. They demonstrate that a deep understanding of dataset characteristics can lead to significant improvements in existing models.
Xiaojun Wan, Yue Hu
System: The paper discusses the challenges of document summarization for the blind and visually impaired people and proposes a new system called BrailleSUM. The system takes into account the length of each sentence in news articles and uses an ILP-based summarization method. Evaluation results show that BrailleSUM can produce shorter braille summaries without sacrificing content quality.
Alexander R. Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, Dragomir Radev
judgments. The paper addresses the lack of consensus and comprehensive studies on evaluation metrics for text summarization. The authors re-evaluate 14 automatic evaluation metrics and benchmark 23 recent summarization models using these metrics. They also assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset and share it in a unified format. Additionally, they implement and share a toolkit for evaluating summarization models across a broad range of automatic metrics and assemble the largest and most diverse collection of human judgments of model-generated summaries on the CNN/Daily Mail dataset. The authors hope that their work will promote a more complete evaluation protocol for text summarization and advance research in developing evaluation metrics that better correlate with human judgments.
Alexander R. Fabbri, Xiaojian Wu, Srini Iyer, Haoran Li, Mona Diab
The paper discusses the challenge of answer summarization in Community Question Answering (CQA) fora such as Stack Overflow and Yahoo! Answers, where each question thread can receive a large number of answers with different perspectives. The absence of a dataset to provide supervision for producing such summaries is a major obstacle. The paper introduces a novel dataset of 4,631 CQA threads for answer summarization curated by professional linguists. The pipeline gathers annotations for all subtasks of answer summarization, including relevant answer sentence selection, grouping these sentences based on perspectives, summarizing each perspective, and producing an overall summary. The paper also introduces a novel unsupervised approach for multi-perspective data augmentation that boosts summarization performance according to automatic evaluation. Finally, the paper proposes reinforcement learning rewards to improve factual consistency and answer coverage and analyzes areas for improvement.
Ahmed Magooda, Diane Litman
System: This paper discusses three techniques for improving abstractive summarization models without requiring additional data. These techniques include data synthesis with paraphrasing, data augmentation with sample mixing, and curriculum learning with two new difficulty metrics. The experiments conducted show that these techniques can improve summarization performance across two models and two small datasets, both when applied in isolation and when combined.
Manik Bhandari, Pranav Gour, Atabak Ashfaq, Pengfei Liu, Graham Neubig
The paper discusses the importance of automated evaluation metrics in text summarization tasks and highlights the need to re-evaluate the current standard metric, ROUGE, which has been used for almost 20 years. The authors assess the reliability of automatic metrics using top-scoring system outputs on modern datasets and systems, both abstractive and extractive, for system-level and summary-level evaluation settings. They find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems. The authors release a dataset of human judgments collected from 25 top-scoring neural summarization systems, which can be found on GitHub.
Soham Poddar, Azlaan Mustafa Samad, Rajdeep Mukherjee, Niloy Ganguly, Saptarshi Ghosh
The paper discusses the societal challenge of convincing people to get vaccinated against COVID-19 and the use of social media analysis to understand specific concerns people have towards vaccines. The authors have curated CAVES, a large-scale dataset of about 10k COVID-19 anti-vaccine tweets labeled into various specific anti-vaccine concerns in a multi-label setting. This is the first multi-label classification dataset that provides explanations for each label and class-wise summaries of all tweets. Preliminary experiments show that this is a challenging dataset for multi-label explainable classification and tweet summarization.
Alexander R. Fabbri, Faiaz Rahman, Imad Rizvi, Borui Wang, Haoran Li, Yashar Mehdad, Dragomir Radev
The paper discusses the lack of standardized datasets for summarizing online discussions, which has resulted in abstractive text summarization primarily focusing on news articles. To address this gap, the authors design annotation protocols to crowdsource four new datasets on diverse online conversation forms. They benchmark state-of-the-art models on these datasets and analyze characteristics associated with the data. They also evaluate these models on widely-used conversation summarization datasets to establish strong baselines in this domain. The authors incorporate argument mining through graph construction to directly model the issues, viewpoints, and assertions present in a conversation and filter noisy input, showing comparable or improved results according to automatic and human evaluations.
Ojas Ahuja, Jiacheng Xu, Akshay Gupta, Kevin Horecka, Greg Durrett
The paper discusses the limitations of generic and query-based summaries and proposes aspect-oriented summaries that focus on high-level topics discussed among similar types of documents. The authors collected a dataset of aspect-oriented summaries for articles in news sub-domains and evaluated existing techniques for generating such summaries without in-domain training data. They compared different training schemes and found that their final approach produced focused summaries that were better than those from a generic summarization system or keyword matching, and that the system was sensitive to the choice of keywords.
Yiran Chen, Pengfei Liu, Ming Zhong, Zi-Yi Dou, Danqing Wang, Xipeng Qiu, Xuanjing Huang
The paper discusses the limitations of existing evaluation methods for text summarization models, which are typically trained and evaluated on the same dataset. The authors argue that this approach can narrow our understanding of the generalization ability for different summarization systems. To address this, they perform an in-depth analysis of different datasets and investigate the performance of 11 representative summarization systems on 5 datasets from different domains under a cross-dataset setting. The study reveals the effect of model architectures and generation ways (i.e. abstractive and extractive) on model generalization ability and sheds light on the limitations of existing summarizers. Supplementary code can be found on their Github page.
Yang Liu, Chenguang Zhu, Michael Zeng
System: The paper introduces a new way of digesting news content by segmenting a news article into multiple sections and generating corresponding summaries for each section. The authors create a dataset called SEGNEWS, consisting of 27k news articles with sections and aligned heading-style section summaries. They propose a novel segmentation-based language generation model adapted from pretrained language models that can jointly segment a document and produce the summary for each section. Experimental results on SEGNEWS show that their model outperforms several state-of-the-art sequence-to-sequence generation models for this task.
Prasetya Ajie Utama, Joshua Bambrick, Nafise Sadat Moosavi, Iryna Gurevych
The paper discusses how neural abstractive summarization models can generate summaries that are factually inconsistent with their source documents. Previous attempts to recognize such inconsistencies using natural language inference (NLI) have been unsuccessful due to the models' inability to generalize to the task. The authors propose a data generation pipeline called Falsesum, which uses a text generation model to introduce varying types of factual inconsistencies into human-annotated summaries. The resulting dataset contains diverse yet plausible examples, and models trained on it improve performance on four benchmarks for detecting factual inconsistency in summarization.
Yang Deng, Wai Lam, Yuexiang Xie, Daoyuan Chen, Yaliang Li, Min Yang, Ying Shen
The paper discusses the issues of redundancy and lengthiness in crowdsourced answers in Community Question Answering (CQA), which limit the performance of answer selection and lead to difficulties for community users. To solve these problems, the authors propose a novel joint learning model that tackles the tasks of answer selection and answer summary generation in CQA. They design a question-driven pointer-generator network that exploits the correlation information between question-answer pairs to aid in attending the essential information when generating answer summaries. They also leverage the answer summaries to alleviate noise in original lengthy answers when ranking the relevancy degrees of question-answer pairs. The authors construct a new large-scale CQA corpus, WikiHowQA, which contains long answers for answer selection as well as reference summaries for answer summarization. The experimental results show that the joint learning method effectively addresses the answer redundancy issue in CQA and achieves state-of-the-art results on both answer selection and text summarization tasks. The proposed model is also shown to be of great transferring ability and applicability for resource-poor CQA tasks that lack reference answer summaries.
Miguel Arana-Catania, Rob Procter, Yulan He, Maria Liakata
The paper discusses the summarization of deliberative processes in non-English languages, which involves combining multiple narratives of poor grammatical quality in a single text. The authors evaluate various abstractive summarization models in combination with a machine translation model, and report promising results in terms of fluency, consistency, and relevance of the summaries produced. The approach is easy to implement for many languages by changing the translation model.
Shiyue Zhang, Mohit Bansal
The paper discusses the challenges of evaluating summarization tasks using human evaluation and automatic metrics. The authors propose a flexible semiautomatic to automatic summary evaluation metrics called LitePyramid, which uses a natural language inference model and semantic role labeling model to replace manual work. LitePyramid is compared to 15 existing metrics and evaluated on three meta-evaluation datasets and a newly collected dataset. The results show that LitePyramid consistently has the best summary-level correlations and can reduce costs for future data collection.
Vivek Gupta, Prerna Bharti, Pegah Nokhiz, Harish Karnick
The paper discusses the limitations of text summarization models that are trained on news article datasets, where the summary is typically located at the beginning of the text. To address this issue, the authors created a new dataset called SUMPUBMED, which contains scientific articles from the PubMed archive. The summary in SUMPUBMED is distributed throughout the text and contains rare domain-specific scientific terms, making it challenging for seq2seq models that are trained on news articles to summarize effectively. The authors conclude that SUMPUBMED provides new opportunities for improving text summarization models and developing new evaluation metrics.
Reinald Kim Amplayo, Stefanos Angelidis, Mirella Lapata
The paper discusses the challenges of using deep learning techniques for summarizing reviews due to the lack of large-scale datasets. The authors propose a method that incorporates content planning into the summarization model, which improves the quality of the output and allows for the creation of more natural synthetic datasets. The content plans are generated from aspect and sentiment distributions induced from data without expensive annotations. The synthetic datasets are created by sampling pseudo-reviews from a Dirichlet distribution, and the model generates summaries based on input reviews and induced content plans. Experimental results show that their approach outperforms other models in generating informative, coherent, and fluent summaries that capture opinion consensus.
Menglin Xia, Ekaterina Kochmar, Ted Briscoe
System: The paper discusses the development of a tool for assessing learner reading comprehension through automated assessment of their summaries. The authors propose three novel approaches to assess the summaries and evaluate them on two datasets they created. The results show that their models outperform traditional approaches and produce quality assessments close to those of professional examiners.
Cai Yang, Stephen Wan
The paper discusses the LongSumm shared task, which focuses on long document summarization and has been limited by its use of a single family of metrics for evaluation. The authors replicated the evaluation using multiple test set samples and found that the use of additional metrics revealed high-quality summaries missed by the original metrics. They also suggest that SPICE could be a candidate metric for summarization evaluation in LongSumm1. The relative ranking of systems changed under this more rigorous evaluation, but some key learnings from previous years still held.
Qingyu Zhou, Furu Wei, Ming Zhou
The paper discusses the effectiveness of extractive methods in automatic document summarization and proposes extracting sub-sentential units instead of full sentences. The authors show that extracting full sentences can lead to redundancy and unnecessity issues, and present a neural extractive model that leverages sub-sentential information. The experiments and analyses demonstrate that extracting sub-sentential units performs competitively compared to full sentence extraction. The paper provides inspiration for future research on the basic extraction units in extractive summarization.
The paper proposes a two-step method to interpret the decisions made by neural abstractive summarization models. The first step involves analyzing the model's behavior to categorize each decoder decision into one of several generation modes. The second step involves interpreting the decisions using different attribution methods to determine their importance for the generation of the next token. The paper demonstrates the method's capability to identify phrases the summarization model has memorized and determine where in the training pipeline this memorization happened, as well as study complex generation phenomena like sentence fusion on a per-instance basis.
Logan Lebanoff, John Muchovej, Franck Dernoncourt, Doo Soon Kim, Lidan Wang, Walter Chang, Fei Liu
The paper discusses the challenge of fusing sentences with disparate content to create informative and succinct summaries, which is a task that humans can easily perform but is difficult for modern abstractive summarizers. The authors propose introducing the notion of points of correspondence, which are cohesive devices that tie any two sentences together into a coherent text, and provide a dataset containing human annotations of points of correspondence between sentences. The dataset bridges the gap between coreference resolution and summarization and can serve as a basis for future work to measure the success of sentence fusion systems.
Yassine Mrabet, Dina Demner-Fushman
The paper discusses the need for evaluation measures in document summarization that can rank systems based on individual summaries rather than just an average score. It highlights the limitations of current measures like ROUGE and BLEU, which are lexical in nature and not ideal for training neural networks. The authors propose a new hybrid evaluation measure called HOLMS, which combines language models and lexical similarity measures. They demonstrate through experiments that HOLMS outperforms ROUGE and BLEU in correlation with human judgments on several extractive summarization datasets for both linguistic quality and pyramid scores.
Anna Jørgensen, Anders Søgaard
System: The paper discusses how summarization systems are evaluated by human annotators and raters, who are often recruited through platforms with skewed demographics. The authors argue that this can lead to bias in system development and evaluation, as summary evaluation is sensitive to protected attributes. They suggest building models that cater to all groups rather than just some.
Masato Takatsuka, Tetsunori Kobayashi, Yoshihiko Hayashi
The paper proposes a methodology for identifying inconsistency errors in summarization. A synthetic dataset is created to train a model called SumPhrase, which can detect factual errors in summarization more effectively than existing weakly supervised methods. The joint identification of error-corresponding original sentences is proven to be effective in improving error detection accuracy.
Elaheh ShafieiBavani, Mohammad Ebrahimi, Raymond Wong, Fang Chen
System: The paper discusses the limitations of the ROUGE evaluation metric for text summarization, which only considers surface similarities between summaries and cannot accurately assess summaries with lexical variations and paraphrasing. The authors propose a graph-based approach to incorporate both lexical and semantic similarities into ROUGE. The results of experiments on TAC AESOP datasets show that this approach improves the correlation between ROUGE and human judgments.
Shahin Rahbariasl, Mark D. Smucker
The paper discusses the importance of relevance assessing in applications such as high-recall retrieval and test collection construction. The authors conducted a user study with 60 participants to investigate the impact of time limits and document size on relevance assessing. They found that using a time limit as short as 15 seconds or judging document summaries in place of full documents could significantly speed judging without significantly affecting judging quality. Participants found judging document summaries with a 60 second time limit to be the easiest and best experience. The authors suggest that high quality document summaries can provide the same speed benefits as time limits while improving the judging experience for assessors.
Rezvaneh Rezapour, Rosie Jones, Sravana Reddy, Ian Soboroff
System: This paper discusses the motivation behind abstractive summarization of podcasts, which is driven by the increasing popularity of podcasts and the needs of their listeners. The authors note that podcasting is a unique domain that differs from news and other media commonly studied in automatic summarization research. The study uses a collection of podcast summaries generated by different algorithms and human judgments of summary quality from the TREC 2020 Podcasts Track to explore the correlations between various automatic evaluation metrics and human judgments, as well as the linguistic aspects of summaries that lead to strong evaluations. The qualities of a good podcast summary are still unknown, and this study aims to shed light on this topic.
Taehee Jung, Dongyeop Kang, Lucas Mentch, Eduard Hovy
The paper explores the biases and sub-aspects of summarization systems, specifically position, importance, and diversity, across nine different summarization corpora. The study finds that while position exhibits substantial bias in news articles, this is not the case for academic papers and meeting minutes. Additionally, different types of summarization systems are composed of different degrees of the sub-aspects. The study provides useful lessons for developing new summarization systems and collecting new summarization datasets.
Ming Zhong, Pengfei Liu, Danqing Wang, Xipeng Qiu, Xuanjing Huang
System: The paper discusses the success of deep neural networks in text summarization, but notes that there is still much to be understood about why they work so well and how they can be improved. The authors explore different model architectures, transferable knowledge, and learning schemas to improve neural extractive summarization systems. They also present a new framework that achieves state-of-the-art results on CNN/DailyMail. The paper aims to provide insights for future research on extractive summarization and the source code is available on Github.
Oleg Vasilyev, John Bohannon
System: The paper challenges the commonly held belief that a summary quality measure is best judged by how closely it correlates with quality scores produced by human annotators. The authors present observations that question this view and propose an alternative criterion for selecting the best measure from a group of measures that does not rely on human scores.
Philippe Laban, Tobias Schnabel, Paul N. Bennett, Marti A. Hearst
The paper discusses the importance of factual consistency in summaries and the limitations of natural language inference (NLI) models for inconsistency detection. The authors propose a new method called SUMMACCONV, which segments documents into sentence units and aggregates scores between pairs of sentences, enabling NLI models to be successfully used for this task. They also introduce a new benchmark called SUMMAC, consisting of six large inconsistency detection datasets. On this dataset, SUMMACConv obtains state-of-the-art results with a balanced accuracy of 74.4%, a 5% improvement compared with prior work.
Dandan Huang, Leyang Cui, Sen Yang, Guangsheng Bao, Kun Wang, Jun Xie, Yue Zhang
The paper discusses the current state of text summarization using deep learning and highlights the gaps that still exist between automatic summarizers and human professionals. The authors use the Multidimensional Quality Metric to identify 8 major sources of errors on 10 representative summarization models. They find that extractive summarizers are generally better than abstractive ones in terms of faithfulness and factual-consistency. They also note that pre-training techniques, particularly sequence-to-sequence pre-training, are highly effective for improving text summarization, with BART being the most effective. The paper provides insights into the strengths and limitations of different summarization techniques and highlights areas for future research.
Yiran Chen, Pengfei Liu, Xipeng Qiu
The paper discusses the importance of generating summaries that are not only fluent and informative but also factually correct, and the rapid development of the field of factual evaluation. However, the meta-evaluation methodologies of factuality metrics are limited in their opacity, leading to insufficient understanding of their relative advantages and applicability. The paper presents an adversarial meta-evaluation methodology that diagnoses the strengths and weaknesses of 6 existing top-performing metrics over 24 diagnostic test datasets and searches for directions for further improvement by data augmentation. The authors propose several calls for future research and make all codes, diagnostic test datasets, and trained factuality models available.
Yuning Mao, Liyuan Liu, Qi Zhu, Xiang Ren, Jiawei Han
The paper proposes a new evaluation setup for extractive summarization that focuses on assessing the information coverage in extracted summaries. This setup involves treating each sentence in the reference summary as a facet and identifying the sentences in the document that express the semantics of each facet as support sentences. The evaluation is then performed by comparing the indices of extracted sentences and support sentences of all the facets in the reference summary. The authors construct an extractive version of the CNN/Daily Mail dataset to facilitate this new evaluation setup and demonstrate that it is more effective than commonly adopted metrics like ROUGE in manifesting better correlation with human judgment, enabling fine-grained evaluation and comparative analysis, and revealing valuable insights of state-of-the-art summarization methods.
Ruijia Cheng, Alison Smith-Renner, Ke Zhang, Joel R. Tetreault, Alejandro Jaimes
The paper explores the role of humans in automatic text summarization systems and the design considerations for human-AI interaction in text generation tasks. The authors conducted a literature review and developed a taxonomy of five interactions in AI-assisted text generation. They designed text summarization prototypes for each interaction and interviewed 16 users to understand their expectations, experience, and needs regarding efficiency, control, and trust with AI in text summarization. The paper proposes design considerations for human-AI interaction in text summarization and broader text generation tasks.
Xinnuo Xu, Ondřej Dušek, Jingyi Li, Verena Rieser, Ioannis Konstas
The paper discusses the difficulty of evaluating abstractive summarization using standard word-overlap-based metrics, and introduces a new evaluation metric based on fact-level content weighting. The metric relates the facts of the document to the facts of the summary, and assumes that a good summary will reflect all relevant facts present in the human-generated reference summary. The authors confirm this hypothesis by showing that their weightings are highly correlated to human perception and compare favorably to a recent manual highlight-based metric.
Ping Chen, Fei Wu, Tong Wang, Wei Ding
The paper discusses the challenge of assessing the quality of Natural Language Processing and Computational Linguistics applications that generate new texts based on existing texts. Specifically, the paper focuses on the problem of pinpointing content differences between two text passages, especially for large passages such as articles and books. The authors propose a new approach that treats one text passage as a small knowledge base and asks it a large number of questions to identify all content points. By comparing the correctly answered questions from two text passages, the authors are able to compare their content precisely. The experiment using 2007 DUC summarization corpus shows promising results.
Nikita Salkar, Thomas Trikalinos, Byron C. Wallace, Ani Nenkova
The paper analyzes self-repetition in the output of neural summarizers, measuring it as the number of repeated n-grams of length four or longer. Three popular architectures (BART, T5, and Pegasus) are analyzed, and it is found that BART is particularly prone to self-repetition. Fine-tuning on more abstractive data and data featuring formulaic language is associated with a higher rate of self-repetition. Qualitative analysis reveals that systems produce artefacts such as ads and disclaimers unrelated to the content being summarized, as well as formulaic phrases common in the fine-tuning domain. The paper suggests that their approach to corpus level analysis of self-repetition may help practitioners clean up training data for summarizers and ultimately support methods for minimizing the amount of self-repetition.
Maxime Peyrard, Judith Eckle-Kohler
System: The paper introduces a new framework for evaluating extractive summarizers based on an optimization problem. It shows that every extractive summarizer can be broken down into an objective function and an optimization technique. The authors compare and evaluate several objective functions in well-known summarizers and analyze their correlation with human judgments. The comparison across two datasets provides surprising insights into the role and performance of objective functions in different summarizers.
Joshua Maynez, Shashi Narayan, Bernd Bohnet, Ryan McDonald
The paper examines the limitations of neural text generation models for abstractive document summarization and finds that these models often generate content that is unfaithful to the input document. A large scale human evaluation of several neural abstractive summarization systems was conducted to better understand the types of hallucinations they produce. The analysis shows that pretrained models are better summarizers in terms of generating faithful and factual summaries as evaluated by humans. Textual entailment measures are found to better correlate with faithfulness than standard metrics, potentially leading to better automatic evaluation metrics and training and decoding criteria.
Wojciech Kryściński, Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, Richard Socher
The paper discusses the current state of text summarization, which aims to condense long documents into shorter versions while retaining important information. Despite increased interest and research, progress on benchmark datasets has stalled. The authors identify three primary issues: 1) datasets may contain noise and are underconstrained, 2) evaluation metrics do not account for important factors such as factual correctness, and 3) models overfit to biases in current datasets and lack diversity in their outputs.
Mateusz Krubiński, Pavel Pecina
The paper discusses the use of COMET, a neural-based evaluation metric for Machine Translation systems, for evaluating Text Summarization systems. Despite being trained on multilingual MT outputs, COMET performs well in monolingual settings for predicting summarization output quality. The authors introduce a variant of the model, COMES, trained on annotated summarization outputs using MT data for pre-training. The performance of COMES is examined on several datasets with human judgments for different notions of summary quality, across various domains and languages.
Daniel Deutsch, Tania Bedrax-Weiss, Dan Roth
The paper proposes a new metric, QAEval, to evaluate the content quality of a summary using question-answering (QA) instead of traditional text overlap based metrics such as ROUGE. QA-based methods directly measure a summary's information overlap with a reference, making them fundamentally different than text overlap metrics. The authors demonstrate the experimental benefits of QA-based metrics through an analysis of QAEval, which outperforms current state-of-the-art metrics on most evaluations using benchmark datasets. The authors also identify the performance bottlenecks of QAEval and estimate that its potential upper-bound performance surpasses all other automatic metrics, approaching that of the gold-standard Pyramid Method.
Ori Shapira, David Gabay, Yang Gao, Hadar Ronen, Ramakanth Pasunuru, Mohit Bansal, Yael Amsterdamer, Ido Dagan
The paper discusses the importance of manual evaluation in summary evaluation methodology and the traditional Pyramid protocol, which is reliable but expensive and requires expertise. Cheaper and less thorough manual evaluation methods have been used instead, but the authors propose a lightweight sampling-based version of the Pyramid approach that can be crowdsourced. They analyze the performance of their method and release their crowdsourced Summary-ContentUnits and crowdsourcing scripts for future evaluations.
System: The paper argues that establishing theoretical models of Importance will advance our understanding of summarization and improve summarization systems. The authors propose definitions of Redundancy, Relevance, and Informativeness, and show how Importance arises as a single quantity that unifies these concepts. The paper also provides intuitions to interpret the proposed quantities and experiments to demonstrate the potential of the framework to inform and guide subsequent works.
Krtin Kumar, Jackie Chi Kit Cheung
The paper examines the performance of neural abstractive summarizers in generating summary texts and their ability to understand deeper syntactic and semantic structures. The authors generate a set of contrastive summaries and test whether existing neural summarizers score them more highly than human-written summaries. They find that these systems fail to understand the source text in a majority of cases.
Raghuram Vadapalli, Litton J Kurisinkel, Manish Gupta, Vasudeva Varma
The paper introduces a new metric called Semantic Similarity for Abstractive Summarization (SSAS) that evaluates system-generated summaries at a semantic inference level. Previous approaches relied on word or syntactic sub-sequence overlap, which cannot evaluate summaries at this level. SSAS uses natural language inference and paraphrasing techniques to weigh quantities representing agreement, contradiction, topical neutrality, paraphrasing, and optionally ROUGE score between a system-generated and human-written summary.
Hardy Shashi Narayan, Andreas Vlachos
The paper discusses the challenges of manual evaluation of system-generated summaries and proposes a novel approach called HIGHlight-based Reference-less Evaluation of Summarization (HIGHRES). This approach involves assessing summaries against the source document using manually highlighted salient content. The authors validate their approach by employing crowd-workers to augment a dataset and compare two state-of-the-art systems. They demonstrate that HIGHRES improves inter-annotator agreement and helps emphasize differences among systems that would be ignored under other evaluation approaches.
Ukyo Honda, Tsutomu Hirao, Masaaki Nagata
The paper introduces a new automatic evaluation measure for summarization called pruned Basic Elements (pBE). It addresses the weakness of the widely used BE concept, which redundantly matches basic elements. pBE prunes basic elements by disregarding frequency count and reducing semantically overlapped elements based on word similarity. The study shows that pBE outperforms ROUGE in DUC datasets and achieves the highest rank correlation coefficient in TAC 2011 AESOP task.
Yuexiang Xie, Fei Sun, Yang Deng, Yaliang Li, Bolin Ding
The paper discusses the issue of factual inconsistency in generated summaries despite significant progress in text summarization. The authors propose a novel metric to evaluate factual consistency in text summarization via counterfactual estimation, which removes the effect of language prior from the total causal effect on the generated summary. This provides a simple yet effective way to evaluate consistency without relying on other auxiliary tasks. The authors conduct experiments on three public abstractive text summarization datasets and demonstrate the advantages of the proposed metric in improving the correlation with human judgments and the convenience of usage. The source code is available at https://github.com/xieyxclack/factual_coco.
Forrest Sheng Bao, Minghui Qiu, Yinfei Yang, Cen Chen
The paper discusses the limitations of current automatic summary evaluation metrics, which focus on lexical similarity and require a reference summary. The authors propose a weakly supervised approach that does not require a reference summary, using existing summarization datasets and pairing documents with corrupted reference summaries for training. In cross-domain tests, their approach outperforms baselines and shows advantages in gauging linguistic qualities over all metrics.
Leonardo F. R. Ribeiro, Mengwen Liu, Iryna Gurevych, Markus Dreyer, Mohit Bansal
The paper discusses the limitations of current abstractive summarization approaches, which often generate summaries that are not factually consistent with the source document. The authors propose a new method called FACTGRAPH, which decomposes the document and summary into structured meaning representations (MR) to better evaluate factuality. FACTGRAPH encodes these MRs using a graph encoder and text encoder, and experiments show that it outperforms previous approaches by up to 15% in identifying factual errors and inconsistencies.
The paper proposes a new evaluation approach for automatic summarization systems based on pairwise preferences of sentences, which is simpler and cheaper to obtain than gold standard summaries. The authors show that humans can provide useful feedback using this approach, and that it outperforms the three most popular versions of ROUGE with less expensive human input. Additionally, the framework can reuse already available evaluation data to achieve even better results.
The paper discusses the issue of evaluating automatic summarization systems using human judgments. The current human judgment datasets were created during the DUC/TAC shared tasks, but modern systems are better than the best systems submitted at that time. The paper shows that evaluation metrics which behave similarly on these datasets strongly disagree in the higher-scoring range where current systems operate. This creates a problem as we cannot decide which metric to trust. The paper calls for collecting human judgments for high-scoring summaries to resolve this debate and improve summarization systems and metrics.
Yanjun Gao, Chen Sun, Rebecca J. Passonneau
System: The paper discusses the development of a method called Pyramid evaluation, which assesses the content of paragraph-length summaries of source texts. This method involves creating a pyramid that lists distinct units of content found in several reference summaries, weights them based on how many reference summaries they occur in, and produces three scores based on the weighted content of new summaries. The paper presents an automated version of this method that is more efficient, transparent, and complete than previous automated pyramid methods. The new method is tested on a dataset of student summaries and historical NIST data from extractive summarizers.
Tanya Goyal, Jiacheng Xu, Junyi Jessy Li, Greg Durrett
The paper discusses the fine-tuning process of pre-trained language models for summarization tasks and analyzes the training dynamics for generation models. The study focuses on different datasets and summary properties, such as abstractiveness and hallucination, to understand what the model learns at different stages of its fine-tuning process. The authors find that the model learns to copy the input early in the training process consistently across all datasets studied, while factual errors are learned in the later stages, though this behavior is more varied across domains. Based on these observations, the authors explore complementary approaches for modifying training to achieve different goals, such as improving factuality or improving abstractiveness.
Chris Kedzie, Kathleen McKeown
The paper discusses experiments with deep learning models of summarization in various domains, finding that many sophisticated features do not improve performance over simpler models. This suggests that creating a summarizer for a new domain may be easier than previously thought, and questions the benefit of deep learning models for summarization in domains with massive datasets. The paper suggests that new forms of sentence representations or external knowledge sources are needed for better summarization.
Logan Lebanoff, John Muchovej, Franck Dernoncourt, Doo Soon Kim, Seokhwan Kim, Walter Chang, Fei Liu
System: The paper examines how abstractive summarization systems combine information from multiple sentences to form summary sentences. The researchers analyzed the outputs of five state-of-the-art summarizers and found that while the summary sentences were mostly grammatical, they often failed to remain faithful to the original article. The study highlights the need for further research in this area to improve the accuracy of abstractive summarization systems.
Xiangru Tang, Alexander R. Fabbri, Ziming Mao, Griffin Adams, Borui Wang, Haoran Li, Yashar Mehdad, Dragomir Radev
The paper discusses the issue of factual inconsistencies in current pre-trained models used for summarization and the need to evaluate the factual consistency of summaries to develop better models. The authors conducted crowdsourced evaluations using two different methods to determine the factors that affect the reliability of human evaluation. They found that the ranking-based Best-Worst Scaling method is more reliable than the rating-based Likert Scale method, which highly depends on the target dataset and evaluation design. To improve crowdsourcing reliability, they extended the Likert rating scale and presented a scoring algorithm for Best-Worst Scaling called value learning. The authors also made their crowdsourcing guidelines publicly available to facilitate future work on factual consistency in summarization.
Tsutomu Hirao, Hidetaka Kamigaito, Masaaki Nagata
System: The paper discusses the automation of the pyramid method, a manual evaluation framework. The authors transform human-made reference summaries into extractive reference summaries consisting of Elementary Discourse Units (EDUs) from source documents. They then weight each EDU by counting the number of extractive reference summaries that contain it. The summary is scored based on the correspondences between EDUs in the summary and those in the pyramid. The authors conducted experiments on DUC and TAC data sets and found that their methods strongly correlate with various manual evaluations.
Daniel Deutsch, Rotem Dror, Dan Roth
The paper discusses the reliability of automatic summarization evaluation metrics in replicating human judgments of summary quality. The authors identify two inconsistencies in the definition of system-level correlation and propose changes to address them. First, they suggest using the full test set instead of a subset judged by humans to calculate the system score for an automatic metric, leading to more precise estimates of system-level correlations. Second, they propose calculating correlations only on pairs of systems with small differences in automatic scores, which are commonly observed in practice. The authors demonstrate that the best estimate of the correlation of ROUGE to human judgments is near 0 in realistic scenarios, highlighting the need for more high-quality human judgments and improved automatic metrics when differences in system scores are small.
Alex Wang, Kyunghyun Cho, Mike Lewis
The paper discusses the limitations of abstractive summarization models due to frequent factual inconsistencies in their output. Existing automatic evaluation metrics are not sensitive to such errors. The authors propose QAGS, an automatic evaluation protocol that identifies factual inconsistencies in generated summaries by asking questions about the summary and its source. QAGS has higher correlations with human judgments of factual consistency than other automatic evaluation metrics and provides interpretability by indicating which tokens of a summary are inconsistent and why. The authors believe QAGS is a promising tool for automatically generating usable and factually consistent text. Code for QAGS is available on GitHub.
Liam Scanlon, Shiwei Zhang, Xiuzhen Zhang, Mark Sanderson
The paper discusses the effectiveness of extractive-abstractive hybrid summarization in generating concise summaries for long documents. Two approaches to hybrid summarization, extraction-then-abstraction and extraction-with-abstraction, are compared and evaluated through large-scale experiments. The study examines the generalization of the algorithms by testing them within and across news domains and comparing automatic assessments to human judgments. The results show that the extraction-then-abstraction approach outperforms the extraction-with-abstraction approach, especially for cross-domain headline generation.
Daniel Deutsch, Dan Roth
The paper analyzes the token alignments used by reference-based metrics such as ROUGE and BERTScore to compare summaries and argues that their scores largely cannot be interpreted as measuring information overlap. Rather, they are better estimates of the extent to which the summaries discuss the same topics. The consequence of this result is that the most frequently used summarization evaluation metrics do not align with the community’s research goal, to generate summaries with high-quality information. However, the paper concludes by demonstrating that a recently proposed metric, QAEval, which scores summaries using question-answering, appears to better capture information quality than current evaluations, highlighting a direction for future research.
Tamara Sladoljev-Agejev, Jan Šnajder
The paper discusses the automated scoring of college-level summary writing tasks in English as a second language (EL2) using the Reading-for-Understanding (RU) cognitive framework, extended with the Reading-to-Write (RW) element, and analytic scoring with six rubrics covering content and writing quality. The authors show that regression models with reference-based and linguistic features perform better than baselines across all rubrics and reveal interesting correlations between summary features and analytic rubrics, highlighting the links between the RU and RW constructs.
Julius Steen, Katja Markert
The paper discusses the importance of automatically evaluating the coherence of summaries and the challenges of doing so due to the use of disparate datasets and metrics. The authors conduct a large-scale investigation of various methods for summary coherence modeling and introduce two novel analysis measures to identify biases in coherence measures. They find that currently available automatic coherence measures are not reliable across all evaluation metrics, but large-scale language models fine-tuned on self-supervised tasks show promising results if they are trained to generalize across different summary lengths.
Ge Luo, Forrest Sheng Bao
System: The paper proposes a method for evaluating machine-generated summaries without a human-written reference summary. The method involves learning the preference rank of summaries using the Bradley-Terry power ranking model from inferior summaries generated by corrupting base summaries. The experiments conducted on several datasets show that the proposed method can produce scores highly correlated with human ratings.
Wang Chen, Piji Li, Irwin King
The paper proposes a training-free and reference-free summarization evaluation metric to avoid the costly and time-consuming process of collecting human-annotated references and ratings. The metric consists of a centrality-weighted relevance score and a self-referenced redundancy score. The relevance score is computed between the pseudo reference built from the source document and the given summary, and the redundancy score evaluates the redundant information in the summary. The final evaluation score is produced by combining the relevance and redundancy scores. The proposed method outperforms existing methods on both multi-document and single-document summarization evaluation. The source code is available at the given link.
Julius Steen, Katja Markert
The paper discusses the importance of manual evaluation in assessing progress in automatic text summarization. The authors conducted a survey on recent summarization system papers and found little agreement on how to perform evaluation studies. They conducted two evaluation experiments on coherence and repetitiveness and compared Likert-type and ranking annotations. They found that the best choice of evaluation method can vary depending on the aspect being evaluated. The authors also found that study parameters are often not fully reported and subsequent statistical analysis ignores grouping factors. They showed that the total number of annotators can have a strong impact on study power and that current statistical analysis methods can inflate type I error rates up to eight-fold. They highlight that eliciting multiple judgments per summary leads to less powerful and reliable annotations for system comparison given a fixed study budget.
Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher
The paper proposes a model-based approach for verifying factual consistency and identifying conflicts between source documents and generated summaries. The model is trained jointly for three tasks: predicting whether each summary sentence is factually consistent or not, extracting a span in the source document to support this consistency prediction, and extracting the inconsistent span from each summary sentence that is deemed inconsistent. The approach outperforms previous models and provides useful assistance in verifying factual consistency. The authors also release a dataset, code, and trained model weights for factual consistency verification.
Oleg Vasilyev, John Bohannon
The paper introduces a new reference-free summary quality evaluation measure called ESTIME, which focuses on the faithfulness of the summary. The measure counts potential inconsistencies between the summary and the source document and correlates strongly with expert scores in the SummEval dataset. The paper also presents a method of generating subtle factual errors in human summaries and shows that ESTIME is more sensitive to these errors than other common evaluation measures.
Matan Eyal, Tal Baumel, Michael Elhadad
The paper discusses recent developments in automatic summarization and headline generation, which have focused on maximizing ROUGE scores. The authors propose an alternative evaluation metric called Answering Performance for Evaluation of Summaries (APES), which uses reading comprehension to assess a summary's ability to answer questions about the source article. They compare APES to other manual evaluation metrics and present a neural abstractive model that maximizes APES and increases ROUGE scores.
Alexander R. Fabbri, Chien-Sheng Wu, Wenhao Liu, Caiming Xiong
The paper discusses the importance of factual consistency in text summarization models and evaluates two types of metrics, entailment-based and question answering (QA)-based, for measuring this quality. The authors find that carefully selecting the components of a QA-based metric is critical to performance and propose an optimized metric called QAFACTEVAL, which outperforms previous QA-based and entailment-based metrics. Additionally, the authors suggest that combining both types of metrics can further improve performance.
Stratos Xenouleas, Prodromos Malakasiotis, Marianna Apidianaki, Ion Androutsopoulos
System: The paper introduces a new model called SUM-QE, which uses BERT to evaluate the quality of summarizations. Unlike other models, SUM-QE focuses on linguistic quality aspects that are not captured by content-based approaches. The model achieves high correlations with human ratings and outperforms simpler models. The predictions of SUM-QE can be used for system development and to inform users about the quality of automatically generated summaries and other types of text.
Pierre Jean A. Colombo, Chloé Clavel, Pablo
The paper discusses the challenges of assessing the quality of natural language generation systems through human annotation, which is expensive and time-consuming. Researchers often rely on automatic metrics, but existing string-based metrics like BLEU do not handle synonyms well. The authors introduce InfoLM, a family of untrained metrics that uses a pre-trained masked language model and information measures to address these flaws. They demonstrate that InfoLM achieves significant improvement and correlation gains in many configurations on both summarization and data2text generation through direct assessment.
Simeng Sun, Ani Nenkova
The paper discusses the limitations of using ROUGE to evaluate summarization systems and presents experiments on using distributed representations for evaluation. The results show that the max value over each dimension of the summary ELMo word embeddings and averaging the cosine similarity of all encoders yield high correlation with human ratings in both reference-based and reference-free settings. The distributed representations outperform ROUGE in recent corpora for abstractive news summarization but are less effective on older test data and systems.
Kexiang Wang, Tianyu Liu, Baobao Chang, Zhifang Sui
The paper discusses a new protocol for designing reference-based metrics for document summarization that requires the endorsement of source documents. The proposed anchored ROUGE metric fixes each summary particle on the source document, resulting in a more solid computation. Empirical results on benchmark datasets show that using the source document induces a higher correlation with human judgments for the ROUGE metric. The protocol is self-explanatory and easy to implement, and can foster various effective designs of reference-based metrics besides the anchored ROUGE.
Sen Zhang, Jianwei Niu, Chuyuan Wei
System: This paper proposes a framework called SumFC for assessing the factual consistency of abstractive summarization models. SumFC uses a two-stage approach to select relevant sentences and perform fine-grained consistency reasoning at the sentence level. The model is trained using data synthesis and contrastive loss to identify subtle cues. Experimental results show that SumFC outperforms previous methods and can distinguish detailed differences better.
Sascha Rothe, Joshua Maynez, Shashi Narayan
The paper compares task-agnostic pretraining objectives with task-specific pretraining objectives for summarization tasks in a controlled study. The results show that task-agnostic pretraining is sufficient for most cases, reducing the need for costly task-specific pretraining. The study also reports new state-of-the-art numbers for two summarization tasks using a T5 model with 11 billion parameters and an optimal beam search length penalty.
Simeng Sun, Ori Shapira, Ido Dagan, Ani Nenkova
The paper discusses how traditional summarization evaluations compared systems that produced summaries of the same length, but neural approaches have done away with this requirement. The paper presents experiments showing that summaries of different lengths produced by the same system have a clear non-linear pattern of quality as measured by ROUGE F1 scores. The paper proposes a new evaluation method where ROUGE scores are normalized by those of a random system producing summaries of the same length. The paper reanalyzes recently reported results and shows that some negative results are actually reports of system improvement once differences in length are taken into account. Finally, the paper presents a small-scale human evaluation showing a similar trend of perceived quality increase with summary length, calling for the need of similar normalization in reporting human scores.
Manik Bhandari, Pranav Gour, Atabak Ashfaq, Pengfei Liu
The paper discusses the evaluation of automatic metrics in text summarization, specifically focusing on the disagreement between metrics when ranking high-scoring summaries. The authors revisit previous experiments and suggest that the narrow scoring range of summaries may be the reason for the disagreement. They also analyze three other properties that impact inter-metric agreement: Ease of Summarization, Abstractiveness, and Coverage. The authors make their analysis code and data publicly available to encourage reproducible research.
Maartje ter Hoeve, Julia Kiseleva, Maarten de Rijke
The paper discusses the gap between the current research focus in automatic summarization and users' needs, particularly university students who heavily rely on summaries. To address this, the authors propose a survey methodology that can be adjusted to investigate different user groups. They find that the current research directions do not fully align with students' needs and suggest ways to mitigate this mismatch in future research.
Ori Shapira, David Gabay, Hadar Ronen, Judit Bar-Ilan, Yael Amsterdamer, Ani Nenkova, Ido Dagan
The paper explores whether reference summaries of a single length can be used to evaluate system summaries of varying lengths. The authors conducted a case study using several variants of the ROUGE metric and found that the evaluation protocol is competitive. This paves the way for practical evaluation of varying-length summaries using existing summarization benchmarks.
Tobias Falke, Leonardo F. R. Ribeiro, Prasetya Ajie Utama, Ido Dagan, Iryna Gurevych
The paper discusses the limitations of abstractive summarization due to factual errors in generated summaries. The authors evaluate summaries produced by state-of-the-art models and find that errors occur frequently, especially with more abstractive models. They explore the use of textual entailment predictions to detect and reduce such errors by reranking alternative predicted summaries. The authors find that current entailment models do not offer the desired performance for this task and release their annotations as additional test data for future evaluations of natural language inference.
Jiacheng Xu, Shrey Desai, Greg Durrett
The paper discusses the difficulty in interpreting the behavior of seq2seq abstractive summarization models, which generate text in a free-form manner. The authors analyze summarization decoders by studying the entropy of the model's token-level predictions, finding a correlation between low prediction entropy and where the model copies tokens rather than generating novel text. The decoder's uncertainty also connects to factors like sentence position and syntactic distance between adjacent pairs of tokens, giving insight into what factors make a context particularly selective for the model's next output token. Finally, the authors study the relationship between decoder uncertainty and attention behavior to understand how attention gives rise to these observed effects in the model. The paper concludes that uncertainty is a useful perspective for analyzing summarization and text generation models more broadly.
Maartje ter Hoeve, Julia Kiseleva, Maarten de Rijke
The paper discusses the need to re-assess the focus and objectives of automatic text summarization and whether they align with users' desires. The authors conducted a survey among heavy users of pre-made summaries and found that the current focus of the field does not fully align with participants' wishes. They propose adopting a broader perspective on automatic summarization, expanding the types of input material that can be summarized, and defining requirements for datasets that can facilitate these research directions. They also propose including usefulness as an important aspect of summarization in the evaluation methodology and propose a methodology to evaluate the usefulness of a summary. The authors hope to unlock important research directions for future work on automatic summarization.
Daniel Deutsch, Dan Roth
The paper discusses the importance of answer verification in question answering-based summarization evaluation metrics. The authors benchmark various answer verification methods, including lexical overlap and more sophisticated text comparison methods like BERTScore and LERC. They find that LERC performs well in some settings, but overall, improved verification performance does not necessarily lead to better QA-based metric quality. The authors attribute this to dataset properties.
Zhiyuan Zeng, Jiaze Chen, Weiran Xu, Lei Li
The paper proposes a method for generating highly abstract yet factually correct summaries using an efficient weak-supervised adversarial data augmentation approach. The approach forms a factual consistency dataset and trains an evaluation model that can accurately and robustly discriminate factual consistency and trace factual errors. Experiments and analysis on public annotated summarization and factual consistency datasets demonstrate the effectiveness and reasonableness of the approach. The codes for the approach can be found at https://github.com/parZival27/GrAdualCC.
System: This paper highlights the limitations of using the ROUGE metric for evaluating summarization systems, particularly in terms of optimal solutions. The authors provide the first proof that the task of summarization is NPhard. However, they also demonstrate that greedy algorithms perform well on three benchmark datasets. The paper also points out the difficulty in ensuring overall quality assurance, as there is no natural upper bound on the quality of summarization systems and even humans cannot achieve optimal summarization.
Yizhu Liu, Qi Jia, Kenny Q. Zhu
System: The paper proposes a new automatic reference-free evaluation metric for summarization that compares semantic distribution between source document and summary by pretrained language models and considers summary compression ratio. The experiments show that this metric is more consistent with human evaluation in terms of coherence, consistency, relevance, and fluency.
Saadia Gabriel, Asli Celikyilmaz, Rahul Jha, Yejin Choi, Jianfeng Gao, ♣ ♠Paul, G. Allen
The paper discusses the challenge of ensuring factual correctness in machine-generated text and introduces a metaevaluation framework called GO FIGURE for evaluating factuality evaluation metrics. The framework proposes five necessary conditions for evaluating factuality metrics on diagnostic factuality data across three different summarization tasks. The benchmark analysis on ten factuality metrics shows that the framework provides a robust and efficient evaluation that is extensible to multiple types of factual consistency and standard generation metrics, including QA metrics. However, the performance of QA metrics is highly dependent on the way in which questions are generated.
Artidoro Pagnoni, Vidhisha Balachandran, Yulia Tsvetkov
The paper discusses the issue of factually unreliable outputs generated by modern summarization models and the lack of common benchmarks to measure their factuality. To address this, the authors devise a typology of factual errors and collect human annotations of generated summaries from state-of-the-art summarization systems for the CNN/DM and XSum datasets. They identify the proportion of different categories of factual errors in various summarization models and benchmark factuality metrics, showing their correlation with human judgement and specific strengths and weaknesses.
Nicholas Egan, Oleg Vasilyev, John Bohannon
The paper introduces new reference-free summary evaluation metrics that use a pretrained language model to estimate the information content shared between a document and its summary. These metrics are a modern take on the Shannon Game and an extension of BLANC. The authors empirically verify that their metrics achieve state-of-the-art correlation with human judgement of the summary quality dimensions of coherence and relevance, as well as competitive correlation with human judgement of consistency and fluency.
Mousumi Akter, Naman Bansal, Shubhra Kanti Karmaker
The paper discusses the limitations of the traditional ROUGE metric for evaluating automated summarization tasks and proposes a semantic-aware nCG-based evaluation metric called Sem-nCG. The paper demonstrates how to generate more reliable semantic-aware ground truths for evaluating extractive summarization tasks without additional human intervention. The authors conducted extensive experiments using the CNN/DailyMail dataset and found that Sem-nCG is more reliable and shows higher correlation with human judgement than ROUGE. The paper suggests that ROUGE often leads to inaccurate conclusions and Sem-nCG is a better alternative for evaluating extractive summarization tasks.
Maxime Peyrard, Teresa Botschen, Iryna Gurevych
System: The paper proposes a new automatic scoring metric for evaluating summaries, based on human judgments from classical summarization datasets. The model learns the best combination of existing automatic scoring metrics that correlates with human judgments. The reliability of the new metric is tested through a manual evaluation, and the trained metric is released as an open-source tool.
Yuhao Zhang, Derek Merck, Emily Bao Tsai, Christopher D. Manning, Curtis P. Langlotz
The paper discusses the limitations of existing neural abstractive summarization models in terms of factual correctness and proposes a framework to evaluate and optimize the factual correctness of generated summaries using an information extraction module and reinforcement learning. The proposed method is applied to the summarization of radiology reports, where factual correctness is crucial, and is shown to substantially improve the quality of outputs over a competitive neural summarization system, approaching the quality of human-authored summaries.
Hwanhee Lee, Kang Min Yoo, Joonsuk Park, Hwaran Lee, Kyomin Jung
The paper discusses the challenge of determining whether a generated summary is factually consistent with the source text, despite recent advances in abstractive summarization systems. The latest approach is to train a factual consistency classifier on factually consistent and inconsistent summaries, with the former readily available as reference summaries in existing summarization datasets. However, generating factually inconsistent summaries that are closely relevant to the source text remains a challenge. The paper proposes a method of generating such summaries using source texts and reference summaries with key information masked. Experiments on seven benchmark datasets demonstrate that factual consistency classifiers trained on summaries generated using this method generally outperform existing models and show a competitive correlation with human judgments. The characteristics of the summaries generated using this method are also analyzed, and a pre-trained model and code will be released.
Zheheng Luo, Qianqian Xie, Sophia Ananiadou
The paper discusses the need for readability controllable summarization for biomedical documents, as existing summarization systems do not consider the varying levels of expertise of readers. The authors introduce a new task of generating technical summaries for experts and plain language summaries for laypeople, and construct a corpus of biomedical papers with both types of summaries. They benchmark multiple advanced summarization models and propose a novel metric to evaluate the readability discrepancy between the two types of summaries. The results show that current control techniques are not effective in generating suitable summaries for different levels of expertise.
Xiuying Chen, Mingzhe Li, Shen Gao, Rui Yan, Xin Gao, Xiangliang Zhang
The paper discusses the use of citation graphs to improve scientific paper extractive summarization. The authors propose two models: a Multi-granularity Unsupervised Summarization model (MUS) and a Graph-based Supervised Summarization model (GSS). MUS finetunes a pre-trained encoder model on the citation graph by link prediction tasks, while GSS introduces a gated sentence encoder and a graph information fusion module to polish the sentence representation. Experiments on a public benchmark dataset show that both models bring substantial improvements over the prior state-of-the-art model.
Hongli Zhan, Tiberiu Sosea, Cornelia Caragea, Junyi Jessy Li
The paper discusses the importance of understanding the triggers that lead to people's emotions during crises such as the COVID-19 pandemic. It proposes a novel approach of emotion detection and trigger summarization using social media posts, which tend to be charged with multiple emotions and scattered triggers. The authors introduce COVIDET, a dataset of ~1,900 English Reddit posts related to COVID-19, with manual annotations of perceived emotions and abstractive summaries of their triggers. The paper also presents strong baselines for jointly detecting emotions and summarizing emotion triggers. The authors conclude that COVIDET presents new challenges in emotion-specific summarization and multi-emotion detection in long social media posts.
Abhishek Agarwal, Shanshan Xu, Matthias Grabmair
The paper presents techniques for extractive summarization of legal decisions in a low-resource setting using limited expert annotated data. The models locate relevant content using a sequential model and tackle redundancy by leveraging maximal marginal relevance to compose summaries. The proposed approaches can achieve ROUGE scores vis-à-vis expert extracted summaries that match those achieved by inter-annotator comparison. The multi-task learning model variant leverages rhetorical role identification as an auxiliary task to further improve the summarizer.
Vidhisha Balachandran, Hannaneh Hajishirzi, William W. Cohen, Yulia Tsvetkov
The paper proposes a new approach to correcting factual errors in abstractive summarization models. Instead of using heuristics to generate non-factual summaries, the authors generate hard, representative synthetic examples of non-factual summaries through infilling language models. With this data, they train a more robust fact-correction model to post-edit the summaries to improve factual consistency. The approach is shown to vastly outperform prior methods in correcting erroneous summaries on two popular summarization datasets, improving factuality scores by over ∼11 points on CNN/DM and over ∼31 points on XSum on average across multiple summarization models, while maintaining competitive summarization quality. The proposed model is called FACTEDIT.
Yixin Liu, Ansong Ni, Linyong Nan, Budhaditya Deb, Chenguang Zhu, Ahmed H. Awadallah, Dragomir Radev
The paper discusses the challenges of using neural attention models for long text summarization due to the quadratic memory complexity of the self-attention module. Instead of designing more efficient attention modules, the authors investigate if models with a restricted context can have competitive performance. They propose a locality-aware modeling strategy where the model is applied to individual pages grouped by the principle of locality during both the encoding and decoding stages. The authors empirically investigate three kinds of locality in text summarization at different levels of granularity and show that their model outperforms strong baseline models with efficient attention modules.
Naman Bansal, Mousumi Akter, Shubhra Kanti Karmaker
The paper discusses the Semantic Overlap Summarization (SOS) task, which involves summarizing common information from multiple alternate narratives. The lack of existing datasets for supervised training is a major challenge for this task. To address this, the authors propose a novel data augmentation technique to create synthetic data for training a seq-to-seq model. Through experiments using news narratives, they show that models trained using the synthetic dataset provide significant performance improvements over pre-trained summarization techniques and are close to models trained on golden training data. The proposed data augmentation technique is effective for training seq-to-seq models on the SOS task.
Chenhui Shen, Liying Cheng, Lidong Bing, Yang You, Luo Si
The paper discusses the limitations of current structure-controlling methods in controllable text generation and proposes a new method called sentence-level beam search generation (SentBS) to address these limitations. SentBS evaluates sentences throughout the generation process to select suitable ones for subsequent generations. The paper experiments with different decoding methods as subcomponents for SentBS and evaluates the results on the structure-controlled dataset MReD. The experiments show that all explored combinations for SentBS can improve the agreement between the generated text and the desired structure, with the best method reducing structural discrepancies by approximately 68%.
Naman Bansal, Mousumi Akter, Shubhra Kanti Karmaker
The paper discusses the Semantic Overlap Summarization (SOS) task, which involves generating a summary from multiple alternative narratives that convey common information. The authors focus on the automated evaluation of the SOS task using a benchmark dataset and find that the popular ROUGE metric is not suitable for this task. They propose a new evaluation metric called SEM-F1, which yields higher correlation with human judgment and inter-rater agreement compared to ROUGE. The metric is inspired by the sentence-wise annotation technique using overlap labels reported in previous work.
Hwanhee Lee, Cheoneum Park, Seunghyun Yoon, Trung Bui, Franck Dernoncourt, Juae Kim, Kyomin Jung
The paper discusses the problem of factual errors in abstractive summarization systems and proposes a solution in the form of an efficient factual error correction system called RFEC. The system is based on entity retrieval and retrieves evidence sentences from the original document to reduce the length of the text to analyze. It then detects entity-level errors in the summaries and substitutes the wrong entities with accurate ones from the evidence sentences. The experimental results show that RFEC outperforms baseline methods in correcting factual errors with a faster speed.
Rajdeep Mukherjee, Abhinav Bohra, Akash Banerjee, Soumya Sharma, Manjunath Hegde, Afreen Shaikh, Shivani Shrivastava, Koustuv Dasgupta, Niloy Ganguly, Saptarshi Ghosh, Pawan Goyal
The paper discusses the lack of efficient techniques to summarize financial documents and introduces a new dataset called ECTSum, which consists of transcripts of earnings calls and expert-written bullet point summaries. The authors benchmark their dataset with state-of-the-art summarization methods and present a simple yet effective approach called ECT-BPS to generate bullet points that capture important facts discussed in the calls.
Wenhao Wu, Wei Li, Jiachen Liu, Xinyan Xiao, Ziqiang Cao, Sujian Li, Hua Wu
The paper discusses the unfaithful generation problem in current Seq2Seq summarization models, despite their ability to generate fluent and grammatical text. The authors propose a new perspective of factual robustness to measure the faithfulness of existing systems, which is the ability to correctly generate factual information over adversarial unfaithful information. They propose a novel training strategy called FRSUM, which enhances the model's factual robustness by teaching it to defend against both explicit adversarial samples and implicit factual adversarial perturbations. The evaluation results show that FRSUM consistently improves the faithfulness of various Seq2Seq models, such as T5 and BART.
Haojie Zhuang, Wei Emma Zhang, Jian Yang, Congbo Ma, Yutong Qu, Quan Z. Sheng
The paper introduces an unsupervised learning method called SCR (Summarize, Contrast and Review) for abstractive text summarization. Unlike most state-of-the-art methods that heavily rely on high-quality and large-scale parallel corpora, SCR removes the need for reference summaries. It leverages contrastive learning and is the first work to apply it for unsupervised abstractive summarization. The model is trained using true source documents as positive examples and strategically generated fake source documents as negative examples. The generated summaries are also guided to be similar to human-written texts. The extensive experiments show that SCR outperforms other unsupervised abstractive summarization baselines, demonstrating its effectiveness.
Yuning Mao, Ming Zhong, Jiawei Han
The paper proposes a new approach to automatically extract ultra-short summaries of scientific papers from their citation texts, creating a new benchmark dataset called CiteSum without human annotation. The authors conduct a comprehensive analysis of CiteSum and demonstrate the usefulness of the dataset by adapting models pre-trained on CiteSum to new tasks and domains with limited supervision. The results show that CITES outperforms most fully-supervised methods on SciTLDR for scientific extreme summarization and achieves significant gains on XSum for news extreme summarization and news headline generation.
Liam van der Poel, Ryan Cotterell, Clara Meister
The paper discusses the issue of abstractive summarization models exhibiting the tendency to output content not supported by the source document, known as hallucinations. The authors identify high model uncertainty as a criterion that leads to more probability of hallucinated content during generation. They propose a decoding strategy that switches to optimizing for pointwise mutual information of the source and target token when the model exhibits uncertainty, which decreases the probability of hallucinated tokens while maintaining the ROUGE and BERTS scores of top-performing decoding strategies. The experiments on the XSUM dataset support the effectiveness of their proposed method.
Melanie Sclar, Peter West, Sachin Kumar, Yulia Tsvetkov, Yejin Choi
REFEREE is a new framework for sentence summarization that can be trained without the need for gold summaries. It allows for direct control of compression ratio and uses Symbolic Knowledge Distillation to distill latent knowledge from pre-trained language models. The framework proposes iterative distillation of knowledge, where student models from previous iterations serve as teacher models in the next iteration. The results show that the final student models outperform the much larger GPT3-Instruct model in terms of controllability of compression ratios without compromising the quality of summarization. The iterative distillation process also produces a high-quality dataset of sentence-summary pairs with varying degrees of compression ratios.
Jiayu Song, Iman Munire Bilal, Adam Tsakalidis, Rob Procter, Maria Liakata
The paper discusses the challenges of opinion summarization of social media posts and presents WassOS, an unsupervised abstractive summarization model that uses the Wasserstein distance. The model disentangles the distributions of documents/posts into separate semantic and syntactic spaces and obtains the summary distribution using the Wasserstein barycenter. A latent variable is then fed into a GRU decoder with a transformer layer to produce the final summary. The experiments on multiple datasets show that WassOS outperforms the state-of-the-art on ROUGE metrics and consistently produces the best summaries with respect to meaning preservation according to human evaluations.
Wojciech Kryściński, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, Dragomir Radev
The paper discusses the limitations of existing text summarization datasets and introduces BOOKSUM, a collection of datasets for long-form narrative summarization. The dataset covers literature documents and includes highly abstractive, human-written summaries on three levels of granularity. The unique challenges posed by the domain and structure of the dataset include processing long documents, non-trivial causal and temporal dependencies, and rich discourse structures. The paper also evaluates multiple extractive and abstractive summarization models as baselines for the dataset.
Chao Zhao, Faeze Brahman, Kaiqiang Song, Wenlin Yao, Dian Yu, Snigdha Chaturvedi
The paper proposes NARRASUM, a large-scale narrative summarization dataset containing 122K narrative documents and their corresponding abstractive summaries. The dataset is collected from plot descriptions of movies and TV episodes with diverse genres. The paper highlights the challenges of summarizing a narrative, which requires an understanding of event causality and character behaviors. The experiments show a large performance gap between humans and state-of-the-art summarization models on NARRASUM. The authors hope that this dataset will promote future research in summarization and broader studies of natural language understanding and generation. The dataset is available at https://github.com/zhaochaocs/narrasum.
Daniel King, Zejiang Shen, Nishant Subramani, Daniel S. Weld, Iz Beltagy, Doug Downey, †MosaicML ‡MIT Allen
The paper discusses the issue of "hallucinations" in abstractive summarization systems, where the system produces statements not supported by the source text. The authors analyze the connection between hallucinations and training data, and find that models hallucinate because they train on target summaries that are unsupported by the source. They present a new decoding method called PINOCCHIO, which improves the consistency of a transformer-based abstractive summarizer by constraining beam search to avoid hallucinations. PINOCCHIO detects likely model hallucinations based on various measures of attribution to the source text and can backtrack to find more consistent output or produce no summary at all when no consistent generation can be found. The experiments show that PINOCCHIO improves the consistency of generation by an average of 68% on two abstractive summarization datasets without hurting recall.
Tanya Goyal, Nazneen Rajani, Wenhao Liu, Wojciech Kryściński
The paper introduces HYDRASUM, a new summarization architecture that uses multiple decoders to automatically learn contrasting summary styles without extra supervision. HYDRASUM provides a simple mechanism to obtain stylistically-diverse summaries by sampling from individual decoders or their mixtures, outperforming baseline models on three summarization datasets. A small modification to the gating strategy during training can enforce an even stricter style partitioning, allowing users to vary summary styles along multiple dimensions.
Tianshu Wang, Faisal Ladhak, Esin Durmus, He He
The paper discusses the issue of current abstractive summarization systems producing summaries that are unfaithful to the source document, which can lead to misinformation. The authors propose a back-translation-style approach to augment negative samples that mimic factual errors made by the model, in order to teach the model to distinguish between faithful and unfaithful summaries. They also incorporate textual entailment data through multitasking to further improve performance. Experiments on three datasets show that their method consistently improves faithfulness without sacrificing informativeness.
Marcio Fonseca, Yftah Ziser, Shay B. Cohen
The paper proposes a method called FACTORSUM1 that disentangles content selection from the budget used to cover salient content, improving the performance and applicability of abstractive summarizers. This is achieved by factorizing summarization into two steps through an energy function: (1) generation of abstractive summary views covering salient information in subsets of the input document (document views); (2) combination of these views into a final summary, following a budget and content guidance. The model achieves significantly higher ROUGE scores on multiple benchmarks for long document summarization, and is effective for domain adaptation. The performance gains are due to more flexible budget adaptation and processing of shorter contexts provided by partial document views.
Alex Wang, Richard Yuanzhe Pang, Angelica Chen, Jason Phang, Samuel R. Bowman
The paper discusses the challenges of assembling summarization datasets and proposes a new approach of hiring contractors to write original summaries from scratch. The resulting dataset, SQuALITY, consists of question-focused summaries and is shown to be challenging for state-of-the-art summarization systems. The authors also note that existing automatic evaluation metrics are weak indicators of summary quality. SQuALITY is available for use at https://github.com/nyu-mll/SQuALITY.
Griffin Adams, Han-Chin Shing, Qing Sun, Christopher Winestock, Kathleen McKeown, Noémie Elhadad
The paper proposes a new approach to improve the quality of reference summaries while retaining all data. The approach involves selectively rewriting unsupported reference sentences to better reflect source data. A synthetic dataset of positive and negative revisions is automatically generated, and models are trained to revise reference sentences with contrastive learning. The intensity of revisions is treated as a controllable attribute to balance faithfulness and abstraction. The proposed method is tested on noisy references from publicly available MIMIC-III discharge summaries for hospital-course summarization, and models trained on revised clinical references are found to be more faithful, informative, and fluent than models trained on original or filtered data.
Haopeng Zhang, Xiao Liu, Jiawei Zhang
The paper discusses the challenges of extractive summarization for long documents due to the extended structured input context and long-distance sentence dependency. It proposes HEGEL, a hypergraph neural network that captures high-order cross-sentence relations to improve summarization. HEGEL uses hypergraph transformer layers to update and learn effective sentence representations and fuses different types of sentence dependencies, including latent topics, keywords coreference, and section structure. The paper validates HEGEL through extensive experiments on two benchmark datasets, demonstrating its effectiveness and efficiency.
Shuaiqi Liu, Jiannong Cao, Ruosong Yang, Zhiyuan Wen
The paper discusses the limitations of existing document summarization methods that focus only on text and filter out non-textual content, such as tables. To address this, the authors propose FINDSum, a large-scale dataset for long text and multi-table summarization. The dataset is built on 21,125 annual reports from 3,794 companies and has two subsets for summarizing each company's results of operations and liquidity. The authors present three types of summarization methods and propose evaluation metrics to assess the usage of numerical information in produced summaries. The paper highlights the importance of jointly considering input textual and tabular data when summarizing report documents.
Alexander R. Fabbri, Prafulla Kumar Choubey, Jesse Vig, Chien-Sheng Wu, Caiming Xiong
The paper discusses the problem of factual inconsistency in summarization models and proposes a model-agnostic approach to address it through post-editing. The focus is on removing extrinsic entity errors, or entities not in the source, to improve consistency while retaining the summary's essential information and form. The proposed method uses sentence-compression data to train the post-editing model to remove errors marked with special tokens. The model improves factual consistency while maintaining ROUGE and can be applied on top of another post-editor, improving entity precision by up to a total of 38%. The paper also compares different post-editing approaches and analyzes settings where post-editors show the largest improvements.
Tomas Goldsack, Zhihao Zhang, Chenghua Lin, Carolina Scarton
The paper discusses the importance of lay summarisation in making scientific literature more accessible to non-experts. It highlights the limitations of current corpora for this task and presents two new datasets, PLOS and eLife, containing biomedical journal articles and expert-written lay summaries. The paper characterizes the lay summaries and benchmarks them using mainstream summarization approaches, demonstrating their utility and identifying key challenges. The datasets and code are available for use.
John Glover, Federico Fancellu, Vasudevan Jagannathan, Matthew R. Gormley, Thomas Schaaf
The paper discusses the use of Natural Language Inference models to score the factuality of generated summaries. Previous studies have shown that decomposing either the input document or the summary into sentences can improve factuality scoring. However, the paper systematically compares different granularities of decomposition and shows that fine-grained decomposition is not always the best strategy. The results also suggest that incorporating additional context can improve performance, but this may not apply to all datasets. The paper highlights the importance of caution in model and methodology selection for downstream tasks.
Meng Cao, Yue Dong, Jingyi He, Jackie Chi, Kit Cheung
The paper proposes a new training objective for abstractive summarization that uses rejection learning to identify and reject potentially noisy tokens. They also propose a regularized decoding objective that penalizes non-factual candidate summaries during inference. The method improves the factuality of generated summaries while increasing their abstractiveness, as shown in evaluations compared to five baseline models. Existing methods drop noisy samples or tokens from the training set, reducing its size and creating an artificial propensity to copy words from the source.
Yifu Qiu, Shay B. Cohen
The paper proposes a new approach to summarizing scientific articles using a hierarchy-aware graph neural network (HierGNN). This approach captures the underlying structure and dependencies between sentences in the input article, which is essential for integrating and consolidating information from different parts of the text. The HierGNN model consists of three main steps: learning a hierarchical document structure, propagating sentence information over this structure, and using graph-level attention to concentrate the decoder on salient information. Experiments show that HierGNN improves upon strong sequence models such as BART, with a significant margin in average ROUGE-1/2/L for CNN/DM and XSum. Human evaluation also demonstrates that summaries produced by HierGNN are more relevant and less redundant than baselines. The model synthesizes summaries by fusing multiple source sentences, rather than compressing a single source sentence, and processes long inputs more effectively.
Mathieu Ravaut, Shafiq Joty, Nancy F. Chen
The paper discusses the limitations of current abstractive summarization methods and proposes a new paradigm called SummaFusion, which fuses multiple summary candidates to produce a novel abstractive second-stage summary. This method improves both the ROUGE scores and qualitative properties of the summaries, especially in the few-shot setup where it sets a new state-of-the-art. The code and checkpoints for SummaFusion are available on GitHub.
Yanzhu Guo, Chloé Clavel, Moussa Kamal Eddine, Michalis Vazirgiannis
The paper discusses the lack of a well-defined formulation for summarization evaluation, which has led to popular summarization datasets being constructed in a way that does not guarantee validity or factual consistency. The authors address this issue by combining factual consistency models to identify problematic instances and release a filtered summarization dataset called SummFC with improved factual consistency. They demonstrate that models trained on this dataset achieve improved performance in nearly all quality aspects and argue that it should become a valid benchmark for developing and evaluating summarization systems.
Dongmin Hyun, Xiting Wang, Chanyoung Park, Xing Xie, Hwanjo Yu
The paper discusses the development of an abstractive model for unsupervised summarization of texts, which is based on reinforcement learning and does not require human-written summaries. The model uses a Markov decision process with rewards to formulate the summarization process and a multi-summary learning mechanism to generate multiple summaries of varying lengths that enhance each other. Experimental results show that the proposed model outperforms both abstractive and extractive models and frequently generates new words not present in the input texts.
Junxian He, Wojciech Kryściński, Bryan McCann, Nazneen Rajani, Caiming Xiong
The paper introduces CTRLSUM, a framework for generating summaries that can be controlled through a set of keywords. The keywords are automatically extracted during training, and at test time, a control function maps control signals to keywords. The same trained model can be applied to control summaries on various dimensions without affecting the model training process or pretrained models. The framework is effective in entity-centric and length-controllable summarization, contribution summarization on scientific papers, invention purpose summarization on patent filings, and question-guided summarization on news articles. CTRLSUM is also comparable or better than strong pretrained systems in standard, unconstrained summarization settings.
Tanya Goyal, Junyi Jessy Li, Greg Durrett
The paper discusses the lack of appropriate evaluation frameworks for summarizing long texts, which inhibits progress in this field. The authors introduce SNAC, a narrative coherence evaluation framework for fine-grained annotations of long summaries. They develop a taxonomy of coherence errors in generated narrative summaries and collect annotations for 6.6k sentences across 150 book and movie summaries. The collected annotations allow them to benchmark past work in coherence modeling and train a strong classifier for automatically localizing coherence errors in generated summaries. The SNAC framework can support future work in long document summarization and coherence evaluation, including improved summarization modeling and posthoc summary correction.
Wenchuan Mu Kwan, Hui Lim
The paper discusses the importance of automatic scoring of summaries in guiding the development of summarizers, but notes that summary scoring has not been studied as a machine learning task to assess its accuracy and robustness. The authors perform evasion attacks to explore the robustness of summary scoring systems and find that non-summary strings can achieve competitive scores with good summarizers on popular metrics such as ROUGE, METEOR, and BERTScore. The attacks also outperform state-of-the-art summarization methods on ROUGE-1 and ROUGE-L, and score the second-highest on METEOR. The authors observe a BERTScore backdoor where a simple trigger can score higher than any automatic summarization method. The low robustness of current scoring systems at the system level is highlighted, and the authors hope that their proposed attacks will facilitate the development of summary scores.
Fei Wang, Kaiqiang Song, Hongming Zhang, Lifeng Jin, Sangwoo Cho, Wenlin Yao, Xiaoyang Wang, Muhao Chen, Dong Yu
The paper proposes a new summarization approach called SEASON (SaliencE Allocation as Guidance for Abstractive SummarizatiON) that uses salience expectation to guide abstractive summarization and adapts well to articles with different levels of abstractiveness. The paper argues that extractive summaries as guidance can be too strict and lead to information loss or noisy signals. SEASON is shown to be effective and reliable in automatic and human evaluations on two benchmark datasets, and empirical results on more than one million news articles demonstrate a natural fifteen-fifty salience split for news article sentences, providing useful insights for composing news articles.
Guan-Yu Lin, Pu-Jen Cheng
System: The paper proposes a new method called Regularized Teacher-Forcing (R-TeaFor) to address the exposure bias problem in training sequence generation models. R-TeaFor utilizes the pairwise relationship between the original training data and the modified ones for better regularization. The experiments show that R-TeaFor outperforms previous state-of-the-art models in summarization and can be generalized to different pre-trained models.
Huan Yee Koh, Jiaxin Ju, He Zhang, Ming Liu, Shirui Pan
The paper discusses the evaluation of long document abstractive summarization systems using fine-grained human annotations. It highlights the trade-off between generating relevant summaries and factual ones, and suggests promising directions for developing factual consistency metrics. The study also reveals the limitations of factuality metrics in detecting different types of factual errors and the effectiveness of ROUGE and BARTScore in evaluating the relevancy of a summary. The authors release their annotated long document dataset to contribute to the development of metrics across a broader range of summarization settings.
Shen Gao, Haotong Zhang, Xiuying Chen, Dongyan Zhao, Rui Yan
The paper proposes a procedural text summarization task with two summarization granularity: step-view and globalview, which summarizes each step in procedural text separately or gives an overall summary for all steps respectively. To tackle this task, the authors propose an Entity-State Graph-based Summarizer (ESGS) which is based on state-of-the-art entity state tracking methods and constructs a heterogeneous graph to aggregate contextual information for each procedure. The authors also propose to use the contextualized procedure graph representation to predict the salient entity. Experiments conducted on two datasets verify the effectiveness of the proposed model.
Ruifeng Yuan, Zili Wang, Ziqiang Cao, Wenjie Li
The paper proposes a new approach called prefix-merging for few-shot learning in query-focused summarization. The approach integrates the knowledge of text summarization and question answering into a properly designed prefix and applies it to query-focused summarization. With only a small amount of trainable parameters, prefix-merging outperforms fine-tuning on query-focused summarization. The paper also discusses the influence of different prefix designs and proposes a visualized explanation for how prefix-merging works.
Yizhu Liu, Qi Jia, Kenny Q. Zhu, Shanghai Jiao Tong
The paper discusses the challenges of opinion summarization of multiple reviews and proposes a new method to address the issue. The authors convert each review into a mix of structured and unstructured data, called opinion-aspect pairs (OAs) and implicit sentences (ISs), and synthesize training pairs of such mix-structured data as input and the textual summary as output. They design a summarization model with OA encoder and IS encoder and show that their approach outperforms previous methods on Yelp, Amazon and RottenTomatos datasets.
Vinayshekhar Bannihatti Kumar, Rashmi Gangadharaiah
The paper discusses the shift from extractive to abstractive methods in automatic text summarization, and how large autoregressive language models have contributed to this shift. The authors revisit extractive methods and compare their performance to state-of-the-art abstractive models, finding that abstractive methods are not completely abstract in their generated summaries. They propose an evaluation metric to measure the degree of abstractiveness of a summary compared to extractive methods. The authors conduct experiments on two summarization datasets using five techniques in extractive and abstractive summarization to confirm their findings.
Subhajit Chaudhury, Sarathkrishna Swaminathan, Chulaka Gunasekara, Maxwell Crouse, Srinivas Ravishankar, Daiki Kimura, Keerthiram Murugesan, Ramón Fernandez Astudillo, Tahira Naseem, Pavan Kapanipathi, Alexander Gray
The paper discusses the issue of factually inconsistent summaries produced by abstractive summarization models and proposes X-FACTOR, a cross-evaluation of three high-performing fact-aware abstractive summarization methods. The authors propose a fact-aware filtering mechanism to improve the quality of training data, a corrector module to improve the factual consistency of generated summaries, and a re-ranking technique to sample summary instances and rerank them based on their factuality. The paper also provides a detailed crossmetric agreement analysis to show how tuning a model to output summaries based on a particular factuality metric influences factuality as determined by other metrics. The goal of the work is to facilitate research that improves the factuality and faithfulness of abstractive summarization models.
Raymond Li, Wen Xiao, Linzi Xing, Lanjun Wang, Gabriel Murray, Giuseppe Carenini
The paper discusses the combination of two lines of research on the multi-head self-attention mechanism of the transformer model. The first line of research aims to understand why and how transformers work, while the second proposes new attention augmentation methods to make transformers more accurate, efficient, and interpretable. The authors present a human-in-the-loop pipeline to discover task-specific attention patterns, which are then injected into smaller and original models. The benefits of this approach are demonstrated in two case studies on extractive summarization and topic segmentation, where the models show considerable improvements in accuracy and efficiency after injecting the discovered patterns into attention heads.
Andreas Marfurt, James Henderson
The paper discusses the issue of hallucinations in abstractive summarization, which are model generations that are not faithful to the source document. Current methods for detecting hallucinations are limited to certain datasets and focus on noun phrases and named entities. The authors propose a new method that detects candidate hallucinations at the token level, regardless of its part of speech, using information already produced during summary generation. They evaluate their method on the CNN/DailyMail dataset and show that it achieves better precision-recall tradeoffs than existing methods. The authors also repurpose an existing factuality dataset and create their own token-level annotations. Overall, their method enables practitioners to generate summaries and identify possible hallucinations with minimal overhead.
Ming Zhong, Yang Liu, Suyu Ge, Yuning Mao, Yizhu Jiao, Xingxing Zhang, Yichong Xu, Chenguang Zhu, Michael Zeng, Jiawei Han
The paper proposes an unsupervised multi-granularity summarization framework called GRANUSUM, which can generate summaries with customizable semantic coverage. The framework uses events as the basic semantic units of the source documents and ranks them by their salience. A model is developed to summarize input documents with given events as anchors and hints, producing multi-granular summaries in an unsupervised manner. The paper also introduces a new benchmark called GranuDUC, which contains multiple summaries at different granularities for each document cluster. Experimental results show that GRANUSUM outperforms strong baselines in multi-granularity summarization and exhibits state-of-the-art performance under conventional unsupervised abstractive setting by exploiting event information.