Figure 1: Examples of an earthquake-related article paired with extractive summaries from the CNN/DM dataset. “Generic” represents the selection of a general purpose summarization model. “Geo(graphy)” (colored in green) and “Recovery” (colored in orange) indicate our aspects of interest for the summary. We highlight aspect-relevant phrases in the document.
Figure 1: Three out of 150 reviews for the movie “Coach Carter”, and summaries written by the editor, and generated by a model following the EXTRACT-ABSTRACT approach and the proposed CONDENSE-ABSTRACT framework. The latter produces more informative and factual summaries whilst allowing to control aspects of the generated summary (such as the acting or plot of the movie).
Figure 2: Overview of our Citation Graph-Based Model (CGSUM). A denotes the source paper (w/o abstract). B, C, D and E denote the reference papers. The body text of A and the abstract of reference papers are fed into the document encoder, and then used to initialize the node features in the graph encoder. Neighbor extraction method will be used to extract a more relevant subgraph. While decoding, the decoder will pay attention to both the document and the citation graph structure.
Figure 1: A small research community on the subject of Weak Galerkin Finite Element Method. Green text indicates the domain-specific terms shared in these papers, orange text denotes different ways of writing the same sentences, blue text represents the definition of Weak Galerkin Finite Element Method (does not appear in the source paper).
Figure 1: A comparison between two-stage models and COLO. The two-stage models including two training stages and a time-consuming preprocess while COLO is trained in an end-to-end fashion. (GPU and CPU hours cost in each stage are shown in Table 6). Two-stage models use offline sampling to build positive-negative pairs while COLO builds positive-negative pairs with online sampling where we directly get theses pairs from a changing model distribution.
Figure 2: Architecture of our extractive model. Input sequence: The ‘[doc]’ token is used to get vector representation zX of the document X , ‘[sep]’ is used as separator for sentences. We omit the classifier and the BCELoss. hi is the sentence embedding the i-th sentence in X . zCi means the feature representation of the i-th candidate.
Figure 5: T-SNE Visualization of two examples from CNN/DM test set. We divide the candidates into 3 groups based on ROUGE score: candidates ranking 1~50, candidates ranking 51~100, candidate ranking 101~150. The red point denotes the anchor and the purple/cyan/gray points respectively denote the top 50/100/150 candidates.
Figure 3: Sentence ranking via two-step sampling. In this toy example, each sentence (s1 to s5) is assigned to its nearest code (k = 1, 2, 3), as shown by thick purple arrows. During cluster sampling, the probability of sampling a code (top right; shown as blue bars) is proportional to the number of assignments it receives. For every sampled code, we perform sentence sampling; sentences are sampled, with replacement, according to their proximity to the code’s encoding. Samples from codes 1 and 3 are shown in black and red, respectively.
Figure 5: t-SNE projection of the quantized space of an eight-head QT trained on SPACE, showing all 1024 learned latent codes (best viewed in color). Darker codes correspond to lower aspect entropy, our proposed measure of high aspect-specificity. Zooming in the aspect sub-space uncovers good aspect separation.
Figure 2:General opinion summarizationwithQT. All input sentences for an entity are encoded using three heads (shown in orange, blue, and green crosses). Sentence vectors are clustered under their nearest latent code (gray circles). Popular clusters (histogram) correspond to commonly occurring opinions, and are used to sample and extract the most representative sentences.
Figure 1: Block A shows the best predictions from PEGASUS and our PEGFAME (PEGASUS with FAME) model, along with the GOLD summary for an XSUM article. Block B presents diverse summaries generated from PEGASUS using top-k and nucleus sampling. Block C shows diverse summaries generated using our PEGFAME model with Focus sampling. The text in orange is not supported by the input article.
Figure 1: A unidirectional LSTM (blue, Eq. (3)) encodes the partial summary, while the multilayer perceptron network (orange, Eq. (4-5)) utilizes the text unit representation (het ), its positional embedding (gt), and the partial summary representation (st) to determine if the t-th text unit is to be included in the summary. Best viewed in color.
Figure 1: StructSum incorporates Latent Structure (LS) §2.2 and Explicit Structure (ES) §2.3 Attention to produce structure-aware representations. Here, StructSum augments the Pointer-Generator model, but the methodology that we proposed is general, and it can be applied to other encoder-decoder summarization systems
Figure 1: A toy example of Semantic Overlap Summarization (SOS) Task (from multiple alternative narratives). Here, an abortion issue-related event has been reported by two news media (left-wing and right-wing). “Green” Text denotes the common information from both news media, while “Blue” and “Red” text denotes the unique perspectives of left and right wing. Some real examples from the benchmark dataset are provided in the Table 3.
Figure 1: Effect of different properties of reference summaries. We only show correlation between BERTScore and ROUGE-2 due to limited pages. The trend is similar for all other pairs as shown in the appendix. The plots for CNNDM are more dense because of more documents in the CNNDM test set as compared to TAC. “Cov” and “Abs” stand for Coverage and Abstractiveness respectively. The trend lines in red are the 10 point and 100 point moving average for TAC and CNNDM respectively.
Figure 1: p-value of William’s Significance Test for the hypothesis “Is the system on left (y-axis) significantly better than system on top (x-axis)”. ‘BScore’ refers to BERTScore and ‘MScore’ refers to MoverScore. A dark green value in cell (i, j) denotes metric mi has a significantly higher Pearson correlation with human scores compared to metric mj (p-value < 0.05).13‘-’ in cell (i, j) refers to the case when Pearson correlation of mi with human scores is less that of mj (Sec. 4.1).
Figure 1: The patterns of the main (red) and three auxiliary tasks (green). The solid line denotes the concatenation of the document and the corresponding question, which is the input of our model; the dashed line denotes the corresponding answer for each input. All tasks are in a "text-to-text" format.
Figure 1: GenCompareSum pipeline. (a) We split the document into sentences. (b) We combine these sentences into sections of several sentences. (c) We feed each section into the generative text model and generate several text fragments per section. (d) We aggregate the questions, removing redundant questions by using n-gram blocking. Where aggregation occurs, we apply a count to represent the number of textual fragments which were combined and use this as a weighting going forwards. The highest weighted textual fragments are then selected to guide the summary. (e) The similarity between each sentence from the source document and each selected textual fragment is calculated using BERTScore. (f) We create a similarity matrix from the scores calculated in the previous step. These are then summed over the textual fragments, weighted by the values calculated in step (d), to give a score per sentence. (g) The highest scoring sentences are selected to form the summary.
Figure 2: Representations of the two training settings of the T5 encoder-decoder model. The left diagram shows the unsupervised pretraining task, in which a tokenized text containing masked spans is passed to the encoder and the output target of the decoder is the prediction of the masked spans. The right diagram shows the supervised downstream task, where the pre-trained model is finetuned on pairs of tokenized sequences.
Figure 6: Classification accuracies for 21 pairs of summaries. (a) Automatic classification using prototypes (by SVM) on the entire test set. The green avg SVM line is the mean accuracy of SVMs trained on the entire training set. (b) Automatic classification evaluated on 6 test articles per pair. (c) Human classification accuracy on 6 test articles per pair.
Figure 1: An illustrative example of comparative summarisation. Squares are news articles, rows denote different news outlets, and the x-axis denotes time. The shaded articles are chosen to represent AI-related news during Feb and March 2018, respectively. They aim to summarise topics in each month, and also highlight differences between the twomonths.
Figure 1: Illustration of the FEWSUM model that uses the leave-one-out objective. Here predictions of the target review ri is performed by conditioning on the encoded source reviews r−i. The generator attends the last encoder layer’s output to extract common information (in red). Additionally, the generator has partial information about ri passed by the oracle q(ri, r−i).
Figure 1: Illustration of the proposed approach. In Stage 1, all parameters of a large language model are pretrained on generic texts (we use BART). In Stage 2, we pre-train adapters (5% of the full model’s parameters) on customer reviews using held-out reviews as summaries. In Stage 3, we fine-tune the adapters on a handful of reviews-summary pairs.
Figure 2: Example of a reviewer comment rewritten as a TLDR (best viewed in color). A peer review comment often begins with a summary of the paper which annotators use to compose a TLDR. Annotators are trained to preserve the original reviewer’s wording when possible (indicated by colored spans), and to avoid using
Figure 1: An example after-visit summary generated from EHR notes associated with a patient. A novel alerting mechanism is proposed in this work to report errors found in the summary, including missing medical events and hallucinated facts. We aim to build effective detectors with self-supervision on unlabeled data for error alerting.
Figure 7: [Left] ROUGE-1 F1 improvement by incrementally applying oracle masking to the next head with most ROUGE improvement per layer on XSum and NYT. Dotted lines indicate that the newly masked heads do not have individual ROUGE improvements. [Right] ROUGE-1 recall improvement by masking all heads vs. sum of improvement by masking each head separately on XSum and NYT. Better displayed with color.
Figure 3: [Left] ROUGE-1 F1 improvement by incrementally applying oracle masking to the next head with most ROUGE improvement per layer on CNN/DM. Dotted lines indicate that the newly masked heads do not have individual ROUGE improvements. [Right] ROUGE-1 recall improvement by masking all heads vs. sum of improvement by masking each head separately on CNN/DM. Best viewed in color.
Figure 2: The distribution of entities over prior/posterior probability. Each point in the figure represents an entity (pprior(ek), ppos(ek)) and shading indicates the confidence of the classifier. (a) The distribution of entities; (b) The entity factuality classification results with KNN (k = 20) classifier. Both factual hallucinated and non-hallucinated entities are colored blue; (c) The KNN (k = 20) classification boundaries of hallucinated and non-hallucinated entities.
Figure 1: The question-summary hierarchy annotated for sentences in a reference summary paragraph. Summarization models are trained to generate the question-summary hierarchy from the document, which signifies the importance of encoding the document structure. For instance, to generate the follow-up question-summary pairs of Q1.1 and A1.1 from A1, it requires the understanding of both the content and the parent-child and sibling relations among §3, §3.1, and §3.4.
Figure 2: Multi-agent-encoder-decoder overview. Each agent a encodes a paragraph using a local encoder followed by multiple contextual layers with agent communication through concentrated messages z(k)a at each layer k. Communication is illustrated in Figure 3. The word context vectors cta are condensed into agent context c∗t . Agent specific generation probabilities, pta, enable voting for the suitable out-of-vocabulary words (e.g., ’yen’) in the final distribution.
Figure 2: Overall model architecture consisting of (M1) shared text encoder, (M2) summary decoder, and (M3) dualview sentiment classification module. The shared text encoder converts the input review text into a memory bank. Based on the memory bank, the summary decoder generates the review summary word by word and receives a summary generation loss. The source-view (summary-view) sentiment classifier uses thememory bank (hidden states) from the encoder (decoder) to predict a sentiment label for the review (summary) and it receives a sentiment classification loss. An inconsistency loss is applied to penalize the disagreement between the source-view and summary-view sentiment classifiers.
Figure 1: Model Structure: There is one encoder, one decoder and one iterative unit (which is used to polish document representation) in each iteration. The final labeling part is used to generating the extracting probabilities for all sentences combining hidden states of decoders in all iterations. We take a document consists of three sentences for example here.
Figure 1: Ranking (descending order) of current 11 top-scoring summarization systems (Abstractive models are red while extractive ones are blue). Each system is evaluated based on three diverse evaluation methods: (a) averaging each system’s in-dataset ROUGE-2 F1 scores (R2) over five datasets; (b-c) evaluating systems using our designed cross-dataset measures: stiffR2, stable-R2 (Sec. 5). Notably, BERTmatch and BART are two state-of-the-art models for extractive and abstractive summarization respectively (highlighted by blue and red boxes).
Figure 7: Statistical significance test on the values in Tab. 8. The four positions correspond to Grammaticality, Informativeness, Relevance and Overall Quality respectively, as shown in upper left box. “=” means no statistical difference, “>” means the row performs significantly better than the column at the significance level α = 0.05, whereas “6” indicates the same at α = 0.01.
Figure 1: Overall framework of our method. w and s are the token-level and sentence-level representations. n and N (m and M ) are the token number and the sentence number of the summary (pseudo reference). For multidocument summary (i.e., K > 1), we compute relevance scores between the summary x and each document dk, and then average them as the final relevance score.
Figure 2: The gap of Spearman’s ρ between Ours(Fβ) and Ours(F1) on TAC-2011 for different |Set| and |Systems|. Positive gaps mean our Fβ can improve the performance while negative gaps indicate our Fβ degrades the performance. When changing one of them, the other is fixed. “all” means the full size is applied, i.e., 10 for |Set| and 50 for |Systems|.
Figure 3: Ablation studies for Ours(Fβ) on TAC datasets and Ours(F1) on CNNDM. “-CentralityW.” means that we remove the centrality weighting when computing relevance scores. “-HybridR.” represents we only utilize the token-level representations when calculating relevance and redundancy scores. “- Redundancy” indicates we omit the redundancy score.
Figure 3: Two excerpts from SUMMSCREEN showing that generating summaries from TV show transcripts requires drawing information from a wide range of the input transcripts. We only show lines in the transcripts that are closely related to the shown parts of summaries. The number at the beginning of each line is the line number in the original transcript. For the first instance, we omit a few lines containing clues about the doctor taking pictures of the mansion at different times due to space constraints.
Figure 1: Excerpts from an example from SUMMSCREEN. The transcript and recap are from the TV show “The Big Bang Theory”. Generating this sentence in the recap requires discerning the characters’ feelings (clues in the transcript are underlined) about playing the board game (references are shown in red). Colored boxes indicate utterances belonging to the same conversations.
Figure 7: The interface for Writing with Model Assistance. In a Google Doc, users can see the original article on the top and they can write their summary under the section “Write your summary here:”. First, the user types a sentence for their summary, then a Bot (played by a researcher who log in with the “SumAssist Bot account”) will insert the next sentence in gray fonts. The Bot will also insert comments on words in the user written sentence and suggest them to make changes.
Figure 3: The interface for Guiding Model Output. Users can change the desired summary length and style (formal or informal) using sliders and highlight parts of the original text that they want to include in the summary. Users can press the “Generate” button to get the “AI-generated” summary based on their inputs.
Figure 2: ROUGE-1, ROUGE-2 and ROUGE-L scores for different summarization approaches. Chartreuse (yellowish green) box shows the oracle, green boxes show the proposed summarizers and blue boxes show the baselines; From left, Oracle; Citation-Context-Comm-It: Community detection on citation-context followed by iterative selection; Citation-ContextCommunity-Div: Community detection on citation-context followed by relevance and diversification in sentence selection; Citation-Context-Discourse-Div: Discourse model on citation-context followed by relevance and diversification; CitationContext-Discourse-It: Discourse model on citation-context followed by iterative selection; Citation Summ.: Citation summary; MMR 0.3: Maximal marginal relevance with λ = 0.3.
Figure 1: The blue highlighted span in the citing article (top) shows the citation text, followed by the citation marker (pink span). For this citation, the citation-context is the green highlighted span in the reference article (bottom). The text spans outside the scope of the citation text and citationcontext are not highlighted.
Figure 1: Overview of our model. The word-level RNN is shown in blue and section-level RNN is shown in green. The decoder also consists of an RNN (orange) and a “predict” network for generating the summary. At each decoding time step t (here t=3 is shown), the decoder forms a context vector ct which encodes the relevant source context (c0 is initialized as a zero vector). Then the section and word attention weights are respectively computed using the green “section attention” and the blue “word attention” blocks. The context vector is used as another input to the decoder RNN and as an input to the “predict” network which outputs the next word using a joint pointer-generator network.
Figure 6: Comparison of ROUGE scores of the Features Only, SAFNet and SFNet models when trained with (bars on the left) and without (bars on the right) AbstractROUGE, evaluated on CSPubSum Test. The FNet classifier suffers a statistically significant (p=0.0279) decrease in performance without the AbstractROUGE metric.
Figure 4: Comparison of the accuracy of each model on CSPubSumExt Test and ROUGE-L score on CSPubSum Test. ROUGE Scores are given as a percentage of the Oracle Summariser score which is the highest score achievable for an extractive summariser on each of the papers. The wider bars in ascending order are the ROUGE scores. There is a statistically significant difference between the performance of the top four summarisers and the 5th highest scoring one (unpaired t-test, p=0.0139).
Figure 2: The framework of our model. There are three major components: (1) The sliding encoder generates representation of each sentence in the current window. (2) The memory layer infuses history information into sentence representations via graph neural networks. (3) The predication layer aggregates learned features to compute the binary sentence labels.
Figure 7: Comparison between the output of our full model (top) and the ablated model (bottom). We use underlined text to denote model-selected sentences and bold text to denote the ground truth sentences. The ablated model selects repetitive contents in 4-th window and noisy contents in 5-th window.
Figure 2: Compression model used for plausibility and salience modeling (§3.3). We extract candidate spans ci ∈ C(T ) to delete, then compute span embeddings with pre-trained encoders (only one span embedding shown here). This embedding is then used to predict whether the span should be kept or deleted.
Figure 1: Decomposing span-based compression into plausibility and salience (§2). Plausible compressions (underlined) must maintain grammaticality, thus [to the ... wineries]PP is not a candidate. Salience identifies low-priority content from the perspective of this dataset (highlighted). Constituents both underlined and highlighted are deleted.
Figure 3: The extractive model uses three separate encoders create representations for the reference document sentences, context tokens, and topics. These are combined through an attention mechanism, encoded at a documentlevel, and passed through a feed-forward layer to compute an extraction probability for each reference sentence.
Figure 4: Example outputs from the abstractive model that uses the context. The model often copies sequences from the references which are sometimes correct (top) or incorrect but sensible (bottom), highlighting the difficulty of automatic evaluation. (Documents shortened for space. Sentences which are underlined were selected by the extraction step.)
Figure 5: The VB+NSUBJ category selects tuples of verbs and their corresponding NSUBJ dependents in the dependency tree. In this example, 2/4 of the alignment (the solid lines) can be explained by matches between such tuples. The dashed lines cannot: The “and” alignment is not part of any tuple; Since “ran” and “sprinted” are not aligned, their corresponding tuples are not considered to be aligned, so the “Reese” match does not count toward the total.
Figure 1: Example answers selected by the three strategies. The only SCU marked by annotators for this sentence is SCU4, which does not include information about the location of the attacks. Therefore, an answer selection strategy that chooses ‘‘Baghdad’’ enables generating a QA pair such as QA3, which probes for information not included in the Pyramid annotation.
Figure 6: (Top) The distribution of the proportion of the QAEval-F1 score that is explained by SCU matches. (Bottom) The percentage of summaries with a score explained by a given proportion of SCU matches. We find that QAEval can be explained by SCU matches far more than ROUGE or BERTScore on average.
Figure 4: An example correct answer predicted by the model that is scored poorly by the EM or F1 QAmetrics (both would assign a score of 0 or near 0). This occurs because the answer and prediction are drawn from two different summaries, and the same event is referred to in different ways in each one.
Figure 5: The Pearson correlations between the scores of several ROUGE variants, APES, and QAEval variants on TAC’08. The results support similar findings of Eyal et al. (2019), namely, that the ROUGE metrics are highly correlated to each other but have low correlation to the QA-based metrics, suggesting the two types of metrics offer complementary signals.
Figure 3: A comparison of the correlations of QAEvalF1 on a subset of TAC’08 using expert-written and model-generated questions. Each point represents the average correlation calculated using 30 samples of {2, 4, 6, 8, 10} instances, plotted with 95% error bars. System-level correlations were calculated against the summarizers’ average responsiveness scores across the entire TAC’08 dataset. We hypothesize the model questions perform better due to their verbosity, which causes more keywords to be included in the question that the QA model can match against the summary.
Figure 4: Every token alignment used by ROUGE or BERTScore is assigned to one or more interpretable categories (defined in §5). This allows us to calculate, for this example, that matches between named-entities contribute 1/4 to the overall score, stopwords 2/4, and noun phrases 3/4 (assuming alignment weights of 1.0).
Figure 1: Both candidate summaries are similar to the reference, but along different dimensions: Candidate 1 contains some of the same information, whereas candidate 2’s information is different, but it at least discusses the correct topic. The goal of this work is to understand if summarization evaluation metrics’ scores should be interpreted as measures of information overlap or, less desirably, topic similarity.
Figure 6: The results of running the PERM-BOTH hypothesis test to find a significant difference between metrics’ Pearson correlations with the Bonferroni correction applied per dataset and correlation level pair instead of per metric (as in Figure 5). A blue square means the test returned a significant p-value at α = 0.05, indicating the row metric has a higher correlation than the column metric. An orange outline means the result remained significant after applying the Bonferroni correction.
Figure 5: The results of running the PERM-BOTH hypothesis test to find a significant difference between metrics’ Pearson correlations. A blue square means the test returned a significant p-value at α = 0.05, indicating the row metric has a higher correlation than the column metric. An orange outline means the result remained significant after applying the Bonferroni correction.
Figure 3: The distribution of the proportion of ROUGE (top) and BERTScore (bottom) on TAC 2008 that can be explained by tokens matches that are labeled with the same SCU (Eq. 5). The averages, 25% and 15% (in red), indicate that only a small amount of their scores is between phrases that express the same information.
Figure 4: The 95% confidence intervals for ρSUM (blue) and ρSYS (orange) calculated using Kendall’s correlation coefficient on TAC’08 (left) and CNN/DM summaries (middle, Fabbri et al. (2021); right, Bhandari et al. (2020)) are rather large, reflecting the uncertainty about how well these metrics agree with human judgments of summary quality.
Figure 6: The system- and summary-level Pearson correlations as the number of available reference summaries increases. 95% confidence error bars shown, but may be too small to see. PyrEval is missing data because the official implementation requires at least two references. Evenwith one reference summary, QAEval-F1 maintains a higher system-level correlation than ROUGE.
Figure 1: In the answer verification task, the metrics score how likely two phrases from different contexts have the same meaning. Here, the metrics at the bottom score the similarity between “emergency responders,” which was used to generate the question from the source text, and “paramedics,” the predicted answer from a QA model in the target text.
Figure 3: Bootstrapped estimates of the stabilities of the system rankings for automatic metrics and human annotations on SummEval (left) and REALSumm (right). The τ value quantifies how similar two system rankings would be if they were computed with two random sets of M input documents. When all Mtest test instances are used, the automatic metrics’ rankings become near constant. The error regions represent ±1 standard deviation.
Figure 4: 95% confidence intervals for rSYS calculated with the BOOT-INPUTS resampling method when the system rankings for the automatic metrics are calculated using only the judged data (orange) versus the entire test set (blue). Scoring systems with more summaries leads to better (more narrow) estimates of rSYS.
Figure 10: rSYS∆(`, u) correlations for various combinations of ` and u (see §4.2) for ROUGE (top), BERTScore (middle), and QAEval (bottom) on SummEval (left) and REALSumm (right). The values of ` and uwere chosen so that each value in the heatmaps evaluates on 10% more system pairs than the value to its left. For instance, the first row evaluates on 10%, 20%, . . . , 100% of the system pairs. The second row evaluates on 10%, 20%, . . . , 90% of the system pairs, never including the 10% of pairs which are closest in score. The first row of each of the heatmaps is plotted in Fig. 6. The correlations on realistic score differences between systems are in the upper left portion of the heatmaps and contain the lowest correlations overall. Evaluating on all pairs is the top-rightmost entry, and the “easiest” pairs (those separated by a large score margin) are in the bottom right.
Figure 9: The rSYS∆(`, u) correlations on the SummEval (top) and REALSumm (bottom) datasets for ` = 0 and various values of u for ROUGE-1, ROUGE-2, and ROUGE-L. The u values were chosen to select the 10%, 20%, . . . , 100% of the pairs of systems closest in score. Each u is displayed on the top of each plot.
Figure 6: The rSYS∆(`, u) correlations on the SummEval (top) and REALSumm (bottom) datasets for ` = 0 and various values of u (additional combinations of ` and u can be found in Appendix B). The u values were chosen to select the 10%, 20%, . . . , 100% of the pairs of systems closest in score. Each u is displayed on the top of each plot. For instance, 20% of the ( N 2 ) system pairs on SummEval are separated by < 0.5 ROUGE-1, and the system-level correlation on those pairs is around 0.08. As more systems are used in the correlation calculation, the allowable gap in scores between system pairs increases, and are therefore likely easier to rank, resulting in higher correlations.
Figure 1: Overview pipeline of the proposed model which is executed simultaneously in two phases (a). The first phase encodes the sentences with pre-trained BERT and uses [CLS] information as the input of a graph attention layer (b). The second phase encodes the word and sentence nodes as the inputs of a heterogeneous graph layer (c). The output of the two phases is concatenated and put into an MLP layer in order to classify labels for each sentence.
Figure 1: Average of ROUGE-1,2,L F1 scores on the Daily Mail validation set within one epoch of training on the Daily Mail training set. The x-axis (multiply by 2,000) indicates the number of data example the algorithms have seen. The supervised labels in SummaRuNNer are used to estimate the upper bound.
Figure 2: Model comparisons of the average value for ROUGE-1,2,L F1 scores (f ) on Dearly and Dlate. For each model, the results were obtained by averaging f across ten trials with 100 epochs in each trail. Dearly and Dlate consist of 50 articles each, such that the good summary sentences appear early and late in the article, respectively. We observe a significant advantage of BANDITSUM compared to RNES and RNES3 (based on the sequential binary labeling setting) on Dlate.
Figure 2: Sentence positions in source document for extractive summaries generated by different models on the PubMed validation set. Documents on the x-axis are ordered by increasing article length from shortest to longest. We also see a similar trend on arXiv (the plots with more details can be found in the appendix).
Figure 4: Comparison of the flat fully-connected graph used in Erkan and Radev (2004); Mihalcea and Tarau (2004); Zheng and Lapata (2019) to the hierarchical graph used in our models (b) and (c). Although the section-section multiplication reduces the edge computation proportionally to the number of sections, we found it oversimplifies the graph by assuming independence between sentences across different sections. Our final model loosens the assumption by including sectionsentence connections as shown in sub-figure (c).
Figure 1: Example of a hierarchical document graph constructed by our approach on a toy document that contains two sections {T1, T2}, each containing three sentences for a total of six sentences {s1, . . . , s6}. Each double-headed arrow represents two edges with opposite directions. The solid and dashed arrows indicate intra-section and inter-section connections respectively. When compared to the flat fully-connected graph of traditional methods, our use of hierarchy effectively reduces the number of edges from 60 to 24 in this example.
Figure 3: The attention weight changes by using the contrastive attention mechanism. (a) is the average attention weights of the third layer of the baseline Transformer, (b) is that of “Transformer+ContrastiveAttention”, and (c) is the opponent attention derived from the fifth head of the third layer.
Figure 1: The decision diagram of our human annotation process. Decision nodes are rectangular and outcome nodes are circular. We show the annotation path of two summary sentences, S1 (green arrows) and S2 (red arrows). S2 is annotated as nonsensical thus is not considered for faithfulness. S1 is annotated as unfaithful due to hallucinated content.
Figure 2: Overview of FEQA. Given a summary sentence and its corresponding source document, we first mask important text spans (e.g. noun phrases, entities) in the summary. Then, we consider each span as the “gold” answer and generate its corresponding question using a learned model. Lastly, a QA model finds answers to these questions in the documents; its performance (e.g. F1 score) against the “gold” answers from the summary is taken as the faithfulness score.
Figure 2: Compression constraints on an example sentence. (a) RST-based compression structure like that in Hirao et al. (2013), where we can delete the ELABORATION clause. (b) Two syntactic compression options from Berg-Kirkpatrick et al. (2011), namely deletion of a coordinate and deletion of a PP modifier. (c) Textual units and requirement relations (arrows) after merging all of the available compressions. (d) Process of augmenting a textual unit with syntactic compressions.
Figure 5: Counts on a 1000-document sample of how frequently both a document prefix baseline and a ROUGE oracle summary contain sentences at various indices in the document. There is a long tail of useful sentences later in the document, as seen by the fact that the oracle sentence counts drop off relatively slowly. Smart selection of content therefore has room to improve over taking a prefix of the document.
Figure 1: ILP formulation of our single-document summarization model. The basic model extracts a set of textual units with binary variables xUNIT subject to a length constraint. These textual units u are scored with weights w and features f . Next, we add constraints derived from both syntactic parses and Rhetorical Structure Theory (RST) to enforce grammaticality. Finally, we add anaphora constraints derived from coreference in order to improve summary coherence. We introduce additional binary variables xREF that control whether each pronoun is replaced with its antecedent using a candidate replacement rij . These are also scored in the objective and are incorporated into the length constraint.
Figure 3: Modifications to the ILP to capture pronoun coherence. It, which refers to Kellogg, has several possible antecedents from the standpoint of an automatic coreference system (Durrett and Klein, 2014). If the coreference system is confident about its selection (above a threshold α on the posterior probability), we allow for the model to explicitly replace the pronoun if its antecedent would be deleted (Section 2.2.1). Otherwise, we merely constrain one or more probable antecedents to be included (Section 2.2.2); even if the coreference system is incorrect, a human can often correctly interpret the pronoun with this additional context.
Figure 1: A comparison of token-wise information content within a document as estimated by GPT-2 in 4 scenarios: the document on its own, the document given the document, the document given a high quality summary, and the document given a low quality summary. Tokens with a darker background color have more information.
Figure 2: Distributions of Shannon Score and Information Difference on 100 summaries from the CNN/DailyMail dataset. Three different summaries are used: the original human written reference summary (in blue), the original summary with words scrambled (in orange), and a reference summary for a different document in the dataset (in green).
Figure 4: Example 4134 from the CNN/Daily Mail test set. Colors and underlines in the source reflect differences between baseline and our model attention weights: Red and a single underline reflects words attended by baseline model and not our model, Green and double underline reflects the opposite. Entities in bold in the target summary are answers to the example questions.
Figure 2: An illustration of our automatic dataset pipeline which mirrors the manual pipeline for data augmentation. Given a question and answers, relevant sentences are selected and clustered. Then, the cluster centroid sentence of non-singleton clusters is removed from the input to use as bullet point summaries.
Figure 4: Two alternative sentences from generated summaries, one correct and one incorrect, for the given source sentence. All tested NLI models predict very high entailment probabilities for the incorrect sentence, with only BERT estimating a slightly higher probability for the correct alternative.
Figure 1: Two news articles with color-coded encoder attention-based document segmentations, and selected words for illustration (left), the abridged news article (top right) and associated aspect-specific model summaries (bottom right). Top: Article from our synthetic corpus with aspects sport, tvshowbiz and health. The true boundaries are known, and indicated by black lines in the plot and ‖in the article. Bottom: Article from the RCV1 corpus with document-level human-labeled aspects sports, news and tvshowbiz (gold segmentation unknown).
Figure 2: Visualization of our three aspect-aware summarization models, showing the embedded input aspect (red), word embeddings (green), latent encoder and decoder states (blue) and attention mechanisms (dotted arrows). Left: the decoder aspect attention model; Center: the encoder attention model; Right: the source-factors model.
Figure 1: Generated abstracts for a biology article (from the Bio subset of our arXiv dataset). Abstracts are ranked from most (top) to least likely (bottom) using the generator model. Abstracts with better narrative structure and domain-specific content (such as the circled abstract) are often out-ranked in terms of likelihood by abstracts with factual errors and less structure.
Figure 1: Distribution of common factual error types in sampled generated summaries (96.37% of all errors). We draw from the same error types for our controlled analysis to ensure we match the true distribution of errors. Here extrinsic entity refers to entities that did not previously appear in the source, while an intrinsic entity appeared in the source.
Figure 4: Part of an EDUA solution graph. Each vertex is a segment vector from a reference summary, indexed by Summary.ID (si), Sentence.ID (sij), Segmentation.ID (sijk), Segment.ID (sijkm). All segments of all reference summaries have a corresponding node. All edges connect segments from different summaries with similarity ≥ tedge. This schematic representation of a partial solution contains three fully connected subgraphs with attraction edges (solid lines), each representing an SCU, whose weight is the number of vertices (segments).
Figure 5: Formal specification of EDUA’s input graph G consisting of all segments from all segmentations of reference summary sentences (item 2), the objective (item 6), and three scores for defining the objective function that are assigned to candidate SCUs (item 3), sets of SCUs of the same weight (item 4), and a candidate pyramid (item 5).
Figure 6: A directed Depth First Search tree for EDUAC. Nodes are cliques representing candidate SCUs, as illustrated in Figure 4, labeled by their weights. Each DFS path is a partition over one way to segment all the input summaries and group all segments into SCUs. The solution is the path with the highest AP .
Figure 1: Alignment of a single PyrEval SCU of weight 5 to a manual SCU of weight 4 from a dataset of student summaries. The manual and automated SCUs express the same content, and their weights differ only by one. For each of five reference summaries (RSUM1-RSUM5), exact matches of words between the PyrEval and manual contributor are in bold, text in plain font (RSUM2, RSUM4) appears only in the manual version, and text in italics appears only in the PyrEval version. Paraphrases of the same content from RSUM4 were identified by human annotators (plain font) and PyrEval (italics). Also shown is a matching segment from a student summary, where the student used synonyms of some of the words in the reference summaries.
Figure 2: PyrEval preprocessors segment sentences from reference (RSUM) and evaluation (ESUM) summaries into clause-like units, then convert them to latent vectors. EDUA constructs a pyramid from RSUM vectors (lower left): the horizontal bands of the pyramid represent SCUs of decreasing weight (shaded squares). WMIN matches SCUs to ESUM segments to produce a raw score, and three normalized scores.
Figure 1: Overview of our summarization model. As shown, “bilateral” in the FINDINGS is a significant ontological term which has been encoded into the ontology vector. After refining FINDINGS word representation, the decoder computes attention weight (highest on “bilateral”) and generates it in the IMPRESSION.
Figure 2: Histograms and arrow plots showing differences between IMPRESSION of 100 manually-scored radiology reports. Although challenges remain to reach human parity for all metrics, 81% (a), 82% (b), and 80% (c) of our system-generated Impressions are as good as human-written Impressions across different metrics.
Figure 1: Excerpt from “The Silence of the Lambs”. The scene heading INT. THE PANEL TRUCK - NIGHT denotes that the action takes place inside the panel truck at night. Character cues (e.g., MAN, CATHERINE) preface the lines the actors speak. Action lines describe what the camera sees (e.g., We can’t get a good glimpse of his face, but his body. . . ).
Figure 1: Examples from the synthetic and human annotated factuality datasets. The entity-centric and generationcentric approaches produce bad summaries from processes which can label their errors. All models can be adapted to give word-level, dependency-level, or sentence-level highlights, except for Gen-C.
Figure 8: Example showing loss modification to improve abstractiveness. The table shows which tokens are retained (green checkmark) or dropped (red cross) from the loss computation at different training stages. During later stages of the training, when loss truncation is applied, copied tokens are excluded from the loss.
Figure 7: N-gram overlap of the generated summaries in CNNDM and MEDIASUM. Initializing from BART-XSUM offers no benefits over the baseline. On the other hand, loss truncation is successful at enforcing abstractiveness; generated summaries for both datasets are closer to the target abstractiveness of reference summaries.
Figure 3: Pairwise significance test outcomes for BLEU, best-performing ROUGE (rows 2-9), and ROUGE applied in Hong et al. (2014) (bottom 3 rows), with (ST1) and without (ST0) stemming, with (RS1) and without (RS0) removal of stop words, for average (A) or median (M) ROUGE precision (P), recall (R) or f-score (F), colored cells denote significant win for row i metric over column j metric with Williams test.
Figure 4: Summarization system pairwise significance test outcomes (paired t-test) for state-ofthe-art (top 7 rows) and baseline systems (bottom 5 rows) of Hong et al. (2014) evaluated with best-performing ROUGE variant: average ROUGE2 precision with stemming and stop words removed, colored cells denote a significant greater mean score for row i system over column j system according to paired t-test.
Figure 1: Our proposed modification of a multi-layer transformer architecture. The input sequence is composed of K blocks of tokens. Each transformer layer is applied within the blocks, and a bidirectional GRU network propagates information in the whole document by updating the [CLS] representation of each block.
Figure 4: Density and coverage distributions across the different domains and existing datasets. NEWSROOM contains diverse summaries that exhibit a variety of summarization strategies. Each box is a normalized bivariate density plot of extractive fragment coverage (x-axis) and density (y-axis), the two measures of extraction described in Section 4.1. The top left corner of each plot shows the number of training set articles n and the median compression ratio c of the articles. For DUC and New York Times, which have no standard data splits, n is the total number of articles. Above, top left to bottom right: Plots for each publication in the NEWSROOM dataset. We omit TMZ, Economist, and ABC for presentation. Below, left to right: Plots for each summarization dataset showing increasing diversity of summaries along both dimensions of extraction in NEWSROOM.
Figure 3: Procedure to compute the set F(A,S) of extractive phrases in summary S extracted from article A. For each sequential token of the summary, si, the procedure iterates through tokens of the text, aj . If tokens si and aj match, the longest shared token sequence after si and aj is marked as the extraction starting at si.
Figure 1: The process of attention optimization (better view in color). The original attention distribution (red bar on the left) is updated by the refinement gate rt and attention on some irrelevant parts are lowered. Then the updated attention distribution (blue bar in the middle) is further supervised by a local variance loss and get a final distribution (green bar on the right).
Figure 1: Overview of our multi-task model with parallel training of three tasks: abstractive summary generation (SG), question generation (QG), and entailment generation (EG). We share the ‘blue’ color representations across all the three tasks, i.e., second layer of encoder, attention parameters, and first layer of decoder.
Figure 2: The UI for content evaluation with highlight. Judges are given an article with important words highlighted using heat map. Judges can also remove less important highlight color by sliding the scroller at the left of the page. At the right of the page judges give the recall and precision assessment by sliding the scroller from 1 to 100 based on the given summary quality.
Figure 2: Two-stage model diagram. The aspect classifier assigns aspect labels for each reference sentence Rij from references R with a threshold λ. Sentences are then grouped according to the assigned labels, which are fed to the summarization model. Groups about irrelevant aspects (i.e., a2) is ignored. Finally, the summarization model outputs summaries for each relevant aspect.
Figure 5: Results for the purpose factor questions. Specific purpose factor in italics. Answer type in brackets: MC = Multiple Choice, MR = Multiple Response, LS = Likert Scale. ** indicates significance (𝜒2), after Bonferroni correction, with 𝑝 ≪ 0.001, * with 𝑝 < 0.05. † indicates noteworthy results where significance was lost after correction for the number of tests. If two options are flagged, these options are not significantly different from each other, yet both were chosen significantly more often than the other options.
Figure 4: Results for the input factor questions. Specific input factor in italics. Answer type in brackets: MC =Multiple Choice, MR = Multiple Response. ** indicates significance (𝜒2), after Bonferroni correction, with 𝑝 ≪ 0.001. If two options are flagged with **, these options are not significantly different from each other, yet both have been chosen significantly more often than the other options.
Figure 5: Results for the purpose factor questions. Specific purpose factor in italics. Answer type in brackets: MC = Multiple Choice, MR = Multiple Response, LS = Likert Scale. ** indicates significance ( 2), after Bonferroni correction, with p ⌧ 0.001, * with p < 0.05. † indicates noteworthy results where significance was lost after correction for the number of tests. If two options are flagged, these options are not significantly different from each other, yet both were chosen significantly more often than the other options.
Figure 4: Results for the input factor questions. Specific input factor in italics. Answer type in brackets: MC = Multiple Choice, MR = Multiple Response. ** indicates significance ( 2), after Bonferroni correction, with p ⌧ 0.001. If two options are flagged with **, these options are not significantly different from each other, yet both have been chosen significantly more often than the other options.
Figure 1: Comparison of extractive, abstractive, and our unified summaries on a news article. The extractive model picks most important but incoherent or not concise (see blue bold font) sentences. The abstractive summary is readable, concise but still loses or mistakes some facts (see red italics font). The final summary rewritten from fragments (see underline font) has the advantages from both extractive (importance) and abstractive advantage (coherence (see green bold font)).
Figure 5: Visualizing the consistency between sentence and word attentions on the original article. We highlight word (bold font) and sentence (underline font) attentions. We compare our methods trained with and without inconsistency loss. Inconsistent fragments (see red bold font) occur when trained without the inconsistency loss.
Figure 7: Model prediction distributions of each rating group from logistic regression (LR), deep averaging networks (DAN), and Longformer. Only Longformer model can properly distinguish sentences located at different score range. LR and DAN are not robust to input length shift where models are trained with input of full 10 reviews but are tested with sentences.
Figure 3: Wasserstien distance between model predictions of summary sentences and all sentences of the first ten reviews. Lower values indicate better representativeness. Error bars represent standard errors. DecSum (1, 1, 1) is significantly better than other approaches, including DecSum (1, 0, 1), with p-value ≤ 0.0001 with paired t-tests.
Figure 1: Illustration of the selected sentences by different methods on the distribution of model predictions on all individual sentences. Our method (DecSum) covers the full distribution, while PreSumm, a text-only summarization method, concentrates on the right side, and integrated gradients, a model-explanation method, misses the middle part.
Figure 2: The overall architecture of our proposed method with graph and contrastive learning. An example input and output at t− 1 and t step are shown in the figure, where the top is the backbone sequence-to-sequence paradigm with a graph to store relation information between critical words and the bottom is the contrastive learning module with specific positive and negative examples. m refer to a mask vector.
Figure 1: Illustration of the latent space Z and text space X . The de facto standard approach in unsupervised opinion summarization uses the simple average of input review vectors zreview (◦) to obtain the summary vector zavg (▴). The simply averaged vector zavg tends to be close to the center (i.e., has a small L2-norm) in the latent space, and a generated summary xavg (⬩) tends to become overly generic. Our proposed framework COOP finds a better aggregated vector to generate a more specific summary xCOOP(▪) from the latent vector zCOOP (⋆).
Figure 7: Example of summaries generated by BIMEANVAE with SimpleAvg and COOP for reviews about a product on the Yelp dataset. The colors denote the corresponding opinions, and struck-through reviews in gray were not selected by COOP for summary generation (Note that SimpleAvg uses all the input reviews.) Terms that are more specific to the entity are underlined.
Figure 11: Example of summaries generated by BIMEANVAE with SimpleAvg and COOP for reviews about a product on the Amazon dataset. The colors denote the corresponding opinions, and struck-through reviews in gray were not selected by COOP for summary generation (Note that SimpleAvg uses all the input reviews.) Terms that are more specific to the entity are underlined. Red and struck-through text denotes hallucinated content that has the opposite meaning compared to the input.
Figure 5: Sentence Annotation Task. By showing sentences of the same aspect category, it is easier for annotators to compare two group of sentences (from two entities). To further facilitate the annotation process, we also provide several additional features, such as allowing workers to group sentences that contain the same token through double clicking, and to highlight sentences through hovering over the sentence label.
Figure 2: Illustration of Co-decoding: (a) For contrastive summary generation, distinctive words are emphasized by contrasting the token probability distribution of target entity against that of the counterpart entity. (b) For common summary generation, entity-pair-specific words are highlighted by aggregating token probability distributions of all base models to alleviate the overly generic summary generation issue.
Figure 7: Examples of generated summaries and induced latent discourse trees for long reviews. (a) shows a movie review. The 4th sentence mentions the whole positiveness. The 10th describes that the contents are easy to follow, while the 20th to 22nd show the detail of the contents. The 27th mentions the performance and accurate portrayal, and the 8th and 16th elaborate on the latter and the former, respectively. (b) presents a pocket knife review. The 11th, 13th, 15th, and 21st sentences concisely describe the goodness in each aspect. The 14th, 24th, and 28th elaborate on the parents.
Figure 6: Examples of generated summaries and induced latent discourse trees for negative reviews. (a) shows a board game review. The induced tree shows that the 1st and 6th sentences present additional information about the generated summary. While the 1st to 4th indicate the heaviness of the game, the 5th and 6th criticize the artwork. The 2nd, 3rd, and 4th present the additional information about the parent. (b) presents a movie review. The 1st and 2nd sentences describe the whole evaluation, while 6th and 7th strengthen the opinion. The 3rd to 5th mention the boring points in detail. Although our model catches the negativeness, the summary is redundant probably because each sentence in the body is relatively long.
Figure 1: Outline of our approach. (1) The latent distribution of review sentences is represented as a recursive GMM and trained in an autoencoding manner. Then, (2) the topic sentences are inferred by decoding each Gaussian component. An example of a restaurant review and its corresponding gold summary are displayed.
Figure 2: Illustration of word and sentence level attention in the second decoder step (Eq. 1 and Eq. 2). Purple: attention on words, Orange: attention on sentences, Unidirectional dotted arrows: attention from previous step, Bidirectional arrows: attention from previous and to present step. Best viewed in color.
Figure 1: Baseline model repeats itself twice (italic), and fails to find all salient information (highlighted in red in the original text) from the source text that is covered by our 2-decoder model. The summary generated by our 2-decoder model also recovers most of the information mentioned in the reference summary (highlighted in blue in the reference summary).
Figure 3: The graph layer consists of a graph attention mechanism and a feed-forward network. Through the graph attention, each node merges the neighbor relations. The neighbor relations are represented as triples, and the incoming relations and the outgoing relations are obtained through different mappings, which are marked in red and green color respectively in the Figure.
Figure 1: Social bias in automatic summarization: We take steps toward evaluating the impact of the gender, age, and race of the humans involved in the summarization system evaluation loop: the authors of the summaries and the human judges or raters. We observe significant group disparities, with lower performance when systems are evaluated on summaries produced by minority groups. See §3 and Table 1 for more details on the Rouge-L scores in the bar chart.
Figure 3: Intersection of averaged summary sentence overlaps across the sub-aspects. We use First for Position, ConvexFall for Diversity, and N-Nearest for Importance. The number in the parenthesis called Oracle Recall is the averaged ratio of how many the oracle sentences are NOT chosen by union set of the three sub-aspect algorithms. Other corpora are in Appendix with their Oracle Recalls: Newsroom(54.4%), PubMed (64.0%) and MScript (99.1%).
Figure 4: PCA projection of extractive summaries chosen by multiple aspects of algorithms (CNNDM). Source and target sentences are black circles ( ) and cyan triangles, respectively. The blue, green, red circles are summary sentences chosen by First, ConvexFall, NN, respectively. The yellow triangles are the oracle sentences. Shaded polygon represents a ConvexHull volume of sample source document. Best viewed in color. Please find more examples in Appendix.
Figure 5: Sentence overlap proportion of each sub-aspect (row) with the oracle summary across corpora (column). y-axis is the frequency of overlapped sentences with the oracle summary. X-axis is the normalized RANK of individual sentences in the input document where size of bin is 0.05. E.g., the first / the most diverse / the most important sentence is in the first bin. If earlier bars are frequent, the aspect is positively relevant to the corpus.
Figure 1: Sentence extractor architectures: a) RNN, b) Seq2Seq, c) Cheng & Lapata, and d) SummaRunner. The� indicates attention. Green blocks repesent sentence encoder output and red blocks indicates learned “begin decoding” embeddings. Vertically stacked yellow and orange boxes indicate extractor encoder and decoder hidden states respectively. Horizontal orange and yellow blocks indicate multi-layer perceptrons. The purple blocks represent the document and summary state in the SummaRunner extractor.
Figure 2: Left scatter-plot shows two-dimensional visualization of outputs generated from 10 models on basis of multi-dimensional scaling (Cox and Cox, 2008), and right list shows their contents. Each point in plot represents sentence embedding of corresponding output, and label indicates model ID and ROUGE-1, i.e., “ID (ROUGE).” Color intensity means score of kernel density estimation of PostCosE (see right color bar), and outputs are sorted by scores. Reference and input are as follows. Each bold word in above list means co-occurrence with reference below. Reference: interpol asks world govts to make rules for global policing Input: top interpol officers on wednesday asked its members to devise rules and procedures for policing at the global level and providing legal status to red corner notices against wanted fugitives .
Figure 1: Estimator model architecture used in COMES. Source, reference and hypothesis are all independently encoded with a pre-trained encoder. Pooling layer is used to create sentence embeddings from sequences of token embeddings. In the COMES variant, the last feed-forward layer has 4 outputs, corresponding to different summary evaluation dimensions. Dashed lines are used to indicate the reference-less variant. For the full COMET description see Rei et al. (2020).
Figure 1: The network architecture with the decoder factorized into separate contextual and language models. The reference vector, composed of context vectors ctmpt , c int t , and the hidden state of the contextual model hdect , is fused with the hidden state of the language model and then used to compute the distribution over the output vocabulary.
Figure 2: The Summary Loop involves three neural models: Summarizer, Coverage and Fluency. Given a document and a length constraint, the Summarizer writes a summary. Coverage receives the summary and a masked version of the document, and fills in each of the masks. Fluency assigns a writing quality score to the summary. The Summarizer model is trained, other models are pretrained and frozen.
Figure 4: Histogram and average copied span lengths for abstractive summaries. A summary is composed of novel words and word spans of various lengths copied from the document. Summary Loop summaries copy shorter spans than prior automatic systems, but do not reach abstraction levels of human-written summaries.
Figure 1: Example document with an inconsistent summary. When running each sentence pair (Di, Sj) through an NLI model, S3 is not entailed by any document sentence. However, when running the entire (document, summary) at once, the NLI model incorrectly predicts that the document highly entails the entire summary.
Figure 2: Diagram of the SUMMACZS (top) and SUMMACCONV (bottom)models.Bothmodels utilize the same NLI Pair Matrix (middle) but differ in their processing to obtain a score. The SUMMACZS is Zero-Shot, and does not have trained parameters. SUMMACCONV uses a convolutional layer trained on a binned version of the NLI Pair Matrix.
Figure 1: Extractiveness of generated outputs versus automated metric scores for Entailment, FactCC and DAE on the Gigaword dataset. We use coverage defined in Grusky et al. (2018) to measure extractiveness, where summaries with higher coverage are more extractive. We observe that automated metrics of faithfulness are positively correlated with extractiveness.
Figure 2: Faithfulness-Abstractiveness trade-off curves. The blue dots represent the quartile models used to generate the curve. The purple dot corresponds to the baseline. DAE and Loss Truncation are depicted by the brown and orange dots respectively. The green dots correspond to our proposed systems.
Figure 4: User experience plots for task difficulty, “I found it difficult to summarize the article well”, frustration, “Performing the summarization tasks was frustrating”, and assistance utility, “The provided summaries were not useful to me when I was performing the summarization tasks” for Reddit (Left) and XSum (Right). Responses were collected using 7 point rating scales.
Figure 3: The first attention head from the l-th layer is dedicated to coreferring mentions. The head encourages tokens of the same PoC to share similar representations. Our results suggest that the attention head of the 5-th layer achieves competitive performance, while most heads perform better than the baseline. The findings are congruent with (Clark et al., 2019) that provides a detailed analysis of BERT’s attention.
Figure 1: An example of a news story in our data set. The short manual summary is marked in red rectangle. The blue rectangle shows a post from a user. In the green rectangle, it is a link of a related news story. Some posts may only include comments, reactions, etc. without the link to the related news stories.
Figure 1: Comparison of sentence-level attention distributions for the summaries in Table 1 on a news article. (a) is the heatmap for the gold reference summary, (b) is for the Seq2seq-baseline system, (c) is for the Point-gen-cov (See et al., 2017) system, (d) is for the Hierarchical-baseline system and (e) is for our system. Ii and Oi indicate the i-th sentence of the input and output, respectively. Obviously, the seq2seq models, including the Seq2seq-baseline model and the Point-gen-cov model, lose much salient information of the input document and focus on the same set of sentences repeatedly. The Hierarchical-baseline model fails to detect several specific sentences that are salient and relevant for each summary sentence and focuses on the same set of sentences repeatedly. On the contrary, our method with structural regularizations focuses on different sets of source sentences when generating different summary sentences and discovers more salient information from the document.
Figure 1: The framework of our model. Entailment-aware encoder is learned by jointly training summarization generation (left part of (a), which is a seq2seq model) and entailment recognition (right part of (a), in which sentence pair in the entailment recognition corpus are encoded as u and v). Entailmentaware decoder is learned by entailment RAML training, in which the summary will be rewarded if it is entailed by the source sentence.
Figure 3: Comparisons of structural-compression and structural-coverage analysis results on random samples from CNN/Daily Mail datasets, which demonstrate that both the Seq2seq-baseline model and the Hierarchical-baseline model are not yet able to capture them properly, but our model with structural regularizations achieves similar behavior with the gold reference summary.
Figure 2: The architecture of User-aware Sequence Network (USN). USN encodes two kinds of user information, user embedding (u) and user-specific vocabulary memory (U), into its two basic modules (User-aware Encoder and User-aware Decoder). 1©, and 2© show strategies based on user embedding, and represent User Selection strategy, and User Prediction strategy, respectively. 3© and 4© indicate strategies based on user-specific vocabulary memory and represent User Memory Prediction strategy and User Memory Generation strategy, respectively.
Figure 3: The architecture of Attribute-aware Sequence Network (ASN). ASN encodes two kinds of attribute information, attribute embedding (a) and attribute-specific vocabulary memory (A), into its two basic modules (Attribute-aware Review Encoder and Attribute-aware Summary Decoder). 1©, and 2© show strategies based on attribute embedding, and represent Attribute Selection strategy, and Attribute Prediction strategy, respectively. 3© and 4© indicate strategies based on attribute-specific vocabulary memory, and represent Attribute Memory Prediction strategy and Attribute Memory Generation strategy, respectively.
Figure 2: The framework of our model with co-selective encoding. During training, a BiLSTM reads the original sentence (x1, x2, · · · , xn) and the ground-truth keywords (k1, k2, · · · , km) into the first-level hidden states hri and hki . A jointly trained keyword extractor takes h r i as the input to predict whether the input word is a keyword or not. Co-selective encoding layer builds the second-level hidden states hr ′ i and h k′ i . Then the summary is generated via dualattention and dual-copy for both the original sentence and the keyword sequence. During testing, the ground-truth keywords are replaced by the keywords predicted by our trained keyword extractor.
Figure 1: An example from the Gov-Report dataset to introduce the process of our method. “...” refers to the omissions of context sentences due to space limitations. Highlight sentences refer to the final extracted summary sentences. The content of the arrow pointed is the facet description of the left semantic block. Bold facets represent vital facet-aware semantic blocks of the final summary.
Figure 1: Model architecture for sequence-to-sequence with coarse-to-fine attention. The left side is the encoder that reads the document, and the right side is the decoder that produces the output sequence. On the encoder side, the top-level hidden states are used for the coarse attention weights, while the word-level hidden states are used for the fine attention weights. The context vector is then produced by a weighted average of the word-level states. In HIER, we average over the coarse attention weights, thus requiring computation of all word-level hidden states. In C2F, we make a hard decision for which chunk of text to use, and so we only need to compute word-level hidden states for one chunk.
Figure 1: The architecture of the self-attention layers used in the T-DMCA model. Every attention layer takes a sequence of tokens as input and produces a sequence of similar length as the output. Left: Original self-attention as used in the transformer-decoder. Middle: Memory-compressed attention which reduce the number of keys/values. Right: Local attention which splits the sequence into individual smaller sub-sequences. The sub-sequences are then merged together to get the final output sequence.
Figure 1: Architecture of the original BERT model (left) and BERTSUM (right). The sequence on top is the input document, followed by the summation of three kinds of embeddings for each token. The summed vectors are used as input embeddings to several bidirectional Transformer layers, generating contextual vectors for each token. BERTSUM extends BERT by inserting multiple [CLS] symbols to learn sentence representations and using interval segmentation embeddings (illustrated in red and green color) to distinguish multiple sentences.
Figure 1: SimCLS framework for two-stage abstractive summarization, where Doc, S, Ref represent the document, generated summary and reference respectively. At the first stage, a Seq2Seq generator (BART) is used to generate candidate summaries. At the second stage, a scoring model (RoBERTa) is used to predict the performance of the candidate summaries based on the source document. The scoring model is trained with contrastive learning, where the training examples are provided by the Seq2Seq model.
Figure 3: The overall framework of SEGTRANS model. The blue circles indicate input source text, where dark blue circles indicate paragraph boundaries. The yellow circles indicate output target text, where orange circles indicate heading boundaries. Dotted red lines indicate attention heads with segmentation-aware attention mechanism and dotted blue lines indicate attention heads with original full attention mechanism.
Figure 3: Performance comparison (BART v.s. BRIO-Mul) w.r.t. reference summary novelty. The x-axis represents different buckets of test examples grouped by reference summary novelty (Eq. 11). Larger x-coordinates correspond to examples of which the reference summaries have higher novelty. The left figure shows the performance improvement of our model compared with the baseline model, while the right one shows model performance.
Figure 2: Visualization of the encoder-decoder attention weights. The x-axis are the encoder input, including prompts across the encoder Pen and the source document X . The y-axis are the decoder input, including prompts across the decoder Pde and the target summary Y . The area in the red box represents the attentions of Pde assigning to Pen. The area in the yellow box represents the attentions of Y assigning to X . Darker color shows the more highly related associations between tokens.
Figure 1: Comparison of MLE loss (LMLE) and the contrastive loss (LCtr) in our method. MLE assumes a deterministic (one-point) distribution, in which the reference summary receives all the probability mass. Our method assumes a nondeterministic distribution in which system-generated summaries also receive probability mass according to their quality. The contrastive loss encourages the order of model-predicted probabilities of candidate summaries to be coordinated with the actual quality metric M by which the summaries will be evaluated. We assign the abstractive model a dual role – a single model could be used both as a generation model and a reference-free evaluation model.
Figure 1: One example from the segmentation-based summarization task SEGNEWS. The news article is taken from a CNN news article and we truncate the article for display. CNN editors have divided this article into several sections and written a heading to section. The goal of this task is to automatically identify sub-topic segments of multiple paragraphs, and generate the heading-style summary for each segment. Dotted lines in the figure indicate segment boundaries. In this article, paragraphs 1,2 are annotated as the first segment, paragraphs 3,4 are annotated as the second segment, paragraphs 5,6 are annotated as the third segment, and paragraphs 7,8 are annotated as the forth segment. To the right of the article are the heading-style summaries for segments. Since the first segment is usually an overview of the news, we do not assign a summary to it.
Figure 3: The architecture of our proposed model for abstractive summarization. Our model consists of three parts: 1. Transformer Encoder-Decoder, 2. Entity Pointer Network, 3. Relation Pointer Network. The encoder in Transformer EncoderDecoder shares parameters with that in Relation Pointer Network.
Figure 1: A sample document with corresponding summaries generated by different abstractive summarization methods, in which extrinsic hallucinations are marked in yellow, and intrinsic hallucinations which are marked in blue. Note that the results of PTGEN [25] and TCONVS2S [20] come from Maynez et al. [18].
Figure 1: (a) Attention distribution is composed of the summation of cross attention on the samecolored lines, distinguished from that of different-colored lines which always equals 1 due to softmax. (b) Local attention gradually increases as the decoding proceeds. (c) Desired situation: growing local attention has been lower than global attention during decoding and exactly reaches it at the end.
Figure 1: Examples of ∆(y,y′) of the original MRT and ∆̃(y,y′) of GOLC where ROUGE-1 recall is calculated based on unigrams. In the two examples, a reference y is ⟨malaysia,markets, closed, for, holiday⟩ and a sampled summary y′ is ⟨markets, in,malaysia, closed, for, holiday⟩ and cb(y) = len(’ ’.join(y)) = 38 and cb(y ′) = len(’ ’.join(y′)) = 35.
Figure 5: The first three figures show the ground-truth and estimated FAR scores via human-annotated FAMs and machine-created FAMs. The fourth figure shows the fitting of linear regression on the human-annotated samples (LR-Small) and the prediction on the whole test set of CNN/Daily Mail (LR-Large). Systems are sorted in an ascending order by the ground-truth FAR on the human-annotated samples.
Figure 2: Visualization of the learned node embeddings in testing each epoch. Red nodes are words (light) and sentence (heavy) in labels of summary, while blue nodes are related to words (light) and sentences (heavy) in non-summaries. Purple nodes are words shared by sentences between summaries and non-summaries.
Figure 1: The overview architecture of the MuchSUM with three specific graph convolutional channels and a common convolutional channel shared by the three graph channels. We denote three specific channels as Node Lexical Feature Encoding Channel (A,X𝑠 ), Node Centrality Feature Encoding Channel (A,X𝑐 ) andNode Position Feature EncodingChannel (A,X𝑝 ). In the bipartite word-sentence heterogeneous graph, each sentence node (solid node) is connected to its contained word-related nodes (hollow nodes) and takes the weight of the relation as their edge feature. Different thicknesses of edges represent different edge weights.
Figure 1: (a) BERTSUMEXTABS model. An encoder encodes the document, and a word generator generates the next word given previous words, while paying attention to the document. (b) Sentence planner model. A shared encoder separately encodes the document and each sentence of the summary generated so far. The sentence generator takes the summary sentence embeddings and predicts the next sentence embedding, which the word generator is then conditioned on. Both generators integrate document information through attention.
Figure 1: Hallucinations in extreme document summarization: the abbreviated article, its gold summary and the abstractive model generated summaries (PTGEN, See et al. 2017; TCONVS2S, Narayan et al. 2018a; and, GPTTUNED, TRANS2S and BERTS2S, Rothe et al. 2020) for a news article from the extreme summarization dataset (Narayan et al., 2018a). The dataset and the abstractive models are described in Section 3 and 4. We also present the [ROUGE-1, ROUGE-2, ROUGE-L] F1 scores relative to the reference gold summary. Words in red correspond to hallucinated information whilst words in blue correspond to faithful information.
Figure 2: Illustration of our summarization system. The model extracts the most relevant sentences from the document by taking into account the WordEncoder representation of the current sentence e(si), the SentEncoder representation of the previous sentence hsi , the current summary state representation o s i , and the representation of the document e(D). If a sentence is selected (zi = 1), its representation is fed to SentStates, and we move to the next sentence. Here, sentences s1 and s3 were selected. If the model is also compressing, the compressive layer selects words for the final summary (Compressive Decoder). See Figure 3 for details on the decoders.
Figure 3: Decision decoder architecture. Decoder contains an extractive level for sentences (orange box) and a compressive level for words (dashed gray box), using an LSTM to model the summary state. Red diamond shapes represent decision variables zi = 1 if p(zi ∣ pi) > 0.5 for selecting the sentence si, and zi = 0 if p(zi ∣ pi) ≤ 0.5 for skipping this sentence. The same for yij and p(yij ∣ qij) > 0.5 for deciding over words wij to keep in the summary.
Figure 3: Hierarchical encoder with hierarchical attention: the attention weights at the word level, represented by the dashed arrows are re-scaled by the corresponding sentencelevel attention weights, represented by the dotted arrows. The dashed boxes at the bottom of the top layer RNN represent sentence-level positional embeddings concatenated to the corresponding hidden states.
Figure 2: Switching generator/pointer model: When the switch shows ’G’, the traditional generator consisting of the softmax layer is used to produce a word, and when it shows ’P’, the pointer network is activated to copy the word from one of the source document positions. When the pointer is activated, the embedding from the source is used as input for the next time-step as shown by the arrow from the encoder to the decoder at the bottom.
Figure 1: SummaRuNNer: A two-layer RNN based sequence classifier: the bottom layer operates at word level within each sentence, while the top layer runs over sentences. Double-pointed arrows indicate a bi-directional RNN. The top layer with 1’s and 0’s is the sigmoid activation based classification layer that decides whether or not each sentence belongs to the summary. The decision at each sentence depends on the content richness of the sentence, its salience with respect to the document, its novelty with respect to the accumulated summary representation and other positional features.
Figure 2: Visualization of SummaRuNNer output on a representative document. Each row is a sentence in the document, while the shading-color intensity is proportional to its probability of being in the summary, as estimated by the RNN-based sequence classifier. In the columns are the normalized scores from each of the abstract features in Eqn. (6) as well as the final prediction probability (last column). Sentence 2 is estimated to be the most salient, while the longest one, sentence 4, is considered the most content-rich, and not surprisingly, the first sentence the most novel. The third sentence gets the best position based score.
Figure 5: Negative log likelihood per subword token on two q-a pairs from the QAGen model according to the summary(blue) and input document (orange). Higher means unlikely. The first q-a pair (top figure) has a much higher average negative log likelihood according to the input document than according to the summary.
Figure 1: Extractive summarization model with reinforcement learning: a hierarchical encoder-decoder model ranks sentences for their extract-worthiness and a candidate summary is assembled from the top ranked sentences; the REWARD generator compares the candidate against the gold summary to give a reward which is used in the REINFORCE algorithm (Williams, 1992) to update the model.
Figure 4: Finetuning results on the XSum validation set using one of the base-sized pretrained models: PEGASUS, FROST(F), and FROST(P+F). All pretrained models were trained for 1.5m steps. See text for more details. We only report on a subset of measures, similar observations were made for other measures.
Figure 5: Finetuning results on the XSum (in blue) and CNN/DailyMail (in red) validation sets at various steps during pretraining FROST-Large. Instead of pretraining from scratch, we start with a PEGASUS-Large checkpoint, and continue pretraining for additional 1.5m steps with the planning objective. We report finetuning results for the PEGASUS finetuned baseline and our models at 0.1m, 1m, and 1.5m steps.
Figure 1: The overview of our framework. in which the backbone is in charge of generating two summaries for a document. Then the oracle selects which summary is better for a given document. The reward model afterward transforms the oracle’s preference into a discrete signal to optimize the backbone. Our framework contains two novel components: efficient sampling from offline data and the preference-guided reward model.
Figure 1: Histogram of the position of sentences selected by our method and PacSum on CNN/DM. PacSum uses position information which allows it to take advantage of the lead bias. In contrast, our method is position-agnostic but still captures the fact that earlier sentences are more important in news articles.
Figure 3: Correlation between metrics and human judgement on subsets of data. The x and y axis represent the human judgement the metric scores respectively. The red line is a linear regression fitted on full data. Each dotted line is a linear regression fitted on a model-dataset subset. Each colored point has coordinates equal to average factuality judgement, and metric score for its corresponding partition.
Figure 1: We propose a linguistically grounded typology of factual errors. We select crowd workers to annotate summaries from two datasets according to this typology achieving near perfect agreement with experts. We collect FRANK, the resulting dataset, to benchmark factuality metrics and state-of-art summarization systems.
Figure 1: Example of the QFS dataset and a constructed graph. Query nodes are denoted by blue circles, and document nodes by yellow circles. Root words in red letters indicate important words in a query and each sentence. The nodes in the purple dotted rectangle are especially important to generate a summary.
Figure 1: Comparing the uni-, bi-, and tri-gram novelty for the medium sized datasets. These datasets contain generated sequences up to 128 tokens in length. The methods are as follows: NLL (baseline), RwB-Hinge, RISK2, and RISK-3. The unique average n-gram novelty (n-grams that do not appear in the source text) is shown to increase across the board compared to the standard NLL baseline.
Figure 2: Comparison of each method for the full-data approach over a medium size dataset (CNN/DM). The methods are as follows: NLL (baseline), RwB-Hinge, RISK-2, and RISK-3. We see that the reinforcement learning approaches have led, on average, to higher ROUGE-L scores for the longer summaries compared to the NLL baseline.
Figure 1: Illustration of the encoder and decoder attention functions combined. The two context vectors (marked “C”) are computed from attending over the encoder hidden states and decoder hidden states. Using these two contexts and the current decoder hidden state (“H”), a new word is generated and added to the output sequence.
Figure 1: The blue distribution represents the score distribution of summaries available in the human judgment datasets of TAC-2008 and TAC-2009. The red distribution is the score distribution of summaries generated by mordern systems. The green distribution corresponds to the score distribution of summaries we generated in this work as described in section 3.
Figure 2: Visualized the efficiency of using passage nodes to enhance sentence representation. The degree of highlighting expresses the important role of the passage in the document. Underlined sentences are modelselected summaries. As result, the selected sentences belong to passages that have high scores of α (Equation 8).
Figure 1: Our approach for abstractive summarization of a scientific article. An older version of this paper is shown as the reference document. First, a sentence pointer network extracts important sentences from the paper. Next, these sentences are provided along with the whole scientific article to be arranged in the following order: Introduction, extracted Sentences, abstract & the rest of the paper. A transformer language model is trained on articles organized in this format. During inference, the introduction and the extracted sentences are given to the language model as context to generate a summary. In domains like news and patent documents, the introduction can be replaced by the entire document.
Figure 2: The model contains three main modules: 1) Local Encoder: is composed of an Entity Encoder and a Sentence Encoder, the embeddings of entities and sentences are the initial features of graph nodes; 2) Heterogeneous Graph Encoder: an iteratively computed graph with FacetWeight; and 3) Extraction & Postprocess: ranks sentences while minimizing redundancy with Trigram Blocking.
Figure 1: The complete pipeline of the proposed method. In the first step, we split the input text into sentences by using a regular expression handcrafted specifically for scientific documents. In the second step, we compute the sentence embeddings of the parsed sentences using SBERT. In the third step, we create a graph by comparing all the pairs of sentence embeddings obtained using cosine similarity. In the fourth step, we rank the sentences by the degree centrality in the generated graph. In the fifth and final step, we only keep a certain number of sentences or words to adjust to the length requirements of the summary.
Figure 2: The process of graph generation and ranking of the sentences. Every node in the generated complete graph represents a sentence in the document and the weight of each edge is given by the similarity between the nodes it conects. The importance of the sentence in the document is modelled as rank(si) =∑n j=1 1− sim(ei, ej), where ei and ej are the corresponding SBERT sentence embeddings of si and sj .
Figure 1: Architecture of the HiStruct+ model. The model consists of a base TLM for sentence encoding and two stacked inter-sentence Transformer layers for hierarchical contextual learning with a sigmoid classifier for extractive summarization. The two blocks shaded in light-green are the HiStruct injection components.
Figure 3: Two samples for human evaluation and case analysis of the extractive summaries predicted by the HiStruct+ model and the baseline model, in comparison with the gold summary (i.e., the abstract of the paper). The first sample is selected from the arXiv dataset, while the second sample is from PubMed. Top-7 sentences with the highest predicted scores are extracted, and then combined in their original order to construct a final summary. Their linear indices within the original document are shown in the second row of each table. The texts highlighted in yellow are the key words and the main content that appear in the gold summary. The phrases highlighted in green indicate typical parts of a scientific paper such as summary and future work. Sentences are split by ’<q>’.1308
Figure 2: Example input sentence and the generated summary. The score of generating yi+1 (terrorism) is based on the context yc (for . . . against) as well as the input x1 . . .x18. Note that the summary generated is abstractive which makes it possible to generalize (russian defense minister to russia) and paraphrase (for combating to against), in addition to compressing (dropping the creation of), see Jing (2002) for a survey of these editing operations.
Figure 3: Orange crosses show the objective score optimized by exhaustive search minus the objective score optimized by FCHC. Blue pluses show the ROUGE-L difference between exhaustive search and FCHC. Plotted for the 1135 instances in the headline generation test set, where the source sentence has 30 words or fewer.
Figure 1: Summarizing a sentence x by hill climbing. Each row is a Boolean vector at at a search step t . A black cell indicates a word is selected, and vice versa. Randomly swapping two values in the Boolean vector yields a new summary that is scored by an objective function that measures language fluency and semantic similarity. If the new summary increases the objective, this summary is accepted as the current best solution. Rejected solutions are not depicted.
Figure 4: Positional bias for different systems, calculated for the headline generation test set. The source sentence is divided into four areas: 0–25%, 25–50%, 50–75%, and 75-100% of the sentence. The y-axis shows the normalized frequency of how often a word in the summary is extracted from one of the four source sentence areas.
Figure 3: Distribution of the log probabilities of answerability – i.e. log(1 − QA( |T, q)) – for two QA models. 1) solid lines: a model trained on SQuADv2 without the negative sampled examples. 2) dashed lines: a model trained on SQuAD-v2 with the negative sampled examples. The evaluated samples belong to three distinct categories: 1) answerable, 2) unanswerable questions (but present in SQuAD-v2) and 3) the negatively sampled ones (as described in §5.1).
Figure 1: Illustration of the QUESTEVAL framework: the blue area corresponds to the precision-oriented framework proposed by Wang et al. (2020). The orange area corresponds to the recall-oriented SummaQA (Scialom et al., 2019). We extend it with a weighter component for an improved recall (red area). The encompassing area corresponds to our proposed unified approach, QUESTEVAL.
Figure 1: Comparison of output of 3 abstractive summarization models on a news article. The baseline model makes factual errors, a nonsensical sentence and struggles with OOV words muhammadu buhari. The pointer-generator model is accurate but repeats itself. Coverage eliminates repetition. The final summary is composed from several fragments.
Figure 3: Pointer-generator model. For each decoder timestep a generation probability pgen ∈ [0,1] is calculated, which weights the probability of generating words from the vocabulary, versus copying words from the source text. The vocabulary distribution and the attention distribution are weighted and summed to obtain the final distribution, from which we make our prediction. Note that out-of-vocabulary article words such as 2-0 are included in the final distribution. Best viewed in color.
Figure 9: The baseline model incorrectly substitutes dutch for new zealand (perhaps reflecting the European bias of the dataset), fabricates irish, and struggles with out-of-vocabulary words saili and aucklandbased. Though it is not clear why, the phrase addition to our backline is changed to the nonsensical addition to their respective prospects. The pointer-generator model fixes these accuracy problems, and the addition of coverage fixes the repetition problem. Note that the final model skips over large passages of text to produce shorter sentences.
Figure 13: The baseline model appropriately replaces stumped with novel word mystified. However, the reference summary chooses flummoxed (also novel) so the choice of mystified is not rewarded by the ROUGE metric. The baseline model also incorrectly substitutes 600,000 for 25. In the final model’s output we observe that the generation probability is largest at the beginning of sentences (especially the first verb) and on periods.
Figure 8: The baseline model reports the wrong score 6-3, substitutes bedene for thiem and struggles with the uncommon word assimilation. The pointer-network models accurately reproduce the outof-vocabulary words thiem and aljaz. Note that the final model produces the novel word defeated to incorporate several fragments into a single sentence.
Figure 15: The baseline model fabricates a completely false detail about a u.n. peacekeeping force that is not mentioned in the article. This is most likely inspired by a connection between U.N. peacekeeping forces and northern sinai in the training data. The pointer-generator model is more accurate, correctly reporting the reshuffle of several senior military positions.
Figure 14: The baseline model incorrectly changes thwart criminals and others contributing to nigeria’s instability to destabilize nigeria’s economy – which has a mostly opposite meaning. It also produces a nonsensical sentence. Note that our final model produces the novel word says to paraphrase told cnn ‘s christiane amanpour.
Figure 12: Baseline model replaces cecily strong with mariah carey, and produces generally nonsensical output. The baseline model may be struggling with the out-of-vocabulary word beetlejuice, or perhaps the unusual non-news format of the article. Note that the final model omits – ever so slightly – from its first sentence.
Figure 2: Baseline sequence-to-sequence model with attention. The model may attend to relevant words in the source text to generate novel words, e.g., to produce the novel word beat in the abstractive summary Germany beat Argentina 2-0 the model may attend to the words victorious and win in the source text.
Figure 11: The baseline model makes several factual inaccuracies: it claims porto beat bayern munich not vice versa, the score is changed from 7-4 to 2-0, jackson is changed to james and a heroes reception is replaced with a trophy. Our final model produces sentences that are individually accurate, but they do not make sense as a whole. Note that the final model omits the parenthesized phrase ( left ) from its second sentence.
Figure 1: Alignment visualization of our model when decoding “closes”. Posterior alignment is more accurate for model interpretation. In contrast, the prior alignment probability is spared to “announced” and “closure”, which can be manually controlled to generate desired summaries. Decoded samples are shown when aligned to “announced” and “closure” respectively. Highlighted source words are those that can be directly aligned to a target token in the gold summary.
Figure 2: Architecture of the generalized pointer. The same encoder is applied to encode the source and target. When decoding “closes”, we first find top-k source positions with the most similar encoded state. For each position, the decoding probability is computed by adding its word embedding and a predicted relation embedding.
Figure 3: Example of attention heatmap between document sentences (rows) and gold summary sentences (columns). s[0]: The illusion is the work of German body-painting artist Joerg Duesterwald, who spent hours painting his model. s[1]: Stunning set of pictures was taken in front of a rockface in a forest in Langenfeld, Germany, yesterday. Best viewed in color.
Figure 1: Overview of our approach to create selfsupervised pre-training datasets from unlabelled scientific documents. The aspect-based summarization model is pre-trained on unlabelled documents, the section headings as aspects, and the following paragraphs corresponding to the aspects as aspect-based summaries.
Figure 2: Histogram of 50 most frequent aspects in the self-supervised samples (top: PubMed⋆, bottom: FacetSum⋆). PubMed⋆ has [150K,1.4K,214,33] unique aspects with frequency of higher than [1,10,100,1000] (FacetSum⋆:[96K,841,120,21]). Aspects removed from the NoOverlap datasets are highlighted in red.
Figure 3: Aspect-based summarization performance with limited supervised examples. Pre-training with in-domain and out-of-domain datasets significantly improves the low-resource training sample performance. Top: evaluation done on PubMed dataset, Bottom: evaluation is done on FacetSum dataset. ( —– BART , –•– BART + pre-trained on PubMed⋆, –×– BART + pretrained on FacetSum⋆, - - - BART fine-tuned on all samples)
Figure A.4: Effect of each variable to HaRiM. ∆ represents ps2s − plm. The last figure at the righter down shows the effect of replacing auxiliary LM probability with empty-sourced decoder inference (HaRiMlmless). Figure 1 shows article-summary pair as a datapoint in the plot, here we show each token of the decoded output as a datapoint.
Figure A.6: Graphical model representation attributing to the factors that affects metric (M )-human (H) correlation. A is the graphical model that supports the use of partial correlation as argued in (Pagnoni et al., 2021). B is the graphical model that adheres to our argument that why should we measure correlation, ignoring the effect of the generation system (S) whose effect is hindered by observed child node, text.
Figure A.1: Permutation test done for metric scores on FRANK-CNN/DM. 1 (filled grid) represents significant difference in metric performance, 0 represents negligible difference with confidence >=.95 (p <= 0.05), i.e. HaRiM is significantly more correlated to human judgements than all the other metrics except itself with a confidence of >=95%.
Figure 1: Effects of replacing the auxiliary language model (q(yi|y<i)) with an empty-sourced encoderdecoder model (p(yi|y<i; {}). Left compares the values of plm, and Right compares the HaRiM values. The values are calculated on the summary-article pairs in FRANK benchmark. The high correlation of HaRiM suggests that the effect of replacement is minimal.
Figure A.5: Boxplot of HaRiM and log-likelihood scales, varying with the evaluating summarizer weight. base+cnn: BART-base fine-tuned on CNN/DailyMail, brio: BRIO (Meng et al., 2021), large+cnn: BARTlarge fine-tuned on CNN/DailyMail, large+cnn+para: further fine-tuned checkpoint of the previous model on ParaBank2 corpus as suggested in (Yuan et al., 2021).
Figure 3: System architectures for ‘Struct+2Way+Word’ (left) and ‘Struct+2Way+Relation’ (right). βt,i (left) measures the structural importance of the i-th source word; βt,i (right) measures the saliency of the dependency edge pointing to the i-th source word. gep,i is the structural embedding of the parent. In both cases δt,i replaces αt,i to become the new attention value used to estimate the context vector ct.
Figure 2: System architectures for ‘Struct+Input’ (left) and ‘Struct+Hidden’ (right). A critical question we seek to answer is whether the structural embeddings (sei ) should be supplied as input to the encoder (left) or be exempted from encoding and directly concatenated with the encoder hidden states (right).
Figure 1: An illustration of our CopyTrans architecture. The self-attention mechanism allows (i) a source word to attend to lower-level representations of all source words (including itself) to build a higher-level representation for it, and (ii) a summary word to attend to all source words, summary words prior to it, as well as the token at the current position (‘MASK’) to build a higher-level representation.
Figure 1: An illustration of the generation process. A sequence of placeholders (“[MASK]”) are placed following the source text. Our model simultaneously predicts the most probable tokens for all positions, rather than predicting only the most probable next token in an autoregressive setting. We obtain the token that has the highest probability, and use it to replace the [MASK] token of that position. Next, the model makes new predictions for all remaining positions, conditioned on the source text and all summary tokens seen thus far. Our generator produces a summary having the exact given length and with a proper endpoint.
Figure 4: The proportion of instances containing TLDR in TLDR9+ dataset. As seen, the number of TLDRs is increasing each year. At the time of conducting this research, the submission data dumps are partially uploaded for 2021 (until 2021-06), while there is no comments uploaded for 2021 in the Pushshift repository.
Figure 3: Detailed illustration of our summarization framework. Task-1 (t1): source sentence extraction (right-hand gray box). Task-2 (t2): introductory sentence extraction (left-hand gray box). As shown, the identified salient introductory sentences at training stages are incorporated into the representations of source sentences by the Select(·) function (orange box) with k = 3. Plus sign shows the concatenation layer. The feed-forward neural network is made of one linear layer.
Figure 4: (a) Our system’s generated summary, (b) Sentence graph visualization of our system’s generated summary. Green and gray nodes are introductory and non-introductory sentences, respectively. Edge thickness denotes the ROUGE score strength between pair of sentences. Parts, from which sentences are sampled, are shown inside brackets. The summary is truncated due to space limitations. Ground-truth summary-worthy sentences are underlined, and colored spans show pointers from introductory to non-introductory sentences.
Figure 4: Relation of type I error rates at p < 0.05 to the total number of annotators for different designs, all with 100 documents and 3 judgements per summary. We conduct the experiment with both the t-test and approximate randomization test (ART). We show results both with averaging results per document and without any aggregation. We run 2000 trials per design. The red line marks the nominal error rate of 0.05.
Figure 3: Bias matrix for BAS with specific analysis for BART and Pegasus. The upper triangular matrix indicates τ+ for the given summarizer pair, the lower τ−. The area of each circle is proportional to the number of pairs in H+/H− for the cell. To read off the behaviour of the CM on a specific summarizer, we follow both the corresponding row and column. A high score in the row, combined with a low score in the corresponding cell in the column implies the CM is biased towards generations by this particular summarizer.
Figure 7: Summary quality as a function of metric optimized and amount of optimization, using best-of-N rejection sampling. We evaluate ROUGE, our main reward models, and an earlier iteration of the 1.3B model trained on approximately 75% as much data (see Table 11 for details). ROUGE appears to peak both sooner and at a substantially lower preference rate than all reward models. Details in Appendix G.3.
Figure 1: Fraction of the time humans prefer our models’ summaries over the human-generated reference summaries on the TL;DR dataset.4Since quality judgments involve an arbitrary decision about how to trade off summary length vs. coverage within the 24-48 token limit, we also provide length-controlled graphs in Appendix F; length differences explain about a third of the gap between feedback and supervised learning at 6.7B.
Figure 4: Transfer results on CNN/DM. (a) Overall summary quality on CNN/DM as a function of model size. Full results across axes shown in Appendix G.2. (b) Overall scores vs. length for the 6.7B TL;DR supervised baseline, the 6.7B TL;DR human feedback model, and T5 fine-tuned on CNN/DM summaries. At similar summary lengths, our 6.7B TL;DR human feedback model nearly matches T5 despite never being trained to summarize news articles.
Figure 5: Preference scores versus degree of reward model optimization. Optimizing against the reward model initially improves summaries, but eventually overfits, giving worse summaries. This figure uses an earlier version of our reward model (see rm3 in Appendix C.6). See Appendix H.2 for samples from the KL 250 model.
Figure 6: Reward model performance versus data size and model size. Doubling amount of training data leads to a ~1.1% increase in reward model validation accuracy, whereas doubling the model size leads to a ~1.8% increase. The 6.7B model trained on all data begins approaching the accuracy of a single human.
Figure 1: For each number of articles, we sample and compute the correlation for 50 times and plot the average as well as standard deviation. The decreasing size of error bar shows that enough articles are provided for each system and it is not the reason of the performance discrepancy between DUC2001 and DUC2002.
Figure 2: Process for dataset creation. In this example, two source sentences, s1 and s2, are selected, and the hypothesis sentence is generated by a sentence fusion operation. The selected sentences are considered errororiginating if the hypothesis sentence poses an error. These source sentences are referred to as corresponding sentences.
Figure 6: Comparison of FactCCX and SumPhrase outputs. The sentence with a blue underline was identified as an error-corresponding sentence by SumPhrase, whereas the span with a red underline was localized by FactCCX. The dependency relation between the red phrases was determined as erroneous by SumPhrase.
Figure 1: Illustration of our approach. Left: Constructing weak supervisions using ConceptNet, including (1) extracting aspects and (2) synthesizing aspect-based summaries. Right: Augmenting aspect information, including (3) identifying aspect related words in the document using Wikipedia and (4) feeding both aspect and related words into summarization model.
Figure 1: Illustration of Optimal Transport Extractive Summariser (OTExtSum): the formulation of extractive summarisation as an optimal transport (OT) problem. Optimal sentence extraction is conceptualised as obtaining the optimal extraction vector m∗, which achieves an OT plan from a document D to its optimal summary S∗ that has the minimum transportation cost. Such a cost is defined as the Wasserstein distance between the document’s semantic distribution TFD and the summary’s semantic distribution TFS and is used to measure the summary’s semantic coverage.
Figure 2: Interpretable visualisation of the OT plan from a source document to a resulting summary on the CNN/DM dataset. The higher the intensity, the more the semantic content of a particular document token is covered by a summary token. Purple line highlights the transportation from the document to the summary of semantic content of token “month”, which appears in both the document and the summary. Red line highlights how the semantic content of token “sponsor”, which appears in the document only but not the summary, are transported to token “tour” and “extension”, which are semantically closer and have lower transport cost, and thus achieve a minimum transportation cost in the OT plan.
Figure 1: Overview of the Falsesum generation framework. Falsesum preprocesses and formats the source document (A) and a gold summary (B) before feeding it to a fine-tuned generator model. The model produces a factually inconsistent summary, which can then be used to obtain (A,D) or (A,E) as the negative (non-entailment) NLI premise-hypothesis example pair. We also use the original (A,B) as a positive NLI example (entailment).
Figure 5: Drop of mean BLANC-tune value when parameters differ from optimal. The drop is shown as a fraction of the optimal mean BLANC value. The summaries probed are: CNN and DM (from the CNN/Daily Mail dataset), Top and Rand (first two sentences and random two sentences from random news articles). The parameters probed are: ’gap-infer 2/1’ is gap = 2 and gap mask = 1; ’gap-tune 2/1’ is gaptune = 2 and gap masktune = 1; ’p-replace 0.1’ is preplace = 0.1; ’toks-normal 4’ is Lnormal = 4; ’tune-rand’ is making tokens masking random rather than even at tuning.
Figure 2: Factor by which Spearman correlation of BLANC with human scores increases when only part of text is used for BLANC. The text part is selected as sentences with top BLANC values (thin lines) or as contiguous sentences having highest average BLANC (thick lines). The resulting BLANC is calculated as average over BLANC of the sentences.
Figure 4: Drop of mean BLANC-help value when parameters differ from optimal. The drop is shown as a fraction of the optimal mean BLANC value. The summaries probed are: CNN and DM (from the CNN/Daily Mail dataset), Top and Rand (first two sentences and random two sentences from random news articles). The parameters probed are: ’gap 3/1’ is gap = 3 and gap mask = 1; ’gap 3/2’ is gap = 3 and gap mask = 2; ’toks-normal 5’ is Lnormal = 5; ’toks-lead 2’ is Llead = 2; ’toks-follow 2’ is Lfollow = 2.
Figure 1: At t = 4, pgen4 weighs the probability of copying a word from V ext higher than generating a word from the fixed vocabulary V †. The decoder learns to interpret the weighted sum of hL4 and c4 in order to compute a probability distribution for the most appropriate text realisation given the context of the triples. The attention mechanism highlights f2 as the most important triple for the generation of the upcoming token. The attention scores are distributed among the entries of V ext, and accumulated into the final distribution over V . As a result, the model copies “science fiction”, one of the surface forms associated with f2.
Figure 6: Faithfulness-abstractiveness trade-off curve, shown as the dashed red line, on Gigaword dataset. We plot each model’s average faithfulness score evaluated by AMT against its extractiveness level. Our model lies above the graph, performing better than MLE-baseline, DAE (Goyal and Durrett, 2021), and Loss Truncation (Kang and Hashimoto, 2020).
Figure 2: Example output using different strategies of corrector and contrastor. The first two rows show the original document and summary with highlighted entities and their respective labels (date, number, ent). We mark hallucinated entities in the summaries with red, factual entities in document and summary with green and underlined, and
Figure 4: Zero-shot and few-shot results. The lines represent each models’s performance when fine-tuned on 0 (zero-shot), 1, 10, 100, and 1000 examples. FACTPEGASUS consistently improves sentence error with more training data. Without the corrector and contrastor, factuality decreases with just 10 examples.
Figure 1: Illustration of FACTPEGASUS. For pre-training (a), we use the factGSG objective introduced in Section 3.1 that transforms a text document into a pseudo-summarization dataset. We select the pseudo-summary using the combination of ROUGE and FactCC. Here, sentence A is selected as the pseudo-summary, and we mask this sentence in the original text to create the pseudo-document. During fine-tuning (b), the connector (i) simulates the factGSG task by appending the same mask token used in (a) to the input document, so that we have the same setup in both training stages. Then, corrector (ii) removes hallucinations (highlighted in red) from the summary. Finally, contrastive learning in (iii) encourages the model to prefer the corrected summary over the perturbed summary.
Figure 2: Architecture of proposed model. The generator network and reconstructor network are a seq2seq hybrid pointer-generator network, but for simplicity, we omit the pointer and the attention parts. loss varies widely from sample to sample, and thus the rewards to the generator are not stable either. Hence we add a baseline to reduce their difference. We apply self-critical sequence training (Rennie et al., 2017); the modified reward rR(x, x̂) from reconstructor R with the baseline for the generator is
Figure 1: A graphical illustration of the topic-aware convolutional architecture. Word and topic embeddings of the source sequence are encoded by the associated convolutional blocks (bottom left and bottom right). Then we jointly attend to words and topics by computing dot products of decoder representations (top left) and word/topic encoder representations. Finally, we produce the target sequence through a biased probability generation mechanism.
Figure 2: The architecture of our model. Blue bar represents the attention distribution over the inputs. Purple bar represents the concept distribution over the inputs. Noted that, this distribution can be sparse since not every word has its upper concept. Green bar represents the vocabulary distribution generated from seq2seq component.
Figure 3: Graph structure of HETERDOCSUMGRAPH for multi-document summarization (corresponding to the Graph Layer part of Figure 1). Green, blue and orange boxes represent word, sentence and document nodes respectively. d1 consists of s11 and s12 while d2 contains s21 and s22. As a relay node, the relation of document-document, sentence-sentence, and sentencedocument can be built through the common word nodes. For example, sentence s11, s12 and s21 share the same word w1, which connects them across documents.
Figure 4: The plot of the improvement of BertSUM+TA over BertSUM as a function of the document length for (a) CNN/DM and (b) Xsum, where the improvement is measured by the amount of increase in the ROUGE scores. The documents in each corpus are equally divided into 10 different groups based on their lengths. Each point of a curve indicates the average ROUGE score in its corresponding group.
Figure 1: Model Overview. The framework consists of three major modules: graph initializers, the heterogeneous graph layer and the sentence selector. Green circles and blue boxes represent word and sentence nodes respectively. Orange solid lines denote the edge feature (TF-IDF) between word and sentence nodes and the thicknesses indicate the weight. The representations of sentence nodes will be finally used for summary selection.
Figure 2: The detailed update process of word and sentence nodes in Heterogeneous Graph Layer. Green and blue nodes are word and sentence nodes involved in this turn. Orange edges indicate the current information flow direction. First, for sentence s1, word w1 and w3 are used to aggregate word-level information in (a). Next,w1 is updated by the new representation of s1 and s2 in (b), which are the sentences it occurs. See Section 3.3 for details on the notation.
Figure 1: Example contrasting the Autoencoder (AE) and Information Bottleneck (IB) approaches to summarization. While AE (top) preserves any detail that helps to reconstruct the original, such as population size in this example, IB (bottom) uses context to determine which information is relevant, which results in a more appropriate summary.
Figure 1: Model Framework. The top figure describes the framework for contrastive learning, where for each document x, we create different types of negative samples and compare them with x to get a ranking loss. The bottom figure is the evaluator which generates the final evaluation score. For short, here we use SS , SL and SLS to indicate S Score, L Score and LS Score.
Figure 1: Illustration of a unified semantic graph and its construction procedure for a document containing three sentences. In Graph Construction, underlined tokens represent phrases., co-referent phrases are represented in the same color. In The Unified Semantic Graph, nodes of different colors indicate different types, according to section 3.1.
Figure 2: Illustration of our graph-based summarization model. The graph node representation is initialized from merging token representations in two-level. The graph encoder models the augmented graph structure. The decoder attends to both token and node representations and utilizes graph structure by graph-propagation attention.
Figure 4: The figure compares high-frequency semantic units and semantic units in the summary of each article in CNN/Daily Mail, which includes 287k articlesummary pairs in total. The x-axis represents the ratio of high-frequency semantic units which also show up in summaries. The y-axis is the number of articles in the CNN/Daily Mail training set. The threshold of the cosine similarity is set as 0.5.
Figure 3: Comparison of the mean gold scores assigned for Q2 and Q3 to each of the 32 systems in the DUC05 dataset, and the corresponding scores predicted by SUM-QE. Scores range from 1 to 5. The systems are sorted in descending order according to the gold scores. SUM-QE makes more accurate predictions forQ2 than for Q3, but struggles to put the systems in the correct order.
Figure 2: Illustration of different flavors of the investigated neural QE methods. An encoder (E) converts the summary to a dense vector representation h. A regressor Ri predicts a quality score SQi using h. E is either a BiGRU with attention (BiGRU-ATT) or BERT (SUM-QE).R has three flavors, one single-task (a) and two multi-task (b, c).
Figure 2: Similarity matrices of two summaries for the same reading passage from the simulated learner data. Summary A is a good summary and Summary B is a bad summary. The rows of the matrix represent sentences in the summary and the columns of the matrix represent sentences in the reading passage.
Figure 1: The structure of our model, sei, sri represent the sentence embedding and sentence representation of sentence i, respectively. The binary decision of whether the sentence should be included in the summary is based on the sentence itself (A), the whole document (B) and the current topic (C). The document representation is simply the concatenation of the last hidden states of the forward and backward RNNs, while the topic segment representation is computed by applying LSTM-Minus, as shown in detail in the left panel (Detail of C).
Figure 2: The overview of the HYSUM framework. Hierarchical representation module first encodes the article sentences si into vectors hj . Then each sentence vector becomes two versions by adding with two different markers mc,mr. When the pointer network (arrows denote attention and darker color represents higher weights) selects the copy version hci of a sentence, it will be copied. Otherwise when the rewriting version hri is selected, the sentence will be rewritten to reduce redundancy.
Figure 1: (a) The causal graph of text summarization reflects the causal relationships among the fact C, source document X , language prior K, and the modelgenerated summary Y . (b) According to Eq. (6), the causal effect of X on Y can be obtained by subtracting the effect of K on Y from the total effect.
Figure 3: Sentence extraction module of JECS. Words in input document sentences are encoded with BiLSTMs. Two layers of CNNs aggregate these into sentence representations hi and then the document representation vdoc. This is fed into an attentive LSTM decoder which selects sentences based on the decoder state d and the representations hi, similar to a pointer network.
Figure 1: Next token entropies computed on 10K generation steps from PEGASUSCNN/DM, PEGASUSXSUM, BARTCNN/DM and BARTXSUM respectively, broken into two cases: an Existing Bigram means the bigram just generated occurs in the input document, while a Novel Bigram is an organic model generation. These cases are associated with low entropy and high entropy actions, respectively. The x-axis shows the entropy (truncated at 5), and the y-axis shows the count of bigram falling in each bin. The dashed lines indicate the median of each distribution.
Figure 1: Illustration of DISCOBERT for text summarization. Sentence-based BERT model (baseline) selects whole sentences 1, 2 and 5. The proposed discourse-aware model DISCOBERT selects EDUs {1- 1, 2-1, 5-2, 20-1, 20-3, 22-1}. The right side of the figure illustrates the two discourse graphs we use: (i) Coref(erence) Graph (with the mentions of ‘Pulitzer prizes’ highlighted as examples); and (ii) RST Graph (induced by RST discourse trees).
Figure 3: Correlating syntactic distance between neighboring tokens with the entropy change in those tokens’ generation decisions for PEGASUS summaries. The median entropy change is depicted as a dashed black line. At points of high syntactic distance, the model’s behavior is less restricted by the context, correlating with higher entropy.
Figure 4: Correlation between attention entropy and prediction entropy of PEG(ASUS) and BART on C(NN/DM) and X(Sum). We compute the mean value of the attention entropy within each bucket of prediction entropy. The uncertainty of attention strongly correlates with the entropy of the model’s prediction.
Figure 2: Prediction entropy values by relative sentence positions. For example, 0.0 indicates the first 10% of tokens in a sentence, and 0.9 is the last 10% of tokens. PEGASUSCNN/DM and BARTCNN/DM make highly uncertain decisions to start, but then entropy decreases, suggesting that these models may be copying based on a sentence prefix. Entropies on XSum are more constant across the sentence.
Figure 2: Example of discourse segmentation and RST tree conversion. The original sentence is segmented into 5 EDUs in box (a), and then parsed into an RST discourse tree in box (b). The converted dependencybased RST discourse tree is shown in box (c). Nucleus nodes including [2], [3] and [5], and Satellite nodes including [2] and [4] are denoted by solid lines and dashed lines, respectively. Relations are in italic. The EDU [2] is the head of the whole tree (span [1-5]), while the EDU [3] is the head of the span [3-5].
Figure 3: Four-way evaluation for our content attribution methods. The reported value is the NLL loss with respect to the predicted token. Lower is better for display methods and higher is better for removal methods (we “break” the model more quickly). n = 0 means the baseline when there is no token or sentence displayed in DISP or removed or masked in RM.
Figure 3: The framework of our model. Attentions marked in grey are from the naive Transformer. Then, the reasoning unit is consists of the inter-reasoning attention marked in green and the personalized intra-reasoning attention marked in yellow. Finally, the memory-decoder attention marked in red incorporates the historical reasoning memory into the decoder layer.
Figure 2: Comparison of Summarization tasks. Single-document Summarization (SDS task) focuses on generating summary S based on a single documentD. Multi-document Summarization (MDS task) creates a holistic summary S covering multiple articles D. The MIRANEWS task differs by producing summary S based only on the events pertinent in the main article D, while reaching to a set of assisting documents A for complementary background.
Figure 1: An example where the summary (top section) contains information that is not explicitly included in its main document (middle section), but is covered in the related assisting document (bottom section). We highlight the information in the summary that is aligned to its corresponding main and assisting documents with yellow and pink colors, respectively.
Figure 4: An example from MIRANEWS, where the key information in the gold summary and summaries generated by systems conditioning on the main document (BART-S) or both on the main and assisting documents (rest variants) were only mentioned in the assisting documents. Facts in the gold summary supported by the as-
Figure 1: Our two-stage ablation-attribution framework. First, we compare a decoder-only language model (not fine-tuned on summarization task, and not conditioned on the input article) and a full summarization model. They are colored in gray and orange respectively. the The higher the difference, the more heavily model depends on the input context. For those context-dependent decisions, we conduct content attribution to find the relevant supporting content with methods like Integrated Gradient or Occlusion.
Figure 3: Example of a page on newser.com: a newser.com article is a news event including editor-picked links to relevant news articles from other news websites. This example shows the webpage https:// www.newser.com/story/305823/starship-prototype-lands-doesnt-explode.html. In the webpage (D1), three extra news pieces (D2, D3, D4) from nytimes, newser, and CNBC are linked. All of these four news articles report on the same event of starship prototype landing.
Figure 1: Proposed summarization framework: generative process and neural parametrization. Shaded nodes represent observed variables, unshaded nodes indicate latent variables, arrows represent conditional dependencies between variables, and plates refer to repetitions of sampling steps. Dashed lines denote optional queries at test time. Latent queries create a query-focused view of the input document, which together with a query-agnostic view serve as input to a decoder for summary generation.
Figure 3: ROUGE-L F1 for SBERT-EX and SBERT-QB for each intent. From left to right, intents are ordered in increasing order of their subjectiveness score shown in Table 1. The Pearson’s correlation between the subjectiveness score and the F1 score for SBERT-EX and SBERT-QB is −0.97 and −0.77 respectively.
Figure 2: Question semantic volume maximization using convex hull. The and are the selected and non-selected candidates RTT questions using convex hull. The left side figure shows the toy-example of the convex hull. The right side figure shows the selected RTT question with respect to the gold question .
Figure 3: Theme modeling is essentially updating TED with a semantic classifier. The input sentence pair is first processed by adding a “class” token in the beginning and a “separation” token between the two sentences. Then the sentence pair is fed into the transformer encoder, and the first output vector is classified to “similar” or “distinct”.
Figure 1: Abstract and citations of (Bergsma and Lin 2006). The abstract emphasizes their pronoun resolution techniques and improved performance; the citation sentences reveal that their noun gender dataset is also a major contribution to the research community, but it is not covered in the abstract.
Figure 3: ROUGE-1 results of BART fine-tuning, DAPT and SDPT over different numbers of training data for email (left) and dialog (right) domains. We consider both low-resource settings (50, 100, 200 and 300 (∼2%) samples), medium-resource settings (25% and 50% samples), and high-resource settings (75% and 100% samples).
Figure 1: Illustration of the Length Attention Unit. Firstly, decoder hidden state (blue) and remaining length (yellow) are employed to compute the attention weights al. Then, the length context vector clt (green) is produced by calculating the weighted sum between attention weights and pre-defined length embeddings (purple). Better viewed in color.
Figure 2: Our proposed training strategy. Compared to existing work which relies only on a ROUGE reward rR, we add a factual correctness reward rC which is enabled by a fact extractor. The summarization model is updated via RL, using a combination of the NLL loss, a ROUGE-based loss and a factual correctness-based loss. For simplicity we only show a subset of the clinical variables in the fact vectors v and v̂.
Figure 1: The illustration of our metrics. This data example is from REALSumm (Bhandari et al., 2020) (we omit unnecessary content by ‘...’). For gold labels, ‘1’ stands ‘present’ and ‘0’ stands ‘not present’. Other scores are the 2-class entailment probabilities, p2c(e), from our finetuned NLI model.
Figure 2: Model Architecture. The sentences and actions are first encoded and then fed to the highlight classifiers. The hidden representations of sentences and actions, along with their probability of being highlights are then used in the cross-attention layer in the decoder. The email encoder has the same structure as BART encoder. The graph encoder utilizes graph attention networks to encode the action graph.
Figure 2: Our controllable neural model with guiding entities. The original article texts are encoded with a BiLSTM layer. We utilize a pretrained BERT named entity recognition tool to extract entities from input texts. The decoder consists of two LSTMs: LSTM-L and LSTM-R. Our model starts generating the left and right part of a summary with selected entities and can guarantee that entities appear in final output summaries.
Figure 1: Different behaviours of two decoders (SeqLab and Pointer) under different testing environment. (a) shows repetition scores of different architectures when extracting six sentences on CNN/DailyMail. (b) shows the relationship between ∆R and positional bias. The abscissa denotes the positional bias of six different datasets and ∆R denotes the average ROUGE difference between the two decoders under different encoders. (c) shows average length of k-th sentence extracted from different architectures.
Figure 2: Distribution of z(%) on six datasets. Because the number of candidate summaries for each document is different (short text may have relatively few candidates), we use z / number of candidate summaries as the X-axis. The Y-axis represents the proportion of the best-summaries with this rank in the test set.
Figure 4: Datasets splitting experiment. We split test sets into five parts according to z described in Section 3.2. The X-axis from left to right indicates the subsets of the test set with the value of z from small to large, and the Y-axis represents the ROUGE improvement of MATCHSUM over BERTEXT on this subset.
Figure 1: Overview of the NEUSUM model. The model extracts S5 and S1 at the first two steps. At the first step, we feed the model a zero vector 0 to represent empty partial output summary. At the second and third steps, the representations of previously selected sentences S5 and S1, i.e., s5 and s1, are fed into the extractor RNN. At the second step, the model only scores the first 4 sentences since the 5th one is already included in the partial output summary.
Figure 2: The overview of the BERT-based model for sub-sentential extraction (SSE). In this simplified example, the document has 3 sentences. The first and the third sentences have two extraction units and the second sentence has one. After encoding the document with pre-trained BERT encoder, an average pooling layer are used to aggregate information of each extraction unit. The final Transformer layer captures the document-level information and then the MLP predicts the extraction probability.
Figure 3: The detailed update process of word, sentence, and section nodes in HEROES. The figure is a toy example consisting of 3 sections, 7 sentences, and 5 unique words, where the vertical dashed lines are section boundaries. Green, blue, red nodes are word, sentence, section nodes involved in the update in this turn. Orange edges denote the direction of information flow.
Figure 2: The distribution of the overlapping ratio of nonstopping words between: (Red) the reference summary and the article; (Green) the reference summary and the article excluding the first 3 sentences, i.e. Rest; and (Blue) the leading 3 sentences, i.e. Lead-3, and Rest. The area under each curve is 1. All ratios are computed on CNN/DailyMail.
Figure 1: Assume a document (x1, x2, · · · , x8) contains three sentences (i.e., SENT. 1, SENT. 2 and SENT. 3). A SEQ2SEQ Transformer model can be pre-trained with our proposed objective. It takes the transformed document (i.e., a shuffled document, the first segment of a document, or a masked document) as input and learns to recover the original document (or part of the original document) by generation. SR: Sentence Reordering; NSG: Next Sentence Generation; MDG: Masked Document Generation.