TLDR Progress

Figure 1: Consumer health questions and associated summaries from the gold standard. The entities in Red are the foci (main entities). The words in Blue and underlined are the triggers of the question types.

Figure 2: A summary generated by PG+M#2 method.

Figure 2: Distribution of #companies per SIC code

Figure 1: Architecture for our proposed system for Summarization of 10-K documents,MHFinSum

Figure 1: Examples of an earthquake-related article paired with extractive summaries from the CNN/DM dataset. “Generic” represents the selection of a general purpose summarization model. “Geo(graphy)” (colored in green) and “Recovery” (colored in orange) indicate our aspects of interest for the summary. We highlight aspect-relevant phrases in the document.

Figure 2: User interface for Turkers’ annotation.

Figure 1: Pipeline for Sem-nCG@k evaluation of extractive summarization task, CG@k stands for Cumulative Gain at kth position and ICG@k for Ideal CG@k.

Figure 2: ROUGE-2 and Sem-nCG scores for the 6 extractive summarization models on CNN/DailyMail test dataset. The results are for top-5 extracted sentences when the outputs are in actual and perturbed. The plot demonstrates the robustness of Sem-nCG against perturbation.

Figure 1: Observations on linked entities in summaries. O1: Summaries are mainly composed of entities. O2: Entities can be used to represent the topic of the summary. O3: Entity commonsense learned from a large corpus can be used.

Figure 2: Full architecture of our proposed sequence-to-sequence model with Entity2Topic (E2T) module.

Figure 3: Entity encoding submodule with selective disambiguation applied to the entity 3©. The left figure represents the full submodule while the right figure represents the two choices of disambiguating encoders.

Figure 3: Model architecture of PLANSUM. The content plan is constructed as the average of the aspect and sentiment probability distributions induced by the content plan induction model. It is then passed to the decoder, along with the aggregated token encodings to generate the summary.

Figure 2: Pseudo-summary y and input reviews X; the aspect code for summary y is room. Review sentences with the same aspect are underlined and same aspectkeywords are magnified.

Figure 1: Overview of the controller induction model. Token-level aspect predictions are aggregated into sentencelevel predictions using a multiple instance pooling mechanism (described on the right). The process is repeated from sentence- to document-level predictions.

Figure 4: Yelp summaries generated by PLANSUM and its variants. Aspects also mentioned in the gold summary (not shown to save space) are in color, other aspects are italicized.

Figure 1: Yelp reviews about a local bar and corresponding summary. Aspect-specific opinions are in color (e.g., drinks, guests, staff), while less salient opinions are shown in italics.

Figure 1: Three out of 150 reviews for the movie “Coach Carter”, and summaries written by the editor, and generated by a model following the EXTRACT-ABSTRACT approach and the proposed CONDENSE-ABSTRACT framework. The latter produces more informative and factual summaries whilst allowing to control aspects of the generated summary (such as the acting or plot of the movie).

Figure 2: Illustration of EA and CA frameworks for opinion summarization. In the CA framework, users can obtain need-specific summaries at test time (e.g., give me a summary focusing on acting).

Figure 2: Model architecture of our content plan induction model. The dotted line indicates that a reverse gradient function is applied.

Figure 2: Overview of our Citation Graph-Based Model (CGSUM). A denotes the source paper (w/o abstract). B, C, D and E denote the reference papers. The body text of A and the abstract of reference papers are fed into the document encoder, and then used to initialize the node features in the graph encoder. Neighbor extraction method will be used to extract a more relevant subgraph. While decoding, the decoder will pay attention to both the document and the citation graph structure.

Figure 4: Relationships between the degree of source paper nodes (X-axis) and R̃ (the average of ROUGE-1, ROUGE-2 and ROUGE-L) of two models: CGSUM + 1-hop neighbors and PTGEN + COV (inductive setting).

Figure 3: Different ways of splitting training, validation, test sets from the whole graph. We omit the directionality of the edges for simplification. The green, orange, cyan nodes represent papers from the training, validation, test set.

Figure 1: A small research community on the subject of Weak Galerkin Finite Element Method. Green text indicates the domain-specific terms shared in these papers, orange text denotes different ways of writing the same sentences, blue text represents the definition of Weak Galerkin Finite Element Method (does not appear in the source paper).

Figure 1: A comparison between two-stage models and COLO. The two-stage models including two training stages and a time-consuming preprocess while COLO is trained in an end-to-end fashion. (GPU and CPU hours cost in each stage are shown in Table 6). Two-stage models use offline sampling to build positive-negative pairs while COLO builds positive-negative pairs with online sampling where we directly get theses pairs from a changing model distribution.

Figure 2: Architecture of our extractive model. Input sequence: The ‘[doc]’ token is used to get vector representation zX of the document X , ‘[sep]’ is used as separator for sentences. We omit the classifier and the BCELoss. hi is the sentence embedding the i-th sentence in X . zCi means the feature representation of the i-th candidate.

Figure 4: Test inference time with beam size for abstractive model. We use the maximum batch size allowed by GPU memory.

Figure 3: Inference speed on CNN/DM (extractive). we use the candidate size |C| as the X-axis. The Y-axis represents the number of samples processed per second. batch=MAX means we use the maximum batch size allowed by GPU memory.

Figure 5: T-SNE Visualization of two examples from CNN/DM test set. We divide the candidates into 3 groups based on ROUGE score: candidates ranking 1~50, candidates ranking 51~100, candidate ranking 101~150. The red point denotes the anchor and the purple/cyan/gray points respectively denote the top 50/100/150 candidates.

Figure 1: Aspect-based opinion summarization. Opinions on image quality, sound quality, connectivity, and price of an LCD television are extracted from a set of reviews. Their polarities are then used to sort them into positive and negative, while neutral or redundant comments are discarded.

Figure 3: Human and system summaries for a product in the Televisions domain.

Figure 2: Multi-Seed Aspect Extractor (MATE).

Figure 3: Sentence ranking via two-step sampling. In this toy example, each sentence (s1 to s5) is assigned to its nearest code (k = 1, 2, 3), as shown by thick purple arrows. During cluster sampling, the probability of sampling a code (top right; shown as blue bars) is proportional to the number of assignments it receives. For every sampled code, we perform sentence sampling; sentences are sampled, with replacement, according to their proximity to the code’s encoding. Samples from codes 1 and 3 are shown in black and red, respectively.

Figure 5: t-SNE projection of the quantized space of an eight-head QT trained on SPACE, showing all 1024 learned latent codes (best viewed in color). Darker codes correspond to lower aspect entropy, our proposed measure of high aspect-specificity. Zooming in the aspect sub-space uncovers good aspect separation.

Figure 1: A sentence is encoded into a 3-head representation and head vectors are quantized using a weighted average of their neighboring code embeddings. The QT model is trained by reconstructing the original sentence.

Figure 4: Aspect opinion summarization with QT. The aspect-encoding sub-space is identified using mean aspect entropy and all other sub-spaces are ignored (shown in gray). Two-step sampling is restricted only to the codes associated with the desired aspect (shown in red).

Figure 2:General opinion summarizationwithQT. All input sentences for an entity are encoded using three heads (shown in orange, blue, and green crosses). Sentence vectors are clustered under their nearest latent code (gray circles). Popular clusters (histogram) correspond to commonly occurring opinions, and are used to sample and extract the most representative sentences.

Figure 6: Mean aspect entropies (bars) for each of QT’s head sub-spaces and corresponding aspect ROUGE-1 scores for the summaries produced by each head (line).

Figure 2: A Transformer-based encoder-decoder architecture with FAME.

Figure 3: Top 40 sentence pieces and their logits from topic distribution tX in ROBFAME and PEGFAME for the XSUM article discussed in Figure 1.

Figure 4: ROUGE-1 F1 scores of ROBFAME and PEGFAME models with different top-k vocabularies (Eq. (5)) on the XSUM test set. Similar patters are observed for ROUGE-2 and ROUGE-L scores.

Figure 5: A 2010 BBC article from the XSUM testset, its human written summary and model predictions from ROBERTAS2S, and PEGASUS, with and without FAME. The text in orange is not supported by the input article.

Figure 7: FAME model predictions with Focussample,k (k = 10000). The text in orange is not supported by the input article.

Figure 6: Model predictions with focus sampling Focustop,k, a controlled generation setting. The text in orange is not supported by the input article. We note that with smaller values of k, both ROBERTAS2S-based and PEGASUSbased models tend to hallucinate more often.

Figure 1: Block A shows the best predictions from PEGASUS and our PEGFAME (PEGASUS with FAME) model, along with the GOLD summary for an XSUM article. Block B presents diverse summaries generated from PEGASUS using top-k and nucleus sampling. Block C shows diverse summaries generated using our PEGFAME model with Focus sampling. The text in orange is not supported by the input article.

Figure 1: System framework. The model uses an extractive summary as a document surrogate to answer important questions about the document. The questions are automatically derived from the human abstract.

Figure 1: A unidirectional LSTM (blue, Eq. (3)) encodes the partial summary, while the multilayer perceptron network (orange, Eq. (4-5)) utilizes the text unit representation (het ), its positional embedding (gt), and the partial summary representation (st) to determine if the t-th text unit is to be included in the summary. Best viewed in color.

Figure 1: An illustration of label smoothing. Words aligned to the abstract are colored orange; gap words are colored turquoise.

Figure 2: Summarization results using fLSTM1 or fCNN2 encoder with word/chunk as the extraction unit.

Figure 1: The overview architecture of the extractor netwrok

Figure 6: Figure showing the impact of the fine-tuning epochs of the BART-b + T-ID model on ROUGE L performance.

Figure 1: A topical summarization example, summarizing a sample document with respect to economy and climate topics.

Figure 3: Comparison of per topic normalized counts of NEWTS test documents versus CNN/Dailymail counts

Figure 4: Comparison of per topic normalized counts of Train Documents of our Dataset versus CNN/Dailymail

Figure 2: The step-by-step process of building the NEWTS dataset.

Figure 1: StructSum incorporates Latent Structure (LS) §2.2 and Explicit Structure (ES) §2.3 Attention to produce structure-aware representations. Here, StructSum augments the Pointer-Generator model, but the methodology that we proposed is general, and it can be applied to other encoder-decoder summarization systems

Figure 2: Comparison of % Novel n-grams between StructSum, Pointer-Generator+Coverage and the Reference. Here, “sent” indicates full novel sentences.

Figure 3: Coverage of source sentences in summary. Here the x-axis is the sentence position in the source article and y-axis shows the normalized count of sentences in that position copied to the summary.

Figure 4: Examples of induced structures and generated summaries.

Figure 1: A toy example of Semantic Overlap Summarization (SOS) Task (from multiple alternative narratives). Here, an abortion issue-related event has been reported by two news media (left-wing and right-wing). “Green” Text denotes the common information from both news media, while “Blue” and “Red” text denotes the unique perspectives of left and right wing. Some real examples from the benchmark dataset are provided in the Table 3.

Figure 2: Example of three-step summarization process: selecting, grouping and rewriting.

Figure 3: Architecture of the contextualized rewriter. The group tag embeddings are tied between the encoder (left figure) and the decoder (right figure), through which the decoder can address to the corresponding tokens in the document.

Figure 7: Example of the ability to maintain coherence.

Figure 6: Example of the ability to reduce redundancy.

Figure 1: Example showing that contextual information can benefit summary rewriting.

Figure 4: Comparison of the ability to generate non-redundant summaries.

Figure 2: Training sample generation by mutation. Mu-

Figure 1: The weakly supervised training approach in this paper and the test of a trained model.

Figure 3: Training sample generation via cross pairing.

Figure 1: Argument coverage per number of key points.

Figure 2: Precision/Recall trade-off for different key point selection policies. For each method, the highest F1 score, as well as the F1 score for the chosen threshold are specified. For the Best Match + Threshold policy, these two scores coincide.

Figure 4. DUC 2005-7 vs. QCFS Dataset Structure

Figure 7. Comprastion of Retrieval Based Algorithms Performance on the TD-QFS Dataset

Figure 5. ROUGE-Recall results of KLSum on relevance-filtered subsets of the TD-QFS dataset compared to DUC datasets.

Figure 2. Two-stage QFS Scheme.

Figure 6. Comprastion of QFS to non-QFS Algorithms Performance on the TD-QFS Dataset

Figure 1. ROUGE-Comparing QFS methods to generic summarization methods: Biased-LexRank is not significantly better than generic algorithms.

Figure 3. Comparing Retrieval Components on DUC 2005

Figure 10: Coverage (x-axis) vs Kendall’s τ (y-axis) for all metric pairs on the CNNDM dataset.

Figure 3: System-level Pearson correlation with humans on top-k systems (Sec. 4.2).

Figure 7: Abstractiveness (x-axis) vs Kendall’s τ (y-axis) for all metric pairs on the TAC dataset.

Figure 9: Coverage (x-axis) vs Kendall’s τ (y-axis) for all metric pairs on the TAC dataset.

Figure 4: F1-Scores with bootstrapping (Sec. 4.3).

Figure 5: Ease of Summarization (x-axis) vs Kendall’s τ (y-axis) for all metric pairs on the TAC dataset.

Figure 2: Disagreement between metrics on TAC and CNNDM.

Figure 1: Effect of different properties of reference summaries. We only show correlation between BERTScore and ROUGE-2 due to limited pages. The trend is similar for all other pairs as shown in the appendix. The plots for CNNDM are more dense because of more documents in the CNNDM test set as compared to TAC. “Cov” and “Abs” stand for Coverage and Abstractiveness respectively. The trend lines in red are the 10 point and 100 point moving average for TAC and CNNDM respectively.

Figure 8: System-level Kendall correlation with humans on top-k systems.

Figure 2: System-level Pearson correlation between metrics and human scores (Sec. 4.1).

Figure 6: Ease of Summarization (x-axis) vs Kendall’s τ (y-axis) for all metric pairs on the CNNDM dataset.

Figure 1: p-value of William’s Significance Test for the hypothesis “Is the system on left (y-axis) significantly better than system on top (x-axis)”. ‘BScore’ refers to BERTScore and ‘MScore’ refers to MoverScore. A dark green value in cell (i, j) denotes metric mi has a significantly higher Pearson correlation with human scores compared to metric mj (p-value < 0.05).13‘-’ in cell (i, j) refers to the case when Pearson correlation of mi with human scores is less that of mj (Sec. 4.1).

Figure 6: System-level Kendall correlation between metrics and human scores.

Figure 5: Pearson correlation between metrics and human judgments across different datasets (Sec. 4.4).

Figure 8: Abstractiveness (x-axis) vs Kendall’s τ (y-axis) for all metric pairs on the CNNDM dataset.

Figure 7: Kendall correlation between metrics and human judgements across different datasets.

Figure 4: F ′/N ration between metrics on TAC and CNNDM.

Figure 3: F/N ratio between metrics on TAC and CNNDM.

Figure 1: The patterns of the main (red) and three auxiliary tasks (green). The solid line denotes the concatenation of the document and the corresponding question, which is the input of our model; the dashed line denotes the corresponding answer for each input. All tasks are in a "text-to-text" format.

Figure 2: We insert adapter 𝑇𝑆𝑖 into original BART encoder layers in a serial manner. Each task has a unique adapter. Red rectangle denotes the adapter for the main task.

Figure 4: We compare the performance of our multi-task model with the baseline in training process. We calculate the average loss of the model at each step on the training set and validation set. This figure shows that our model can alleviate the overfitting problem.

Figure 3: It shows how the weights of all the auxiliary tasks change during training on 100 samples. The weights of ext and spo are almost monotonically increasing. The weight of nli task begins to decline except for a short period of ascent at the beginning.

Figure 1: GenCompareSum pipeline. (a) We split the document into sentences. (b) We combine these sentences into sections of several sentences. (c) We feed each section into the generative text model and generate several text fragments per section. (d) We aggregate the questions, removing redundant questions by using n-gram blocking. Where aggregation occurs, we apply a count to represent the number of textual fragments which were combined and use this as a weighting going forwards. The highest weighted textual fragments are then selected to guide the summary. (e) The similarity between each sentence from the source document and each selected textual fragment is calculated using BERTScore. (f) We create a similarity matrix from the scores calculated in the previous step. These are then summed over the textual fragments, weighted by the values calculated in step (d), to give a score per sentence. (g) The highest scoring sentences are selected to form the summary.

Figure 2: Representations of the two training settings of the T5 encoder-decoder model. The left diagram shows the unsupervised pretraining task, in which a tokenized text containing masked spans is passed to the encoder and the output target of the decoder is the prediction of the masked spans. The right diagram shows the supervised downstream task, where the pre-trained model is finetuned on pairs of tokenized sequences.

Figure 3: Data volume over time for each topic

Figure 6: Classification accuracies for 21 pairs of summaries. (a) Automatic classification using prototypes (by SVM) on the entire test set. The green avg SVM line is the mean accuracy of SVMs trained on the entire training set. (b) Automatic classification evaluated on 6 test articles per pair. (c) Human classification accuracy on 6 test articles per pair.

Figure 5: An example questionnaire used for crowd-sourced evaluation. It consists of: (a) instructions, (b) two groups of summaries, (c) question articles, and (d) a comment box for feedback. See §6.3.

Figure 1: An illustrative example of comparative summarisation. Squares are news articles, rows denote different news outlets, and the x-axis denotes time. The shaded articles are chosen to represent AI-related news during Feb and March 2018, respectively. They aim to summarise topics in each month, and also highlight differences between the twomonths.

Figure 4: Comparative summarisation methods evaluated using the balanced accuracy of 1-NN (left) and SVM (right) classifiers. Each row represent a dataset. Error bars show 95% confidence intervals.

Figure 1: Distributions of some metrics/rewards for summaries with different human ratings. Among the four presented metrics/rewards, only BERT+MLP+Pref (the rightmost sub-figure) does not use reference summaries.

Figure 4: Information about the length of the generated summaries for the CNN/DM dataset.

Figure 3: Information about the length of the generated summaries for the CASS dataset.

Figure 1: Training of the model. The blocks present steps of the analysis. All the elements above the blocks are inputs (document embedding, sentences embeddings, threshold, real summary embedding, trade-off).

Figure 2: Processing time of the summarization function (y-axis) by the number of lines of the text as input (x-axis). Results computed on an i7-8550U.

Figure 2: Production of latent code zN for review rN .

Figure 1: Unfolded graphical representation of the model.

Figure 1: Illustration of the FEWSUM model that uses the leave-one-out objective. Here predictions of the target review ri is performed by conditioning on the encoded source reviews r−i. The generator attends the last encoder layer’s output to extract common information (in red). Additionally, the generator has partial information about ri passed by the oracle q(ri, r−i).

Figure 2: Architecture of the prior score function.

Figure 1: The SELSUM model is trained to select and summarize a subset of relevant reviews r̂1:K from a full set r1:N using the approximate posterior qφ(r̂1:K |r1:N , s). To yield review subsets in test time, we fit and use a parametrized prior pψ(r̂1:K |r1:N ).

Figure 1: Illustration of the proposed approach. In Stage 1, all parameters of a large language model are pretrained on generic texts (we use BART). In Stage 2, we pre-train adapters (5% of the full model’s parameters) on customer reviews using held-out reviews as summaries. In Stage 3, we fine-tune the adapters on a handful of reviews-summary pairs.

Figure 2: Illustration of the query-based summarizer that inputs reviews and a text query consisting of aspects, such as ‘volume,’ ‘price,’ and ‘bluetooth.’ The query is automatically created from gold summaries in training and reviews in test time.

Figure 3: Two example TLDR-Auth and TLDR-PR pairs with colored spans corresponding to nuggets in Table 3

Figure 4: Training regimen for CATTS.

Figure 2: Example of a reviewer comment rewritten as a TLDR (best viewed in color). A peer review comment often begins with a summary of the paper which annotators use to compose a TLDR. Annotators are trained to preserve the original reviewer’s wording when possible (indicated by colored spans), and to avoid using

Figure 1: An example TLDR of a scientific paper. A TLDR is typically composed of salient information (indicated by colored spans) found in the abstract, intro, and conclusion sections of a paper.

Figure 4: Summaries are rated by medical practitioners along the dimensions of adequacy, faithfulness, readability and ease of revision. Their ratings are averaged for each summary.

Figure 1: An example after-visit summary generated from EHR notes associated with a patient. A novel alerting mechanism is proposed in this work to report errors found in the summary, including missing medical events and hallucinated facts. We aim to build effective detectors with self-supervision on unlabeled data for error alerting.

Figure 2: One or two event nuggets are randomly masked out from a summary sentence (a). The masked sequence (b) is fed to a denoising auto-encoder to produce a synthesized sentence that may contain hallucinated medical events (c).

Figure 3: “abdominal pain” appears in both the clinical document and after-visit summary, with the same CUI. “nausea vomiting” and “nauseous” are aligned because there is an is-a relation between the two concepts.

Figure 1: Generation of sentence and document cluster embeddings. “⊕” stands for a pooling operation, while “⊗” represents a relevance measurement function.

Figure 1: A dependency tree example. The meaning of the dependency labels can be referred to (De Marneffe and Manning 2008). We extract the following two fact descriptions: taiwan share prices opened lower tuesday ||| dealers said

Figure 2: Model framework

Figure 2: Jointly Rerank and Rewrite

Figure 3: Gates change during training.

Figure 1: Flow chat of the proposed method. We use the dashed line for Retrieve since there is an IR system embedded.

Figure 5: Summaries use different portions of error correction operations. Contrastive learning with SYSLOWCON (CL.SLC) and BATCH (CL.B) substitute errors with correct content more often than unlikelihood training with MASKENT and ENTAILRANK.

Figure 7: Probability distributions of generating the non-first tokens of proper nouns, numbers, nouns, and verbs, grouped by extrinsic errors, intrinsic errors, world knowledge, and other correct tokens. Non-first tokens do not exist for numbers and verbs, as they only contain single tokens.

Figure 3: Probability distributions of generating the first tokens of proper nouns and numbers, grouped by extrinsic errors, intrinsic errors, world knowledge, and other correct tokens.

Figure 6: ROUGE-1 improvement with oracle masks for each head at each layer on the analysis sets of XSum and NYT.

Figure 1: Illustration of attention head masking (m̃).

Figure 2: ROUGE-1 F1 improvement with oracle masks for each head at each layer on the analysis set of CNN/DM. Overall, top layers see greater improvement than bottom layers. Layer 1 is the bottom layer connected with the word embeddings.

Figure 4: Portions of summaries with errors. CL models consistently reduce both types of errors.

Figure 9: Percentages of COPY SALIENT, NON-COPY SALIENT, COPY CONTENT, NON-COPY CONTENT, FIRST and LAST attendees for each head at each layer on the analysis set of NYT.

Figure 11: Sample generated summaries by fine-tuned BART models. Intrinsic errors are highlighted in red

Figure 10: Guideline for our human evaluation (§ 7.2).

Figure 9: Guideline for our summary error annotation (§ 4).

Figure 7: [Left] ROUGE-1 F1 improvement by incrementally applying oracle masking to the next head with most ROUGE improvement per layer on XSum and NYT. Dotted lines indicate that the newly masked heads do not have individual ROUGE improvements. [Right] ROUGE-1 recall improvement by masking all heads vs. sum of improvement by masking each head separately on XSum and NYT. Better displayed with color.

Figure 8: Percentages of samples containing world knowledge as labeled by human on the outputs of XSum and CNN/DM.

Figure 8: Percentages of COPY SALIENT, NON-COPY SALIENT, COPY CONTENT, NON-COPY CONTENT, FIRST and LAST attendees for each head at each layer on the analysis set of XSum.

Figure 5: Results on CNN/DM with different sizes of training data. Our method consistently improves the summarizer.

Figure 10: Percentages of NON-COPY CONTENT and LAST attendees for each head at each layer on the analysis set of CNN/DM.

Figure 1: Sample article and system summaries by different methods. Our contrastive learning model trained on low confidence system outputs correctly generates

Figure 6: Probability distributions of generating the first tokens of nouns and verbs, grouped by extrinsic errors, intrinsic errors, world knowledge, and other correct tokens. No verb is annotated as world knowledge.

Figure 3: [Left] ROUGE-1 F1 improvement by incrementally applying oracle masking to the next head with most ROUGE improvement per layer on CNN/DM. Dotted lines indicate that the newly masked heads do not have individual ROUGE improvements. [Right] ROUGE-1 recall improvement by masking all heads vs. sum of improvement by masking each head separately on CNN/DM. Best viewed in color.

Figure 4: COPY and NON-COPY SALIENT attendee word percentages on the analysis set of CNN/DM. Top layers focus on words to be “copied", while bottom layers attend to the broader salient context.

Figure 2: Percentage of samples with intrinsic and extrinsic error spans for models fine-tuned from BART and PEGASUS on XSum and CNN/DM.

Figure 1: The factuality and ROUGE score trade-off curve on XSUM. We use different reward value rnfe for our approach and different drop rate c for the loss truncation baseline.

Figure 8: Question-summary hierarchy annotation instructions. (Page 2 / 4)

Figure 2: The distribution of entities over prior/posterior probability. Each point in the figure represents an entity (pprior(ek), ppos(ek)) and shading indicates the confidence of the classifier. (a) The distribution of entities; (b) The entity factuality classification results with KNN (k = 20) classifier. Both factual hallucinated and non-hallucinated entities are colored blue; (c) The KNN (k = 20) classification boundaries of hallucinated and non-hallucinated entities.

Figure 11: Human evaluation guidelines.

Figure 6: Visualization of hierarchical biases in HIBRIDS-ENC (left) and HIBRIDS-DEC (right) on GOVREPORT. Positive and negative values are shaded in

Figure 12: Screenshot of the human evaluation interface.

Figure 6: Entity distribution over posterior probabilities from CMLMXSUM and CMLMCNN/DM. The shading shows the classification boundaries of the classifier.

Figure 4: Results on full summary generation. In each subfigure, the left panel includes models for comparisons and the right panel shows our models. HIBRIDS on either encoder and decoder uniformly outperforms the comparisons on both datasets.

Figure 1: The question-summary hierarchy annotated for sentences in a reference summary paragraph. Summarization models are trained to generate the question-summary hierarchy from the document, which signifies the importance of encoding the document structure. For instance, to generate the follow-up question-summary pairs of Q1.1 and A1.1 from A1, it requires the understanding of both the content and the parent-child and sibling relations among §3, §3.1, and §3.4.

Figure 5: Evaluation of PEGASUSLARGE trained on datasets with different levels of noises.

Figure 2: Example path lengths and level differences (right) that encode the relative positions with regard to the document tree structure (left). Each query/key represents a block of tokens that belong to the same

Figure 8: ROC curve of entity’s posterior probability and factuality.

Figure 3: Sample output by the hierarchical encoding model (HIERENC) and HIBRIDS-ENC. Our generated structure makes more sense with the constructed follow-up questions to Q1, highlighted in green, than the comparison model HIERENC.

Figure 10: Question-summary hierarchy annotation instructions. (Page 4 / 4)

Figure 7: Posterior probabilities calculated from CLM and CMLM. Both models are trained on XSUM dataset.

Figure 5: Visualization of hierarchical biases in HIBRIDS-ENC (left) and HIBRIDS-DEC (right) on QSGen-

Figure 1: Illustration of deep communicating agents presented in this paper. Each agent a and b encodes one paragraph in multiple layers. By passing new messages through multiple layers the agents are able to coordinate and focus on the important aspects of the input text.

Figure 4: The average ROUGE-L scores for summaries that are binned by each agent’s average attention when generating the summary (see Section 5.2). When the agents contribute equally to the summary, the ROUGE-L score increases.

Figure 3: Multi-agent encoder message passing. Agents b and c transmit the last hidden state output (I) of the current layer k as a message, which are passed through an average pool (Eq. 6). The receiving agent a uses the new message z (k) a as additional input to its next layer.

Figure 2: Multi-agent-encoder-decoder overview. Each agent a encodes a paragraph using a local encoder followed by multiple contextual layers with agent communication through concentrated messages z(k)a at each layer k. Communication is illustrated in Figure 3. The word context vectors cta are condensed into agent context c∗t . Agent specific generation probabilities, pta, enable voting for the suitable out-of-vocabulary words (e.g., ’yen’) in the final distribution.

Figure 2: Overall model architecture consisting of (M1) shared text encoder, (M2) summary decoder, and (M3) dualview sentiment classification module. The shared text encoder converts the input review text into a memory bank. Based on the memory bank, the summary decoder generates the review summary word by word and receives a summary generation loss. The source-view (summary-view) sentiment classifier uses thememory bank (hidden states) from the encoder (decoder) to predict a sentiment label for the review (summary) and it receives a sentiment classification loss. An inconsistency loss is applied to penalize the disagreement between the source-view and summary-view sentiment classifiers.

Figure 1: An example of truncated review and its corresponding summary and sentiment label.

Figure 6: Sample summaries generated by our D.GPT2+CMDP model using different length bins on the DUC-2002 testing set.

Figure 1: A sample document and three summaries generated by our entity-controlled model based on DistilGPT2 (Sanh et al., 2019) and fine-tuned by our proposed method. Each summary corresponds to the requested entity inside the pair of brackets.

Figure 2: Bin % of different models with different specified length bins on the DUC-2002 dataset. Our framework improves the bin % of PG and D.GPT2 for bin 4, 7, and 10 by a wide margin.

Figure 4: Results of entity-controlled models for entities in different document sentences. Our CMDP framework consistently improves the QA′-F1 and appear % for entities at different positions.

Figure 3: Values of costs (c) and Lagrangian multipliers (λ) of PG+CMDP for length control on every checkpoint (4k iterations) during training. Each value is averaged over 4k iterations.

Figure 5: Sample summaries generated by our D.GPT2+CMDP model with abstractiveness bin 1, 2, and 3 on the Newsroom-b testing set. Extractive fragments in summaries are in blue color. Factual errors are in red color.

Figure 1: Distribution of lexical formality (Equation 6) in CNN/Daily Mail, Reddit and the complete dataset. Positive values on the X-axis indicate high formality and negative values indicate informality.

Figure 2: Distribution of formality scores of the generated summaries. Y-Axis: Fraction of datapoints, XAxis: Intervals of formality scores.

Figure 3: RL learning curve.

Figure 2: Reinforced training of the extractor (for one extraction step) and its interaction with the abstractor. For simplicity, the critic network is not shown. Note that all d’s and st are raw sentences, not vector representations.

Figure 3: The predicted extracting probabilities for each sentence calculated by the output of each iteration.

Figure 5: Example from the dataset showing the generated summary of our best models. The colored (marked) sentences correspond to our extractor’s sentence selection. The listed ROUGE scores are computed for this specific example.

Figure 1 Our QA-based semantic evaluation system architecture

Figure 2: Relationship between number of iteration and Rouge score on DailyMail test dataset with respect to the ground truth at 75 bytes.

Figure 2 Customizing a typical QA system for our evaluation approach

Figure 4 Comparison of our evaluation scores and human evaluation scores

Figure 4: Example from the dataset showing the generated summary of our best models. The colored (marked) sentences correspond to our extractor’s sentence selection. The listed ROUGE scores are computed for this specific example.

Figure 1: Model Structure: There is one encoder, one decoder and one iterative unit (which is used to polish document representation) in each iteration. The final labeling part is used to generating the extracting probabilities for all sentences combining hidden states of decoders in all iterations. We take a document consists of three sentences for example here.

Figure 3 The QA-based Summarization Evaluation Process Using DUC 2007 corpus

Figure 6: Distribution of model summary length (test set).

Figure 1: Ranking (descending order) of current 11 top-scoring summarization systems (Abstractive models are red while extractive ones are blue). Each system is evaluated based on three diverse evaluation methods: (a) averaging each system’s in-dataset ROUGE-2 F1 scores (R2) over five datasets; (b-c) evaluating systems using our designed cross-dataset measures: stiffR2, stable-R2 (Sec. 5). Notably, BERTmatch and BART are two state-of-the-art models for extractive and abstractive summarization respectively (highlighted by blue and red boxes).

Figure 4: Illustration of stiffness and stableness of ROUGE-1 F1 scores for various models. Yellow bars stand for extractive models and grey bars stand for abstractive models.

Figure 3: Bar chart for Tab. 4.

Figure 3: Characteristics of test set for each dataset (the train set possesses almost the same property thus is not displayed here): coverage,copy length, novelty, sentence fusion score, repetition. Here we choose 2-gram to calculate the novelty and 3-gram for the repetition.

Figure 2: Different metrics characterized by a relation chart among generated summaries (Gsum), references (Ref) and input documents (Doc).

Figure 5: Illustration of stiffness and stableness of factuality scores for various models. Yellow bars stand for extractive systems and grey bars stand for abstractive systems.

Figure 2: Distribution of the similarity scores between summary and abstract according to Eq. 1.

Figure 5: Average percentage of novel n-grams in the generated summaries with the filtered training dataset RSCSUM-80.

Figure 1: Extractive fragment coverage and density distributions across the compared datasets, where n indicates the number of documents.

Figure 7: Statistical significance test on the values in Tab. 8. The four positions correspond to Grammaticality, Informativeness, Relevance and Overall Quality respectively, as shown in upper left box. “=” means no statistical difference, “>” means the row performs significantly better than the column at the significance level α = 0.05, whereas “6” indicates the same at α = 0.01.

Figure 8: Participants’ individual mean scores, by conditions.

Figure 4: Average percentage of novel n-grams in the generated summaries.

Figure 4: The performance of Ours(Fβ) on TAC-2010 under different λ, γ, M , and encoding models. When we change one of them, the others are fixed. The Pearson’s r and Spearman’s ρ are reported.

Figure 1: Overall framework of our method. w and s are the token-level and sentence-level representations. n and N (m and M ) are the token number and the sentence number of the summary (pseudo reference). For multidocument summary (i.e., K > 1), we compute relevance scores between the summary x and each document dk, and then average them as the final relevance score.

Figure 2: The gap of Spearman’s ρ between Ours(Fβ) and Ours(F1) on TAC-2011 for different |Set| and |Systems|. Positive gaps mean our Fβ can improve the performance while negative gaps indicate our Fβ degrades the performance. When changing one of them, the other is fixed. “all” means the full size is applied, i.e., 10 for |Set| and 50 for |Systems|.

Figure 5: Distributions of the reversed rank from SUPERT and Ours(Fβ) for bad and good summaries on TAC-2011. The bar in the middle indicates the median.

Figure 1: The overall accuracy performance of six representative factuality checkers.

Figure 3: Ablation studies for Ours(Fβ) on TAC datasets and Ours(F1) on CNNDM. “-CentralityW.” means that we remove the centrality weighting when computing relevance scores. “-HybridR.” represents we only utilize the token-level representations when calculating relevance and redundancy scores. “- Redundancy” indicates we omit the redundancy score.

Figure 2: The overall accuracy performance of six representative factuality checkers.

Figure 4: An excerpt from anonymized SUMMSCREEN that corresponds to the instance in the Figure 1 in the main text. Character names are replaced with IDs that are permuted across episodes.

Figure 2: Left: TV show genres from ForeverDreaming. Right: TV show genres from TVMegaSite.

Figure 3: Two excerpts from SUMMSCREEN showing that generating summaries from TV show transcripts requires drawing information from a wide range of the input transcripts. We only show lines in the transcripts that are closely related to the shown parts of summaries. The number at the beginning of each line is the line number in the original transcript. For the first instance, we omit a few lines containing clues about the doctor taking pictures of the mansion at different times due to space constraints.

Figure 1: Excerpts from an example from SUMMSCREEN. The transcript and recap are from the TV show “The Big Bang Theory”. Generating this sentence in the recap requires discerning the characters’ feelings (clues in the transcript are underlined) about playing the board game (references are shown in red). Colored boxes indicate utterances belonging to the same conversations.

Figure 2: A recurrent convolutional document reader with a neural sentence extractor.

Figure 1: DailyMail news article with highlights. Underlined sentences bear label 1, and 0 otherwise.

Figure 3: Neural attention mechanism for word extraction.

Figure 4: Visualization of the summaries for a DailyMail article. The top half shows the relative attention weights given by the sentence extraction model. Darkness indicates sentence importance. The lower half shows the summary generated by the word extraction.

Figure 4: The interface for Selecting or Rating Model Output. Users can chose the final product from three “AI-generated” candidate summaries.

Figure 6: The interface for Interactive Editing. Users see an “AI-generated” summary in the text box. They can use the drop-down menu to change certain words in the first sentence. They can then press “Predict” to request the model to update the rest of the summary based on those edits.

Figure 7: The interface for Writing with Model Assistance. In a Google Doc, users can see the original article on the top and they can write their summary under the section “Write your summary here:”. First, the user types a sentence for their summary, then a Bot (played by a researcher who log in with the “SumAssist Bot account”) will insert the next sentence in gray fonts. The Bot will also insert comments on words in the user written sentence and suggest them to make changes.

Figure 2: Illustration of participants’ perception on level of efficiency, control, and trust with each interaction. These conceptual level charts show a qualitative, rather than precise, comparison between interactions.

Figure 1: Five human-AI interactions in text generation from Study 1, illustrated as summarization tasks. Explanation of the actions and visual elements are in §2.2.

Figure 3: The interface for Guiding Model Output. Users can change the desired summary length and style (formal or informal) using sliders and highlight parts of the original text that they want to include in the summary. Users can press the “Generate” button to get the “AI-generated” summary based on their inputs.

Figure 5: The interface for Post-editing. Users see an “AI-generated” summary in the text box that they can hypothetically edit.

Figure 1: The XLNet architecture with two-stream attention mechanism is leveraged to estimate whether a segment is selfcontained or not. A self-contained segment is assumed to be preceded and followed by end-of-sentence markers (eos).

Figure 4: Absolute position of the whole sentence among all segments sorted by XLNet scores of self-containedness.

Figure 3: Example of a constituent parse tree, from which tree segments are extracted.

Figure 2: DPP selects a set of summary segments (marked yellow) based on the quality and pairwise dissimilarity of segments.

Figure 1: Example sentence summaries produced on Gigaword. I is the input, G is the true headline, A is ABS+, and R is RAS-ELMAN.

Figure 3: Visualization of UMAP projections of dictionary elements. Projections form clusters, which are shown in different colors.

Figure 2: General summary generation routine. The relevance score of each sentence w.r.t mean representation is computed, and top N sentences (Oe) with highestR(·) are selected as the summary.

Figure 1: An example workflow of SemAE. The encoder producesH = 3 representations (sh) for a review sentence s, which are used to generate latent representations over dictionary elements. The decoder reconstructs the input sentences using vectors (zh) formed using latent representations (αh).

Figure 4: Head-wise visualization of UMAP (McInnes et al., 2018) dictionary element projections.

Figure 5: UMAP (McInnes et al., 2018) projections of dictionary element over different epochs (warmup epoch #4 to epoch 10). We observe that dictionary elements gradually evolve to form clusters over the epochs.

Figure 2: ROUGE-1, ROUGE-2 and ROUGE-L scores for different summarization approaches. Chartreuse (yellowish green) box shows the oracle, green boxes show the proposed summarizers and blue boxes show the baselines; From left, Oracle; Citation-Context-Comm-It: Community detection on citation-context followed by iterative selection; Citation-ContextCommunity-Div: Community detection on citation-context followed by relevance and diversification in sentence selection; Citation-Context-Discourse-Div: Discourse model on citation-context followed by relevance and diversification; CitationContext-Discourse-It: Discourse model on citation-context followed by iterative selection; Citation Summ.: Citation summary; MMR 0.3: Maximal marginal relevance with λ = 0.3.

Figure 1: The blue highlighted span in the citing article (top) shows the citation text, followed by the citation marker (pink span). For this citation, the citation-context is the green highlighted span in the reference article (bottom). The text spans outside the scope of the citation text and citationcontext are not highlighted.

Figure 3: Comparison of the effect of different citation-context extraction methods on the quality of the final summary.

Figure 1: Dot product of embeddings and its logit for a sample word and its top most similar words (top 2000 and 1000).

Figure 1: Overview of our model. The word-level RNN is shown in blue and section-level RNN is shown in green. The decoder also consists of an RNN (orange) and a “predict” network for generating the summary. At each decoding time step t (here t=3 is shown), the decoder forms a context vector ct which encodes the relevant source context (c0 is initialized as a zero vector). Then the section and word attention weights are respectively computed using the green “section attention” and the blue “word attention” blocks. The context vector is used as another input to the decoder RNN and as an input to the “predict” network which outputs the next word using a joint pointer-generator network.

Figure 2: Example of a generated summary

Figure 3: Ratio of uncommon words in the document, which cannot be found in the Top-K OpenSubtitles words, for different k values.

Figure 2: Category distribution in WikiSum.

Figure 1: Examples of how-to questions and their corresponding answer’s summarization in WikiSum.

Figure 6: Comparison of ROUGE scores of the Features Only, SAFNet and SFNet models when trained with (bars on the left) and without (bars on the right) AbstractROUGE, evaluated on CSPubSum Test. The FNet classifier suffers a statistically significant (p=0.0279) decrease in performance without the AbstractROUGE metric.

Figure 1: SAFNet Architecture

Figure 2: Comparison of the average ROUGE scores for each section and the Normalised Copy/Paste score for each section, as detailed in Section 5.1. The wider bars in ascending order are the ROUGE scores for each section, and the thinner overlaid bars are the Copy/Paste count.

Figure 4: Comparison of the accuracy of each model on CSPubSumExt Test and ROUGE-L score on CSPubSum Test. ROUGE Scores are given as a percentage of the Oracle Summariser score which is the highest score achievable for an extractive summariser on each of the papers. The wider bars in ascending order are the ROUGE scores. There is a statistically significant difference between the performance of the top four summarisers and the 5th highest scoring one (unpaired t-test, p=0.0139).

Figure 3: Comparison of the best performing model and several baselines by ROUGE-L score on CSPubSum Test.

Figure 5: Comparison of the ROUGE scores of FNet, SAFNet and SFNet when trained on CSPubSumExt Train (bars on the left) and CSPubSum Train (bars on the right) and .

Figure 1: Results of the correlation between metrics and human judgments on the CNN dataset. First raw reports correlation as measured by the Person (r) and second raw focus on Kendall (τ ) coefficient. In this experiment, parameter are optimized for each criterion.

Figure 2: Pearson correlation at the system level between metrics when considering abstractive system outputs.

Figure 4: Impact of Calibration for summarization.

Figure 3: Score distribution of text score when considering abstractive system outputs. Pyr. stands for pyramide score.

Figure 3. Visualized results of sentence topical weight. The degree of highlighting represents the overall relevance of the sentence and all topics. Underlined sentences are model-selected summary. The left document is from PubMed dataset, and the right document is from CNN/DM dataset.

Figure 1. Overall architecture of our model (Topic-GraphSum). In the graph attention layer (top right), the square nodes denote the sentence representations output from the document encoder (bottom right), and the circular nodes denote the topic representations learned by NTM (left).

Figure 2. Rouge-1 and -2 results of our full model and three ablated variants on four datasets.

Figure 8: Rouge results of our full model and the ablated version on the two datasets.

Figure 1: An example where a paragraph-by-paragraph extraction will produce an incoherent summary.

Figure 6: Impact of window length (left) and slot number (right) on model performance (R-1).

Figure 5: Proportion of sentences selected by each window.

Figure 2: The framework of our model. There are three major components: (1) The sliding encoder generates representation of each sentence in the current window. (2) The memory layer infuses history information into sentence representations via graph neural networks. (3) The predication layer aggregates learned features to compute the binary sentence labels.

Figure 3: An illustration of the information flow in our model. Paths (a) denote the interaction between memory vectors (M) and sentence representations (S) via a GAT layer. Paths (b) denote the compution of sentence labels. Paths (c) denote the updating process of memory module.

Figure 7: Comparison between the output of our full model (top) and the ablated model (bottom). We use underlined text to denote model-selected sentences and bold text to denote the ground truth sentences. The ablated model selects repetitive contents in 4-th window and noisy contents in 5-th window.

Figure 4: Position distribution of gold sentences on two datasets.

Figure 2: The Joint Model Architecture for both Document-level and Paragraph-level News Genre Tags Prediction.

Figure 1: Four News Structures: Document-level News Structure Tags (in rectangle) and Paragraph-level News Element Tags (in circle). A News Element may include one or more consecutive paragraphs.

Figure 4: Generated Summaries for Resource-poor CQA

Figure 4: Analysis of Multi-hop Reasoning

Figure 3: Gated Selective Pointer-Generator Network.

Figure 1: An example from PubMedQA. The highlighted sentences illustrate the inference process when humans answer the given question. Italic represents direct matching sentences from the question. Underlined and

Figure 1: Hierarchical and Sequential Context Modeling for Question-driven Answer Summarization

Figure 6: Duplication Analysis in Answers

Figure 2: The overview of Multi-hop Selective Generator (MSG).

Figure 2: Case Study. Bold / underlined / shadowed sentences are selected by HSCM / CA / NeuralSum, respectively.

Figure 2: Model Accuracy in terms of Answer Length

Figure 1: The Joint Learning Framework of Answer Selection and Abstractive Summarization (ASAS).

Figure 3: Case Study. ASAS generates the answer summary highly related to the question (Underlined), while PGN may misunderstand the core idea of the answer (Wavy-lined).

Figure 5: A case study with the same legend as Figure 1. The highlighted sentences are attended by MSG (3-hop).

Figure 3: Varying the salience threshold λS ∈ [0, 1) (depicted as % confidence) and its impact on ROUGE upon deleting spans ZP ∩ ZS.

Figure 2: Compression model used for plausibility and salience modeling (§3.3). We extract candidate spans ci ∈ C(T ) to delete, then compute span embeddings with pre-trained encoders (only one span embedding shown here). This embedding is then used to predict whether the span should be kept or deleted.

Figure 1: Decomposing span-based compression into plausibility and salience (§2). Plausible compressions (underlined) must maintain grammaticality, thus [to the ... wineries]PP is not a candidate. Salience identifies low-priority content from the perspective of this dataset (highlighted). Constituents both underlined and highlighted are deleted.

Figure 3: The extractive model uses three separate encoders create representations for the reference document sentences, context tokens, and topics. These are combined through an attention mechanism, encoded at a documentlevel, and passed through a feed-forward layer to compute an extraction probability for each reference sentence.

Figure 2: A paragraph from the “Family and personal life” section of Barack Obama’s Wikipedia page and selected excerpts from the cited documents which provide supporting evidence.

Figure 4: Example outputs from the abstractive model that uses the context. The model often copies sequences from the references which are sometimes correct (top) or incorrect but sensible (bottom), highlighting the difficulty of automatic evaluation. (Documents shortened for space. Sentences which are underlined were selected by the extraction step.)

Figure 1: Given a topic, reference document, and a partial summary (the context), the objective of the summary cloze task is to predict the next sentence of the summary, known as the cloze.

Figure 5: The VB+NSUBJ category selects tuples of verbs and their corresponding NSUBJ dependents in the dependency tree. In this example, 2/4 of the alignment (the solid lines) can be explained by matches between such tuples. The dashed lines cannot: The “and” alignment is not part of any tuple; Since “ran” and “sprinted” are not aligned, their corresponding tuples are not considered to be aligned, so the “Reese” match does not count toward the total.

Figure 1: An illustration of the three methods for sampling matrices during bootstrapping. The dark blue color marks values selected by the sample. Only 3 system and input samples are shown here, whenN and M are actually sampled with replacement.

Figure 1: Example answers selected by the three strategies. The only SCU marked by annotators for this sentence is SCU4, which does not include information about the location of the attacks. Therefore, an answer selection strategy that chooses ‘‘Baghdad’’ enables generating a QA pair such as QA3, which probes for information not included in the Pyramid annotation.

Figure 6: (Top) The distribution of the proportion of the QAEval-F1 score that is explained by SCU matches. (Bottom) The percentage of summaries with a score explained by a given proportion of SCU matches. We find that QAEval can be explained by SCU matches far more than ROUGE or BERTScore on average.

Figure 4: An example correct answer predicted by the model that is scored poorly by the EM or F1 QAmetrics (both would assign a score of 0 or near 0). This occurs because the answer and prediction are drawn from two different summaries, and the same event is referred to in different ways in each one.

Figure 2: A typical example of expert-written and model-generated questions answerable by the phrase in red. The model questions are often significantly more verbose than the expert questions, typically copying the majority of the input sentence.

Figure 5: The Pearson correlations between the scores of several ROUGE variants, APES, and QAEval variants on TAC’08. The results support similar findings of Eyal et al. (2019), namely, that the ROUGE metrics are highly correlated to each other but have low correlation to the QA-based metrics, suggesting the two types of metrics offer complementary signals.

Figure 3: A comparison of the correlations of QAEvalF1 on a subset of TAC’08 using expert-written and model-generated questions. Each point represents the average correlation calculated using 30 samples of {2, 4, 6, 8, 10} instances, plotted with 95% error bars. System-level correlations were calculated against the summarizers’ average responsiveness scores across the entire TAC’08 dataset. We hypothesize the model questions perform better due to their verbosity, which causes more keywords to be included in the question that the QA model can match against the summary.

Figure 4: Every token alignment used by ROUGE or BERTScore is assigned to one or more interpretable categories (defined in §5). This allows us to calculate, for this example, that matches between named-entities contribute 1/4 to the overall score, stopwords 2/4, and noun phrases 3/4 (assuming alignment weights of 1.0).

Figure 2: An example token alignment used by ROUGE or BERTScore. Each color represents a summary content unit (SCU) that marks informational content. Only 2/5 of the token alignments (the solid edges) can be explained by matches between phrases that express the same information (the green phrases).

Figure 1: Both candidate summaries are similar to the reference, but along different dimensions: Candidate 1 contains some of the same information, whereas candidate 2’s information is different, but it at least discusses the correct topic. The goal of this work is to understand if summarization evaluation metrics’ scores should be interpreted as measures of information overlap or, less desirably, topic similarity.

Figure 6: The results of running the PERM-BOTH hypothesis test to find a significant difference between metrics’ Pearson correlations with the Bonferroni correction applied per dataset and correlation level pair instead of per metric (as in Figure 5). A blue square means the test returned a significant p-value at α = 0.05, indicating the row metric has a higher correlation than the column metric. An orange outline means the result remained significant after applying the Bonferroni correction.

Figure 5: The results of running the PERM-BOTH hypothesis test to find a significant difference between metrics’ Pearson correlations. A blue square means the test returned a significant p-value at α = 0.05, indicating the row metric has a higher correlation than the column metric. An orange outline means the result remained significant after applying the Bonferroni correction.

Figure 3: The distribution of the proportion of ROUGE (top) and BERTScore (bottom) on TAC 2008 that can be explained by tokens matches that are labeled with the same SCU (Eq. 5). The averages, 25% and 15% (in red), indicate that only a small amount of their scores is between phrases that express the same information.

Figure 2: An illustration of the three permutation methods which swap system scores, document scores, or scores for individual summaries between X and Y .

Figure 4: The 95% confidence intervals for ρSUM (blue) and ρSYS (orange) calculated using Kendall’s correlation coefficient on TAC’08 (left) and CNN/DM summaries (middle, Fabbri et al. (2021); right, Bhandari et al. (2020)) are rather large, reflecting the uncertainty about how well these metrics agree with human judgments of summary quality.

Figure 6: The system- and summary-level Pearson correlations as the number of available reference summaries increases. 95% confidence error bars shown, but may be too small to see. PyrEval is missing data because the official implementation requires at least two references. Evenwith one reference summary, QAEval-F1 maintains a higher system-level correlation than ROUGE.

Figure 3: The system- and summary-level Pearson estimates of the power of the BOOT-BOTH, PERM-BOTH, andWilliams hypothesis test methods calculated on the annotations from Fabbri et al. (2021). The power for BOOT-BOTH and Williams at the system-level is ≈ 0 for all values.

Figure 1: In the answer verification task, the metrics score how likely two phrases from different contexts have the same meaning. Here, the metrics at the bottom score the similarity between “emergency responders,” which was used to generate the question from the source text, and “paramedics,” the predicted answer from a QA model in the target text.

Figure 3: Bootstrapped estimates of the stabilities of the system rankings for automatic metrics and human annotations on SummEval (left) and REALSumm (right). The τ value quantifies how similar two system rankings would be if they were computed with two random sets of M input documents. When all Mtest test instances are used, the automatic metrics’ rankings become near constant. The error regions represent ±1 standard deviation.

Figure 4: 95% confidence intervals for rSYS calculated with the BOOT-INPUTS resampling method when the system rankings for the automatic metrics are calculated using only the judged data (orange) versus the entire test set (blue). Scoring systems with more summaries leads to better (more narrow) estimates of rSYS.

Figure 7: The 95% CIs calculated using the BOOTSYSTEMS bootstrapping method with Mjud summaries in orange and Mtest in blue.

Figure 10: rSYS∆(`, u) correlations for various combinations of ` and u (see §4.2) for ROUGE (top), BERTScore (middle), and QAEval (bottom) on SummEval (left) and REALSumm (right). The values of ` and uwere chosen so that each value in the heatmaps evaluates on 10% more system pairs than the value to its left. For instance, the first row evaluates on 10%, 20%, . . . , 100% of the system pairs. The second row evaluates on 10%, 20%, . . . , 90% of the system pairs, never including the 10% of pairs which are closest in score. The first row of each of the heatmaps is plotted in Fig. 6. The correlations on realistic score differences between systems are in the upper left portion of the heatmaps and contain the lowest correlations overall. Evaluating on all pairs is the top-rightmost entry, and the “easiest” pairs (those separated by a large score margin) are in the bottom right.

Figure 2: The distributions of score values for three metrics on the SummEval dataset for ground-truth answer and QA model prediction pairs from QAEval with the same (blue) and different (orange) meanings.

Figure 1: The system-level correlation is calculated between the average X and Z scores on a set of summarization systems. xji and z j i are the scores for the summary produced by system i (represented by rows) on input document j (represented by columns).

Figure 9: The rSYS∆(`, u) correlations on the SummEval (top) and REALSumm (bottom) datasets for ` = 0 and various values of u for ROUGE-1, ROUGE-2, and ROUGE-L. The u values were chosen to select the 10%, 20%, . . . , 100% of the pairs of systems closest in score. Each u is displayed on the top of each plot.

Figure 2: The bootstrapped 95% confidence intervals for the BERTScore of each system in the REALSumm dataset using Mjud judged instances in blue and Mtest instances in orange. Evaluating systems with Mtest instances leads to far better estimate of their true scores.

Figure 5: The systems (each represented by a point) on the two datasets (shown here for REALSumm) are rather diverse in quality as measured by both human judgments and automatic metrics.

Figure 8: The 95% CIs calculated using the BOOTBOTH bootstrapping method with Mjud summaries in orange and Mtest in blue.

Figure 6: The rSYS∆(`, u) correlations on the SummEval (top) and REALSumm (bottom) datasets for ` = 0 and various values of u (additional combinations of ` and u can be found in Appendix B). The u values were chosen to select the 10%, 20%, . . . , 100% of the pairs of systems closest in score. Each u is displayed on the top of each plot. For instance, 20% of the ( N 2 ) system pairs on SummEval are separated by < 0.5 ROUGE-1, and the system-level correlation on those pairs is around 0.08. As more systems are used in the correlation calculation, the allowable gap in scores between system pairs increases, and are therefore likely easier to rank, resulting in higher correlations.

Figure 11: See Fig. 10 for a description of the heatmaps, shown here for ROUGE-2 (top) and ROUGE-L (bottom).

Figure 1: Overview pipeline of the proposed model which is executed simultaneously in two phases (a). The first phase encodes the sentences with pre-trained BERT and uses [CLS] information as the input of a graph attention layer (b). The second phase encodes the word and sentence nodes as the inputs of a heterogeneous graph layer (c). The output of the two phases is concatenated and put into an MLP layer in order to classify labels for each sentence.

Figure 1: Average of ROUGE-1,2,L F1 scores on the Daily Mail validation set within one epoch of training on the Daily Mail training set. The x-axis (multiply by 2,000) indicates the number of data example the algorithms have seen. The supervised labels in SummaRuNNer are used to estimate the upper bound.

Figure 2: Model comparisons of the average value for ROUGE-1,2,L F1 scores (f ) on Dearly and Dlate. For each model, the results were obtained by averaging f across ten trials with 100 epochs in each trail. Dearly and Dlate consist of 50 articles each, such that the good summary sentences appear early and late in the article, respectively. We observe a significant advantage of BANDITSUM compared to RNES and RNES3 (based on the sequential binary labeling setting) on Dlate.

Figure 2: Model architecture (Left: QA-span fact correction model. Right: Auto-regressive fact correction model).

Figure 1: Training example created for the QA-span prediction model (upper right) and the auto-regressive fact correction model (bottom right).

Figure 2: Sentence positions in source document for extractive summaries generated by different models on the PubMed validation set. Documents on the x-axis are ordered by increasing article length from shortest to longest. We also see a similar trend on arXiv (the plots with more details can be found in the appendix).

Figure 4: Comparison of the flat fully-connected graph used in Erkan and Radev (2004); Mihalcea and Tarau (2004); Zheng and Lapata (2019) to the hierarchical graph used in our models (b) and (c). Although the section-section multiplication reduces the edge computation proportionally to the number of sections, we found it oversimplifies the graph by assuming independence between sentences across different sections. Our final model loosens the assumption by including sectionsentence connections as shown in sub-figure (c).

Figure 1: Example of a hierarchical document graph constructed by our approach on a toy document that contains two sections {T1, T2}, each containing three sentences for a total of six sentences {s1, . . . , s6}. Each double-headed arrow represents two edges with opposite directions. The solid and dashed arrows indicate intra-section and inter-section connections respectively. When compared to the flat fully-connected graph of traditional methods, our use of hierarchy effectively reduces the number of edges from 60 to 24 in this example.

Figure 5: Sentence positions in source document for extractive summaries generated by different models on the arXiv validation set. Documents on the x-axis are ordered by increasing article length from shortest to longest.

Figure 3: ROUGE-L scores for (a) different positional functions (L=lead, U=undirected, B=boundary) and (b) different graph hierarchies (NS=no section, H=hierarchical). Each point corresponds to one configuration of the hyperparameter gridsearch described in Section 4.2.

Figure 4: There is a strong correlation between the guidance quality and output quality, demonstrating the controllability of our guided model.

Figure 1: Our framework generates summaries using both the source document and separate guidance signals. We use an oracle to select guidance during training and use automatically extracted or user-specified guidance at test time.

Figure 3: Our model can generate more novel words and achieve higher recall of novel words in the gold reference compared with baseline.

Figure 5: The factCC model will give the gold reference an accuracy of about 10%.

Figure 3: The attention weight changes by using the contrastive attention mechanism. (a) is the average attention weights of the third layer of the baseline Transformer, (b) is that of “Transformer+ContrastiveAttention”, and (c) is the opponent attention derived from the fifth head of the third layer.

Figure 1: Overall networks. The left part is the original Transformer. The right part that takes the opponent attention as bottom layer fulfils the contrastive attention mechanism.

Figure 2: Heatmaps of two sampled heads from the conventional encoder-decoder attention. (a) is of the fifth head of the third layer, and (b) is of the fifth head of the first layer.

Figure 1: The decision diagram of our human annotation process. Decision nodes are rectangular and outcome nodes are circular. We show the annotation path of two summary sentences, S1 (green arrows) and S2 (red arrows). S2 is annotated as nonsensical thus is not considered for faithfulness. S1 is annotated as unfaithful due to hallucinated content.

Figure 2: Overview of FEQA. Given a summary sentence and its corresponding source document, we first mask important text spans (e.g. noun phrases, entities) in the summary. Then, we consider each span as the “gold” answer and generate its corresponding question using a learned model. Lastly, a QA model finds answers to these questions in the documents; its performance (e.g. F1 score) against the “gold” answers from the summary is taken as the faithfulness score.

Figure 2: Compression constraints on an example sentence. (a) RST-based compression structure like that in Hirao et al. (2013), where we can delete the ELABORATION clause. (b) Two syntactic compression options from Berg-Kirkpatrick et al. (2011), namely deletion of a coordinate and deletion of a PP modifier. (c) Textual units and requirement relations (arrows) after merging all of the available compressions. (d) Process of augmenting a textual unit with syntactic compressions.

Figure 4: Examples of an article kept in the NYT50 dataset (top) and an article removed because the summary is too short. The top summary has a rich structure to it, corresponding to various parts of the document (bolded) and including some text that is essentially a direct extraction.

Figure 5: Counts on a 1000-document sample of how frequently both a document prefix baseline and a ROUGE oracle summary contain sentences at various indices in the document. There is a long tail of useful sentences later in the document, as seen by the fact that the oracle sentence counts drop off relatively slowly. Smart selection of content therefore has room to improve over taking a prefix of the document.

Figure 1: ILP formulation of our single-document summarization model. The basic model extracts a set of textual units with binary variables xUNIT subject to a length constraint. These textual units u are scored with weights w and features f . Next, we add constraints derived from both syntactic parses and Rhetorical Structure Theory (RST) to enforce grammaticality. Finally, we add anaphora constraints derived from coreference in order to improve summary coherence. We introduce additional binary variables xREF that control whether each pronoun is replaced with its antecedent using a candidate replacement rij . These are also scored in the objective and are incorporated into the length constraint.

Figure 3: Modifications to the ILP to capture pronoun coherence. It, which refers to Kellogg, has several possible antecedents from the standpoint of an automatic coreference system (Durrett and Klein, 2014). If the coreference system is confident about its selection (above a threshold α on the posterior probability), we allow for the model to explicitly replace the pronoun if its antecedent would be deleted (Section 2.2.1). Otherwise, we merely constrain one or more probable antecedents to be included (Section 2.2.2); even if the coreference system is incorrect, a human can often correctly interpret the pronoun with this additional context.

Figure 2: Examples of Layout vs Plain Summaries

Figure 3: Counts of participant responses when comparing of Plain & Stock.

Figure 4: Counts of participant responses when comparing of Layout & Stock.

Figure 1: VerbNet frame for ‘murder’

Figure 4: The average document information and document information given summary when prompting GPT-2 with different amounts of upstream sentences for the SummEval dataset.

Figure 3: The average document information and document information given summary as estimated by different sizes of GPT-2 for the SummEval dataset.

Figure 1: A comparison of token-wise information content within a document as estimated by GPT-2 in 4 scenarios: the document on its own, the document given the document, the document given a high quality summary, and the document given a low quality summary. Tokens with a darker background color have more information.

Figure 2: Distributions of Shannon Score and Information Difference on 100 summaries from the CNN/DailyMail dataset. Three different summaries are used: the original human written reference summary (in blue), the original summary with words scrambled (in orange), and a reference summary for a different document in the dataset (in green).

Figure 1: Extractive Model Architecture

Figure 2: Evaluation flow of APES.

Figure 3: Example 202 from the CNN/Daily Mail test set.

Figure 1: Example 3083 from the test set.

Figure 4: Example 4134 from the CNN/Daily Mail test set. Colors and underlines in the source reflect differences between baseline and our model attention weights: Red and a single underline reflects words attended by baseline model and not our model, Green and double underline reflects the opposite. Entities in bold in the target summary are answers to the example questions.

Figure 2: Pairwise Kendall’s tau correlations for all automatic evaluation metrics.

Figure 1: Histogram of standard deviations of inter-annotator scores between: crowd-sourced annotations, first round expert annotations, and second round expert annotations, respectively.

Figure 1: Sample argument subgraph construct from NYT news comments illustrating varying viewpoints. Claims “I honestly...” and “but I dont..” are entailed by premises, connected through Default Inference nodes, and opposing claims are connected through Issue nodes.

Figure 3: Example of the data collection interface used by crowd-source and expert annotators.

Figure 1: ROUGE-1 scores across datasets, training dataset size, data augmentation (*-a), and consistency loss (*-c) showing the generalizable and robust performance of models transferred from WikiTransfer. Standard deviation bars are also plotted.

Figure 1: An illustration of our dataset annotation pipeline. Given a question and answers to that question, professional linguists 1) select relevant sentences, 2) cluster those selected sentences, 3) summarize each cluster’s sentences, and 4) fuse clusters into a coherent, overall summary.

Figure 2: An illustration of our automatic dataset pipeline which mirrors the manual pipeline for data augmentation. Given a question and answers, relevant sentences are selected and clustered. Then, the cluster centroid sentence of non-singleton clusters is removed from the input to use as bullet point summaries.

Figure 1: Example of an incorrect summary sentence produced by PGC (see Section 4) on CNN-DM.

Figure 3: Examples of incorrect sentences produced by different summarization models on the CNN-DM test set.

Figure 2: Agreement between crowdsourced and expert annotations at increasing numbers of workers.

Figure 4: Two alternative sentences from generated summaries, one correct and one incorrect, for the given source sentence. All tested NLI models predict very high entailment probabilities for the incorrect sentence, with only BERT estimating a slightly higher probability for the correct alternative.

Figure 1: Length control vs summary length. Length control can take 10 discrete values.

Figure 2: Example of monolingual system-generated summaries.

Figure 3: Example of cross-lingual system-generated summaries.

Figure 2: Structure of the document encoder in Bi-AES.

Figure 1: Structure of Uni-AES.

Figure 3: Performance for different document lengths.

Figure 3: Models trained on synthetic data evaluated on original CNN/DM documents, of either <1000 words (short) or >2000 words (long). True uses the summary under the document’s true aspect. ‘Best’ takes the bestscoring summary under all possible input aspects.

Figure 1: Two news articles with color-coded encoder attention-based document segmentations, and selected words for illustration (left), the abridged news article (top right) and associated aspect-specific model summaries (bottom right). Top: Article from our synthetic corpus with aspects sport, tvshowbiz and health. The true boundaries are known, and indicated by black lines in the plot and ‖in the article. Bottom: Article from the RCV1 corpus with document-level human-labeled aspects sports, news and tvshowbiz (gold segmentation unknown).

Figure 2: Visualization of our three aspect-aware summarization models, showing the embedded input aspect (red), word embeddings (green), latent encoder and decoder states (blue) and attention mechanisms (dotted arrows). Left: the decoder aspect attention model; Center: the encoder attention model; Right: the source-factors model.

Figure 1: Overview of the variational hierarchical topic-aware model.

Figure 2: Overview of the topic embedding mechanism: ϕ(rd) is the topic distribution, Mt is the topic mapping matrix, and tei is the topic embedding of word xi.

Figure 3: The similarity of topic distributions and the topic number mapping between documents and summaries generated by human or the learning models.

Figure 4: Topic distribution visualization of some extracted words which are consisted of three different topic groups and a random one.

Figure 2: Model architecture for adjacency reranking variation of Co-opNet

Figure 1: Generated abstracts for a biology article (from the Bio subset of our arXiv dataset). Abstracts are ranked from most (top) to least likely (bottom) using the generator model. Abstracts with better narrative structure and domain-specific content (such as the circled abstract) are often out-ranked in terms of likelihood by abstracts with factual errors and less structure.

Figure 2: Examples of factual errors given in annotation task.

Figure 1: Distribution of common factual error types in sampled generated summaries (96.37% of all errors). We draw from the same error types for our controlled analysis to ensure we match the true distribution of errors. Here extrinsic entity refers to entities that did not previously appear in the source, while an intrinsic entity appeared in the source.

Figure 4: Part of an EDUA solution graph. Each vertex is a segment vector from a reference summary, indexed by Summary.ID (si), Sentence.ID (sij), Segmentation.ID (sijk), Segment.ID (sijkm). All segments of all reference summaries have a corresponding node. All edges connect segments from different summaries with similarity ≥ tedge. This schematic representation of a partial solution contains three fully connected subgraphs with attraction edges (solid lines), each representing an SCU, whose weight is the number of vertices (segments).

Figure 5: Visualizations of ROUGE score with different hop numbers.

Figure 8: Contour plot for score correlations with β (X-axis) and tedge (Y-axis).

Figure 1: Overview of PESG. We divide our model into four parts: (1) Prototype Reader; (2) Fact Extraction; (3) Editing Generator; (4) Fact Checker.

Figure 2: Framework of fact extraction module.

Figure 7: Alignment of an PyrEval SCU of weight 3 to segments from student summaries on autonomous vehicle.

Figure 5: Formal specification of EDUA’s input graph G consisting of all segments from all segmentations of reference summary sentences (item 2), the objective (item 6), and three scores for defining the objective function that are assigned to candidate SCUs (item 3), sets of SCUs of the same weight (item 4), and a candidate pyramid (item 5).

Figure 1: The workflows of cross-input RL (top), input-specific RL (middle) and RELIS (bottom). The ground-truth reward can by provided by humans or automatic metrics (e.g. BLEU or ROUGE) measuring the similarity between generated output text and the reference output.

Figure 4: Visualizations of editing gate.

Figure 6: A directed Depth First Search tree for EDUAC. Nodes are cliques representing candidate SCUs, as illustrated in Figure 4, labeled by their weights. Each DFS path is a partition over one way to segment all the input summaries and group all segments into SCUs. The solution is the path with the highest AP .

Figure 1: Alignment of a single PyrEval SCU of weight 5 to a manual SCU of weight 4 from a dataset of student summaries. The manual and automated SCUs express the same content, and their weights differ only by one. For each of five reference summaries (RSUM1-RSUM5), exact matches of words between the PyrEval and manual contributor are in bold, text in plain font (RSUM2, RSUM4) appears only in the manual version, and text in italics appears only in the PyrEval version. Paraphrases of the same content from RSUM4 were identified by human annotators (plain font) and PyrEval (italics). Also shown is a matching segment from a student summary, where the student used synonyms of some of the words in the reference summaries.

Figure 3: Framework of fact checker module.

Figure 2: PyrEval preprocessors segment sentences from reference (RSUM) and evaluation (ESUM) summaries into clause-like units, then convert them to latent vectors. EDUA constructs a pyramid from RSUM vectors (lower left): the horizontal bands of the pyramid represent SCUs of decreasing weight (shaded squares). WMIN matches SCUs to ESUM segments to produce a raw score, and three normalized scores.

Figure 4: Workflow of the data augmentation method with an input reference summary and output augmented sample

Figure 3: Top: An example assessment input with all the concepts (highlighted in color box ) identified through QuickUMLS, a state-of-the-art off-the-shelf medical concept extractor. Middle: Two example plan subsections with the the annotated problems,

Figure 5: Performance drops (lighter color) and gains (darker color) over baseline (first column) on ROUGE-L Recall (top 4 rows) and Precision (bottom 4 rows). The darker the cell color is, the higher performance gain the model obtains over baseline.

Figure 6: Two cherry-picked examples from T5-DAPT-CUI output, with cyan fonts highlighted the correct diseases.4

Figure 7: Two example reference (REF) and predicted summaries (PRED) from T5-ALL (input with objective sections).

Figure 1: When a sick patient arrives to the hospital, diagnostic evaluations are performed to assess the patient’s condition and deduce the problems causing the illness.

Figure 2: An input example of assessment and subjective sections available in the notes: Chief Complaint, Allergies, Review of systems.

Figure 8: Given an input assessment, we show the reference summary, example output from fine-tuning T5 and BART, and T5 DAPT with token masking and concept masking. The red fonts show the information that is outside the input text.

Figure 4: Recall at rank threshold n for summary 4B

Figure 5: IU samples with rephrasing.

Figure 2: Averaged precision at at rank threshold n

Figure 1: Extended definition of IU based on Kroll (1977). Our edits are presented in italics.

Figure 3: Averaged recall at rank threshold n

Figure 6: Screenshot of Segment Matcher

Figure 4: For all copied words, we show the distribution over the length of copied phrases they are part of. The black lines indicate the reference summaries, and the bars the summaries with and without bottom-up attention.

Figure 2: Overview of the selection and generation processes described throughout Section 4.

Figure 3: The AUC of the content selector trained on CNN-DM with different training set sizes ranging from 1,000 to 100,000 data points.

Figure 1: Example of two sentence summaries with and without bottom-up attention. The model does not allow copying of words in [gray], although it can generate words. With bottom-up attention, we see more explicit sentence compression, while without it whole sentences are copied verbatim.

Figure 1: Overview of our summarization model. As shown, “bilateral” in the FINDINGS is a significant ontological term which has been encoded into the ontology vector. After refining FINDINGS word representation, the decoder computes attention weight (highest on “bilateral”) and generates it in the IMPRESSION.

Figure 2: Histograms and arrow plots showing differences between IMPRESSION of 100 manually-scored radiology reports. Although challenges remain to reach human parity for all metrics, 81% (a), 82% (b), and 80% (c) of our system-generated Impressions are as good as human-written Impressions across different metrics.

Figure 6: Difference in ROUGE-2 between Variational models and their deterministic counterparts versus the fraction of data discarded. Positive values indicate that deterministic ROUGE-2 is lower than Variational.

Figure 5: ROUGE-L scores vs fraction of data discarded due to high BLEUVarN. The straight dashed lines indicate the performance level of the deterministic PEGASUS and BART models.

Figure 3: Difference in ROUGE-1 between Variational models and their deterministic counterparts versus the fraction of data discarded. Positive values indicate that deterministic ROUGE-1 is lower than Variational.

Figure 2: BLEUVarN curves as a function of data discarded due to low ROUGE-1 scores.

Figure 4: ROUGE-2 scores vs fraction of data discarded due to high BLEUVarN. The straight dashed lines indicate the performance level of the deterministic PEGASUS and BART models.

Figure 7: Difference in ROUGE-L between Variational models and their deterministic counterparts versus the fraction of data discarded. Positive values indicate that deterministic ROUGE-L is lower than Variational.

Figure 1: ROUGE-1 scores vs fraction of data discarded due to high BLEUVarN. The straight dashed lines indicate the performance level of the deterministic PEGASUS and BART models.

Figure 1: The complete pipeline of the proposed method.

Figure 2: Simplified encoder-decoder transformer architectures used by BART and T5.

Figure 1: Example conditional summaries for two tasks.

Figure 2: ScriptBase corpus statistics. Movies can have multiple genres, thus numbers do not add up to 1,276.

Figure 4: Example of a bipartite graph, connecting a movie’s scenes with participating characters.

Figure 1: Excerpt from “The Silence of the Lambs”. The scene heading INT. THE PANEL TRUCK - NIGHT denotes that the action takes place inside the panel truck at night. Character cues (e.g., MAN, CATHERINE) preface the lines the actors speak. Action lines describe what the camera sees (e.g., We can’t get a good glimpse of his face, but his body. . . ).

Figure 3: Example of consecutive chain (top). Squares represent scenes in a screenplay. The bottom chain would not be allowed, since the connection between s3 and s5 makes it non-consecutive.

Figure 5: Social network for “The Silence of the Lambs”; edge weights correspond to absolute number of interactions between nodes.

Figure 4: Fractions of examples in each dataset exhibiting different error types (note a single example may have multiple errors). The graphs show a significant mismatch between the error distributions of actual generation models and synthetic data corruptions.

Figure 2: Set of transformations/data corruption techniques from Kryscinski et al. (2020) used to generate training data for the entity-centric approach.

Figure 5: The dependency arc entailment (DAE) model from (Goyal and Durrett, 2020a). A pre-trained encoder is used to obtain arc representations; these are used to predict arc-level factuality decisions.

Figure 6: Performance of different train checkpoints on a held-out dataset and on the human annotated dev set for models trained on the synthetic data in the CNN/DM domain.

Figure 3: Taxonomy of error types considered in our manual annotation. On the right are example summaries with highlighted spans corresponding to the error types; the first summary is an actual BART generated summary while others are manually constructed representative examples.

Figure 1: Examples from the synthetic and human annotated factuality datasets. The entity-centric and generationcentric approaches produce bad summaries from processes which can label their errors. All models can be adapted to give word-level, dependency-level, or sentence-level highlights, except for Gen-C.

Figure 7: Screenshot of the Mechanical Turk experiment. Given an input articles, the annotators were tasked with evaluating the factuality of 3 model generated summaries on a binary scale.

Figure 1: N-gram overlap of the generated summaries with the source article at different time steps. For CNNDM and MEDIASUM, the summaries fail to achieve the target degree of abstractiveness (denoted by the dotted lines).

Figure 8: Example showing loss modification to improve abstractiveness. The table shows which tokens are retained (green checkmark) or dropped (red cross) from the loss computation at different training stages. During later stages of the training, when loss truncation is applied, copied tokens are excluded from the loss.

Figure 7: N-gram overlap of the generated summaries in CNNDM and MEDIASUM. Initializing from BART-XSUM offers no benefits over the baseline. On the other hand, loss truncation is successful at enforcing abstractiveness; generated summaries for both datasets are closer to the target abstractiveness of reference summaries.

Figure 6: Modified training under loss truncation. After K steps of standard training, loss is computed on a subset of the tokens. To encourage factuality, high-loss tokens (↑) are excluded from the final loss computation whereas tokens with low loss (↓) are excluded to encourage abstractiveness.

Figure 10: Example illustrating +factuality loss modification. The table shows which tokens are retained or dropped from the loss computation at each training stage. We can see that high-loss generally corresponds with hallucinated content.

Figure 9: Factuality of output summaries for the baseline and loss truncation variants. The plot shows that compared to the standard BART baseline, token-level loss truncation improves factuality, with comparable results on abstractiveness and ROUGE.

Figure 4: Factuality Sentence Error Rate of the generated summaries at different time steps during training. The graph shows that factual error rate is roughly proportional to abstractiveness (compare plot trends with Figure 1) for CNNDM and MEDIASUM.

Figure 3: Comparison of summary-level output probabilities between high-overlap and low-overlap subsets for the BART models. For both CNNDM and MEDIASUM, high-overlap summaries are predicted with substantially higher confidence compared to low-overlap examples.

Figure 2: ROUGE scores of the generated summaries of all datasets at different training stages.

Figure 3: Pairwise significance test outcomes for BLEU, best-performing ROUGE (rows 2-9), and ROUGE applied in Hong et al. (2014) (bottom 3 rows), with (ST1) and without (ST0) stemming, with (RS1) and without (RS0) removal of stop words, for average (A) or median (M) ROUGE precision (P), recall (R) or f-score (F), colored cells denote significant win for row i metric over column j metric with Williams test.

Figure 2: Combining linguistic quality and coverage scores provided by human assessors in DUC2004

Figure 1: Scatter-plot of mean linguistic quality and coverage scores for human assessments of summaries in DUC-2004

Figure 4: Summarization system pairwise significance test outcomes (paired t-test) for state-ofthe-art (top 7 rows) and baseline systems (bottom 5 rows) of Hong et al. (2014) evaluated with best-performing ROUGE variant: average ROUGE2 precision with stemming and stop words removed, colored cells denote a significant greater mean score for row i system over column j system according to paired t-test.

Figure 1: Our proposed modification of a multi-layer transformer architecture. The input sequence is composed of K blocks of tokens. Each transformer layer is applied within the blocks, and a bidirectional GRU network propagates information in the whole document by updating the [CLS] representation of each block.

Figure 4: Document lengths after tokenization with pretrained BERT-base tokenizer and position of the [CLS] tokens of Oracle sentences in the input documents.

Figure 3: Proportion of the extracted sentences according to their position in the input document from PubMed test dataset.

Figure 2: Average R-1 scores of extracted summaries according to the number of words in the input documents from arXiv test dataset.

Figure 1: Training curves for BanditSum based models. Average ROUGE is the average of ROUGE-1, -2 and -L F1.

Figure 1: NEWSROOM summaries showing different extraction strategies, from time.com, mashable.com, and foxsports.com. Multi-word phrases shared between article and summary are underlined. Novel words used only in the summary are italicized.

Figure 2: Example summaries for existing datasets.

Figure 4: Density and coverage distributions across the different domains and existing datasets. NEWSROOM contains diverse summaries that exhibit a variety of summarization strategies. Each box is a normalized bivariate density plot of extractive fragment coverage (x-axis) and density (y-axis), the two measures of extraction described in Section 4.1. The top left corner of each plot shows the number of training set articles n and the median compression ratio c of the articles. For DUC and New York Times, which have no standard data splits, n is the total number of articles. Above, top left to bottom right: Plots for each publication in the NEWSROOM dataset. We omit TMZ, Economist, and ABC for presentation. Below, left to right: Plots for each summarization dataset showing increasing diversity of summaries along both dimensions of extraction in NEWSROOM.

Figure 3: Procedure to compute the set F(A,S) of extractive phrases in summary S extracted from article A. For each sequential token of the summary, si, the procedure iterates through tokens of the text, aj . If tokens si and aj match, the longest shared token sequence after si and aj is marked as the extraction starting at si.

Figure 9: We designed an interactive web interface for the human evaluation experiments introduced in Section 5.4.

Figure 6: The ROUGE scores for different stopping thresholds pthres on the arXiv validating set.

Figure 3: The position distribution of extracted sentences in the PubMedtrunc dataset.

Figure 8: The ROUGE scores for different stopping thresholds pthres on the GovReport validating set.

Figure 7: The ROUGE scores for different stopping thresholds pthres on the PubMedtrunc validating set.

Figure 2: The architecture of our MemSum extractive summarizer with a multi-step episodic MDP policy. With the updating of the extraction-history embeddings h at each time step t, the scores u of remaining sentences and the stopping probability pstop are updated as well.

Figure 5: The ROUGE scores for different stopping thresholds pthres on the PubMed validating set.

Figure 4: The sentence scores of 50 sentences computed by MemSum at extraction steps 0 to 3. In the document, there is artificial redundancy in that the (2n)th and the (2n+ 1)th sentences are identical (n = 0, 1, ..., 24).

Figure 1: We modeled extractive summarization as a multi-step iterative process of scoring and selecting sentences. si represents the ith sentence in the document D.

Figure 2: A case study on the LCSTS dataset. ST is source text; Ref is reference summary; +Kw is keywords augmented; +KwKG is keywords and knowledge augmented.

Figure 3: An example of generated summaries on the LCSTS dataset. ST is source text; Ref is reference summary; +Kw is keywords topic augmented; +KwKG is keywords topic and knowledge augmented.

Figure 1: The Model Structure of KAS. The λi are soft gates for distributing copy probabilities.

Figure 2: With global variance loss, our model (green bar) can avoid repetitions and achieve comparable percentage of duplicates with reference summaries.

Figure 1: The process of attention optimization (better view in color). The original attention distribution (red bar on the left) is updated by the refinement gate rt and attention on some irrelevant parts are lowered. Then the updated attention distribution (blue bar in the middle) is further supervised by a local variance loss and get a final distribution (green bar on the right).

Figure 4: Reward calculation with Question-Answer pairs

Figure 2: The training process for the summarization framework with QA rewards

Figure 1: A document, its corresponding ground truth summary and model generated summaries.

Figure 5: The interface used for human evaluation of the summaries.

Figure 3: Model improvements after QA based rewards - SAMSUM data

Figure 1: Overview of our multi-task model with parallel training of three tasks: abstractive summary generation (SG), question generation (QG), and entailment generation (EG). We share the ‘blue’ color representations across all the three tasks, i.e., second layer of encoder, attention parameters, and first layer of decoder.

Figure 2: Attention Probability for decoding on DUC 2001 dataset example, showing the summary is more inclined to an extractive nature. Attention corresponding to the word ‘pietersen’ present in the generated summary is shown.

Figure 3: Attention Probability for decoding on a SUMPUBMED example where the attention corresponding to word ‘present’ in the generated summary is shown.

Figure 1: SUMPUBMED creation pipeline.

Figure 1: An example concept map browser. The system indicates that (t1)=“Slobodan Milosevic” is related to (t2)=“Kosovo Province.” The user clicks to investigate the relationship, and the system must generate a summary explaining how Milosevic is related to Kosovo.

Figure 3: Highlighted article, reference summary, and summaries generated by TCONVS2S and PTGEN. Words in red in the system summaries are highlighted in the article but do not appear in the reference.

Figure 2: The UI for content evaluation with highlight. Judges are given an article with important words highlighted using heat map. Judges can also remove less important highlight color by sliding the scroller at the left of the page. At the right of the page judges give the recall and precision assessment by sliding the scroller from 1 to 100 based on the given summary quality.

Figure 1: Highlight-based evaluation of a summary. Annotators to evaluate a summary (bottom) against the highlighted source document (top) presented with a heat map marking the salient content in the document; the darker the colour, the more annotators deemed the highlighted text salient.

Figure 2: Two-stage model diagram. The aspect classifier assigns aspect labels for each reference sentence Rij from references R with a threshold λ. Sentences are then grouped according to the assigned labels, which are fed to the summarization model. Groups about irrelevant aspects (i.e., a2) is ignored. Finally, the summarization model outputs summaries for each relevant aspect.

Figure 3: Precision differences in varying threshold ranges.

Figure 1: In WikiAsp, given reference documents cited by a target article, a summarizationmodelmust produce targeted aspect-based summaries that correspond to sections.

Figure 1: Example of a search tree

Figure 1: Overview of our pyramid construction.

Figure 2: Examples of head-modifier-relation triples.

Figure 3: Examples of SCUs obtained from pyramids.

Figure 6: Results for the output factor questions. Specific output factor in italics. Answer type in brackets: MC = Multiple Choice, MR = Multiple Response, LS = Likert Scale. ** indicates significance (𝜒2 or Fisher’s exact test), after Bonferroni correction, with 𝑝 ≪ 0.001, * with 𝑝 < 0.05.

Figure 5: Results for the purpose factor questions. Specific purpose factor in italics. Answer type in brackets: MC = Multiple Choice, MR = Multiple Response, LS = Likert Scale. ** indicates significance (𝜒2), after Bonferroni correction, with 𝑝 ≪ 0.001, * with 𝑝 < 0.05. † indicates noteworthy results where significance was lost after correction for the number of tests. If two options are flagged, these options are not significantly different from each other, yet both were chosen significantly more often than the other options.

Figure 1: Summarization methods that are currently the standard vs. example of summarizing while taking users’ wishes and desires into account.

Figure 3: Overview of survey procedure.

Figure 2: Participant details.

Figure 7: Results for the future feature questions. Answer type in brackets. MC = Multiple Choice, MR = Multiple Response. ** indicates significance (𝜒2 or Fisher’s exact test), after Bonferroni correction, with 𝑝 ≪ 0.001.

Figure 4: Results for the input factor questions. Specific input factor in italics. Answer type in brackets: MC =Multiple Choice, MR = Multiple Response. ** indicates significance (𝜒2), after Bonferroni correction, with 𝑝 ≪ 0.001. If two options are flagged with **, these options are not significantly different from each other, yet both have been chosen significantly more often than the other options.

Figure 5: Results for the purpose factor questions. Specific purpose factor in italics. Answer type in brackets: MC = Multiple Choice, MR = Multiple Response, LS = Likert Scale. ** indicates significance ( 2), after Bonferroni correction, with p ⌧ 0.001, * with p < 0.05. † indicates noteworthy results where significance was lost after correction for the number of tests. If two options are flagged, these options are not significantly different from each other, yet both were chosen significantly more often than the other options.

Figure 3: Overview of the survey procedure.

Figure 6: Results for the output factor questions. Specific output factor in italics. Answer type in brackets: MC = Multiple Choice, MR = Multiple Response, LS = Likert Scale. ** indicates significance ( 2 or Fisher’s exact test), after Bonferroni correction, with p⌧ 0.001, * with p < 0.05.

Figure 2: Participant details.

Figure 4: Results for the input factor questions. Specific input factor in italics. Answer type in brackets: MC = Multiple Choice, MR = Multiple Response. ** indicates significance ( 2), after Bonferroni correction, with p ⌧ 0.001. If two options are flagged with **, these options are not significantly different from each other, yet both have been chosen significantly more often than the other options.

Figure 8: Results for the future feature questions. Answer type in brackets. MC = Multiple Choice, MR = Multiple Response. ** indicates significance ( 2 or Fisher’s exact test), after Bonferroni correction, with p⌧ 0.001.

Figure 1: Distribution of sentence position predictions.

Figure 6: Typical Comparison. Our model attended at the most important information (blue bold font) matching well with the reference summary; while other state-of-the-art methods generate repeated or less important information (red italic font).

Figure 1: Comparison of extractive, abstractive, and our unified summaries on a news article. The extractive model picks most important but incoherent or not concise (see blue bold font) sentences. The abstractive summary is readable, concise but still loses or mistakes some facts (see red italics font). The final summary rewritten from fragments (see underline font) has the advantages from both extractive (importance) and abstractive advantage (coherence (see green bold font)).

Figure 5: Visualizing the consistency between sentence and word attentions on the original article. We highlight word (bold font) and sentence (underline font) attentions. We compare our methods trained with and without inconsistency loss. Inconsistent fragments (see red bold font) occur when trained without the inconsistency loss.

Figure 2: Our unified model combines the word-level and sentence-level attentions. Inconsistency occurs when word attention is high but sentence attention is low (see red arrow).

Figure 4: Decoding mechanism in the abstracter. In the decoder step t, our updated word attention α̂t is used to generate context vector h∗(α̂t). Hence, it updates the final word distribution Pfinal.

Figure 9: Self-reported usefulness.

Figure 2: Screenshot of the experiment interface for human evaluation. Participants are asked to predict which restaurant will be rated higher after 50 reviews based on the summaries of the first 10 reviews where these two restaurants have the same average rating in the first 10 reviews.

Figure 6: Fig. 6a shows that DecSum is the only method that enables humans to statistically outperform random chance. Fig. 6b further shows that DecSum leads to more individuals with high performance.

Figure 5: Summary quality evaluation using SUM-QE (Xenouleas et al., 2019). DecSum achieves strong textual non-redundancy, but leads to lower grammaticality and coherence.

Figure 10: SUM-QE evluation on referential clarity and focus.

Figure 7: Model prediction distributions of each rating group from logistic regression (LR), deep averaging networks (DAN), and Longformer. Only Longformer model can properly distinguish sentences located at different score range. LR and DAN are not robust to input length shift where models are trained with input of full 10 reviews but are tested with sentences.

Figure 4: Sentence-level sentiment distribution of summaries. DecSum can select a wider range of sentences w.r.t. sentiment diversity.

Figure 8: The effect of summary length.

Figure 3: Wasserstien distance between model predictions of summary sentences and all sentences of the first ten reviews. Lower values indicate better representativeness. Error bars represent standard errors. DecSum (1, 1, 1) is significantly better than other approaches, including DecSum (1, 0, 1), with p-value ≤ 0.0001 with paired t-tests.

Figure 1: Illustration of the selected sentences by different methods on the distribution of model predictions on all individual sentences. Our method (DecSum) covers the full distribution, while PreSumm, a text-only summarization method, concentrates on the right side, and integrated gradients, a model-explanation method, misses the middle part.

Figure 3: The results of human evaluation, where forward and backslash represent that BASE+GRAPH+CL versus the reference and BASE, respectively. Yellow, green and blue represent that our model loses, equal to competitors and wins.

Figure 1: An example of the findings and corresponding impression, where the relation information, as well as positive and negative examples, are also shown in the figure. Note that △ represents the removed word.

Figure 4: R-1 score of generated impressions from BASE and our model on the MIMIC-CXR test set, where OURS represent the BASE+GRAPH+CL.

Figure 2: The overall architecture of our proposed method with graph and contrastive learning. An example input and output at t− 1 and t step are shown in the figure, where the top is the backbone sequence-to-sequence paradigm with a graph to store relation information between critical words and the bottom is the contrastive learning module with specific positive and negative examples. m refer to a mask vector.

Figure 5: Examples of the generated impressions from BASE and BASE+GRAPH+CL as well as reference impressions. The yellow nodes in the graph indicate that these words are contained in entities.

Figure 5: Sample summaries based on OUT-OFDOMAIN and MIX-DOMAIN training on opinion articles.

Figure 3: Named Entities distribution (left) and subjective words distribution (right) in abstracts. More PERSON, less ORGANIZATION, and less subjective words are observed in OPINION.

Figure 1: A snippet of sample news story and opinion article from The New York Times Annotated Corpus (Sandhaus, 2008).

Figure 4: BLEU (left) and ROUGE-L (right) performance on In-domain and Mix-domain setup over different amount of training data. As the training data increases, In-domain outperforms Mix-domain training.

Figure 2: [Left] Part-of-speech (POS) distribution for words in abstracts. [Right] Percentage of words in abstracts that are reused from input, per POS and all words. OPINION abstracts generally reuse less words.

Figure 7: Factual errors made by the BertSumExtAbs model.

Figure 4: Score Card

Figure 16: Reference contains rhetorical sentences that interest readers.

Figure 15: Noise data in reference.

Figure 12: The Bottom-Up works bad but other models work well.

Figure 2: Distribution of source sentence used for content generation. X-axis: sentence position in source article. Y-axis: the negative log of coverage of sentence.

Figure 4: Evaluation with QA model prediction probability and accuracy on our multiple choice cloze test, with higher numbers indicating better summaries.

Figure 10: The Pointer-Generator-with-Coverage model tends to make Addition errors when Pointer-Generator does not have repetitions.

Figure 14: Reference contains grammatical errors.

Figure 2: Our ASGARD framework with documentlevel graph encoding. Summary is generated by attending to both the graph and the input document.

Figure 13: Example of positive-negtive errors.

Figure 5: Error Log

Figure 1: PolyTope verdicts each error by three coordinates according to its syntactic and semantic roles.

Figure 3: Sample construction of multiple choice cloze questions and candidate answers from reference summary and salient context. Arguments and predicates in candidate answers are color-coded and italicized.

Figure 3: A case study that compares various evaluation methods with each other.

Figure 6: Distribution of automatic summarization metrics with three types of unfaithful errors. “True” indicates summaries with such type of error.

Figure 9: Factual errors made by the Bottom-Up model.

Figure 6: Scores per Segment

Figure 8: Factual errors made by the Point-Generator-with-Coverage model.

Figure 11: The Pointer-Generator-with-Coverage model tends to incorrectly combine information from the document, thus leading to Inacc Intrinsic errors.

Figure 1: Sample knowledge graph constructed from an article snippet. The graph localizes relevant information for entities (color coded, e.g. “John M. Fabrizi”) or events (underlined) and provides global context.

Figure 5: Sample summaries for an NYT article. Summaries by our models with the graph encoder are more informative than the variant without it.

Figure 3: Sample summaries for a government report. The model with truncated input generates unfaithful content. HEPOS attention with a Sinkhorn encoder covers more salient information.

Figure 5: Aspect-level informativeness and percentages of sentences containing unfaithful errors as labeled by both human judges on PubMed. Models with efficient attentions reduce errors for later sections in the sources, e.g., “Results" and “Conclusion".

Figure 6: Aspect-level informativeness and percentages of sentences with unfaithful errors on GovReport.

Figure 2: System overview.

Figure 4: Summarizing articles truncated at different lengths by the best models: LSH (7168)+HEPOS on PubMed and SINKHORN (10240)+HEPOS on GovReport. Reading more consistently improves ROUGE-2.

Figure 2: Percentage of unique salient bigrams accumulated from the start to X% of the source. Key information is spread over the documents in GOVREPORT, highlighting the importance of understanding longer text.

Figure 1: A toy example of our HEPOS attention, with a stride of 2 and four attention heads. Dark colors indicate that heads 1 and 3 attend to the first and third tokens (“Job" and “home") in the input, heads 2 and 4 look at the second and fourth words (“in" and “care").

Figure 7: Sample summaries for a government report. Model with truncated input generates unfaithful content. Our HEPOS encoder-decoder attention with Sinkhorn encoder attention covers more salient information in “What GAO found” aspect.

Figure 1: Relations among text spans of different granularity.

Figure 3: Graph attention mechanism.

Fig. 1. Overview of a query-biased summarizer with a copying mechanism.

Figure 6: ROUGE-1 scores of COOP with approximate search in different configurations.

Figure 1: Illustration of the latent space Z and text space X . The de facto standard approach in unsupervised opinion summarization uses the simple average of input review vectors zreview (◦) to obtain the summary vector zavg (▴). The simply averaged vector zavg tends to be close to the center (i.e., has a small L2-norm) in the latent space, and a generated summary xavg (⬩) tends to become overly generic. Our proposed framework COOP finds a better aggregated vector to generate a more specific summary xCOOP(▪) from the latent vector zCOOP (⋆).

Figure 8: Illustrations of the relationships between L2norm of latent vectors ∥z∥ and the input review quality: (a) text length and (b) the information content.

Figure 2: Average L2-norm of simply averaged summary vectors for different number of input reviews.

Figure 7: Example of summaries generated by BIMEANVAE with SimpleAvg and COOP for reviews about a product on the Yelp dataset. The colors denote the corresponding opinions, and struck-through reviews in gray were not selected by COOP for summary generation (Note that SimpleAvg uses all the input reviews.) Terms that are more specific to the entity are underlined.

Figure 5: L2-norm distributions of latent vectors of input reviews and aggregated vectors.

Figure 4: COOP searches convex combinations of the latent vectors of input reviews based on the input-output word overlap between a generated summary and input reviews. × denotes the simply averaged vector.

Figure 10: The inference runtime of BIMEANVAE and Optimus with different batch sizes.

Figure 9: Approximated search performance of ROUGE-2/L scores with different batch sizes.

Figure 3: Correlation analysis of the L2-norm of latent vectors ∥z∥ and the generated text quality: (a) text length and (b) information amount.

Figure 11: Example of summaries generated by BIMEANVAE with SimpleAvg and COOP for reviews about a product on the Amazon dataset. The colors denote the corresponding opinions, and struck-through reviews in gray were not selected by COOP for summary generation (Note that SimpleAvg uses all the input reviews.) Terms that are more specific to the entity are underlined. Red and struck-through text denotes hallucinated content that has the opposite meaning compared to the input.

Figure 4: BERTScore and Intra-BERTScore for generated contrastive summaries with different hyperparameters δ. The goal is to generate high quality and distinctive summaries (upper right).

Figure 3: Encoder of the base common summarization model has type embeddings to distinguish the original entity.

Figure 1: Overview of the comparative opinion summarization task. The model takes two set of reviews about different entities to generate two contrastive opinion summaries, which contain distinctive opinions, and one common opinion summary, which describes common opinions between the two entities.

Figure 5: Sentence Annotation Task. By showing sentences of the same aspect category, it is easier for annotators to compare two group of sentences (from two entities). To further facilitate the annotation process, we also provide several additional features, such as allowing workers to group sentences that contain the same token through double clicking, and to highlight sentences through hovering over the sentence label.

Figure 2: Illustration of Co-decoding: (a) For contrastive summary generation, distinctive words are emphasized by contrasting the token probability distribution of target entity against that of the counterpart entity. (b) For common summary generation, entity-pair-specific words are highlighted by aggregating token probability distributions of all base models to alleviate the overly generic summary generation issue.

Figure 6: Summary Collection Task. We show workers two group of sentences based on labels we collected from the sentence annotation task. Similar features, such as allowing workers to group sentences that contain the same token through double clicking, are also supported in this task.

Figure 1: Proposed multi-task learning framework for sentence extraction with document classification

Figure 2: Sentence extraction model using LSTM-RNN with multi-task learning

Figure 5: Visualization of DiscourseRank. The darker the highlightning, the higher the rank score. The references and generated summaries are also shown.

Figure 2: Outline of StrSum.

Figure 4: Examples of generated summaries and induced latent discourse trees.

Figure 7: Examples of generated summaries and induced latent discourse trees for long reviews. (a) shows a movie review. The 4th sentence mentions the whole positiveness. The 10th describes that the contents are easy to follow, while the 20th to 22nd show the detail of the contents. The 27th mentions the performance and accurate portrayal, and the 8th and 16th elaborate on the latter and the former, respectively. (b) presents a pocket knife review. The 11th, 13th, 15th, and 21st sentences concisely describe the goodness in each aspect. The 14th, 24th, and 28th elaborate on the parents.

Figure 3: ROUGE-L F1 score on evaluation set with various numbers of sentences.

Figure 6: Examples of generated summaries and induced latent discourse trees for negative reviews. (a) shows a board game review. The induced tree shows that the 1st and 6th sentences present additional information about the generated summary. While the 1st to 4th indicate the heaviness of the game, the 5th and 6th criticize the artwork. The 2nd, 3rd, and 4th present the additional information about the parent. (b) presents a movie review. The 1st and 2nd sentences describe the whole evaluation, while 6th and 7th strengthen the opinion. The 3rd to 5th mention the boring points in detail. Although our model catches the negativeness, the summary is redundant probably because each sentence in the body is relatively long.

Figure 1: Example of the discourse tree of a jigsaw puzzle review. StrSum induces the latent tree and generates the summary from the children of a root, while DiscourseRank supports it to focus on the main review point.

Figure 5: 2-D latent space projected by principal component analysis. Each point corresponds to the mean of the latent distribution of a topic sentence, and each circle denotes the same Mahalanobis distance from the mean.

Figure 3: Analogy with Gaussian word embedding.

Figure 6: Example of a path distribution (blue) and level distribution (red). Both the sum of a path distribution over each level and the sum of a level distribution over each path are equal to 1.

Figure 1: Outline of our approach. (1) The latent distribution of review sentences is represented as a recursive GMM and trained in an autoencoding manner. Then, (2) the topic sentences are inferred by decoding each Gaussian component. An example of a restaurant review and its corresponding gold summary are displayed.

Figure 2: Outline of our model. We set a recursive Gaussian mixture as the latent prior of review sentences and obtain the latent posteriors of topic sentences by decomposing the posteriors of review sentences.

Figure 4: Generated topic sentences of (a) an Amazon review of heeled shoes, (b) a Yelp review of a coffee shop, and (c) an Amazon review of table chair set. Topic sentences selected as a summary are highlighted in italic.

Figure 2: Illustration of word and sentence level attention in the second decoder step (Eq. 1 and Eq. 2). Purple: attention on words, Orange: attention on sentences, Unidirectional dotted arrows: attention from previous step, Bidirectional arrows: attention from previous and to present step. Best viewed in color.

Figure 1: SWAP-NET architecture. EW: word encoder, ES: sentence encoder, DW: word decoder, DS: sentence decoder, Q: switch. Input document has words [w1, . . . , w5] and sentences [s1, s2]. Target sequence shown: v1 = w2, v2 = s1, v3 = w5. Best viewed in color.

Figure 2: T-SNE Visualization on CNN/DM Test Set

Figure 1: Overview of Hierarchical Attentive Heterogeneous Graph

Figure 1: Overview of Architecture

Figure 1: ROUGE score for documents with different length. The result is calculated on the test set of CNN/DM and the trained model is based on BERT.

Figure 3: Comparison Between the ROUGE Scores Tendencies of BERTSUMEXT and DifferSum

Figure 2: T-SNE Visualization on CNN/DM.

Figure 1: Overview of ThresSum.

Figure 2: Overview of DifferSum.

Figure 4: Solid lines represent the forward pass, and dashed lines represent the gradient flow in backpropagation. For the two ablation tests, we stop the gradient at 1© and 2© respectively.

Figure 2: Our 2-decoder summarization model with a pointer decoder and a closed-book decoder, both sharing a single encoder (this is during training; next, at inference time, we only employ the memory-enhanced encoder and the pointer decoder).

Figure 3: The summary generated by our 2-decoder model covers salient information (highlighted in red) mentioned in the reference summary, which is not presented in the baseline summary.

Figure 1: Baseline model repeats itself twice (italic), and fails to find all salient information (highlighted in red in the original text) from the source text that is covered by our 2-decoder model. The summary generated by our 2-decoder model also recovers most of the information mentioned in the reference summary (highlighted in blue in the reference summary).

Figure 2: The Filler and Role Binding operation of the TPTRANSFORMER Model architecture.

Figure 1: An example document and its one line summary from XSum dataset. Document content that is composed into an abstractive summary is color-coded.

Figure 3: TP-TRANSFORMER model architecture.

Figure 3: Comparison of section-selection strategies of SEHY paired with BigBird-large on the dataset DMath.

Figure 1: The distribution of summary sentences per section type, cited from (Gidiotis and Tsoumakas, 2020).

Figure 2: Comparison of section-selection strategies of SEHY paired with BigBird-large on the dataset DCS .

Figure 4: Comparison of section-selection strategies of SEHY paired with BigBird-large on the dataset DPhy .

Figure 3: The graph layer consists of a graph attention mechanism and a feed-forward network. Through the graph attention, each node merges the neighbor relations. The neighbor relations are represented as triples, and the incoming relations and the outgoing relations are obtained through different mappings, which are marked in red and green color respectively in the Figure.

Figure 1: An example sentence annotated with a semantic dependency graphs. The green color represents the dependency of root node “hoping”. Some dependency edges are omitted for display.

Figure 2: The overview of our SemSUM model

Figure 4: Human evaluation. They are rated on a Likert scale of 1(worst) to 5(best).

Figure 3: Rouge-1 score vs. the number of selected sentences.

Figure 1: An example document: There are two different relationships among sentences: the semantic similarity (yellow) and the natural connection (green). Sentences 2, 3, 21 are the oracle sentences.

Figure 2: Overview of the proposed Multi-GraS, the word block and Multi-GCN.

Figure 1: Social bias in automatic summarization: We take steps toward evaluating the impact of the gender, age, and race of the humans involved in the summarization system evaluation loop: the authors of the summaries and the human judges or raters. We observe significant group disparities, with lower performance when systems are evaluated on summaries produced by minority groups. See §3 and Table 1 for more details on the Rouge-L scores in the bar chart.

Figure 3: An Example taken from COVID-19 dataset. Text in the same color indicates the contents they described are the same.

Figure 2: The sentence position distribution of the extracted summaries and the oracle summaries.

Figure 1: Our proposed multi-view information bottleneck framework. I(s;Y ) denotes the mutual information between sentence s and correlated signal Y , and NSP is short for Next Sentence Prediction task.

Figure 3: Intersection of averaged summary sentence overlaps across the sub-aspects. We use First for Position, ConvexFall for Diversity, and N-Nearest for Importance. The number in the parenthesis called Oracle Recall is the averaged ratio of how many the oracle sentences are NOT chosen by union set of the three sub-aspect algorithms. Other corpora are in Appendix with their Oracle Recalls: Newsroom(54.4%), PubMed (64.0%) and MScript (99.1%).

Figure 2: Volume maximization functions. Black dots are sentences in source document, and red dots are chosen summary sentences. The red-shaded polygons are volume space of the summary sentences.

Figure 4: PCA projection of extractive summaries chosen by multiple aspects of algorithms (CNNDM). Source and target sentences are black circles ( ) and cyan triangles, respectively. The blue, green, red circles are summary sentences chosen by First, ConvexFall, NN, respectively. The yellow triangles are the oracle sentences. Shaded polygon represents a ConvexHull volume of sample source document. Best viewed in color. Please find more examples in Appendix.

Figure 5: Sentence overlap proportion of each sub-aspect (row) with the oracle summary across corpora (column). y-axis is the frequency of overlapped sentences with the oracle summary. X-axis is the normalized RANK of individual sentences in the input document where size of bin is 0.05. E.g., the first / the most diverse / the most important sentence is in the first bin. If earlier bars are frequent, the aspect is positively relevant to the corpus.

Figure 1: A simple change to an article choice (in bold) in the extractive summary can improve its readability and coherence.

Figure 3: Preference judgement scores of the three judges A1, A2 and A3 across various summarizers and datasets.

Figure 5: Performance of the learning models in terms of (average) overlap between the models’ decisions and those of the annotators on 100 randomly sampled summaries generated by BanditSum.

Figure 4: Performance of the learning models in terms of (average) overlap (in %) between the models’ decisions and those of the annotators on 100 randomly sampled summaries generated by the different summarizers. Abbreviations: ST: Stories, AR: Articles, SM: Summaries, sub: subset.

Figure 2: Description of baseline models. (a) Concat model. (b) Text model.

Figure 1: Example of a post and a reply with a quote and a reply with no quote. Implicit quote is the part of post that reply refers to, but not explicitly shown in the reply.

Figure 2: Description of our model, Implicit Quote Extractor (IQE). The Extractor extracts sentences and uses them as summaries. k and j are indices of the extracted sentences.

Figure 3: Correlation between ROUGE-1-F score and maximum PageRank of each post on ECS and EPS datasets. X-axis shows rounded maximum PageRank, and Y-axis shows ROUGE-1-F and the error bar represents the standard error.

Figure 1: Description of the Appropriateness Estimator.

Figure 1: Sentence extractor architectures: a) RNN, b) Seq2Seq, c) Cheng & Lapata, and d) SummaRunner. The� indicates attention. Green blocks repesent sentence encoder output and red blocks indicates learned “begin decoding” embeddings. Vertically stacked yellow and orange boxes indicate extractor encoder and decoder hidden states respectively. Horizontal orange and yellow blocks indicate multi-layer perceptrons. The purple blocks represent the document and summary state in the SummaRunner extractor.

Figure 1: A strapline (“Don’t expect ...”) that is mistaken for a summary in the Newsroom corpus.

Figure 2: Relative locations of bigrams of gold summary in the source text across different datasets.

Figure 1: An example post of the TIFU subreddit.

Figure 6: Examples of abstractive summary generated by our model and baselines. In each set, we too show the source text and reference summary.

Figure 4: Comparison between (a) the gated linear unit (Gehring et al., 2017) and (b) the proposed normalized gated tanh unit.

Figure 5: Examples of abstractive summary generated by our model and baselines. In each set, we too show the source text and gold summary.

Figure 3: Illustration of the proposed multi-level memory network (MMN) model.

Figure 3: F-measure ROUGE-1 performance (%) vs. number of models for news-headline-generation task. X-axis is log scale (21–27).

Figure 1: Flow charts of current runtime-ensemble (a) and our proposed post-ensemble (b).

Figure 2: Left scatter-plot shows two-dimensional visualization of outputs generated from 10 models on basis of multi-dimensional scaling (Cox and Cox, 2008), and right list shows their contents. Each point in plot represents sentence embedding of corresponding output, and label indicates model ID and ROUGE-1, i.e., “ID (ROUGE).” Color intensity means score of kernel density estimation of PostCosE (see right color bar), and outputs are sorted by scores. Reference and input are as follows. Each bold word in above list means co-occurrence with reference below. Reference: interpol asks world govts to make rules for global policing Input: top interpol officers on wednesday asked its members to devise rules and procedures for policing at the global level and providing legal status to red corner notices against wanted fugitives .

Figure 4: Violation penalty for compression (left) and reconstruction (right). The x-axis is step and the y-axis is each ratio. The horizontal lines in the middle are ρ and τ , and the dashed lines represent ρ(t) and τ(t). The circles represent a step where the agent breaks the constraints.

Figure 1: Overview of previous (left) and proposed (right) approaches on CR learning paradigm.

Figure 2: Algorithmic visualization of iterative action prediction

Figure 3: Deterministic compression and reconstruction with masked language model

Figure 3: US H.R.1680 (115th)

Figure 2: Example US Bill

Figure 4: US H.R.6355 (115th)

Figure 5: California Bill Summary

Figure 1: Bill Lengths

Figure 1: Experimental set-up. Left: multi-task training, Right: training with structured input.

Figure 2: Example from the LipKey dataset, with gold-standard and generated summaries.

Figure 3: Example of articles and keyphrases in the LipKey dataset. We highlight words in the article that match its absent keyphrases with different colours. Yellow means partial match, green means acronym, and blue means morphology variants. The English translation is for illustration purposes.

Figure 1: A taxonomy of concepts.

Figure 1: CWE vs Voting for different simplicity and reading ease levels. ROUGE-2 precision and recall are shown for different levels of tuning achieved.

Figure 1: Variation in the attention coverage while summarizing an article for different topics

Figure 1: Procedure to create pretraining dataset using the nonsense corpus and our proposed pretraining tasks

Figure 1: Estimator model architecture used in COMES. Source, reference and hypothesis are all independently encoded with a pre-trained encoder. Pooling layer is used to create sentence embeddings from sequences of token embeddings. In the COMES variant, the last feed-forward layer has 4 outputs, corresponding to different summary evaluation dimensions. Dashed lines are used to indicate the reference-less variant. For the full COMET description see Rei et al. (2020).

Figure 2: ROUGE and novel n-grams results on the anonymized validation set for different runs of each model type. Lines indicates the Pareto frontier for each model type.

Figure 1: The network architecture with the decoder factorized into separate contextual and language models. The reference vector, composed of context vectors ctmpt , c int t , and the hidden state of the contextual model hdect , is fused with the hidden state of the language model and then used to compute the distribution over the output vocabulary.

Figure 2: Pairwise similarities between model outputs computed using ROUGE. Above diagonal: Unigram overlap (ROUGE-1). Below diagonal: 4-gram overlap (ROUGE-4). Model order (M-) follows Table 6.

Figure 1: The distribution of important sentences over the length of the article according to human annotators (blue) and its cumulative distribution (red).

Figure 1: Procedure to generate synthetic training data. S is a set of source documents, T + is a set of semantically invariant text transformations, T − is a set of semantically variant text transformations, + is a positive label, − is a negative label.

Figure 1: Distribution of Gold Summary Rank

Figure 2: Validation losses for BERTSUM, RoBERTa, and SynRoBERTa (ns = {1, 2}) . “[CLS]” and “[ROOT]” indicate the tokens of sentence representations for predicting labels.

Figure 1: Different from the previous work, DISCOBERT (Xu et al., 2020), NeRoBERTa selects sentences by considering both intra- and inter-sentence relationships as a nested tree structure.

Figure 1: Motivating example. A document from CNN.com (keywords generated by masking procedure are bolded), the masked version of the article, and generated summaries by three Summary Loop models under different length constraints.

Figure 2: The Summary Loop involves three neural models: Summarizer, Coverage and Fluency. Given a document and a length constraint, the Summarizer writes a summary. Coverage receives the summary and a masked version of the document, and fills in each of the masks. Fluency assigns a writing quality score to the summary. The Summarizer model is trained, other models are pretrained and frozen.

Figure 4: Histogram and average copied span lengths for abstractive summaries. A summary is composed of novel words and word spans of various lengths copied from the document. Summary Loop summaries copy shorter spans than prior automatic systems, but do not reach abstraction levels of human-written summaries.

Figure 3: The Coverage model uses a finetuned BERT model. The summary is concatenated to the masked document as the input, and the model predicts the identity of each blank from the original document. The accuracy obtained is the raw coverage score.

Figure 1: Example document with an inconsistent summary. When running each sentence pair (Di, Sj) through an NLI model, S3 is not entailed by any document sentence. However, when running the entire (document, summary) at once, the NLI model incorrectly predicts that the document highly entails the entire summary.

Figure 2: Diagram of the SUMMACZS (top) and SUMMACCONV (bottom)models.Bothmodels utilize the same NLI Pair Matrix (middle) but differ in their processing to obtain a score. The SUMMACZS is Zero-Shot, and does not have trained parameters. SUMMACCONV uses a convolutional layer trained on a binned version of the NLI Pair Matrix.

Figure 3: An example from our human evaluation.

Figure 1: Extractiveness of generated outputs versus automated metric scores for Entailment, FactCC and DAE on the Gigaword dataset. We use coverage defined in Grusky et al. (2018) to measure extractiveness, where summaries with higher coverage are more extractive. We observe that automated metrics of faithfulness are positively correlated with extractiveness.

Figure 2: Faithfulness-Abstractiveness trade-off curves. The blue dots represent the quartile models used to generate the curve. The purple dot corresponds to the baseline. DAE and Loss Truncation are depicted by the brown and orange dots respectively. The green dots correspond to our proposed systems.

Figure 3: Comparison between conditions for average time to summarize (per document) for Reddit and XSum. In general, participants in XSum took longer to complete the task, likely due to unfamiliarity with the domain.

Figure 1: Sample task interface for the AI post-edit condition for XSum, showing the provided, AI-generated summary in the text box.

Figure 8: Task interface and questions on the summarization task.

Figure 7: XSum tutorial first example with good and bad explanations.

Figure 5: This figure shows the document length of distribution of both datasets. The average length of the Reddit posts is 243.8 words, and the average length of the XSum articles is 222.3 words.

Figure 2: Average overall quality ratings for the summaries by type and dataset. For Reddit, the human reference was the worst (aside from the Random summary). For XSum, the AI-generated summary was the worst.

Figure 4: User experience plots for task difficulty, “I found it difficult to summarize the article well”, frustration, “Performing the summarization tasks was frustrating”, and assistance utility, “The provided summaries were not useful to me when I was performing the summarization tasks” for Reddit (Left) and XSum (Right). Responses were collected using 7 point rating scales.

Figure 6: Reddit tutorial first example with good and bad explanations.

Figure 9: Annotation interfaces.

Figure 2: System architecture. In this example, a sentence pair is chosen (red) and then merged to generate the first summary sentence. Next, a sentence singleton is selected (blue) and compressed for the second summary sentence.

Figure 4: A sentence’s position in a human summary can affect whether or not it is created by compression or fusion.

Figure 2: Frequency of each merging method. Concatenation is the most common method of merging.

Figure 1: Annotation interface. A sentence from a random summarization system is shown along with four questions.

Figure 3: Position of ground-truth singletons and pairs in a document. The singletons of XSum can occur anywhere; the first and second sentence of a pair also appear far apart.

Figure 1: Portions of summary sentences generated by compression (content is drawn from 1 source sentence) and fusion (content is drawn from 2 or more source sentences). Humans often grab content from 1 or 2 document sentences when writing a summary sentence.

Figure 3: The first attention head from the l-th layer is dedicated to coreferring mentions. The head encourages tokens of the same PoC to share similar representations. Our results suggest that the attention head of the 5-th layer achieves competitive performance, while most heads perform better than the baseline. The findings are congruent with (Clark et al., 2019) that provides a detailed analysis of BERT’s attention.

Figure 2: Comparison of various highlighting strategies. Thresholding obtains the best performance.

Figure 2: Our TRANS-LINKING model facilitates summary generation by reducing the shifting distance, allowing the model attention to shift from “John” to the tokens “[E]” then to “loves” for predicting the next summary word.

Figure 2: Statistics of PoC occurrences and types.

Figure 1: An illustration of the annotation interface. A human annotator is asked to highlight text spans referring to the same entity, then choose one from the five pre-defined PoC types.

Figure 2: Training progress on WIKI’s training and validation data

Figure 1: Architecture of our Q-network

Figure 3: Validation Performance among Masked Ratio for Mask-and-Fill with Masked Article. We experiment with each of the five combinations of article mask ratio and summary mask ratio, and then plot the interpolated results.

Figure 1: An example of generated negative summary using masked article. Spans that are highlighted are masked when generating the negative summary. Note that red spans are factually inconsistent with the given article and blue spans are factually consistent.

Figure 4: Generated negative summaries among various masking ratio in CNN/DM dataset. For MFMA and MF, we fix the summary masking γS = 0.6:

Figure 7: Case study on entailment based models. First example comes from and FactCC-Test and second example comes from XSumHall.

Figure 5: Validation Set Performance among BERTScore between the original reference summaries and the negative summaries we generate using the various combinations of article and summary masking ratios.

Figure 2: Overall flow of our proposed negative summary generation method Mask-and-Fill-with-Masked Article.

Figure 1: An example of a news story in our data set. The short manual summary is marked in red rectangle. The blue rectangle shows a post from a user. In the green rectangle, it is a link of a related news story. Some posts may only include comments, reactions, etc. without the link to the related news stories.

Figure 2: Our deep recurrent generative decoder (DRGD) for latent structure modeling.

Figure 1: Headlines of the top stories from the channel “Technology” of CNN.

Figure 1: Our key information guide model. It consists of key information guide network, encoder and decoder. In the key information guide network, we encode the keywords to the key information representation k.

Figure 2: Visualization of sentence selection vectors. Ii and Oi indicate the i-th sentence of the input and output, respectively. Obviously, our model can detect more salient sentences that are included in the reference summary.

Figure 1: Comparison of sentence-level attention distributions for the summaries in Table 1 on a news article. (a) is the heatmap for the gold reference summary, (b) is for the Seq2seq-baseline system, (c) is for the Point-gen-cov (See et al., 2017) system, (d) is for the Hierarchical-baseline system and (e) is for our system. Ii and Oi indicate the i-th sentence of the input and output, respectively. Obviously, the seq2seq models, including the Seq2seq-baseline model and the Point-gen-cov model, lose much salient information of the input document and focus on the same set of sentences repeatedly. The Hierarchical-baseline model fails to detect several specific sentences that are salient and relevant for each summary sentence and focuses on the same set of sentences repeatedly. On the contrary, our method with structural regularizations focuses on different sets of source sentences when generating different summary sentences and discovers more salient information from the document.

Figure 2: Average count of novel words (words that do not appear in the article). Seq2seq model generates more novel words, but less words are in the reference compared to our model.

Figure 3: Comparison of the output of two models on a news article. Bold words in text are the key information. (Baseline: enc-dec+attn; Our model: KIGN+prediction-guide)

Figure 1: The framework of our model. Entailment-aware encoder is learned by jointly training summarization generation (left part of (a), which is a seq2seq model) and entailment recognition (right part of (a), in which sentence pair in the entailment recognition corpus are encoded as u and v). Entailmentaware decoder is learned by entailment RAML training, in which the summary will be rewarded if it is entailed by the source sentence.

Figure 1: Our abstractive document summarization model, which mainly consists of three layers: document encoder layer (the top part), information selection layer (the middle part) and summary decoder layer (the bottom part).

Figure 2: ROUGE-1, ROUGE-2 and ROUGE-L F1 scores of KIGN+Prediction-guide model w.r.t different hyperparameter α.

Figure 2: Our hierarchical encoder-decoder model with structural regularization for abstractive document summarization.

Figure 3: The performance of (a) summarization generation on Gigaword validation set and (b) entailment recognition on SNLI (Bowman et al., 2015) validation set with different task batch switches (α).

Figure 4: The structural regularization reduces undesirable repetitions while summaries from the Seq2seq-baseline and the Hierarchical-baseline contains many n-gram repetitions.

Figure 3: Comparisons of structural-compression and structural-coverage analysis results on random samples from CNN/Daily Mail datasets, which demonstrate that both the Seq2seq-baseline model and the Hierarchical-baseline model are not yet able to capture them properly, but our model with structural regularizations achieves similar behavior with the gold reference summary.

Figure 4: An example of the generated review summairzation of S2S+Att, PGN and USN (Italic and bold denote words that do not appear in review).

Figure 3: Effects of user-specific vocabulary size on development set of Trip.

Figure 5: Speed comparison of classical DPPs sampling (blue), FGMInference (red) and BFGMInference (gray) with a batch size of 100.

Figure 2: The architecture of User-aware Sequence Network (USN). USN encodes two kinds of user information, user embedding (u) and user-specific vocabulary memory (U), into its two basic modules (User-aware Encoder and User-aware Decoder). 1©, and 2© show strategies based on user embedding, and represent User Selection strategy, and User Prediction strategy, respectively. 3© and 4© indicate strategies based on user-specific vocabulary memory and represent User Memory Prediction strategy and User Memory Generation strategy, respectively.

Figure 4: Conditional sampling in Macro DPPs

Figure 7: Presentation Degeneration Problem in NLG. We use tSNE (Maaten and Hinton, 2008) to reduce the dimension of word embeddings learned in the model.

Figure 1: Degenerated attention distribution behind OTR problem. The generated summary repeats the first sentence in article. We select the first 16 words of summary and show their attention over first 50 words of article.

Figure 6: Actual attention distribution learned by vanilla model and DPPs models.

Figure 1: Personalized review summarization is motivated by that different users are likely to generate different summaries for the same review, according to their own experiences, thoughts, or writing styles.

Figure 4: Effects of attribute-specific vocabulary size on review summarization on development set of TripAtt. When there is no any attribute-specific vocabulary (the size is 0) in ASN, our model degrades into S2SATT. The primary axis is for ROUGE-1 and ROUGE-L, and the second axis is for ROUGE-2.

Figure 2: Comparison of different reweighting methods on a simulated distribution. DPPs sampling reweighting approximates original distribution better since it catches the high attention area around position 160. It also samples less adjacent points around position 110.

Figure 3: The architecture of Attribute-aware Sequence Network (ASN). ASN encodes two kinds of attribute information, attribute embedding (a) and attribute-specific vocabulary memory (A), into its two basic modules (Attribute-aware Review Encoder and Attribute-aware Summary Decoder). 1©, and 2© show strategies based on attribute embedding, and represent Attribute Selection strategy, and Attribute Prediction strategy, respectively. 3© and 4© indicate strategies based on attribute-specific vocabulary memory, and represent Attribute Memory Prediction strategy and Attribute Memory Generation strategy, respectively.

Figure 3: Parameter tuning of k on the metrics of Rhyme, Integrity, and Micro-Dist-2.

Figure 1: Examples of text with rigid formats. In lyrics, the syllables of the lyric words must align with the tones of the notation. In SongCi and Sonnet, there are strict rhyming schemes and the rhyming words are labeled in red color and italic font.

Figure 2: An example of a generated summary sentence that is fused by cross-sentence EDUs.

Figure 3: Results of Co-Selective models with MTL and two-stage learning (TSL) for summarization task.

Figure 2: The framework of our model with co-selective encoding. During training, a BiLSTM reads the original sentence (x1, x2, · · · , xn) and the ground-truth keywords (k1, k2, · · · , km) into the first-level hidden states hri and hki . A jointly trained keyword extractor takes h r i as the input to predict whether the input word is a keyword or not. Co-selective encoding layer builds the second-level hidden states hr ′ i and h k′ i . Then the summary is generated via dualattention and dual-copy for both the original sentence and the keyword sequence. During testing, the ground-truth keywords are replaced by the keywords predicted by our trained keyword extractor.

Figure 4: Results of Co-Selective models with (w/) and without (w/o) fine-tuning (FT) for summarization task.

Figure 1: The overlapping keywords (marked in red) between the input sentence and the reference summary cover the main ideas of the input sentence. Our motivation is to generate summary guided by the keywords extracted from the input sentence.

Figure 1: Overall Architecture of Our Model

Figure 2: The framework of our proposed model.

Figure 3: R1/R2/RL vs Sparsity for token level and sentence level models. For sentence level model, we enforce it to extract at least three sentences.

Figure 3: Results of CoCoNet + CoCoPretrain model with different pre-training data selection strategies. “RG” is short for “ROUGE”.

Figure 2: The process of constructing the pre-training data. Given a piece of text, we divide it into an input span and an output span, and we calculate the overlap score of them by Equation 22. The top-K scored span pairs are selected.

Figure 4: The rate of correctly copied n-grams.

Figure 2: The Extractive-Abstractive model architecture. The extractor samples the evidence from the source which is used by the abstractor.

Figure 4: Summarization outputs with their evidence (highlighted), from our systems at different sparsity levels.

Figure 1: An example of a summary and its evidence (highlighted) as generated by our framework.

Figure 6: Performance on arXiv and PubMed, when we filter examples in test set with summary lenghth.

Figure 4: FAR’s performance against different values of β on arXiv and NYT.

Figure 2: Visualization of facet bias. Nodes refer to sentence representations and star is the document representation. Black solid circles mean facets. Red dashed circle means threshold in Section 3.1. The dashed bidirection arrows denote the sentence similarities.

Figure 5: The examples come from New York Times dataset.

Figure 1: Examples from New York Times. We selected part of key sentences from the source document to show in this table. “...” refers to the omissions of context sentences due to space limitation.

Figure 3: Sentence position distribution of arXiv and NYT. We use the first 40 sentences for NYT and the first 120 sentences for arXiv.

Figure 7: Sentence position distribution of 8 datasets.

Figure 5: The inference time of each system. Each time is the average of multiple runs (10 times). ”×N“ means the running time is N times (rounded up) of our method.

Figure 6: Impact of hyper-parameters λ and α.

Figure 3: A diagram for document segmentation.

Figure 2: The workflow of our proposed coarse-to-fine facet-aware ranking framework.

Figure 4: The smooth similarity curve.

Figure 1: An example from the Gov-Report dataset to introduce the process of our method. “...” refers to the omissions of context sentences due to space limitations. Highlight sentences refer to the final extracted summary sentences. The content of the arrow pointed is the facet description of the left semantic block. Bold facets represent vital facet-aware semantic blocks of the final summary.

Figure 1: Structure of our proposed Convolutional Gated Unit. We implement 1-dimensional convolution with a structure similar to the Inception (Szegedy et al., 2015) over the outputs of the RNN encoder, where k refers to the kernel size.

Figure 2: Percentage of the duplicates at sentence level. Evaluated on the Gigaword.

Figure 1: Model architecture for sequence-to-sequence with coarse-to-fine attention. The left side is the encoder that reads the document, and the right side is the decoder that produces the output sequence. On the encoder side, the top-level hidden states are used for the coarse attention weights, while the word-level hidden states are used for the fine attention weights. The context vector is then produced by a weighted average of the word-level states. In HIER, we average over the coarse attention weights, thus requiring computation of all word-level hidden states. In C2F, we make a hard decision for which chunk of text to use, and so we only need to compute word-level hidden states for one chunk.

Figure 2: Predicted summaries for each model. The source document is truncated for clarity.

Figure 3: Sentence attention visualizations for different models. From left to right: (1) STANDARD, (2) HIER, (3) C2F, (4) C2F +MULTI2 +POS.

Figure 10: Example extractive-output/abstractive-input for models in ”dewey & lebeouf” example. The extractive method used is tf-idf.

Figure 5: Translation examples from the Transformer-ED, L = 500.

Figure 4: Similarity of different length

Figure 1: CNN seq2seq model

Figure 9: Screenshot of side-by-side human evaluation tool. Raters are asked whether they prefer model output on the left or right, given a ground truth Wikipedia text.

Figure 7: Three different samples a T-DMCA model trained to produce an entire Wikipedia article, conditioned only on the title. Samples 1 and 3 are truncated due to space constraints.

Figure 5: Variance of different length

Figure 6: An example decoded from a T-DMCA model trained to produce an entire Wikipedia article, conditioned on 8192 reference document tokens.

Figure 2: Modified Decoder

Figure 4: Shows predictions for the same example from different models. Example model input can be found in the Appendix A.4

Figure 3: Shows perplexity versus L for tf-idf extraction on combined corpus for different model architectures. For T-DMCA, E denotes the size of the mixture-of-experts layer.

Figure 3: The buckets distribution of the dataset

Figure 8: Screenshot of DUC-style linguistic quality human evaluation tool.

Figure 1: The architecture of the self-attention layers used in the T-DMCA model. Every attention layer takes a sequence of tokens as input and produces a sequence of similar length as the output. Left: Original self-attention as used in the transformer-decoder. Middle: Memory-compressed attention which reduce the number of keys/values. Right: Local attention which splits the sequence into individual smaller sub-sequences. The sub-sequences are then merged together to get the final output sequence.

Figure 2: ROUGE-L F1 for various extractive methods. The abstractive model contribution is shown for the best combined tf-idf -T-DMCA model.

Figure 4: Examples of generated summaries. Colored spans contain key information from the gold reference.

Figure 1: The structure of the MemAttr model.

Figure 2: The structure of our Memory Network.

Figure 1: An example of news summarization. Colored spans are salient segments selected to form a summary, and their corresponding sentences are underlined.

Figure 2: Examples of discourse-level segmentation. a) spans in blue and yellow are the EDUs with semantically fragmented information and spans in red are the inaccurate EDU splits; b) the sub-sentential segments after merging.

Figure 1: Architecture of the original BERT model (left) and BERTSUM (right). The sequence on top is the input document, followed by the summation of three kinds of embeddings for each token. The summed vectors are used as input embeddings to several bidirectional Transformer layers, generating contextual vectors for each token. BERTSUM extends BERT by inserting multiple [CLS] symbols to learn sentence representations and using interval segmentation embeddings (illustrated in red and green color) to distinguish multiple sentences.

Figure 3: Content selector designs: a) RNN architecture; b) BERT architecture.

Figure 2: Proportion of extracted sentences according to their position in the original document.

Figure 1: Dependency discourse tree for a document from the CNN/DailyMail dataset (Hermann et al., 2015). Blue nodes indicate the roots of the tree (i.e., summary sentences) and parent-child links indicate dependency relations.

Figure 6: Overview of the neural selector architecture.

Figure 7: Position distribution of generated summaries from a strong baseline model BertEXT and our conditional summarization model with position code set to 0 (3 implementations). X axis is the position ratio. Y axis is the sentence-level proportion.

Figure 1: Proposed conditional generation framework exploiting sub-aspect functions.

Figure 2: Cumulative position distribution of oracles built on ROGUE (Blue) and BertScore (Orange). X axis is the ratio of article length. Y axis is the cumulative percentage of summary sentences.

Figure 5: Sentence-level clustering result labeled with sub-aspect features. X axis is the cluster index. Y axis is the proportion of sub-aspect features in each cluster.

Figure 3: Sample-level distribution of sub-aspect functions of the BertScore oracle. Values are the percentage in categorized samples, which adds up to 60.03% of CNN/Daily Mail training set. The remaining 39.97% do not belong to any of these 3 sub-aspects.

Figure 9: Sub-aspect mapping of generated summary with diversity-focus code [0,1,0]. Left panel: one sentence in the summary belongs to diversity sub-aspect. Right panel: two sentences in the summary belong to diversity sub-aspect. Contour lines denote the number of generated summaries.

Figure 4: Autoencoder with adversarial training strategy for unsupervised clustering of sentence-level distribution of sub-aspect functions.

Figure 8: Sub-aspect mapping of generated summary with importance-focus code [1,0,0]. Left panel: one sentence in the summary belongs to importance subaspect. Right panel: two sentences in the summary belong to importance sub-aspect. Contour lines denote the number of generated summaries.

Figure 2: ROUGE-1 distributions of the candidates in pretraining stage training set (pre-train), fine-tuning stage training set (meta-train) and fine-tuning stage test set (meta-test) on XSum dataset.

Figure 3: The Refactor’s success rates with different bin widths. W denotes the bin widths measured by ROUGE1. R denotes the success rate of the Refactor outperforming the single best base system.

Figure 1: Illustration of two-stage learning. “Doc, Hypo, Ref” represent “input document, generated hypothesis, gold reference” respectively. “Hypo’” represents texts generated during test phase. ΘBase and ΘMeta represent learnable parameters in two stages.

Figure 1: An illustration of sparse attention patterns ((a), (b), (c)) and their combination (d) in HETFORMER.

Figure 2: Test performance with different numbers of candidate summaries on CNNDM. Origin denotes the original performance of the baseline model.

Figure 4: Fine-tuned Refactor’s selection accuracy on CNNDM with different difficulties. The X-axis is the difference of ROUGE score of BART and pre-trained Refactor outputs.

Figure 1: SimCLS framework for two-stage abstractive summarization, where Doc, S, Ref represent the document, generated summary and reference respectively. At the first stage, a Seq2Seq generator (BART) is used to generate candidate summaries. At the second stage, a scoring model (RoBERTa) is used to predict the performance of the candidate summaries based on the source document. The scoring model is trained with contrastive learning, where the training examples are provided by the Seq2Seq model.

Figure 3: Positional Bias. X-asis: the relative position of the matched sentence in source documents. Y-axis: the ratio of the matched sentences. For fair comparison, articles are first truncated to the generator’s maximum input length. Origin denotes the original performance of the baseline model.

Figure 2: Illustration of our length-control algorithm.

Figure 3: Kendall’s τ correlation of evaluation metrics with and without compression ratio.

Figure 4: Comparing our length-control NAUS and the truncated CTC beam search on the Gigaward headline generation test set.

Figure 1: The overview of our NAUS approach. In each search step, input words corresponding to grey cells are selected. Besides, the blue arrow refers to the training process, and the green arrow refers to inference.

Figure 1: The comparison between PSP and previous methods. “E” and “D” represents the encoder and the decoder, respectively.

Figure 3: The overall framework of SEGTRANS model. The blue circles indicate input source text, where dark blue circles indicate paragraph boundaries. The yellow circles indicate output target text, where orange circles indicate heading boundaries. Dotted red lines indicate attention heads with segmentation-aware attention mechanism and dotted blue lines indicate attention heads with original full attention mechanism.

Figure 1: Overview of LAAM on Transformer Seq2seq. The bold values are boosted attention scores. The shadow boxes denote the attention scores of eos.

Figure 2: Loop of candidate generation and model finetuning.

Figure 3: Performance comparison (BART v.s. BRIO-Mul) w.r.t. reference summary novelty. The x-axis represents different buckets of test examples grouped by reference summary novelty (Eq. 11). Larger x-coordinates correspond to examples of which the reference summaries have higher novelty. The left figure shows the performance improvement of our model compared with the baseline model, while the right one shows model performance.

Figure 3: Architecture and training scheme of PSP. Squares in blue and red indicates frozen and tuned parameters, respectively.

Figure 2: Architecture of semantic distribution from auto-regressive language model.

Figure 4: Reliability graphs on the CNNDM and XSum datasets. The accuracy of model’s predictions is plotted against the model’s confidence on these predictions.

Figure 2: Visualization of the encoder-decoder attention weights. The x-axis are the encoder input, including prompts across the encoder Pen and the source document X . The y-axis are the decoder input, including prompts across the decoder Pde and the target summary Y . The area in the red box represents the attentions of Pde assigning to Pen. The area in the yellow box represents the attentions of Y assigning to X . Darker color shows the more highly related associations between tokens.

Figure 1: An overview of our CAST method.

Figure 3: Performance versus the number of training samples in the setting of Group B, Table 1. Notice that NAUS is trained by pseudo-groundtruth given by unsupervised edit-based search (Schumann et al., 2020). Thus, our approach is indeed unsupervised.

Figure 4: One example news article on CNN website. It contains human-annotated segments and heading-style summaries.

Figure 1: Comparison of MLE loss (LMLE) and the contrastive loss (LCtr) in our method. MLE assumes a deterministic (one-point) distribution, in which the reference summary receives all the probability mass. Our method assumes a nondeterministic distribution in which system-generated summaries also receive probability mass according to their quality. The contrastive loss encourages the order of model-predicted probabilities of candidate summaries to be coordinated with the actual quality metric M by which the summaries will be evaluated. We assign the abstractive model a dual role – a single model could be used both as a generation model and a reference-free evaluation model.

Figure 1: One example from the segmentation-based summarization task SEGNEWS. The news article is taken from a CNN news article and we truncate the article for display. CNN editors have divided this article into several sections and written a heading to section. The goal of this task is to automatically identify sub-topic segments of multiple paragraphs, and generate the heading-style summary for each segment. Dotted lines in the figure indicate segment boundaries. In this article, paragraphs 1,2 are annotated as the first segment, paragraphs 3,4 are annotated as the second segment, paragraphs 5,6 are annotated as the third segment, and paragraphs 7,8 are annotated as the forth segment. To the right of the article are the heading-style summaries for segments. Since the first segment is usually an overview of the news, we do not assign a summary to it.

Figure 2: The frequency of the non-stop words in summary appearing at different positions of the source article. The positions range from [0, 1024].

Figure 2: Variance of generated summary lengths in gold length test with soft length control.

Figure 4: Var and R-2 (Pre) of arbitrary length test with soft length control on complete test sets.

Figure 6: k-shot summarization results on XSum.

Figure 3: Var and R-2(F1) scores of gold length test with soft length control on divided test sets.

Figure 2: Coverage and density distributions of the BigSurvey.

Figure 5: Visualization of the encoder-decoder attention weights of the model with only prompts across the encoder and the decoder (left) and PSP (right). Detailed descriptions refer to Figure 2.

Figure 1: A document with its high-quality and lowquality summaries. The heat map marks the salient content in the document. The darker the colour, the more salient the content.

Figure 4: Different inner prompts for one example source document. Different colors indicate different inner prompt embeddings. “NO. of words” means the length of the text span.

Figure 2: The overall framework of HER is formulated as a contextual bandit and can be divided into a two-stage process containing rough reading and careful reading.

Figure 5: The statistics on extracted sentence number of our model. Frequency is the number of documents.

Figure 3: The statistics of model HER, BANDITSUM (Dong et al., 2018), HER w/o Local Net on the selected sentences’ indexes varying different document lengths. This is reported on the documents the lengths of which are all less than 80 on the test split.

Figure 4: A case on sentence selection of HER and HER w/o policy. The article is from CNN dataset. The highlighted indices indicate the corresponding sentences which should be extracted as summary.

Figure 1: An example of how human beings extract summary. The article is from CNN/DailyMail dataset.

Figure 1: Model architecture.

Figure 2: An example of the mixed transitive negative sampling process. The original part is in white, while the modified part is indicated as grey blocks.

Figure 3: The architecture of our proposed model for abstractive summarization. Our model consists of three parts: 1. Transformer Encoder-Decoder, 2. Entity Pointer Network, 3. Relation Pointer Network. The encoder in Transformer EncoderDecoder shares parameters with that in Relation Pointer Network.

Figure 2: A document from XSum dataset and the facts in it.

Figure 5: Sample generated summaries by our models. The intrinsic hallucinations in the summary are marked blue and the key information in the document is bolded.

Figure 1: A sample document with corresponding summaries generated by different abstractive summarization methods, in which extrinsic hallucinations are marked in yellow, and intrinsic hallucinations which are marked in blue. Note that the results of PTGEN [25] and TCONVS2S [20] come from Maynez et al. [18].

Figure 4: Sample generated summaries by our models. The extrinsic hallucinations in the summary are marked yellow and the key information in the document is bolded.

Figure 1: The overview of our model.

Figure 3: Predicted and ORACLE global attention in BART. There are attention distributions of (a) the whole source, (b) the source without the start & end tokens, (c) the source without the start & end tokens and full stops.

Figure 1: (a) Attention distribution is composed of the summation of cross attention on the samecolored lines, distinguished from that of different-colored lines which always equals 1 due to softmax. (b) Local attention gradually increases as the decoding proceeds. (c) Desired situation: growing local attention has been lower than global attention during decoding and exactly reaches it at the end.

Figure 4: Changes of the attention distribution when (a) one word in the reference is replaced by a similar word (s1) and a random word (s2), (b) the sentence order of the reference is shuffled, (c) a factual knowledge in the reference is distorted.

Figure 2: Annotation pipeline of ENTSUM

Figure 3: Distribution of sentence positions for salient and summary sentences.

Figure 1: Example of a generic summary (blue), with three entity-centric summaries from ENTSUM focusing on the entities in bold.

Figure 1: Proposed BERT-Multitask model.

Figure 2: Specificity prediction model used.

Figure 2: Different fine-tuning conditions for T5. (- -) indicates optional additive data for Paraphrasing.

Figure 1: Mixtext model and the modified MixGen for generative tasks.

Figure 1: Examples of ∆(y,y′) of the original MRT and ∆̃(y,y′) of GOLC where ROUGE-1 recall is calculated based on unigrams. In the two examples, a reference y is ⟨malaysia,markets, closed, for, holiday⟩ and a sampled summary y′ is ⟨markets, in,malaysia, closed, for, holiday⟩ and cb(y) = len(’ ’.join(y)) = 38 and cb(y ′) = len(’ ’.join(y′)) = 35.

Figure 2: Summary length distributions on CNN/Daily and Mainichi. Summary length is the number of characters.

Figure 6: ROUGE-1 score relative to that of BART(1k) system evaluated on different partitions by length.

Figure 2: Self-Attention Pattern.

Figure 4: The average mean distance across multiheads for each layer. The average mean distance of the random weight model is slightly lower than DU as some inputs are shorter than 1,024.

Figure 4: ∆R1 (Y-axis) against r at inference (X-axis).

Figure 1: Overview of the combined architecture where we highlight different aspects of this work. N0 is the original document length, N is the input length to the generation system, and M is the summary length.

Figure 8: LoBART positional embedding is initialized by copying and flipping BART’s positional embedding.

Figure 9: ROUGE-1 score relative to that of BART(1k) on Spotify Podcast (Len:Avg=5,727, 90th%=11,677).

Figure 7: Example of LoBART’s encoder-decoder attention evaluated on Podcast test set.

Figure 1: The sum of attention weights against the number of retained sentences (r) evaluated on CNNDM.

Figure 5: Example of BART’s encoder-decoder attention evaluated on CNNDM test set.

Figure 5: The impact of training-time content selection methods on BART(1k) performance.

Figure 3: Operating points for B=1 and M=144. (1) Section 4 studies local attention to reduce quadratic complexity to linear. As W decreases, the gradient of linear complexity decreases. (2) Section 5 studies content selection to move an operating point to the left.

Figure 8: Example of LoBART’s encoder-decoder attention evaluated on arXiv test set.

Figure 3: Modified architecture with model-based approximator where the base model is BART/LoBART. Model-based neural approximator is shown in orange.

Figure 6: Example of BART’s encoder-decoder attention evaluated on XSum test set.

Figure 4: Comparison of summarization metrics. Support sentences are marked in the same color as their corresponding facets. SCUs have to be annotated for each extracted summary during evaluation, while facetaware evaluation can be conducted automatically by comparing sentence indices.

Figure 5: The first three figures show the ground-truth and estimated FAR scores via human-annotated FAMs and machine-created FAMs. The fourth figure shows the fitting of linear regression on the human-annotated samples (LR-Small) and the prediction on the whole test set of CNN/Daily Mail (LR-Large). Systems are sorted in an ascending order by the ground-truth FAR on the human-annotated samples.

Figure 2: Performance of extractive methods under ROUGE, FAR, and SAR. The results under ROUGE-1/2/L often disagree with each other. UnifiedSum(E) generally performs the best in the facet-aware evaluation.

Figure 3: Comparison of extractive methods under FAR and SAR reflects their capability of extracting salient and non-redundant sentences.

Figure 1: An illustration of facet-aware evaluation. Two of three support groups of facet 1 (r1) are covered. Facet 2 (r2) cannot be covered as document sentence 4 (d4) is missing in the extracted summary. The illustration corresponds to the example in Sec. 3.1.

Figure 2: Visualization of the learned node embeddings in testing each epoch. Red nodes are words (light) and sentence (heavy) in labels of summary, while blue nodes are related to words (light) and sentences (heavy) in non-summaries. Purple nodes are words shared by sentences between summaries and non-summaries.

Figure 1: The overview architecture of the MuchSUM with three specific graph convolutional channels and a common convolutional channel shared by the three graph channels. We denote three specific channels as Node Lexical Feature Encoding Channel (A,X𝑠 ), Node Centrality Feature Encoding Channel (A,X𝑐 ) andNode Position Feature EncodingChannel (A,X𝑝 ). In the bipartite word-sentence heterogeneous graph, each sentence node (solid node) is connected to its contained word-related nodes (hollow nodes) and takes the weight of the relation as their edge feature. Different thicknesses of edges represent different edge weights.

Figure 1: (a) BERTSUMEXTABS model. An encoder encodes the document, and a word generator generates the next word given previous words, while paying attention to the document. (b) Sentence planner model. A shared encoder separately encodes the document and each sentence of the summary generated so far. The sentence generator takes the summary sentence embeddings and predicts the next sentence embedding, which the word generator is then conditioned on. Both generators integrate document information through attention.

Figure 2: Scatter plots of ROUGE scores and support scores: X-axis presents ROUGE-1 score between system and reference headlines; and Y-axis presents support score (the same to Figure 1).

Figure 4: The distribution of the support scores on JAMUL.

Figure 5: Guideline for entailment labeling

Figure 3: The distribution of the support scores on the English Gigaword dataset.

Figure 1: Histogram of support scores (recall-oriented ROUGE-1 scores between generated headlines and their source documents).

Figure 6: Examples of the improved headlines.

Figure 3: Sample of question-answer pairs generated from hallucinated summaries that are correctly answered by their source articles. Highlighted spans in the summaries are marked as extrinsic or intrinsic hallucinations by our annotators.

Figure 2: Human assessment of a system generated summary for the article in Figure 1. The annotation user interface is shown as it was shown to raters.

Figure 1: Hallucinations in extreme document summarization: the abbreviated article, its gold summary and the abstractive model generated summaries (PTGEN, See et al. 2017; TCONVS2S, Narayan et al. 2018a; and, GPTTUNED, TRANS2S and BERTS2S, Rothe et al. 2020) for a news article from the extreme summarization dataset (Narayan et al., 2018a). The dataset and the abstractive models are described in Section 3 and 4. We also present the [ROUGE-1, ROUGE-2, ROUGE-L] F1 scores relative to the reference gold summary. Words in red correspond to hallucinated information whilst words in blue correspond to faithful information.

Figure 1: Summaries produced by our model. For illustration, the compressive summary shows the removed spans strike-through.

Figure 2: Illustration of our summarization system. The model extracts the most relevant sentences from the document by taking into account the WordEncoder representation of the current sentence e(si), the SentEncoder representation of the previous sentence hsi , the current summary state representation o s i , and the representation of the document e(D). If a sentence is selected (zi = 1), its representation is fed to SentStates, and we move to the next sentence. Here, sentences s1 and s3 were selected. If the model is also compressing, the compressive layer selects words for the final summary (Compressive Decoder). See Figure 3 for details on the decoders.

Figure 5: Example output summaries on the CNN/DailyMail dataset, gold standard summary, and corresponding questions. The questions are manually written using the GOLD summary. The same EXCONSUMM summaries are shown in Figure 1, but the strikethrough spans are now removed.

Figure 3: Decision decoder architecture. Decoder contains an extractive level for sentences (orange box) and a compressive level for words (dashed gray box), using an LSTM to model the summary state. Red diamond shapes represent decision variables zi = 1 if p(zi ∣ pi) > 0.5 for selecting the sentence si, and zi = 0 if p(zi ∣ pi) ≤ 0.5 for skipping this sentence. The same for yij and p(yij ∣ qij) > 0.5 for deciding over words wij to keep in the summary.

Figure 4: Word distribution in comparison with the human summaries for CNN dataset. Density curves show the length distributions of human authored and system produced summaries.

Figure 2: Oracle sentence distribution over a paper. X-axis: 10,000 papers sampled from FacetSum, sorted by full text length from long to short; y-axis: normalized position in a paper. We provide each sub-figure’s density histogram on their right.

Figure 1: Editorial Network (EditNet)

Figure 2: An example mixed summary (annotated with the editor’s decisions) taken from the CNN/DM dataset

Figure 2: HOLMS: structure and value range (3D Gaussian peaks and spreads are both set to 1).

Figure 1: Illustration of the area under curve representing the HOLMS value.

Figure 3: Hierarchical encoder with hierarchical attention: the attention weights at the word level, represented by the dashed arrows are re-scaled by the corresponding sentencelevel attention weights, represented by the dotted arrows. The dashed boxes at the bottom of the top layer RNN represent sentence-level positional embeddings concatenated to the corresponding hidden states.

Figure 1: Feature-rich-encoder: We use one embedding vector each for POS, NER tags and discretized TF and IDF values, which are concatenated together with word-based embeddings as input to the encoder.

Figure 2: Switching generator/pointer model: When the switch shows ’G’, the traditional generator consisting of the softmax layer is used to produce a word, and when it shows ’P’, the pointer network is activated to copy the word from one of the source document positions. When the pointer is activated, the embedding from the source is used as input for the next time-step as shown by the arrow from the encoder to the decoder at the bottom.

Figure 1: SummaRuNNer: A two-layer RNN based sequence classifier: the bottom layer operates at word level within each sentence, while the top layer runs over sentences. Double-pointed arrows indicate a bi-directional RNN. The top layer with 1’s and 0’s is the sigmoid activation based classification layer that decides whether or not each sentence belongs to the summary. The decision at each sentence depends on the content richness of the sentence, its salience with respect to the document, its novelty with respect to the accumulated summary representation and other positional features.

Figure 2: Visualization of SummaRuNNer output on a representative document. Each row is a sentence in the document, while the shading-color intensity is proportional to its probability of being in the summary, as estimated by the RNN-based sequence classifier. In the columns are the normalized scores from each of the abstract features in Eqn. (6) as well as the final prediction probability (last column). Sentence 2 is estimated to be the most salient, while the longest one, sentence 4, is considered the most content-rich, and not surprisingly, the first sentence the most novel. The third sentence gets the best position based score.

Figure 7: Human evaluation instruction screenshots.

Figure 2: QAGen model: for an input text (p), it generates a question (q) followed by an answer (a).

Figure 4: Correlation between QUALS and QAGS on XSUM (left) and CNNDM (right). The average QAGS tend to increase with the increase in QUALS. The standard deviation of the QAGS for each bin is about 0.187 for XSUM and 0.127 for CNNDM.

Figure 6: Human evaluation interface using Amazon Sagemaker Ground Truth.

Figure 5: Negative log likelihood per subword token on two q-a pairs from the QAGen model according to the summary(blue) and input document (orange). Higher means unlikely. The first q-a pair (top figure) has a much higher average negative log likelihood according to the input document than according to the summary.

Figure 3: Correlation between QUALS and QAGS on XSUM (left) and CNNDM (right). The average QAGS tend to increase with the increase in QUALS.

Figure 1: Comparison between QAGS (top) and QUALS (bottom) protocols. QUALS uses only one QAGen model instead of the AE, QG and QA models used in QAGS.

Figure 2: AUTOSUMM-CREATE.

Figure 1: Overview of the proposed approach.

Figure 10: Model created with NAS module in AUTOSUMM-CREATE, as visualised through tensorboard

Figure 4: Efficiency comparison for the extractive summarization models on the CNN/DM dataset

Figure 9: Variation of performance with the increase in KD proportion on CNN DM dataset

Figure 5: Cell distribution across varying layer size

Figure 7: Cross-Data experiments

Figure 6: Layer distribution for XSUM and Contract

Figure 3: AUTOSUMM-DISTILL.

Figure 8: Performance vs Training data variation

Figure 1: An abridged example from our extreme summarization dataset showing the document and its oneline summary. Document content present in the summary is color-coded.

Figure 1: Extractive summarization model with reinforcement learning: a hierarchical encoder-decoder model ranks sentences for their extract-worthiness and a candidate summary is assembled from the top ranked sentences; the REWARD generator compares the candidate against the gold summary to give a reward which is used in the REINFORCE algorithm (Williams, 1992) to update the model.

Figure 2: Summaries produced by the LEAD baseline, the abstractive system of See et al. (2017) and REFRESH for a CNN (test) article. GOLD presents the human-authored summary; the bottom block shows manually written questions using the gold summary and their answers in parentheses.

Figure 2: Topic-conditioned convolutional model for extreme summarization.

Figure 3: Length distributions in ETCSum summaries on the CNN/DailyMail test set.

Figure 2: Stepwise HiBERT (left) and ETCSum (right) models. HiBERT builds summary informed representation by jointly modeling partially generated summary and the document during document encoding, while ETCSum takes as input the document appended with the partially generated summary.

Figure 1: Pretraining and finetuning for abstractive summarization with entity chains.

Figure 10: Instructions for human evaluations for overall quality of summaries.

Figure 8: CNN/DailyMail example predictions from FROST and CTRLSum along with their entity prompts and keywords, respectively.

Figure 6: Example XSum predictions for models presented in Tables 3 and 4.We highlight entities in orange that are not faithful to the input document. Entities in green are faithful to the input document.

Figure 9: Instructions for human evaluations for faithfulness.

Figure 3: Sentence-level vs summary-level entity chains. We report summary-level ROUGE-L (RL-Sum), entity chain-level ROUGE-2 (R2-EPlan), and ENTF1 on the CNN/DailyMail validation set. Similar observations were made for other measures.

Figure 4: Finetuning results on the XSum validation set using one of the base-sized pretrained models: PEGASUS, FROST(F), and FROST(P+F). All pretrained models were trained for 1.5m steps. See text for more details. We only report on a subset of measures, similar observations were made for other measures.

Figure 5: Finetuning results on the XSum (in blue) and CNN/DailyMail (in red) validation sets at various steps during pretraining FROST-Large. Instead of pretraining from scratch, we start with a PEGASUS-Large checkpoint, and continue pretraining for additional 1.5m steps with the planning objective. We report finetuning results for the PEGASUS finetuned baseline and our models at 0.1m, 1m, and 1.5m steps.

Figure 7: An example of generating summaries with topical and style diversity using modified entity prompts cmod on XSum.

Figure 2: An example of sentence-level and summary-level entity chains along with the reference summary.

Figure 3: Proposed model for Query based Abstractive Summarization with (i) query encoder (ii) document encoder (iii) query attention model (iv) diversity based document attention model and (v) decoder. The green and red arrows show the connections for timestep 3 of the decoder.

Figure 1: The sentence position and length of extracted summaries

Figure 3: Generation of context states and class-based representations by text representation component.

Figure 1: Graphical representation of the model

Figure 2: Generation of the latent code zi for a review ri by the encoder. Yellow boxes represent the neural networks that compute the prior and variational posterior distributions of latent codes.

Figure 4: Fewshot learning: ROUGE-1 with the mean and standard deviation over 5 runs.

Figure 1: The overview of our framework. in which the backbone is in charge of generating two summaries for a document. Then the oracle selects which summary is better for a given document. The reward model afterward transforms the oracle’s preference into a discrete signal to optimize the backbone. Our framework contains two novel components: efficient sampling from offline data and the preference-guided reward model.

Figure 9: Fewshot learning: ROUGE-2 and ROUGE-L score with the mean and std over 5 runs

Figure 5: Online learning on Reddit TIFU: ROUGE-1 with the mean and standard deviation over 5 runs.

Figure 8: Active learning: ROUGE-2 and ROUGE-L score with the mean and std over 5 runs

Figure 7: Online learning on RedditTIFU dataset: ROUGE-2 and ROUGE-L score with the mean and std over 5 runs

Figure 2: Reward model: (a) Accuracy of 3 models on all sets (b) The ROUGE-1 score of our ROMSR.

Figure 6: Ablation study: (a) Performance with different k values; (b) Quality of selected samples; (c) Semantic similarity between online documents and offline documents.

Figure 3: Active learning: ROUGE-1 with the mean and standard deviation over 5 runs.

Figure 8: ROUGE-L box plot for all candidate summary sets generated with LkO input perturbation method for k = 1, .., 5.

Figure 5: Length box plot for all candidate summary sets generated with LkO input perturbation method for k = 1, .., 5.

Figure 4: Confusion matrix for the Coherence Pairwise Classifier.

Figure 6: ROUGE-1 box plot for all candidate summary sets generated with LkO input perturbation method for k = 1, .., 5.

Figure 10: Confusion matrix for the Fluency Pairwise Classiifer.

Figure 2: IP-SPR-2 scores (measuring IP-Diversity) box plot, for all pairs of candidate summaries generated with LkO input perturbation method for k = 1, ..., 5.

Figure 9: SPR-L box plot for all pairs of candidate summaries generated with LkO input perturbation method for k = 1, .., 5.

Figure 7: SPR-1 box plot for all pairs of candidate summaries generated with LkO input perturbation method for k = 1, .., 5.

Figure 3: ROUGE-2 F1 scores box plot, for all candidate summary sets generated with LkO input perturbation method for k = 1, ..., 5.

Figure 1: A diagram of the PASS components, with an example for a collection of reviews of size d = 4, k = 1.

Figure 1: Histogram of the position of sentences selected by our method and PacSum on CNN/DM. PacSum uses position information which allows it to take advantage of the lead bias. In contrast, our method is position-agnostic but still captures the fact that earlier sentences are more important in news articles.

Figure 3: Correlation between metrics and human judgement on subsets of data. The x and y axis represent the human judgement the metric scores respectively. The red line is a linear regression fitted on full data. Each dotted line is a linear regression fitted on a model-dataset subset. Each colored point has coordinates equal to average factuality judgement, and metric score for its corresponding partition.

Figure 10: Articles web pages are provided.

Figure 2: Proportion of summaries with factual errors based on collected annotations, with breakdown of the categories of errors within. Full specification of categories of errors in Table 1.

Figure 1: We propose a linguistically grounded typology of factual errors. We select crowd workers to annotate summaries from two datasets according to this typology achieving near perfect agreement with experts. We collect FRANK, the resulting dataset, to benchmark factuality metrics and state-of-art summarization systems.

Figure 8: The sentences being annotated is highlighted in yellow. Relevant text is underlined in the article plain text.

Figure 5: Variation in partial Pearson correlation when omitting error types. Higher variation indicates greater influence of an error type in the overall correlation.

Figure 11: Entity question to ensure annotators read the text.

Figure 4: Partial Pearson correlation on different partitions of the data. Entailment metrics have highest correlation on pretrained models in the CNN/DM dataset. Their performance degrades significantly on XSum.

Figure 7: Instructions can be toggled.

Figure 9: After selecting that the sentence is not factual annotators choose the category of error.

Figure 6: Confusion matrix of different types of errors. Entry at row i, column j corresponds to the frequency of annotations that have Fi as the majority class and for which disagreeing annotator selected Fj.

Figure 1: Example of the QFS dataset and a constructed graph. Query nodes are denoted by blue circles, and document nodes by yellow circles. Root words in red letters indicate important words in a query and each sentence. The nodes in the purple dotted rectangle are especially important to generate a summary.

Figure 2: Overview of proposed QSG Transformer.

Figure 1: Comparing the uni-, bi-, and tri-gram novelty for the medium sized datasets. These datasets contain generated sequences up to 128 tokens in length. The methods are as follows: NLL (baseline), RwB-Hinge, RISK2, and RISK-3. The unique average n-gram novelty (n-grams that do not appear in the source text) is shown to increase across the board compared to the standard NLL baseline.

Figure 2: Comparison of each method for the full-data approach over a medium size dataset (CNN/DM). The methods are as follows: NLL (baseline), RwB-Hinge, RISK-2, and RISK-3. We see that the reinforcement learning approaches have led, on average, to higher ROUGE-L scores for the longer summaries compared to the NLL baseline.

Figure 2: Abstract from PLOS Medicine, entity grid, bipartite entity graph

Figure 3: One mode projection of the bipartite graph

Figure 1: Control flow of our summarization method

Figure 1: Abstract from PLOS Medicine, topical grid, bipartite topical graph, one-mode projection

Figure 2: (i) A sample text from PLOS Medicine; (ii) entity graph; (iii) projection graph of the text.

Figure 3: (i) A projection graph; (ii) several instances of a coherence pattern in Figure 1, ii.

Figure 1: (i) A sample of mined coherence patterns from abstracts; nodes are sentences and edges are entity connections; (ii) Sentences S1, S3 and S5 constitute the pattern in an input document.

Figure 4: An illustration of mapping variables to overlay graph g with coherence pattern patu.

Figure 2: Overview of our saliency predictor model.

Figure 1: Our sequence generator with RL training.

Figure 1: Illustration of the encoder and decoder attention functions combined. The two context vectors (marked “C”) are computed from attending over the encoder hidden states and decoder hidden states. Using these two contexts and the current decoder hidden state (“H”), a new word is generated and added to the output sequence.

Figure 2: Cumulated ROUGE-1 relative improvement obtained by adding intra-attention to the ML model on the CNN/Daily Mail dataset.

Figure 1: The blue distribution represents the score distribution of summaries available in the human judgment datasets of TAC-2008 and TAC-2009. The red distribution is the score distribution of summaries generated by mordern systems. The green distribution corresponds to the score distribution of summaries we generated in this work as described in section 3.

Figure 2: Example of PD K in comparison to the word distribution of reference summaries for one topic of TAC-2008 (D0803).

Figure 2: Percentage of disagreement between metrics for increasing scores of summary pairs (Scores have been normalized).

Figure 4: Pairwise correlation between evaluation metrics on various scoring range. The generated dataset uses the topics from TAC-2008 and TAC-2009. The human judgments are the ones available as part of TAC-2008 and TAC-2009.

Figure 3: The x-axis is the score of the normalized average score of s given by 1n ∑ imi(s) after the metrics have been normalized between 0 and 1. On the y-axis: F N associated to the sampled summary s. We also report the average performance of current systems.

Figure 1: figure 1a represents an example distribution of sources, figure 1b an example distribution of background knowledge and figure 1c is the resulting target distribution that summaries should approximate.

Figure 2: Visualized the efficiency of using passage nodes to enhance sentence representation. The degree of highlighting expresses the important role of the passage in the document. Underlined sentences are modelselected summaries. As result, the selected sentences belong to passages that have high scores of α (Equation 8).

Figure 1: Overview of HeterGraphLongSum model. Passages of each document are defined as a set of sentences in sequence with a fixed number of sentences. In this architecture, the edges from passage to word and sentence to passage are not taken into account because of the redundancy.

Figure 2: n-gram overlaps between the abstracts generated by different models and the input article on the arXiv dataset. We show in detail which part of the input was copied for our TLM conditioned on intro + extract.

Figure 1: Our approach for abstractive summarization of a scientific article. An older version of this paper is shown as the reference document. First, a sentence pointer network extracts important sentences from the paper. Next, these sentences are provided along with the whole scientific article to be arranged in the following order: Introduction, extracted Sentences, abstract & the rest of the paper. A transformer language model is trained on articles organized in this format. During inference, the introduction and the extracted sentences are given to the language model as context to generate a summary. In domains like news and patent documents, the introduction can be replaced by the entire document.

Figure 3: t-sne visualization of the TLM-learned word embeddings. The model appears to partition the space based on the broad paper categoty in which it frequently occurs.

Figure 1: Joint distribution of different classes. For a pair of classes 𝑐𝑖 and 𝑐 𝑗 , the value in a cell is 𝑛𝑖 𝑗 × 100/min{|𝑐𝑖 |, |𝑐 𝑗 |}, where 𝑛𝑖 𝑗 is the number of tweets that have been labeled with both 𝑐𝑖 and 𝑐 𝑗 .

Figure 3: Frequency distribution of tweets corresponding to different concerns over time.

Figure 6: Top 20 frequent entities in CORD-SUM vocabulary.

Figure 5: Summary sentences distributions of models.

Figure 2: The model contains three main modules: 1) Local Encoder: is composed of an Entity Encoder and a Sentence Encoder, the embeddings of entities and sentences are the initial features of graph nodes; 2) Heterogeneous Graph Encoder: an iteratively computed graph with FacetWeight; and 3) Extraction & Postprocess: ranks sentences while minimizing redundancy with Trigram Blocking.

Figure 3: Heat map of five section categories.

Figure 1: An example in our CORD-SUM dataset. Texts highlighted with different colors denote different facets of the summary.

Figure 7: Top 20 frequent entities in ArXiv vocabulary.

Figure 4: Oracle sentence distributions over a paper.

Figure 1: The complete pipeline of the proposed method. In the first step, we split the input text into sentences by using a regular expression handcrafted specifically for scientific documents. In the second step, we compute the sentence embeddings of the parsed sentences using SBERT. In the third step, we create a graph by comparing all the pairs of sentence embeddings obtained using cosine similarity. In the fourth step, we rank the sentences by the degree centrality in the generated graph. In the fifth and final step, we only keep a certain number of sentences or words to adjust to the length requirements of the summary.

Figure 2: The process of graph generation and ranking of the sentences. Every node in the generated complete graph represents a sentence in the document and the weight of each edge is given by the similarity between the nodes it conects. The importance of the sentence in the document is modelled as rank(si) =∑n j=1 1− sim(ei, ej), where ei and ej are the corresponding SBERT sentence embeddings of si and sj .

Figure 7: ROUGE-1 on CNN/DM for k sampled candidates at inference time, with k ∈ {1, . . . , 15}. SR stands for SummaReranker, BS and DBS refer to beam search and diverse beam search, respectively.

Figure 4: Example of a summary generated by SummaReranker trained for {R-1, R-2, R-L} on CNN/DM. The sentence in green is included in the SummaReranker summary, while the one in red is discarded.

Figure 2: Expert utilization for a base PEGASUS with SummaReranker optimized with {R-1, R-2, R-L, BS, BaS} on CNN/DM, with 10 experts.

Figure 8: ROUGE-1 on XSum for k sampled candidates at inference time, with k ∈ {1, . . . , 15}. SR stands for SummaReranker, BS and DBS refer to beam search and diverse beam search, respectively.

Figure 5: Human evaluation results on all three datasets. Black vertical bars are standard deviation across human raters.

Figure 3: Best summary candidate recall with 15 diverse beam search candidates for PEGASUS on all three datasets. SR denotes SummaReranker. Dotted lines are random baselines, and dashed lines correspond to the base PEGASUS.

Figure 1: SummaReranker model architecture, optimizing N metrics. The summarization metrics here (ROUGE-1, ROUGE-2, ..., BARTScore) are displayed as examples.

Figure 6: Novel n-grams with PEGASUS, across all datasets and with beam search and diverse beam search.

Figure 9: ROUGE-1 on Reddit TIFU for k sampled candidates at inference time, with k ∈ {1, . . . , 15}. SR stands for SummaReranker, BS and DBS refer to beam search and diverse beam search, respectively.

Figure 1: Experimental Upper bounds of our sentence regression framework and existing sentence regression framework.

Figure 2: Human evaluation scores of the summaries averaged across the test set for the summarization models (sorted by EGFB). See §A.1 for details.

Figure 1: Distribution of system ranks with bootstrapped sampling showing significant differences between systems

Figure 4: Pairwise Kendall’s Tau correlations for all automatically computed features.

Figure 3: Kendall’s Tau correlation between automatically computed features and experts’ evaluations of EGFB values and the 8 boolean attributes described in Section §2.3 (converted to numerical judgments).

Figure 6: Examples of graph-based meaning representations parsed from sentences of documents and generated summaries.

Figure 1: Example of (a) a document, (b) a summary, and (c) the corresponding document and (d) summary graph-based meaning representations. The summary graph does not contain the "consider" node, indicating a factual error (red dashed edge).

Figure 5: (a) AMR and (b) dependency representations for the summary “police have appealed for help in tracing a woman who has been missing for six years.”

Figure 3: Variation in partial Pearson correlation when omitting error types. Higher variation indicates greater influence of an error type in the overall correlation.

Figure 4: An example of a document, its generated summary and factuality predictions for word pairs, based on the dependency graph (DAE) versus AMR graph (FACTGRAPH-E). +/− means the predicted label for that edge.

Figure 2: Overview of FACTGRAPH. A sentence-level summary and document graphs are encoded by the graph encoder with structure-aware adapters. Text and graph encoders use the same pretrained model and only the adapters parameters are trained.

Figure 1: Model generated and reference summaries used for human evaluation. Words in orange correspond to incorrect or repeated information.

Figure 1: Comparison of how models adapt to target lengths from zero-shot to low-resource cases. We plot the average summary lengths for different models. We report results on XSum, similar patterns were found on CNN/DailyMail and SAMSum.

Figure 1: Architecture of the HiStruct+ model. The model consists of a base TLM for sentence encoding and two stacked inter-sentence Transformer layers for hierarchical contextual learning with a sigmoid classifier for extractive summarization. The two blocks shaded in light-green are the HiStruct injection components.

Figure 3: Two samples for human evaluation and case analysis of the extractive summaries predicted by the HiStruct+ model and the baseline model, in comparison with the gold summary (i.e., the abstract of the paper). The first sample is selected from the arXiv dataset, while the second sample is from PubMed. Top-7 sentences with the highest predicted scores are extracted, and then combined in their original order to construct a final summary. Their linear indices within the original document are shown in the second row of each table. The texts highlighted in yellow are the key words and the main content that appear in the gold summary. The phrases highlighted in green indicate typical parts of a scientific paper such as summary and future work. Sentences are split by ’<q>’.1308

Figure 2: Proportions of the extracted sentences at each linear position. The x-axis values are linear sentence indices, the y-axis values are percentages of the extracted sentences. In this figure, only the first 25 sentence indices are included due to space limitation.

Figure 3: (a) A network diagram for the NNLM decoder with additional encoder element. (b) A network diagram for the attention-based encoder enc3.

Figure 2: Example input sentence and the generated summary. The score of generating yi+1 (terrorism) is based on the context yc (for . . . against) as well as the input x1 . . .x18. Note that the summary generated is abstractive which makes it possible to generalize (russian defense minister to russia) and paraphrase (for combating to against), in addition to compressing (dropping the creation of), see Jing (2002) for a survey of these editing operations.

Figure 1: Example output of the attention-based summarization (ABS) system. The heatmap represents a soft alignment between the input (right) and the generated summary (top). The columns represent the distribution over the input after generating each word.

Figure 4: Example sentence summaries produced on Gigaword. I is the input, A is ABS, and G is the true headline.

Figure 1: Task template for our user study.

Figure 2: Example QBS for topic Airport Security

Figure 1: Aggregated human preference judgements across the same 400 instances measured in Table 3. The blue bars show preferences, the red bars show no preference.

Figure 1: Stemmed word frequencies for reference summary set d30001t from duc04: averaged across all reference summaries and for single reference summaries.

Figure 3: Orange crosses show the objective score optimized by exhaustive search minus the objective score optimized by FCHC. Blue pluses show the ROUGE-L difference between exhaustive search and FCHC. Plotted for the 1135 instances in the headline generation test set, where the source sentence has 30 words or fewer.

Figure 2: ROUGE F1 scores on the test set of headline generation for Lead-N and Lead-P baselines with different number n and percentage p of leading words.

Figure 1: Summarizing a sentence x by hill climbing. Each row is a Boolean vector at at a search step t . A black cell indicates a word is selected, and vice versa. Randomly swapping two values in the Boolean vector yields a new summary that is scored by an objective function that measures language fluency and semantic similarity. If the new summary increases the objective, this summary is accepted as the current best solution. Rejected solutions are not depicted.

Figure 4: Positional bias for different systems, calculated for the headline generation test set. The source sentence is divided into four areas: 0–25%, 25–50%, 50–75%, and 75-100% of the sentence. The y-axis shows the normalized frequency of how often a word in the summary is extracted from one of the four source sentence areas.

Figure 4: Pearson correlation with humans on SummEval w.r.t. the QG beam size.

Figure 2: Variation of the Pearson correlations between various metrics and humans, versus the number of references available. QUESTEVAL is constant, since it is independent from the references.

Figure 3: Distribution of the log probabilities of answerability – i.e. log(1 − QA( |T, q)) – for two QA models. 1) solid lines: a model trained on SQuADv2 without the negative sampled examples. 2) dashed lines: a model trained on SQuAD-v2 with the negative sampled examples. The evaluated samples belong to three distinct categories: 1) answerable, 2) unanswerable questions (but present in SQuAD-v2) and 3) the negatively sampled ones (as described in §5.1).

Figure 1: Illustration of the QUESTEVAL framework: the blue area corresponds to the precision-oriented framework proposed by Wang et al. (2020). The orange area corresponds to the recall-oriented SummaQA (Scialom et al., 2019). We extend it with a weighter component for an improved recall (red area). The encompassing area corresponds to our proposed unified approach, QUESTEVAL.

Figure 4: Coverage eliminates undesirable repetition. Summaries from our non-coverage model contain many duplicated n-grams while our coverage model produces a similar number as the reference summaries.

Figure 1: Comparison of output of 3 abstractive summarization models on a news article. The baseline model makes factual errors, a nonsensical sentence and struggles with OOV words muhammadu buhari. The pointer-generator model is accurate but repeats itself. Coverage eliminates repetition. The final summary is composed from several fragments.

Figure 3: Pointer-generator model. For each decoder timestep a generation probability pgen ∈ [0,1] is calculated, which weights the probability of generating words from the vocabulary, versus copying words from the source text. The vocabulary distribution and the attention distribution are weighted and summed to obtain the final distribution, from which we make our prediction. Note that out-of-vocabulary article words such as 2-0 are included in the final distribution. Best viewed in color.

Figure 6: Although our best model is abstractive, it does not produce novel n-grams (i.e., n-grams that don’t appear in the source text) as often as the reference summaries. The baseline model produces more novel n-grams, but many of these are erroneous (see section 7.2).

Figure 9: The baseline model incorrectly substitutes dutch for new zealand (perhaps reflecting the European bias of the dataset), fabricates irish, and struggles with out-of-vocabulary words saili and aucklandbased. Though it is not clear why, the phrase addition to our backline is changed to the nonsensical addition to their respective prospects. The pointer-generator model fixes these accuracy problems, and the addition of coverage fixes the repetition problem. Note that the final model skips over large passages of text to produce shorter sentences.

Figure 13: The baseline model appropriately replaces stumped with novel word mystified. However, the reference summary chooses flummoxed (also novel) so the choice of mystified is not rewarded by the ROUGE metric. The baseline model also incorrectly substitutes 600,000 for 25. In the final model’s output we observe that the generation probability is largest at the beginning of sentences (especially the first verb) and on periods.

Figure 8: The baseline model reports the wrong score 6-3, substitutes bedene for thiem and struggles with the uncommon word assimilation. The pointer-network models accurately reproduce the outof-vocabulary words thiem and aljaz. Note that the final model produces the novel word defeated to incorporate several fragments into a single sentence.

Figure 10: In this example, both our baseline model and final model produce a completely abstractive first sentence, using a novel word beat.

Figure 15: The baseline model fabricates a completely false detail about a u.n. peacekeeping force that is not mentioned in the article. This is most likely inspired by a connection between U.N. peacekeeping forces and northern sinai in the training data. The pointer-generator model is more accurate, correctly reporting the reshuffle of several senior military positions.

Figure 14: The baseline model incorrectly changes thwart criminals and others contributing to nigeria’s instability to destabilize nigeria’s economy – which has a mostly opposite meaning. It also produces a nonsensical sentence. Note that our final model produces the novel word says to paraphrase told cnn ‘s christiane amanpour.

Figure 5: Examples of highly abstractive reference summaries (bold denotes novel words).

Figure 12: Baseline model replaces cecily strong with mariah carey, and produces generally nonsensical output. The baseline model may be struggling with the out-of-vocabulary word beetlejuice, or perhaps the unusual non-news format of the article. Note that the final model omits – ever so slightly – from its first sentence.

Figure 2: Baseline sequence-to-sequence model with attention. The model may attend to relevant words in the source text to generate novel words, e.g., to produce the novel word beat in the abstractive summary Germany beat Argentina 2-0 the model may attend to the words victorious and win in the source text.

Figure 11: The baseline model makes several factual inaccuracies: it claims porto beat bayern munich not vice versa, the score is changed from 7-4 to 2-0, jackson is changed to james and a heroes reception is replaced with a trophy. Our final model produces sentences that are individually accurate, but they do not make sense as a whole. Note that the final model omits the parenthesized phrase ( left ) from its second sentence.

Figure 7: Examples of abstractive summaries produced by our model (bold denotes novel words).

Figure 1: Exploring scaling factor β

Figure 1: The weighted centroid embedding of text T = {t1, t2, ..., tn}

Figure 1: R-1 scores of a few systems, evaluated against the 50-word reference set of DUC 01. Systems R, S and T are from DUC 01; ICSISumm is a later competitive system (Gillick et al., 2008).

Figure 1: Average Pearson and Spearman correlations with Pyramid scores as a function of number of SCUs evaluated per topic, on the DUC ’05 and ’06 data.

Figure 2: Average Pearson and Spearman correlations with Pyramid scores as a function of number of topics used for evaluation, on the DUC ’05 and ’06 data.

Figure 5: Sample summaries for an NYT article. Our model with coherence reward overlaps the most with human summary (green is ours, blue denotes human).

Figure 2: Our proposed entity-driven abstractive summarization framework. Entity-aware content selector extracts salient sentences and abstract generator produces informative and coherent summaries. Both components are connected using reinforcement learning.

Figure 9: Sample summaries for an CNN/DM article. Our models are able to capture important information

Figure 6: Sample summaries for an NYT article. Our models capture salient information which is missed by comparisons. Numbers are replaced with “0”.

Figure 3: Our proposed entity-aware content selector. Arrows denote attention, with darker color representing higher weights.

Figure 1: Sample summary of an article from the New York Times corpus (Sandhaus, 2008). Mentions of the same entity are colored. Underlined sentence in the article occurs relatively at an earlier position in the sum-

Figure 7: Sample summaries for an NYT article. Comparison model contains grammatical errors ; our model is more coherent and with less redundant information. Numbers are replaced with “0”.

Figure 8: Sample summaries for an CNN/DM article. Our model overlap most with human summaries with

Figure 4: Accuracy of our coherence model compared to different baselines and Wu and Hu (2018) on PAIRWISE and SHUFFLE test sets.

Figure 5: Examples of summaries produced by GPG. Each two samples from CNN/DM, Gigaword and XSum (up to down). bold denotes novel words and their pointed source tokens. Bracketed numbers are the pointing probability (1− pgen) during decoding.

Figure 3: Test perplexity when increasing k

Figure 1: Alignment visualization of our model when decoding “closes”. Posterior alignment is more accurate for model interpretation. In contrast, the prior alignment probability is spared to “announced” and “closure”, which can be manually controlled to generate desired summaries. Decoded samples are shown when aligned to “announced” and “closure” respectively. Highlighted source words are those that can be directly aligned to a target token in the gold summary.

Figure 6: Examples of generated summaries. Examples are taken from CNN/DM, Gigaword and XSum (from up to down). Darker means higher pointing probability.

Figure 2: Architecture of the generalized pointer. The same encoder is applied to encode the source and target. When decoding “closes”, we first find top-k source positions with the most similar encoded state. For each position, the decoding probability is computed by adding its word embedding and a predicted relation embedding.

Figure 4: Pointing Ratio of the standard pointer generator and GPG (evaluated on the test data). GPG enables the point mode more often, but quite a few pointed tokens are edited rather than simply copied.

Figure 2: Examples of summaries generated by SummVD and different baselines exposed in §4.2 on a same article belonging to CNN/DM corpus.

Figure 1: SummVD Pipeline illustrating the sequence of operations needed to achieve an extractive summary from a given text document.

Figure 3: Average time to compute a summary, against the number of input words for SummVD and TextRank (gensim implementation). Time is in logarithmic scale.

Figure 4: Heatmap when removing the penalization term. We can see s[0] does not receive attention at all. Best viewed in color.

Figure 2: An instance of our contrastive samples. Given an annotated (D, Ŝ), we randomly discard a summary sentence ŝj′ , and fill di′1 and di′2 to form the contrastive pair. di′1 has the highest ROUGE score with ŝj′ . di′2 is randomly sampled.

Figure 3: Example of attention heatmap between document sentences (rows) and gold summary sentences (columns). s[0]: The illusion is the work of German body-painting artist Joerg Duesterwald, who spent hours painting his model. s[1]: Stunning set of pictures was taken in front of a rockface in a forest in Langenfeld, Germany, yesterday. Best viewed in color.

Figure 1: e Architecture of the Hybrid MemNet Model

Figure 1: Overview of our approach to create selfsupervised pre-training datasets from unlabelled scientific documents. The aspect-based summarization model is pre-trained on unlabelled documents, the section headings as aspects, and the following paragraphs corresponding to the aspects as aspect-based summaries.

Figure 2: Histogram of 50 most frequent aspects in the self-supervised samples (top: PubMed⋆, bottom: FacetSum⋆). PubMed⋆ has [150K,1.4K,214,33] unique aspects with frequency of higher than [1,10,100,1000] (FacetSum⋆:[96K,841,120,21]). Aspects removed from the NoOverlap datasets are highlighted in red.

Figure 3: Aspect-based summarization performance with limited supervised examples. Pre-training with in-domain and out-of-domain datasets significantly improves the low-resource training sample performance. Top: evaluation done on PubMed dataset, Bottom: evaluation is done on FacetSum dataset. ( —– BART , –•– BART + pre-trained on PubMed⋆, –×– BART + pretrained on FacetSum⋆, - - - BART fine-tuned on all samples)

Figure A.4: Effect of each variable to HaRiM. ∆ represents ps2s − plm. The last figure at the righter down shows the effect of replacing auxiliary LM probability with empty-sourced decoder inference (HaRiMlmless). Figure 1 shows article-summary pair as a datapoint in the plot, here we show each token of the decoded output as a datapoint.

Figure A.3: Pearson’s ρ correlation between metric scores on FRANK-BBC/XSUM split. The highter the correlation, the similar the metric behavior becomes.

Figure A.6: Graphical model representation attributing to the factors that affects metric (M )-human (H) correlation. A is the graphical model that supports the use of partial correlation as argued in (Pagnoni et al., 2021). B is the graphical model that adheres to our argument that why should we measure correlation, ignoring the effect of the generation system (S) whose effect is hindered by observed child node, text.

Figure A.1: Permutation test done for metric scores on FRANK-CNN/DM. 1 (filled grid) represents significant difference in metric performance, 0 represents negligible difference with confidence >=.95 (p <= 0.05), i.e. HaRiM is significantly more correlated to human judgements than all the other metrics except itself with a confidence of >=95%.

Figure A.7: Averaged experts’ judgements vs. Averged turkers’ judgements on SummEval, (datapoints are outputs from abstractive summarization models)

Figure 1: Effects of replacing the auxiliary language model (q(yi|y<i)) with an empty-sourced encoderdecoder model (p(yi|y<i; {}). Left compares the values of plm, and Right compares the HaRiM values. The values are calculated on the summary-article pairs in FRANK benchmark. The high correlation of HaRiM suggests that the effect of replacement is minimal.

Figure A.2: Pearson’s ρ correlation between metric scores on FRANK-CNN/DM split. The highter the correlation, the similar the metric behavior becomes. Red boxes highlights notable observation which is unexpected behavioral similarity between metrics.

Figure A.8: Averaged experts’ judgements vs. Averged turkers’ judgements on SummEval, (datapoints are outputs from extractive summarization models)

Figure 2: Factuality label counts from FRANK benchmark. Legend shows the value of factuality annotation, varying from 0 (unfactual) to 1 (factual). The factuality labels for XSUM corpus are almost binary.

Figure A.5: Boxplot of HaRiM and log-likelihood scales, varying with the evaluating summarizer weight. base+cnn: BART-base fine-tuned on CNN/DailyMail, brio: BRIO (Meng et al., 2021), large+cnn: BARTlarge fine-tuned on CNN/DailyMail, large+cnn+para: further fine-tuned checkpoint of the previous model on ParaBank2 corpus as suggested in (Yuan et al., 2021).

Figure 3: System architectures for ‘Struct+2Way+Word’ (left) and ‘Struct+2Way+Relation’ (right). βt,i (left) measures the structural importance of the i-th source word; βt,i (right) measures the saliency of the dependency edge pointing to the i-th source word. gep,i is the structural embedding of the parent. In both cases δt,i replaces αt,i to become the new attention value used to estimate the context vector ct.

Figure 2: System architectures for ‘Struct+Input’ (left) and ‘Struct+Hidden’ (right). A critical question we seek to answer is whether the structural embeddings (sei ) should be supplied as input to the encoder (left) or be exempted from encoding and directly concatenated with the encoder hidden states (right).

Figure 1: An example dependency parse tree created for the source sentence in Table 1. If important dependency edges such as “father← had” can be preserved in the summary, the system summary is likely to preserve the meaning of the original.

Figure 1: fdecoder tree (top) consumes the partial tree representations of time t one by one to build hidden representation hTt ; fdecoder seq (bottom) consumes the embeddings of summary words to build partial summary representation hyt .

Figure 2: F-scores of systems on preserving relations of reference summaries (top) and source texts (bottom). We vary the threshold from 1.0 (strict match) to 0.7 in the x-axis to allow for strict and lenient matching of dependency relations.

Figure 1: An illustration of our CopyTrans architecture. The self-attention mechanism allows (i) a source word to attend to lower-level representations of all source words (including itself) to build a higher-level representation for it, and (ii) a summary word to attend to all source words, summary words prior to it, as well as the token at the current position (‘MASK’) to build a higher-level representation.

Figure 2: Effectiveness of position-aware beam search (§3.1). A larger beam tends to give better results.

Figure 1: An illustration of the generation process. A sequence of placeholders (“[MASK]”) are placed following the source text. Our model simultaneously predicts the most probable tokens for all positions, rather than predicting only the most probable next token in an autoregressive setting. We obtain the token that has the highest probability, and use it to replace the [MASK] token of that position. Next, the model makes new predictions for all remaining positions, conditioned on the source text and all summary tokens seen thus far. Our generator produces a summary having the exact given length and with a proper endpoint.

Figure 2: Strong position bias can cause the abstractor to use only content at the beginning of the input to generate a summary. By exposing the chunks progressively, our approach makes use of this characteristic to consolidate information from multiple transcript chunks.

Figure 1: An example of a grounded summary where spans of summary text are tethered to the original audio. The user can tap to hear the audio clip, thus interpreting a system-generated summary in context.

Figure 6: The n-gram abstractiveness and percentage of novel n-gram metrics across different n-grams on TLDRHQ’s test set. As seen, BART generates more abstractive summaries than BERTSUMABS as it mitigates the gap between BERTSUMABS and ground truth summary.

Figure 3: S score inter-rater agreement for annotation without context (left), and annotation with context (right)

Figure 4: The proportion of instances containing TLDR in TLDR9+ dataset. As seen, the number of TLDRs is increasing each year. At the time of conducting this research, the submission data dumps are partially uploaded for 2021 (until 2021-06), while there is no comments uploaded for 2021 in the Pushshift repository.

Figure 2: The proportion of TLDRs over entire posts (submissions and comments) submitted per year (Figures (c) and (d)). At the time of writing this paper, submissions dumps are partly uploaded for 2021 (until 2021-06), while there is no comments dumps uploaded for 2021.

Figure 5: Heatmaps of TLDRHQ showing (a) the oracle sentence’s importance to its relative position; (b) percentage of novel n-grams; and (c) n-gram abstractiveness. The heat extent shows the number of the instances within the specific bin.

Figure 7: A sample from TLDRHQ test set along with the model generated summaries. Underlined text in source shows the important regions of the source for generating TLDR summary.

Figure 1: An example Reddit post with TLDR summary. As seen, the TLDR summary is extremely short, and highly abstractive.

Figure 3: Detailed illustration of our summarization framework. Task-1 (t1): source sentence extraction (right-hand gray box). Task-2 (t2): introductory sentence extraction (left-hand gray box). As shown, the identified salient introductory sentences at training stages are incorporated into the representations of source sentences by the Select(·) function (orange box) with k = 3. Plus sign shows the concatenation layer. The feed-forward neural network is made of one linear layer.

Figure 4: (a) Our system’s generated summary, (b) Sentence graph visualization of our system’s generated summary. Green and gray nodes are introductory and non-introductory sentences, respectively. Edge thickness denotes the ROUGE score strength between pair of sentences. Parts, from which sentences are sampled, are shown inside brackets. The summary is truncated due to space limitations. Ground-truth summary-worthy sentences are underlined, and colored spans show pointers from introductory to non-introductory sentences.

Figure 1: A truncated human-written extended summary. Top box: introductory information, bottom box: non-introductory information. Colored spans are pointers from introductory sentences to associated nonintroductory detailed sentences.

Figure 2: Our model uses introductory sentences as pointers to the source sentences. It then forms the final extended summary by extracting salient sentences from the source. Highlights in red show the salient parts.

Figure 3: Time spent on annotation (in minutes) vs. correlation with the full-sized score. We gather annotation times in buckets with a width of ten minutes and show the 95% confidence interval for each bucket.

Figure 4: Relation of type I error rates at p < 0.05 to the total number of annotators for different designs, all with 100 documents and 3 judgements per summary. We conduct the experiment with both the t-test and approximate randomization test (ART). We show results both with averaging results per document and without any aggregation. We run 2000 trials per design. The red line marks the nominal error rate of 0.05.

Figure 6: Reliabilities of nested vs. crossed designs for Rank and Likert for both tasks.

Figure 8: Screenshots of the Annotator Instructions.

Figure 9: Screenshots of the Annotation Interfaces.

Figure 2: Score distribution of Likert for both tasks. Each data point shows the number of times a particular score was assigned to each system.

Figure 5: Power for 100 documents and 3 judgements per summary with different number of total annotators.

Figure 7: Power for p < 0.05 of nested and crossed designs for ARTagg and regression. X-axis shows the number of judgements elicited, Y-axis the power-level.

Figure 1: Schematic representation of our study design. Rows represent annotators, columns documents. Each blue square corresponds to a judgement of the summaries of all five systems for a document. Every rectangular group of blue squares forms one block.

Figure 4: Ranking accuracy between shuffled and original summaries of different lengths (in characters). We sample 10,000 pairs and group them in buckets of 20 characters and clamp differences between -200 and 200.

Figure 6: Histograms of the lengths of summaries generated by the summarizers in SummEval and their mean lengths. Both in characters.

Figure 3: Bias matrix for BAS with specific analysis for BART and Pegasus. The upper triangular matrix indicates τ+ for the given summarizer pair, the lower τ−. The area of each circle is proportional to the number of pairs in H+/H− for the cell. To read off the behaviour of the CM on a specific summarizer, we follow both the corresponding row and column. A high score in the row, combined with a low score in the corresponding cell in the column implies the CM is biased towards generations by this particular summarizer.

Figure 1: Distribution of human coherence scores for the 17 systems in the SummEval dataset. The red dots indicate the mean score of each system.

Figure 5: Intra-system correlations of the best CMs as well as the human upper bound on the SummEval dataset. Bars indicate 95% confidence intervals determined by bootstrap resampling with 1000 samples.

Figure 2: Bias Matrices for the best CMs. We also show the bias matrix for the architecture confounder for reference. See Figure 3 for a brief tutorial to bias matrix analysis.

Figure 7: Summary quality as a function of metric optimized and amount of optimization, using best-of-N rejection sampling. We evaluate ROUGE, our main reward models, and an earlier iteration of the 1.3B model trained on approximately 75% as much data (see Table 11 for details). ROUGE appears to peak both sooner and at a substantially lower preference rate than all reward models. Details in Appendix G.3.

Figure 1: Fraction of the time humans prefer our models’ summaries over the human-generated reference summaries on the TL;DR dataset.4Since quality judgments involve an arbitrary decision about how to trade off summary length vs. coverage within the 24-48 token limit, we also provide length-controlled graphs in Appendix F; length differences explain about a third of the gap between feedback and supervised learning at 6.7B.

Figure 3: Evaluations of four axes of summary quality on the TL;DR dataset.

Figure 4: Transfer results on CNN/DM. (a) Overall summary quality on CNN/DM as a function of model size. Full results across axes shown in Appendix G.2. (b) Overall scores vs. length for the 6.7B TL;DR supervised baseline, the 6.7B TL;DR human feedback model, and T5 fine-tuned on CNN/DM summaries. At similar summary lengths, our 6.7B TL;DR human feedback model nearly matches T5 despite never being trained to summarize news articles.

Figure 5: Preference scores versus degree of reward model optimization. Optimizing against the reward model initially improves summaries, but eventually overfits, giving worse summaries. This figure uses an earlier version of our reward model (see rm3 in Appendix C.6). See Appendix H.2 for samples from the KL 250 model.

Figure 6: Reward model performance versus data size and model size. Doubling amount of training data leads to a ~1.1% increase in reward model validation accuracy, whereas doubling the model size leads to a ~1.8% increase. The 6.7B model trained on all data begins approaching the accuracy of a single human.

Figure 1: The framework of QFS-BART. The QA module calculates the answer relevance scores, and we incorporate the scores as explicit answer relevance attention to the encoder-decoder attention.

Figure 4: Screenshot of Content Support Task.

Figure 3: Screenshot of Best-Worst Scaling Task.

Figure 1: Overview of the OPINIONDIGEST framework.

Figure 2: Sensitivity analysis on hyper-parameters. Above row: Top-k opinion (k) vs merging threshold (θ); Bottom row: Top-k opinion (k) vs max token size (L).

Figure 5: Screenshot of Aspect-Specific Summary Task.

Figure 2: Cosine similarities between summaries generated by lead systems and reference in embedding space on the CNN/DailyMail test set.

Figure 2: Truncated articles lead to performance improvement for max, avg and InferSent representation.

Figure 1: ROUGE recall, precision and F1 scores for lead, random, textrank and Pointer-Generator on the CNN/DailyMail test set.

Figure 3: This scatter plot shows the human coverage score and embedding similarity on DUC2001. The baseline system is shortened to ‘b’.

Figure 1: For each number of articles, we sample and compute the correlation for 50 times and plot the average as well as standard deviation. The decreasing size of error bar shows that enough articles are provided for each system and it is not the reason of the performance discrepancy between DUC2001 and DUC2002.

Figure 2: Process for dataset creation. In this example, two source sentences, s1 and s2, are selected, and the hypothesis sentence is generated by a sentence fusion operation. The selected sentences are considered errororiginating if the hypothesis sentence poses an error. These source sentences are referred to as corresponding sentences.

Figure 6: Comparison of FactCCX and SumPhrase outputs. The sentence with a blue underline was identified as an error-corresponding sentence by SumPhrase, whereas the span with a red underline was localized by FactCCX. The dependency relation between the red phrases was determined as erroneous by SumPhrase.

Figure 4: Example of phrase-level labels. The blue and red edges denote consistent and inconsistent labels, respectively, whereas the green edge indicates that the inter-phrase relation is unlabeled.

Figure 3: Motivating example for phrase-level labeling.

Figure 1: Overview of proposal. A synthetic dataset is created and used as weak supervision to train the SumPhrase error localization model.

Figure 5: SumPhrase model. The blue and red frames display the intra-phrase and inter-phrase detection parts, respectively, and the green frame indicates the corresponding sentence localization part.

Figure 4: Attention heatmap when generating the example summary. Ii and Oi indicate the i-th sentence of the input and output, respectively.

Figure 1: The framework of summary-to-headline generation.

Figure 1: Hierarchical encoder-decoder framework and comparison of the attention mechanisms.

Figure 4: Headlines generated by lead-flat-att and summ-hieratt for two examples from the NYT test data. S1 indicates the summary extracted by the Lead method.

Figure 3: Examples of generated summaries.

Figure 3: Heat map of the distribution of summary-level attention weights when generating every word for the example in Figure 2. Darker color indicates higher weight.

Figure 2: An example of headline generation from NYT test data. G is the true headline, L is the output of lead-flat-att, O is the output of our approach summ-hieratt. S1 to S5 are the document-level summaries, and each summarization method is indicated in “[]” at the end.

Figure 2: Results of different setting of hyperparameters tested on 500 samples from the DailyMail test set.

Figure 3: Proportions of model outputs that get a human score ≥ 4. For example, around 95% of summaries by Weak-Sup (ours) are scored 4 or 5 in terms of accuracy.

Figure 2: Visualizing the ROUGE-1 results in Table 2. The green dashed line marks the performance of BART fine-tuned on the whole MA-News training set.

Figure 1: Illustration of our approach. Left: Constructing weak supervisions using ConceptNet, including (1) extracting aspects and (2) synthesizing aspect-based summaries. Right: Augmenting aspect information, including (3) identifying aspect related words in the document using Wikipedia and (4) feeding both aspect and related words into summarization model.

Figure 3: A sample summary comparison on the Multi-News dataset. OTExtSum based summary sentences are

Figure 1: Illustration of Optimal Transport Extractive Summariser (OTExtSum): the formulation of extractive summarisation as an optimal transport (OT) problem. Optimal sentence extraction is conceptualised as obtaining the optimal extraction vector m∗, which achieves an OT plan from a document D to its optimal summary S∗ that has the minimum transportation cost. Such a cost is defined as the Wasserstein distance between the document’s semantic distribution TFD and the summary’s semantic distribution TFS and is used to measure the summary’s semantic coverage.

Figure 4: A sample summary comparison on the BillSum dataset. OTExtSum based summary sentences are

Figure 7: This is how our task will look to Mechanical Turk Workers.

Figure 2: Interpretable visualisation of the OT plan from a source document to a resulting summary on the CNN/DM dataset. The higher the intensity, the more the semantic content of a particular document token is covered by a summary token. Purple line highlights the transportation from the document to the summary of semantic content of token “month”, which appears in both the document and the summary. Red line highlights how the semantic content of token “sponsor”, which appears in the document only but not the summary, are transported to token “tour” and “extension”, which are semantically closer and have lower transport cost, and thus achieve a minimum transportation cost in the OT plan.

Figure 5: A sample summary comparison on the PubMed dataset. OTExtSum based summary sentences are

Figure 2: Score distribution of LS10 across CNN/DM and XSum. Each data point shows the number of times a score was assigned to each system.

Figure 4: Screenshot of the evaluation page for BWS annotation.

Figure 6: Screenshot of the evaluation page for Likert Scale annotation.

Figure 1: Score distribution of LS with a 5-point scale across CNN/DM and XSum. Each data point shows the number of times a score was assigned to each system.

Figure 3: Screenshot of the instruction page for BWS annotation.

Figure 5: Screenshot of the instruction page we used for Likert Scale annotation.

Figure 6: A sample summary comparison on the CNN/DM dataset. OTExtSum based summary sentences

Figure 4: XSum class-level performance, averaged across all models.

Figure 9: XSum correlations.

Figure 6: Pearson correlations for Extractive, Paraphrase and Evidence samples in Gigaword and CNN/DM.

Figure 7: Gigaword correlations.

Figure 1: Distribution of the different class of samples in all datasets.

Figure 2: Gigaword class-level performance, averaged across all models.

Figure 8: CNN/DM correlations.

Figure 3: CNN/DM class-level performance, averaged across all models.

Figure 5: Pearson correlation between different metrics for all three datasets.

Fig. 5 to perform human analysis.

Figure 2: Overview of Hard Typed Decoder (left) and Our Reinforced Hard Typed Decoder (right).

Figure 1: Current models tend to output general and less meaningful summaries.

Figure 3: Learning Curve of RHTD Figure 4: Generated Summaries on E.g. 2 Figure 5: Generated Summaries on E.g. 3

Figure 1: A high level description of the NHG model. The model predicts the next headline word yt given the words in the document x1 . . . xN and already generated headline words y1 . . . yt−1.

Figure 2: Validation set (EN) perplexities of the NHG model with different pre-training methods.

Figure 3: Overarching system process flow

Figure 5: ROUGE-1 score comparisons for various budgets, on the 3 datasets used in this study.

Figure 1: Undirected, weighted graph-of-words example. W = 8 and overspans sentences. Stemmed words, weighted k-core decomposition. Numbers inside parentheses are CoreRank scores. For clarity, non-(nouns and adjectives) in italic have been removed.

Figure 2: k-core decomposition of a graph and illustration of the value added by CoreRank. While nodes ? and ?? have the same core number (=2), node ? has a greater CoreRank score (3+2+2=7 vs 2+2+1=5), which better reflects its more central position in the graph.

Figure 4: Size of the DUC2001 documents in development and test sets.

Figure 3: Comparison between NLI models augmented with Falsesum and FactCC across different measures of summary extractiveness. The x-axis shows the median overlap score of each test subset.

Figure 1: Overview of the Falsesum generation framework. Falsesum preprocesses and formats the source document (A) and a gold summary (B) before feeding it to a fine-tuned generator model. The model produces a factually inconsistent summary, which can then be used to obtain (A,D) or (A,E) as the negative (non-entailment) NLI premise-hypothesis example pair. We also use the original (A,B) as a positive NLI example (entailment).

Figure 2: Input format design of Falsesum. The framework first extracts the predicate and argument spans from the source document and the gold summary. The spans are then corrupted, lemmatized, and shuffled before being inserted into the input template.

Figure 5: Drop of mean BLANC-tune value when parameters differ from optimal. The drop is shown as a fraction of the optimal mean BLANC value. The summaries probed are: CNN and DM (from the CNN/Daily Mail dataset), Top and Rand (first two sentences and random two sentences from random news articles). The parameters probed are: ’gap-infer 2/1’ is gap = 2 and gap mask = 1; ’gap-tune 2/1’ is gaptune = 2 and gap masktune = 1; ’p-replace 0.1’ is preplace = 0.1; ’toks-normal 4’ is Lnormal = 4; ’tune-rand’ is making tokens masking random rather than even at tuning.

Figure 5: Spearman correlation between SummEval experts scores and ESTIME by embeddings from different layers of the model. Thick lines: ESTIME (as defined by Equation 1 and considered through the paper). Thin lines: Nw as defined by Equation 3.

Figure 2: Spearman correlation between SummEval experts scores and ESTIME using embeddings taken from different layers of the model.

Figure 3: Spearman and Kendall Tau-c correlations - system level - between SummEval experts scores of consistency and ESTIME using embeddings taken from different layers of the model.

Figure 12: Kandall Tau-c correlation between SummEval experts scores and ESTIME by embeddings from different layers of the model. Thick lines: sparsity of the masking is defined by the distance 8 and the margin 50 (see Section 2), as used through the paper. Thin lines: Distance 8, margin 100.

Figure 4: Kandall Tau-c correlation between SummEval experts scores and ESTIME by embeddings from different layers of the model. Thick lines: ESTIME (as defined by Equation 1 and considered through the paper). Thin lines: Nw as defined by Equation 3.

Figure 1: Factor by which Spearman correlation of BLANC with human scores increases when only part of text is used for BLANC. The text part is selected as sentences with top BLANC values (thin lines) or as contiguous sentences with highest BLANC (thick lines).

Figure 1: Kandall Tau-c correlation between SummEval experts scores and ESTIME using embeddings taken from different layers of the model.

Figure 9: Spearman correlation between SummEval experts scores and EESTIME by embeddings from different layers of the model. Thick lines: the model is bert-large-uncased-whole-word-masking. Thin lines: the model is bert-base-uncased.

Figure 7: Spearman correlation between SummEval experts scores and ESTIME by embeddings from different layers of the model. Thick lines: unnormalized embeddings (as used through the paper). Thin lines: normalized embeddings.

Figure 2: Factor by which Spearman correlation of BLANC with human scores increases when only part of text is used for BLANC. The text part is selected as sentences with top BLANC values (thin lines) or as contiguous sentences having highest average BLANC (thick lines). The resulting BLANC is calculated as average over BLANC of the sentences.

Figure 13: Spearman correlation between SummEval experts scores and EESTIME by embeddings from different layers of the model. Thick lines: sparsity of the masking is defined by the distance 8 and the margin 50 (see Section 2), as used through the paper. Thin lines: Distance 8, margin 100.

Figure 11: Spearman correlation between SummEval experts scores and EESTIME by embeddings from different layers of the model. Thick lines: all tokens are used, as is done throughout the paper. Thin lines: tokens of determiners (part of speech) are not used.

Figure 8: Example of text from SummEval dataset.

Figure 8: Kandall Tau-c correlation between SummEval experts scores and ESTIME by embeddings from different layers of the model. Thick lines: the model is bert-large-uncased-whole-word-masking. Thin lines: the model is bert-base-uncased.

Figure 6: Example of a summary with a wide coverage (left) and a narrow coverage (right). Both summaries are supposed to cover first four paragraphs of ’Harry Potter And the Sorcerer’s Stone’ by J.K.Rowling.

Figure 6: Kandall Tau-c correlation between SummEval experts scores and ESTIME by embeddings from different layers of the model. Thick lines: unnormalized embeddings (as used through the paper). Thin lines: normalized embeddings.

Figure 3: Factor by which Spearman correlation of BLANC with human scores increases when only part of text is used for BLANC. The text part is selected as sentences with BLANC exceeding threshold.

Figure 4: Drop of mean BLANC-help value when parameters differ from optimal. The drop is shown as a fraction of the optimal mean BLANC value. The summaries probed are: CNN and DM (from the CNN/Daily Mail dataset), Top and Rand (first two sentences and random two sentences from random news articles). The parameters probed are: ’gap 3/1’ is gap = 3 and gap mask = 1; ’gap 3/2’ is gap = 3 and gap mask = 2; ’toks-normal 5’ is Lnormal = 5; ’toks-lead 2’ is Llead = 2; ’toks-follow 2’ is Lfollow = 2.

Figure 10: Kandall Tau-c correlation between SummEval experts scores and ESTIME by embeddings from different layers of the model. Thick lines: all tokens are used, as is done throughout the paper. Thin lines: tokens of determiners (part of speech) are not used.

Figure 7: Example of a summary with a wide coverage (left) and a narrow coverage (right). Both summaries are supposed to cover the same text taken from CNN/Daily Mail dataset. The text is shown in Figure 8.

Figure 1: At t = 4, pgen4 weighs the probability of copying a word from V ext higher than generating a word from the fixed vocabulary V †. The decoder learns to interpret the weighted sum of hL4 and c4 in order to compute a probability distribution for the most appropriate text realisation given the context of the triples. The attention mechanism highlights f2 as the most important triple for the generation of the upcoming token. The attention scores are distributed among the entries of V ext, and accumulated into the final distribution over V . As a result, the model copies “science fiction”, one of the surface forms associated with f2.

Figure 2: The percentage with which the unique predicates from the input triples are covered in the summaries.

Figure 1. ROUGE-1 vs. λ

Figure 3. Average braille length vs. λ

Figure 2. ROUGE-2 vs. λ

Figure 6: Faithfulness-abstractiveness trade-off curve, shown as the dashed red line, on Gigaword dataset. We plot each model’s average faithfulness score evaluated by AMT against its extractiveness level. Our model lies above the graph, performing better than MLE-baseline, DAE (Goyal and Durrett, 2021), and Loss Truncation (Kang and Hashimoto, 2020).

Figure 7: Summaries changed using the corrector. We mark hallucinated entities in the summaries with red.

Figure 2: Example output using different strategies of corrector and contrastor. The first two rows show the original document and summary with highlighted entities and their respective labels (date, number, ent). We mark hallucinated entities in the summaries with red, factual entities in document and summary with green and underlined, and

Figure 8: Example summaries from XSum and Gigaword. Nonfactual components are marked with red.

Figure 3: Relative effect using different sentence selection criteria on XSum. Adding FactCC to criteria consistently improves factuality. Full result in Table 10.

Figure 4: Zero-shot and few-shot results. The lines represent each models’s performance when fine-tuned on 0 (zero-shot), 1, 10, 100, and 1000 examples. FACTPEGASUS consistently improves sentence error with more training data. Without the corrector and contrastor, factuality decreases with just 10 examples.

Figure 1: Illustration of FACTPEGASUS. For pre-training (a), we use the factGSG objective introduced in Section 3.1 that transforms a text document into a pseudo-summarization dataset. We select the pseudo-summary using the combination of ROUGE and FactCC. Here, sentence A is selected as the pseudo-summary, and we mask this sentence in the original text to create the pseudo-document. During fine-tuning (b), the connector (i) simulates the factGSG task by appending the same mask token used in (a) to the input document, so that we have the same setup in both training stages. Then, corrector (ii) removes hallucinations (highlighted in red) from the summary. Finally, contrastive learning in (iii) encourages the model to prefer the corrected summary over the perturbed summary.

Figure 5: Factuality dynamics result. We show token error, sentence error, and FactCC as training progresses. FACTPEGASUS slows down factuality degradation for all metrics compared to BART-base.

Figure 9: Example summaries from WikiHow. The article is truncated to fit the page. Nonfactual information are marked with red.

Figure 3: Sample summaries generated by different systems on movie reviews and arguments. We only show a subset of reviews and arguments due to limited space.

Figure 4: Sampling effect on RottenTomatoes.

Figure 1: Examples for an opinion consensus of professional reviews (critics) about movie “The Martian” from www.rottentomatoes.com, and a claim about “death penalty” supported by arguments from idebate.org. Content with similar meaning is highlighted in the same color.

Figure 2: Evaluation of importance estimation by mean reciprocal rank (MRR), and normalized discounted cumulative gain at top 3 and 5 returned results (NDCG@3 and NDCG@5). Our regression model with pairwise preference-based regularizer uniformly outperforms baseline systems on both datasets.

Figure 3: When the second arrested appears, as the sentence becomes ungrammatical, the discriminator determines that this example comes from the generator. Hence, after this time-step, it outputs low scores.

Figure 4: Real examples with methods referred in Table 1. The proposed methods generated summaries that grasped the core idea of the articles.

Figure 1: Proposed model. Given long text, the generator produces a shorter text as a summary. The generator is learned by minimizing the reconstruction loss together with the reconstructor and making discriminator regard its output as humanwritten text.

Figure 2: Architecture of proposed model. The generator network and reconstructor network are a seq2seq hybrid pointer-generator network, but for simplicity, we omit the pointer and the attention parts. loss varies widely from sample to sample, and thus the rewards to the generator are not stable either. Hence we add a baseline to reduce their difference. We apply self-critical sequence training (Rennie et al., 2017); the modified reward rR(x, x̂) from reconstructor R with the baseline for the generator is

Figure 1: A graphical illustration of the topic-aware convolutional architecture. Word and topic embeddings of the source sequence are encoded by the associated convolutional blocks (bottom left and bottom right). Then we jointly attend to words and topics by computing dot products of decoder representations (top left) and word/topic encoder representations. Finally, we produce the target sequence through a biased probability generation mechanism.

Figure 4: The Rouge-1 and Rouge-L score for each pre-training method and the basic model on the development set during the training process.

Figure 4: Quality of rankings given by Fast Rerank.

Figure 3: Quality of candidate templates under different ranges.

Figure 3: ROUGE metrics on Gigaword and DUC2004 w.r.t a different number of concept candidates. Updates were conducted by hard assignment P̂Ci argmax and random selection P̂ C i random.

Figure 1: “summary1” only copies keyword from the source text, while “summary2” generates new concepts to convey the meaning.

Figure 3: This figure shows the Rouge-2 score for each pre-training method and the basic model on the development set during the training process. We put the result for Rouge-1 and Rouge-L score in Appendix A.2

Figure 1: Overview of the Fast Rerank Module.

Figure 2: The structure of the proposed model: (a) the Bi-Directional Selective Encoding with Template model (BiSET) and (b) the bi-directional selective layer.

Figure 2: The structure of the Basic Model. We use LSTM and self-attention module to encode the sentence and document respectively. Xi represent the word embedding for sentence i. Si and Di represent the independent and document involved sentence embedding for sentence i respectively.

Figure 1: An example for the Mask pre-training task. A sentence is masked in the original paragraph, and the model is required to predicted the missing sentence from the candidate sentences.

Figure 2: The architecture of our model. Blue bar represents the attention distribution over the inputs. Purple bar represents the concept distribution over the inputs. Noted that, this distribution can be sparse since not every word has its upper concept. Green bar represents the vocabulary distribution generated from seq2seq component.

Figure 3: Graph structure of HETERDOCSUMGRAPH for multi-document summarization (corresponding to the Graph Layer part of Figure 1). Green, blue and orange boxes represent word, sentence and document nodes respectively. d1 consists of s11 and s12 while d2 contains s21 and s22. As a relay node, the relation of document-document, sentence-sentence, and sentencedocument can be built through the common word nodes. For example, sentence s11, s12 and s21 share the same word w1, which connects them across documents.

Figure 8: A generated summary example of CNN/DM.

Figure 3: Annotation interface and instructions for XSUM factual consistency task.

Figure 4: The plot of the improvement of BertSUM+TA over BertSUM as a function of the document length for (a) CNN/DM and (b) Xsum, where the improvement is measured by the amount of increase in the ROUGE scores. The documents in each corpus are equally divided into 10 different groups based on their lengths. Each point of a curve indicates the average ROUGE score in its corresponding group.

Figure 3: (a) TA for multi-head attention; (b) Mask matrices in decoder SA (left) and decoder CA (right).

Figure 1: Model Overview. The framework consists of three major modules: graph initializers, the heterogeneous graph layer and the sentence selector. Green circles and blue boxes represent word and sentence nodes respectively. Orange solid lines denote the edge feature (TF-IDF) between word and sentence nodes and the thicknesses indicate the weight. The representations of sentence nodes will be finally used for summary selection.

Figure 5: Relationship between number of source documents (x-axis) and R̃ (y-axis).

Figure 1: The structure of BertSUM with TA, where the names in bold are our proposed modules in TA.

Figure 3: Exploring anchor set size k.

Figure 2: Distribution of tokens represented by ϕv in SIA, learned on CNN/DM using PFA and (5).

Figure 4: Relationships between the average degree of word nodes of the document (x-axis) and R̃, which is the mean of R-1, R-2 and R-L (lines for left y-axis), and between ∆R̃, which is the delta R̃ of HETERSUMGRAPH and Ext-BiLSTM (histograms for right y-axis).

Figure 2: Annotation interface and instructions for CNN/DM factual consistency task.

Figure 2: An example of anchor set for the bigram “great success” when top-3 results are extracted.

Figure 1: Overview of QAGS. A set of questions is generated based on the summary. The questions are then answered using both the source article and the summary. Corresponding answers are compared using a similarity function and averaged across questions to produce the final QAGS score.

Figure 9: A generated summary example of NYT, where the generation of BertSUM comes from the original paper (Liu and Lapata, 2019).

Figure 1: The transition from old-fashioned to newly-introduced protocol for designing reference-based automatic evaluation metrics in the document summarization task. The curved arrows on the right show that both summaries are derived from the source document.

Figure 2: The detailed update process of word and sentence nodes in Heterogeneous Graph Layer. Green and blue nodes are word and sentence nodes involved in this turn. Orange edges indicate the current information flow direction. First, for sentence s1, word w1 and w3 are used to aggregate word-level information in (a). Next,w1 is updated by the new representation of s1 and s2 in (b), which are the sentences it occurs. See Section 3.3 for details on the notation.

Figure 5: The conditional generation results by pluging TEMA into the GPT2 small.

Figure 1: The framework of the proposed method

Figure 1: Overview of our MADY model, containing a hierarchical multi-scale abstraction modeling module (HMAM) on the left and a dynamic key-value memory-augmented attention network (DMA) on the right.

Figure 2: (a) A visualization of the interaction matrix&B . (b) The reshaped matrix controlled by novelty. (c) the reshaped matrix according to relevance. For simplicity, the thresholds were set to n= = nA = 0.5.

Figure 2: The influence and relation of some major factors, i.e., the coefficient β, tweets volume, and tweets latency

Figure 1: (a) Position of highlights hits in the documents; (b) Top-4 tweets hits in the documents; (c) The probability of highlights hit vs. tweets hit in the documents; (d) The maximum similarity between highlights and tweets per document

Figure 1: Example contrasting the Autoencoder (AE) and Information Bottleneck (IB) approaches to summarization. While AE (top) preserves any detail that helps to reconstruct the original, such as population size in this example, IB (bottom) uses context to determine which information is relevant, which results in a more appropriate summary.

Figure 3: Bar plot of per-token pgen and entropy of the generation distribution (purple) and copy distribution (blue), plotted under correlation contributions CC(pgen, Hgen) (purple) and CC(pgen, Hcopy) (blue) for a randomly-sampled CNN/DailyMail test summaries.

Figure 2: Distribution of pgen across all tokens in the test split of the CNN/DailyMail corpus. Sentence-final punctuation makes up 5% of tokens in the dataset, which accounts for 22% of pgen’s mass

Figure 1: (Top) Correlation contributions CC(pgen, Hgen) (green) and CC(pgen, Hcopy) (purple) for a randomlysampled summary. (Bottom) Bar plot of per-token pgen (orange), and entropy of the generation distribution (green) and copy distribution (purple) for the same summary.

Figure 4: Bar plot of per-token pgen and entropy of the generation distribution (purple) and copy distribution (blue), plotted under correlation contributions CC(pgen, Hgen) (purple) and CC(pgen, Hcopy) (blue) for a randomly-sampled XSum test summaries.

Figure 1: Illustration of neural coherence model which is built upon ARC-II proposed by (Hu et al. 2014).

Figure 1: Model Framework. The top figure describes the framework for contrastive learning, where for each document x, we create different types of negative samples and compare them with x to get a ranking loss. The bottom figure is the evaluator which generates the final evaluation score. For short, here we use SS , SL and SLS to indicate S Score, L Score and LS Score.

Figure 3: Comparison of HT, GraphSum (GSum in figure), BASS under various length of input tokens.

Figure 4: Illustration of novel n-grams in generated summaries form different systems.

Figure 1: Illustration of a unified semantic graph and its construction procedure for a document containing three sentences. In Graph Construction, underlined tokens represent phrases., co-referent phrases are represented in the same color. In The Unified Semantic Graph, nodes of different colors indicate different types, according to section 3.1.

Figure 2: Illustration of our graph-based summarization model. The graph node representation is initialized from merging token representations in two-level. The graph encoder models the augmented graph structure. The decoder attends to both token and node representations and utilizes graph structure by graph-propagation attention.

Figure 4: The figure compares high-frequency semantic units and semantic units in the summary of each article in CNN/Daily Mail, which includes 287k articlesummary pairs in total. The x-axis represents the ratio of high-frequency semantic units which also show up in summaries. The y-axis is the number of articles in the CNN/Daily Mail training set. The threshold of the cosine similarity is set as 0.5.

Figure 1: Our training and inference stages. The semantic unit embeddings with darker colors indicate that greater attention mask values are applied.

Figure 2: Our model overview. (a) The two-stage training process. (b) The inference process.

Figure 3: Comparison of the mean gold scores assigned for Q2 and Q3 to each of the 32 systems in the DUC05 dataset, and the corresponding scores predicted by SUM-QE. Scores range from 1 to 5. The systems are sorted in descending order according to the gold scores. SUM-QE makes more accurate predictions forQ2 than for Q3, but struggles to put the systems in the correct order.

Figure 1: SUM-QE rates summaries with respect to five linguistic qualities (Dang, 2006a). The datasets we use for tuning and evaluation contain human assigned scores (from 1 to 5) for each of these categories.

Figure 2: Illustration of different flavors of the investigated neural QE methods. An encoder (E) converts the summary to a dense vector representation h. A regressor Ri predicts a quality score SQi using h. E is either a BiGRU with attention (BiGRU-ATT) or BERT (SUM-QE).R has three flavors, one single-task (a) and two multi-task (b, c).

Figure 1: A histogram illustrating the score distribution in the real learner data

Figure 3: The merged LSTM model

Figure 4: Attention mechanism architecture in the attention-based LSTM model for summary assessment

Figure 5: Combining three approaches using ensemble modelling

Figure 2: Similarity matrices of two summaries for the same reading passage from the simulated learner data. Summary A is a good summary and Summary B is a bad summary. The rows of the matrix represent sentences in the summary and the columns of the matrix represent sentences in the reading passage.

Figure 2: A Comparison between our model, SummaRuNNer and Oracle when applied to documents with increasing length, left-up: ROUGE-1 on Pubmed dataset, right-up: ROUGE-2 on Pubmed dataset, left-down: ROUGE-1 on arXiv dataset, right-down: ROUGE-2 on arXiv dataset

Figure 1: The structure of our model, sei, sri represent the sentence embedding and sentence representation of sentence i, respectively. The binary decision of whether the sentence should be included in the summary is based on the sentence itself (A), the whole document (B) and the current topic (C). The document representation is simply the concatenation of the last hidden states of the forward and backward RNNs, while the topic segment representation is computed by applying LSTM-Minus, as shown in detail in the left panel (Detail of C).

Figure 4: Copy rate learning curve of two balancing mechanisms. Unbalanced 100: |mr| = 102|mc|; Unbalanced 1000: |mr| = 103|mc|.

Figure 7: The relative position distribution of different redundancy reduction methods on Pubmed(left) and arXiv(right) datasets.

Figure 4: Comparing the average ROUGE scores and average unique n-gram ratios of different models on the Pubmed dataset, conditioned on different degrees of redundancy and lengths of the document (extremely long documents - i.e., 1% of the dataset are not shown because of space constraints).7

Figure 5: Comparing the average ROUGE scores and average unique n-gram ratios of different models with different word length limits on the Pubmed dataset. See Appendices for similar results on arXiv.

Figure 3: The average ROUGE scores, average unique n-gram ratios, and average NID scores with different λ used in the MMR-Select on the validation set. Remember that the higher the Unique n-gram Ratio, the lower NID, the less redundancy contained in the summary.

Figure 1: Information amount evaluation with language models. Here we take a subsequence x3x4 as example. [M] denotes mask and PLMs/MLMs/ALMs are three different options for language models. I(x3x4|·) = − log[P (x3|·)P (x4|·)], where conditions for different models are omitted for brevity.

Figure 1: Sample summary of an article from the test set of CNN/DailyMail corpus (Hermann et al., 2015). The words used in summary are colored. In this sample, one sentence (green) is directly copied from article and two are rewritten to be concise.

Figure 3: Architecture of our hierarchical reinforcement learning and reward composition (green lines) of extraction.

Figure 6: Comparing the average ROUGE scores and average unique n-gram ratios of different models on the arXiv dataset, conditioned on different degrees of redundancy and lengths of the document.8

Figure 1: The average unique n-gram ratio in the documents across different datasets. To reduce the effect of length difference, stopwords were removed.

Figure 5: Summary comparison with the baseline models. Overlapped content is colored. Our model overlaps the most with human summary and generates less redundancy.

Figure 2: The pipeline of the MMR-Select+ method, where Ŝ, Ŷ and S̄, Ȳ are the summary and labels generated by the MMR-Select algorithm and the normal greedy algorithm, respectively. S and Y are the ground truth summary and the oracle labels.

Figure 2: The overview of the HYSUM framework. Hierarchical representation module first encodes the article sentences si into vectors hj . Then each sentence vector becomes two versions by adding with two different markers mc,mr. When the pointer network (arrows denote attention and darker color represents higher weights) selects the copy version hci of a sentence, it will be copied. Otherwise when the rewriting version hri is selected, the sentence will be rewritten to reduce redundancy.

Figure 2: Comparison of CoCo values assigned to high (top 50%) and low (bottom 50%) human judgments.

Figure 1: (a) The causal graph of text summarization reflects the causal relationships among the fact C, source document X , language prior K, and the modelgenerated summary Y . (b) According to Eq. (6), the causal effect of X on Y can be obtained by subtracting the effect of K on Y from the total effect.

Figure 2: The position distribution of extracted sentences by different models on the PubMed-Long test set.

Figure 1: The model architecture of GRETEL

Figure 2: Relative position distributions of selected sentences in the original document of two testing corpora (CNN/DM and XSum), obtained by different lead bias demoting strategies.

Figure 1: The overall architecture of our proposed lead bias demoting method.

Figure 1: Architecture of the EDU rewriter. The group tag embedding build connection between encoder and decoder.

Figure 1: Pyramid scores of KPLM+TSR under different summary length limits.

Figure 3: Sentence extraction module of JECS. Words in input document sentences are encoded with BiLSTMs. Two layers of CNNs aggregate these into sentence representations hi and then the document representation vdoc. This is fed into an attentive LSTM decoder which selects sentences based on the decoder state d and the representations hi, similar to a pointer network.

Figure 1: Diagram of the proposed model. Extraction and compression are modularized but jointly trained with supervision derived from the reference summary.

Figure 5: Effect of changing the compression threshold on CNN. The y-axis shows the average of the F1 of ROUGE-1,-2 and -L. The dotted line is the extractive baseline. The model outperforms the extractive model and achieves nearly optimal performance across a range of threshold values.

Figure 2: Text compression example. In this case, “intimate”, “well-known”, “with their furry friends” and “featuring ... friends” are deletable given compression rules.

Figure 4: Text compression module. A neural classifier scores the compression option (with their furry friends) in the sentence and broader document context and decides whether or not to delete it.

Figure 2: An example of masked sentences prediction. The third sentence in the document is masked and the hierarchical encoder encodes the masked document. We then use TransDecS to predict the original sentence one token at a time.

Figure 1: Next token entropies computed on 10K generation steps from PEGASUSCNN/DM, PEGASUSXSUM, BARTCNN/DM and BARTXSUM respectively, broken into two cases: an Existing Bigram means the bigram just generated occurs in the input document, while a Novel Bigram is an organic model generation. These cases are associated with low entropy and high entropy actions, respectively. The x-axis shows the entropy (truncated at 5), and the y-axis shows the count of bigram falling in each bin. The dashed lines indicate the median of each distribution.

Figure 2: A document (left) weighted with respect to a reference summary and two system outputs (right), with Corr-F/Corr-A scores. The colour represents the sum of argument- and fact-level weights for each token (Eqs. 3 and 4). The darker the colour, the more important the fact is.

Figure 1: Illustration of DISCOBERT for text summarization. Sentence-based BERT model (baseline) selects whole sentences 1, 2 and 5. The proposed discourse-aware model DISCOBERT selects EDUs {1- 1, 2-1, 5-2, 20-1, 20-3, 22-1}. The right side of the figure illustrates the two discourse graphs we use: (i) Coref(erence) Graph (with the mentions of ‘Pulitzer prizes’ highlighted as examples); and (ii) RST Graph (induced by RST discourse trees).

Figure 1: The framework of our proposed model. Based on the encoder self-attention graph, we calculate the centrality score for each source word to guide the copy module.

Figure 4: Proportion of extracted sentences by different unsupervised models against their positions.

Figure 3: Correlating syntactic distance between neighboring tokens with the entropy change in those tokens’ generation decisions for PEGASUS summaries. The median entropy change is depicted as a dashed black line. At points of high syntactic distance, the model’s behavior is less restricted by the context, correlating with higher entropy.

Figure 6: The instruction for argument-level human highlight annotation.

Figure 4: Correlation between attention entropy and prediction entropy of PEG(ASUS) and BART on C(NN/DM) and X(Sum). We compute the mean value of the attention entropy within each bucket of prediction entropy. The uncertainty of attention strongly correlates with the entropy of the model’s prediction.

Figure 2: KL divergence with ROUGE F1 in the Gigaword test set for SAGCopy Indegree-1 model. Each point in the above plots represents an sample. The bottom plots show the average ROUGE score for different KL values.

Figure 5: The human annotation interface for fact level. Human judges are required to highlight content in the document that is supporting the fact printed in bold “The Queen has tweeted her thanks” (FACT1 of the summary in Figure 1 in the paper).

Figure 3: (Left) Model architecture of DISCOBERT. The Stacked Discourse Graph Encoders contain k stacked DGE blocks. (Right) The architecture of each Discourse Graph Encoder (DGE) block.

Figure 3: The guidance process for SAGCopy Indegree model, showing that the keyword “northern” is correctly copied for our model.

Figure 2: Prediction entropy values by relative sentence positions. For example, 0.0 indicates the first 10% of tokens in a sentence, and 0.9 is the last 10% of tokens. PEGASUSCNN/DM and BARTCNN/DM make highly uncertain decisions to start, but then entropy decreases, suggesting that these models may be copying based on a sentence prefix. Entropies on XSum are more constant across the sentence.

Figure 5: Vocabulary projected attention attending to the last input yt−2, current input yt−1, current output yt, and next output yt+1. When the prediction entropy is low, the attention mostly focus a few tokens including the current input yt−1 and current output yt.

Figure 7: The human annotation interface for argument level. Human judges are required to highlight content in the document that is supporting the phrase printed in bold “on social media” (argument ARGM-LOC of FACT2 of the summary in Figure 1 in the paper).

Figure 1: List of SRL propositions and corresponding tree MR with two facts for the sentence “The queen has tweeted her thanks to people who sent her 90th birthday messages on social media”.

Figure 1: The architecture of our hierarchical encoder, the token level Transformer encodes tokens and then the sentence level Transformer learns final sentence representations from representations at <s>.

Figure 3: Another document (left) weighted with respect to a reference summary and two system outputs (right), with Corr-F/Corr-A scores (see Fig. 2 for details).

Figure 9: Human highlight annotation for the FACT1 “The Queen has tweeted her thanks” of the summary in Figure 1 in the paper.

Figure 2: Example of discourse segmentation and RST tree conversion. The original sentence is segmented into 5 EDUs in box (a), and then parsed into an RST discourse tree in box (b). The converted dependencybased RST discourse tree is shown in box (c). Nucleus nodes including [2], [3] and [5], and Satellite nodes including [2] and [4] are denoted by solid lines and dashed lines, respectively. Relations are in italic. The EDU [2] is the head of the whole tree (span [1-5]), while the EDU [3] is the head of the span [3-5].

Figure 4: The instruction for fact-level human highlight annotation.

Figure 3: An example of Sentence Shuffling. The sentences in the document are shuffled and then pass through the hierarchical encoder, then a Pointer Network with TransDecP as its decoder is adopted to predict the positions of original sentences in the shuffled document.

Figure 8: Human highlight annotation for the argument ARG1 of FACT1 “her thanks” of the summary in Figure 1 in the paper.

Figure 5: Visualization of attentionweights between the historical summaries and the input review.

Figure 4: Ablation experiments on the dataset Sports. Figure (a) shows the result on ROUGE-1 and Rouge-2 metrics, and Figure (b) shows the result on Rouge-L metric.

Figure 3: Four-way evaluation for our content attribution methods. The reported value is the NLL loss with respect to the predicted token. Lower is better for display methods and higher is better for removal methods (we “break” the model more quickly). n = 0 means the baseline when there is no token or sentence displayed in DISP or removed or masked in RM.

Figure 3: The framework of our model. Attentions marked in grey are from the naive Transformer. Then, the reasoning unit is consists of the inter-reasoning attention marked in green and the personalized intra-reasoning attention marked in yellow. Finally, the memory-decoder attention marked in red incorporates the historical reasoning memory into the decoder layer.

Figure 1: An example of product review and its corresponding summary and historical summaries of corresponding user and product. We mark the relevant historical summaries in red.

Figure 2: Comparison of Summarization tasks. Single-document Summarization (SDS task) focuses on generating summary S based on a single documentD. Multi-document Summarization (MDS task) creates a holistic summary S covering multiple articles D. The MIRANEWS task differs by producing summary S based only on the events pertinent in the main article D, while reaching to a set of assisting documents A for complementary background.

Figure 1: An example where the summary (top section) contains information that is not explicitly included in its main document (middle section), but is covered in the related assisting document (bottom section). We highlight the information in the summary that is aligned to its corresponding main and assisting documents with yellow and pink colors, respectively.

Figure 4: An example from MIRANEWS, where the key information in the gold summary and summaries generated by systems conditioning on the main document (BART-S) or both on the main and assisting documents (rest variants) were only mentioned in the assisting documents. Facts in the gold summary supported by the as-

Figure 1: Our two-stage ablation-attribution framework. First, we compare a decoder-only language model (not fine-tuned on summarization task, and not conditioned on the input article) and a full summarization model. They are colored in gray and orange respectively. the The higher the difference, the more heavily model depends on the input context. For those context-dependent decisions, we conduct content attribution to find the relevant supporting content with methods like Integrated Gradient or Occlusion.

Figure 3: Example of a page on newser.com: a newser.com article is a news event including editor-picked links to relevant news articles from other news websites. This example shows the webpage https:// www.newser.com/story/305823/starship-prototype-lands-doesnt-explode.html. In the webpage (D1), three extra news pieces (D2, D3, D4) from nytimes, newser, and CNBC are linked. All of these four news articles report on the same event of starship prototype landing.

Figure 6: Several generated summaries and the corresponding review, reference summary.

Figure 2: Map of model behavior on XSum (top) and CNN/DM (bottom). The x-axis and y-axis show the distance between LM∅ and Sfull, and distance between S∅ and Sfull. The regions characterize different generation modes, defined in Section 3.

Figure 6: Ablation on the number of keyphrases in testing.

Figure 3: Distribution of the number of keyphrases.

Figure 5: Case study.

Figure 2: General framework of our model. There are mainly three parts, keyphrases prediction network, induction network and condition generation network.

Figure 1: Proposed summarization framework: generative process and neural parametrization. Shaded nodes represent observed variables, unshaded nodes indicate latent variables, arrows represent conditional dependencies between variables, and plates refer to repetitions of sampling steps. Dashed lines denote optional queries at test time. Latent queries create a query-focused view of the input document, which together with a query-agnostic view serve as input to a decoder for summary generation.

Figure 4: Fine tuning on the hyper-parameter λ.

Figure 1: Our framework trains guidance induction and summary generation jointly. It avoids the domain mismatch of the external tools and the guidance extraction is refined during training.

Figure 6: Screenshot of the (partial) screening test workers had to pass before participating in the HITs.

Figure 4: Snapshot of the page with instructions for users on the data collection interface.

Figure 5: Interface to construct summary used to collect data from workers on Amazon Mechanical Turk.

Figure 2: Intents used in the SUBSUME dataset.

Figure 7: Screenshot of the summary overview page.

Figure 3: ROUGE-L F1 for SBERT-EX and SBERT-QB for each intent. From left to right, intents are ordered in increasing order of their subjectiveness score shown in Table 1. The Pearson’s correlation between the subjectiveness score and the F1 score for SBERT-EX and SBERT-QB is −0.97 and −0.77 respectively.

Figure 3: Selected RTT questions with the FQD, PRQD and QSV objective measures.

Figure 2: Question semantic volume maximization using convex hull. The and are the selected and non-selected candidates RTT questions using convex hull. The left side figure shows the toy-example of the convex hull. The right side figure shows the selected RTT question with respect to the gold question .

Figure 1: The highlighted text shows important key aspects of the question which need to be considered while generating the summary.

Figure 2: Learning curves of HATS in terms of Rouge-L.

Figure 1: The overall architecture of our model.

Figure 5: Proportion of novel grams in summaries generated by different models on the CNN/DM test set.

Figure 4: An example of a generated summary by TED. The reference summary and parts of the input article are also included.

Figure 1: Overall structure of our model. TED first pretrains on news articles and then finetunes with theme modeling and denoising. (from left to right).

Figure 2: An example of the pretraining task: predict the Lead-3 sentences (as the target summary) using the rest of the article.

Figure 3: Theme modeling is essentially updating TED with a semantic classifier. The input sentence pair is first processed by adding a “class” token in the beginning and a “separation” token between the two sentences. Then the sentence pair is fed into the transformer encoder, and the first output vector is classified to “similar” or “distinct”.

Figure 1: An example of a long document abstractive summary from the LongSumm data set, presented using SUMMVis (Vig et al., 2021).

Figure 3: The first three images are ROUGE and BERTSCORE common test cases in the top quartile. The last four images are complementary high-quality summaries in top quartile suggested by SPICE and BLEU. The figures depict portions of the source document that align with the system-generated summary.

Figure 2: BART summaries in the ROUGE top quartile (left) and the SPICE top quartile (right).

Figure 1: Abstract and citations of (Bergsma and Lin 2006). The abstract emphasizes their pronoun resolution techniques and improved performance; the citation sentences reveal that their noun gender dataset is also a major contribution to the research community, but it is not covered in the abstract.

Figure 2: Overview of the dataset construction process.

Figure 4: Comparison of our hybrid summary with the abstract and pure cited text spans summary, for paper P05-1004 in the CL-SciSumm 2016 test set. Our hybrid summary covers both the authors’ original motivations (green) and the technical details influential to the research community (red).

Figure 3: Overview of our summarization models.

Figure 1: CNNLM for learning sentence representations. d=300, l = 3, 3 left context words are used in this figure.

Figure 2: The saliency-selection netowrk.

Figure 1: The focus-attention mechanism.

Figure 5: Semantic similarity between model outputs and reference summaries. Desired length is set as reference summary length.

Figure 2: TAPT performance over different pretraining epoch numbers in the email domain in terms of using and not using RecAdam.

Figure 2: Examining hyperparameter ℵ on the AEG dataset. ROUGE-L F1 scores and Length Variance V ar of different models under different ℵ are shown (ℵ = 2, 10, 50, 250).

Figure 3: ROUGE-1 results of BART fine-tuning, DAPT and SDPT over different numbers of training data for email (left) and dialog (right) domains. We consider both low-resource settings (50, 100, 200 and 300 (∼2%) samples), medium-resource settings (25% and 50% samples), and high-resource settings (75% and 100% samples).

Figure 3: Length distribution of reference summaries on the Annotated English Gigaword dataset. Summaries with 30 to 75 characters cover the majority cases.

Figure 1: Illustration of the Length Attention Unit. Firstly, decoder hidden state (blue) and remaining length (yellow) are employed to compute the attention weights al. Then, the length context vector clt (green) is produced by calculating the weighted sum between attention weights and pre-defined length embeddings (purple). Better viewed in color.

Figure 4: Length distribution of reference summaries on the CNN/Daily Mail dataset. Summaries exceed 2000 characters are ignored, since they only cover 0.009% of the dataset.

Figure 2: The framework of the summarization model

Figure 1: A dependency tree example. We extract the following two fact descriptions: Ahmadinejad essentially called Yukiya Amano, the director general of the IAEA, a U.S. puppet ||| said the U.N.A has no jurisdiction in Iran and Irap

Figure 1: The overall architecture of our proposed method and a training process for inconsistent example.

Figure 1: Latent variable extractive summarization model. senti is a sentence in a document and sum senti is a sentence in a gold summary of the document.

Figure 1: Visualization of copy probabilities

Figure 2: Average ROUGE-L improvement on CNN/Daily mail test set samples with different golden summary length.

Figure 1: Model Overview, N represents decoder layer number and L represents summary length.

Figure 2: The architecture of our extractive summarization model. The sentence and document level transformers can be pretrained.

Figure 1: The architecture of HIBERT during training. senti is a sentence in the document above, which has four sentences in total. sent3 is masked during encoding and the decoder predicts the original sent3.

Figure 3: Truncated examples from the test sets along with human, PG baseline and RLR+C outputs. Factual accuracy scores (s) are also shown for the model outputs. For the Stanford example, clinical observations

Figure 4: Distributions of the top 10 most frequent trigrams from model outputs on the Stanford test set.

Figure 1: A (truncated) radiology report and summaries with their ROUGE-L scores. Compared to the human summary, Summary A has high textual overlap (i.e., ROUGE-L) but makes a factual error; Summary B has a lower ROUGE-L score but is factually correct.

Figure 5: More examples from the test splits of both datasets along with human, PG baseline and RLR+C summaries. In the first example, the baseline output successfully copied content from the context, but missed important observations. In the second exam-

Figure 2: Our proposed training strategy. Compared to existing work which relies only on a ROUGE reward rR, we add a factual correctness reward rC which is enabled by a fact extractor. The summarization model is updated via RL, using a combination of the NLL loss, a ROUGE-based loss and a factual correctness-based loss. For simplicity we only show a subset of the clinical variables in the fact vectors v and v̂.

Figure 1: The architecture of our hierarchical T5.

Figure 4: Lite2.xPyramid curves (for system-level correlations) and its comparison to replacing random sentences’ SCUs with STUs.

Figure 5: A part of the Amazon Mechanical Turk webpage used for human evaluation.

Figure 3: Precision-Recall curve on different dataset, while taking FactCC as the baseline.

Figure 2: Illustration of consistency checking stage. In the evidence reasoning process, each sentence pair is scored by a reasoning model. Then each score is combined into the consistency score in the evidence aggregation process.

Figure 3: The Amazon Mechanical Turk user interface for collecting human labels of SCUs’ presence.

Figure 1: The illustration of our metrics. This data example is from REALSumm (Bhandari et al., 2020) (we omit unnecessary content by ‘...’). For gold labels, ‘1’ stands ‘present’ and ‘0’ stands ‘not present’. Other scores are the 2-class entailment probabilities, p2c(e), from our finetuned NLI model.

Figure 4: A part of the Amazon Mechanical Turk webpage used for collecting summaries.

Figure 1: Our two-stage fact consistency assessment framework.

Figure 2: Number of entities in the generated summary from BART and ECC.

Figure 1: Visualization of teacher cross attention weights when generating pseudo labels with normal (left) and smoothed (right) attention weights. This example is generated by the BART teacher trained on CNNDM (see §4.4). Its training and inference hyperparameters are described in detail in §4.2.

Figure 1: Example pipeline of our method to generate to-do items from email. Highlight sentences are extracted from the email text. Highlight action nodes are extracted from the constructed action graph. The highlight actions and sentences are then utilized to generate the to-do item.

Figure 2: Model Architecture. The sentences and actions are first encoded and then fed to the highlight classifiers. The hidden representations of sentences and actions, along with their probability of being highlights are then used in the cross-attention layer in the decoder. The email encoder has the same structure as BART encoder. The graph encoder utilizes graph attention networks to encode the action graph.

Figure 3: ROUGE scores when varying the value of k in the top k sentences extracted as highlights.

Figure 4: Example 2 of visualization of cross attention weight when the student generate summaries with different attention temperatures.

Figure 3: Example 1 of visualization of cross attention weight when the student generate summary with different attention temperatures.

Figure 2: Distributions of evident cross attention weights (≥ 0.15) when teachers generate pseudo labels with different attn. temperatures w.r.t. token positions.

Figure 1: Entity Coverage Control for seq2seq model.

Figure 1: Architecture of HERMAN. Note that the binary classifier for predicting whether a summary is verified (z labels) is omitted here. It simply takes the context vectors of the summary tokens and run through a MLP classifier.

Figure 1: PACSUM’s performance against different values of λ1 on the NYT validation set with with λ2 = 1. Optimal hyper-parameters (λ1, λ2, β) are (−2, 1, 0.6).

Figure 1: An overview of our model which generates summaries with guiding entities. Instead of generating summaries from left to right, our approach can control the process of generation and incorporate selected entities into summaries precisely.

Figure 3: The enhanced abstract generator of our LSTM-L module. To make model generate different entities, we encode all possible entities to initialize the LSTM-L generator. This can also guide the LSTM-R to generate different entities.

Figure 2: Our controllable neural model with guiding entities. The original article texts are encoded with a BiLSTM layer. We utilize a pretrained BERT named entity recognition tool to extract entities from input texts. The decoder consists of two LSTMs: LSTM-L and LSTM-R. Our model starts generating the left and right part of a summary with selected entities and can guarantee that entities appear in final output summaries.

Figure 4: The total entity (a) and novel entity (b) proportion of our model outputs compared to different baselines on Gigaword dataset. Our model can generate summaries with significantly more entities than other methods.

Figure 2: The position and content attribution on CNN/DM, the test set is broken down based on COMPRESSION.

Figure 2: Results of different document encoders with Pointer on normal and shuffled CNN/DailyMail. ∆R denotes the decrease of performance when the sentences in document are shuffled.

Figure 1: Different behaviours of two decoders (SeqLab and Pointer) under different testing environment. (a) shows repetition scores of different architectures when extracting six sentences on CNN/DailyMail. (b) shows the relationship between ∆R and positional bias. The abscissa denotes the positional bias of six different datasets and ∆R denotes the average ROUGE difference between the two decoders under different encoders. (c) shows average length of k-th sentence extracted from different architectures.

Figure 1: The accuracy on CNN/DM dataset, test set is broken down based on P-Value and C-Value.

Figure 3: ∆(D) for different datasets.

Figure 1: MATCHSUM framework. We match the contextual representations of the document with gold summary and candidate summaries (extracted from the document). Intuitively, better candidate summaries should be semantically closer to the document, while the gold summary should be the closest.

Figure 2: Distribution of z(%) on six datasets. Because the number of candidate summaries for each document is different (short text may have relatively few candidates), we use z / number of candidate summaries as the X-axis. The Y-axis represents the proportion of the best-summaries with this rank in the test set.

Figure 5: ψ of different datasets. Reddit is excluded because it has too few samples in the test set.

Figure 4: Datasets splitting experiment. We split test sets into five parts according to z described in Section 3.2. The X-axis from left to right indicates the subsets of the test set with the value of z from small to large, and the Y-axis represents the ROUGE improvement of MATCHSUM over BERTEXT on this subset.

Figure 4: ROUGE-2 F1 score on different groups of input sentences in terms of their length for s2s+att baseline and our SEASS model on English Gigaword test sets.

Figure 2: Overview of the Selective Encoding for Abstractive Sentence Summarization (SEASS).

Figure 3: Precision of extracted sentence at step t of the NN-SE baseline and the NEUSUM model.

Figure 1: Overview of the NEUSUM model. The model extracts S5 and S1 at the first two steps. At the first step, we feed the model a zero vector 0 to represent empty partial output summary. At the second and third steps, the representations of previously selected sentences S5 and S1, i.e., s5 and s1, are fed into the extractor RNN. At the second step, the model only scores the first 4 sentences since the 5th one is already included in the partial output summary.

Figure 2: Position distribution of selected sentences of the NN-SE baseline, our NEUSUM model and oracle on the test set. We only draw the first 30 sentences since the average document length is 27.05.

Figure 2: Examples of alignment results generated by our unsupervised method between the abstractive summaries and corresponding source sentences in the Gigaword test set.

Figure 1: Generative process of the contextual matching model.

Figure 2: The overview of the BERT-based model for sub-sentential extraction (SSE). In this simplified example, the document has 3 sentences. The first and the third sentences have two extraction units and the second sentence has one. After encoding the document with pre-trained BERT encoder, an average pooling layer are used to aggregate information of each extraction unit. The final Transformer layer captures the document-level information and then the MLP predicts the extraction probability.

Figure 1: A screenshot example of the documentsummary pair in the CNN/Daily Mail dataset.

Figure 1: Hierarchical Meeting Summary Network (HMNet) model structure. [BOS] is the special start token inserted before each turn, and its encoding is used in turn-level transformer encoder. Other tokens’ encodings enter the cross-attention module in decoder.

Figure 2: Percentage of novel n-grams in the reference and the summaries generated by HMNet and UNS (Shang et al., 2018) in AMI’s test set.

Figure 4: Qualitative analysis studying how the number of sections in document affect model performance. The mean ROUGE and the ROUGE delta are reported.

Figure 2: Boundary distance distribution for summaryworthy sentences on the arXiv and PubMed dataset.

Figure 1: The model architecture of FASUM. It has L layers of transformer blocks in both the encoder and decoder. The knowledge graph is obtained from information extraction results and it participates in the decoder’s attention.

Figure 5: ROUGE scores for BART-LB onDUC2003 andDUC2004 under different hyper-parameters: beamwidth andmaximum length of summary. Other hyper-parameters are set according to Table 3.

Figure 3: The detailed update process of word, sentence, and section nodes in HEROES. The figure is a toy example consisting of 3 sections, 7 sentences, and 5 unique words, where the vertical dashed lines are section boundaries. Green, blue, red nodes are word, sentence, section nodes involved in the update in this turn. Orange edges denote the direction of information flow.

Figure 2: Percentage of novel n-grams for summaries in XSum test set.

Figure 3: Ratio of novel n-grams, i.e. not in the article’s leading sentence, in summaries from BART, BART-LB, BARTFinetune and the reference summary in XSum’s test set.

Figure 2: The distribution of the overlapping ratio of nonstopping words between: (Red) the reference summary and the article; (Green) the reference summary and the article excluding the first 3 sentences, i.e. Rest; and (Blue) the leading 3 sentences, i.e. Lead-3, and Rest. The area under each curve is 1. All ratios are computed on CNN/DailyMail.

Figure 1: Average overlapping ratio of words between an article sentence and the reference summary, grouped by the normalized position of the sentence (the size of bin is 0.05). The ratio is computed on the training set of corresponding datasets.

Figure 1: The overall workflow of HEROES. Contents with solid (resp. dashed) borders are retained (resp. filtered) by the content ranking module. Sentences selected by the extractive summarization module are marked in colors.

Figure 4: Averaged difference in ROUGE-1 score between the summaries from BART and BART-LB, grouped by the length of reference summary. For example, the first bin refers to the 0-20% percentile of articles with the shortest reference summary in the corresponding dataset.

Figure 3: Agreement trajectories averaged over 100 runs per topic in the TAC 2009 corpus.

Figure 1: Illustration of traditional evaluation models based on reference summaries (top) and the new model (bottom) which is based on pairwise preferences.

Figure 2: Example of a pairwise preference annotation of two sentences. The first sentence is preferred over the second sentence because the first sentence contains more important information given that the information is not already known.

Figure 1: Assume a document (x1, x2, · · · , x8) contains three sentences (i.e., SENT. 1, SENT. 2 and SENT. 3). A SEQ2SEQ Transformer model can be pre-trained with our proposed objective. It takes the transformed document (i.e., a shuffled document, the first segment of a document, or a masked document) as input and learns to recover the original document (or part of the original document) by generation. SR: Sentence Reordering; NSG: Next Sentence Generation; MDG: Masked Document Generation.

The gallery currently contains 1524 figures

On the Summarization of Consumer Health Questions

On the Summarization of Consumer Health Questions

Goal-Directed Extractive Summarization of Financial Reports

Goal-Directed Extractive Summarization of Financial Reports

ASPECTNEWS: Aspect-Oriented Summarization of News Documents

ASPECTNEWS: Aspect-Oriented Summarization of News Documents

Rank-Aware Gain-Based Evaluation of Extractive Summarization

Revisiting Automatic Evaluation of Extractive Summarization Task: Can We Do Better than ROUGE?

Rank-Aware Gain-Based Evaluation of Extractive Summarization

Entity Commonsense Representation for Neural Abstractive Summarization

Entity Commonsense Representation for Neural Abstractive Summarization

Entity Commonsense Representation for Neural Abstractive Summarization

Unsupervised Opinion Summarization with Content Planning

Aspect-Controllable Opinion Summarization

Aspect-Controllable Opinion Summarization

Unsupervised Opinion Summarization with Content Planning

Unsupervised Opinion Summarization with Content Planning

Informative and Controllable Opinion Summarization

Informative and Controllable Opinion Summarization

Unsupervised Opinion Summarization with Content Planning

Enhancing Scientific Papers Summarization with Citation Graph

Enhancing Scientific Papers Summarization with Citation Graph

Enhancing Scientific Papers Summarization with Citation Graph

Enhancing Scientific Papers Summarization with Citation Graph

COLO: A Contrastive Learning based Re-ranking Framework for One-Stage Summarization

COLO: A Contrastive Learning based Re-ranking Framework for One-Stage Summarization

COLO: A Contrastive Learning based Re-ranking Framework for One-Stage Summarization

COLO: A Contrastive Learning based Re-ranking Framework for One-Stage Summarization

COLO: A Contrastive Learning based Re-ranking Framework for One-Stage Summarization

Summarizing Opinions: Aspect Extraction Meets Sentiment Prediction and They Are Both Weakly Supervised

Summarizing Opinions: Aspect Extraction Meets Sentiment Prediction and They Are Both Weakly Supervised

Summarizing Opinions: Aspect Extraction Meets Sentiment Prediction and They Are Both Weakly Supervised

Extractive Opinion Summarization in Quantized Transformer Spaces

Extractive Opinion Summarization in Quantized Transformer Spaces

Extractive Opinion Summarization in Quantized Transformer Spaces

Extractive Opinion Summarization in Quantized Transformer Spaces

Extractive Opinion Summarization in Quantized Transformer Spaces

Extractive Opinion Summarization in Quantized Transformer Spaces

Focus Attention: Promoting Faithfulness and Diversity in Summarization

Focus Attention: Promoting Faithfulness and Diversity in Summarization

Focus Attention: Promoting Faithfulness and Diversity in Summarization

Focus Attention: Promoting Faithfulness and Diversity in Summarization

Focus Attention: Promoting Faithfulness and Diversity in Summarization

Focus Attention: Promoting Faithfulness and Diversity in Summarization

Focus Attention: Promoting Faithfulness and Diversity in Summarization

Reinforced Extractive Summarization with Question-Focused Rewards

Guiding Extractive Summarization with Question-Answering Rewards

Towards Annotating and Creating Sub-Sentence Summary Highlights

Guiding Extractive Summarization with Question-Answering Rewards

Summary Level Training of Sentence Rewriting for Abstractive Summarization

NEWTS: A Corpus for News Topic-Focused Summarization∗

NEWTS: A Corpus for News Topic-Focused Summarization∗

NEWTS: A Corpus for News Topic-Focused Summarization∗

NEWTS: A Corpus for News Topic-Focused Summarization∗

NEWTS: A Corpus for News Topic-Focused Summarization∗

StructSum: Summarization via Structured Representations

StructSum: Summarization via Structured Representations

StructSum: Summarization via Structured Representations

StructSum: Summarization via Structured Representations

Semantic Overlap Summarization among Multiple Alternative Narratives: An Exploratory Study

Contextualized Rewriting for Text Summarization

Contextualized Rewriting for Text Summarization

Contextualized Rewriting for Text Summarization

Contextualized Rewriting for Text Summarization

Contextualized Rewriting for Text Summarization

Contextualized Rewriting for Text Summarization

SueNes: A Weakly Supervised Approach to Evaluating Single-Document Summarization via Negative Sampling

SueNes: A Weakly Supervised Approach to Evaluating Single-Document Summarization via Negative Sampling

SueNes: A Weakly Supervised Approach to Evaluating Single-Document Summarization via Negative Sampling

From Arguments to Key Points: Towards Automatic Argument Summarization

From Arguments to Key Points: Towards Automatic Argument Summarization

Topic Concentration in Query Focused Summarization Datasets

Topic Concentration in Query Focused Summarization Datasets

Topic Concentration in Query Focused Summarization Datasets

Topic Concentration in Query Focused Summarization Datasets

Topic Concentration in Query Focused Summarization Datasets

Topic Concentration in Query Focused Summarization Datasets

Topic Concentration in Query Focused Summarization Datasets

Metrics also Disagree in the Low Scoring Range: Revisiting Summarization Evaluation Metrics