The gallery currently contains 1524 figures
Click on an image to open the paper.
Figure

Figure 1: Consumer health questions and associated summaries from the gold standard. The entities in Red are the foci (main entities). The words in Blue and underlined are the triggers of the question types.

Figure

Figure 2: A summary generated by PG+M#2 method.

Figure

Figure 2: Distribution of #companies per SIC code

Figure

Figure 1: Architecture for our proposed system for Summarization of 10-K documents,MHFinSum

Figure

Figure 1: Examples of an earthquake-related article paired with extractive summaries from the CNN/DM dataset. “Generic” represents the selection of a general purpose summarization model. “Geo(graphy)” (colored in green) and “Recovery” (colored in orange) indicate our aspects of interest for the summary. We highlight aspect-relevant phrases in the document.

Figure

Figure 2: User interface for Turkers’ annotation.

Figure

Figure 1: Pipeline for Sem-nCG@k evaluation of extractive summarization task, CG@k stands for Cumulative Gain at kth position and ICG@k for Ideal CG@k.

Figure

Figure 1: Pipeline for Sem-nCG@k evaluation of extractive summarization task, CG@k stands for Cumulative Gain at kth position and ICG@k for Ideal CG@k.

Figure

Figure 2: ROUGE-2 and Sem-nCG scores for the 6 extractive summarization models on CNN/DailyMail test dataset. The results are for top-5 extracted sentences when the outputs are in actual and perturbed. The plot demonstrates the robustness of Sem-nCG against perturbation.

Figure

Figure 1: Observations on linked entities in summaries. O1: Summaries are mainly composed of entities. O2: Entities can be used to represent the topic of the summary. O3: Entity commonsense learned from a large corpus can be used.

Figure

Figure 2: Full architecture of our proposed sequence-to-sequence model with Entity2Topic (E2T) module.

Figure

Figure 3: Entity encoding submodule with selective disambiguation applied to the entity 3©. The left figure represents the full submodule while the right figure represents the two choices of disambiguating encoders.

Figure

Figure 3: Model architecture of PLANSUM. The content plan is constructed as the average of the aspect and sentiment probability distributions induced by the content plan induction model. It is then passed to the decoder, along with the aggregated token encodings to generate the summary.

Figure

Figure 2: Pseudo-summary y and input reviews X; the aspect code for summary y is room. Review sentences with the same aspect are underlined and same aspectkeywords are magnified.

Figure

Figure 1: Overview of the controller induction model. Token-level aspect predictions are aggregated into sentencelevel predictions using a multiple instance pooling mechanism (described on the right). The process is repeated from sentence- to document-level predictions.

Figure

Figure 4: Yelp summaries generated by PLANSUM and its variants. Aspects also mentioned in the gold summary (not shown to save space) are in color, other aspects are italicized.

Figure

Figure 1: Yelp reviews about a local bar and corresponding summary. Aspect-specific opinions are in color (e.g., drinks, guests, staff), while less salient opinions are shown in italics.

Figure

Figure 1: Three out of 150 reviews for the movie “Coach Carter”, and summaries written by the editor, and generated by a model following the EXTRACT-ABSTRACT approach and the proposed CONDENSE-ABSTRACT framework. The latter produces more informative and factual summaries whilst allowing to control aspects of the generated summary (such as the acting or plot of the movie).

Figure

Figure 2: Illustration of EA and CA frameworks for opinion summarization. In the CA framework, users can obtain need-specific summaries at test time (e.g., give me a summary focusing on acting).

Figure

Figure 2: Model architecture of our content plan induction model. The dotted line indicates that a reverse gradient function is applied.

Figure

Figure 2: Overview of our Citation Graph-Based Model (CGSUM). A denotes the source paper (w/o abstract). B, C, D and E denote the reference papers. The body text of A and the abstract of reference papers are fed into the document encoder, and then used to initialize the node features in the graph encoder. Neighbor extraction method will be used to extract a more relevant subgraph. While decoding, the decoder will pay attention to both the document and the citation graph structure.

Figure

Figure 4: Relationships between the degree of source paper nodes (X-axis) and R̃ (the average of ROUGE-1, ROUGE-2 and ROUGE-L) of two models: CGSUM + 1-hop neighbors and PTGEN + COV (inductive setting).

Figure

Figure 3: Different ways of splitting training, validation, test sets from the whole graph. We omit the directionality of the edges for simplification. The green, orange, cyan nodes represent papers from the training, validation, test set.

Figure

Figure 1: A small research community on the subject of Weak Galerkin Finite Element Method. Green text indicates the domain-specific terms shared in these papers, orange text denotes different ways of writing the same sentences, blue text represents the definition of Weak Galerkin Finite Element Method (does not appear in the source paper).

Figure

Figure 1: A comparison between two-stage models and COLO. The two-stage models including two training stages and a time-consuming preprocess while COLO is trained in an end-to-end fashion. (GPU and CPU hours cost in each stage are shown in Table 6). Two-stage models use offline sampling to build positive-negative pairs while COLO builds positive-negative pairs with online sampling where we directly get theses pairs from a changing model distribution.

Figure

Figure 2: Architecture of our extractive model. Input sequence: The ‘[doc]’ token is used to get vector representation zX of the document X , ‘[sep]’ is used as separator for sentences. We omit the classifier and the BCELoss. hi is the sentence embedding the i-th sentence in X . zCi means the feature representation of the i-th candidate.

Figure

Figure 4: Test inference time with beam size for abstractive model. We use the maximum batch size allowed by GPU memory.

Figure

Figure 3: Inference speed on CNN/DM (extractive). we use the candidate size |C| as the X-axis. The Y-axis represents the number of samples processed per second. batch=MAX means we use the maximum batch size allowed by GPU memory.

Figure

Figure 5: T-SNE Visualization of two examples from CNN/DM test set. We divide the candidates into 3 groups based on ROUGE score: candidates ranking 1~50, candidates ranking 51~100, candidate ranking 101~150. The red point denotes the anchor and the purple/cyan/gray points respectively denote the top 50/100/150 candidates.

Figure

Figure 1: Aspect-based opinion summarization. Opinions on image quality, sound quality, connectivity, and price of an LCD television are extracted from a set of reviews. Their polarities are then used to sort them into positive and negative, while neutral or redundant comments are discarded.

Figure

Figure 3: Human and system summaries for a product in the Televisions domain.

Figure

Figure 2: Multi-Seed Aspect Extractor (MATE).

Figure

Figure 3: Sentence ranking via two-step sampling. In this toy example, each sentence (s1 to s5) is assigned to its nearest code (k = 1, 2, 3), as shown by thick purple arrows. During cluster sampling, the probability of sampling a code (top right; shown as blue bars) is proportional to the number of assignments it receives. For every sampled code, we perform sentence sampling; sentences are sampled, with replacement, according to their proximity to the code’s encoding. Samples from codes 1 and 3 are shown in black and red, respectively.

Figure

Figure 5: t-SNE projection of the quantized space of an eight-head QT trained on SPACE, showing all 1024 learned latent codes (best viewed in color). Darker codes correspond to lower aspect entropy, our proposed measure of high aspect-specificity. Zooming in the aspect sub-space uncovers good aspect separation.

Figure

Figure 1: A sentence is encoded into a 3-head representation and head vectors are quantized using a weighted average of their neighboring code embeddings. The QT model is trained by reconstructing the original sentence.

Figure

Figure 4: Aspect opinion summarization with QT. The aspect-encoding sub-space is identified using mean aspect entropy and all other sub-spaces are ignored (shown in gray). Two-step sampling is restricted only to the codes associated with the desired aspect (shown in red).

Figure

Figure 2:General opinion summarizationwithQT. All input sentences for an entity are encoded using three heads (shown in orange, blue, and green crosses). Sentence vectors are clustered under their nearest latent code (gray circles). Popular clusters (histogram) correspond to commonly occurring opinions, and are used to sample and extract the most representative sentences.

Figure

Figure 6: Mean aspect entropies (bars) for each of QT’s head sub-spaces and corresponding aspect ROUGE-1 scores for the summaries produced by each head (line).

Figure

Figure 2: A Transformer-based encoder-decoder architecture with FAME.

Figure

Figure 3: Top 40 sentence pieces and their logits from topic distribution tX in ROBFAME and PEGFAME for the XSUM article discussed in Figure 1.

Figure

Figure 4: ROUGE-1 F1 scores of ROBFAME and PEGFAME models with different top-k vocabularies (Eq. (5)) on the XSUM test set. Similar patters are observed for ROUGE-2 and ROUGE-L scores.

Figure

Figure 5: A 2010 BBC article from the XSUM testset, its human written summary and model predictions from ROBERTAS2S, and PEGASUS, with and without FAME. The text in orange is not supported by the input article.

Figure

Figure 7: FAME model predictions with Focussample,k (k = 10000). The text in orange is not supported by the input article.

Figure

Figure 6: Model predictions with focus sampling Focustop,k, a controlled generation setting. The text in orange is not supported by the input article. We note that with smaller values of k, both ROBERTAS2S-based and PEGASUSbased models tend to hallucinate more often.

Figure

Figure 1: Block A shows the best predictions from PEGASUS and our PEGFAME (PEGASUS with FAME) model, along with the GOLD summary for an XSUM article. Block B presents diverse summaries generated from PEGASUS using top-k and nucleus sampling. Block C shows diverse summaries generated using our PEGFAME model with Focus sampling. The text in orange is not supported by the input article.

Figure

Figure 1: System framework. The model uses an extractive summary as a document surrogate to answer important questions about the document. The questions are automatically derived from the human abstract.

Figure

Figure 1: A unidirectional LSTM (blue, Eq. (3)) encodes the partial summary, while the multilayer perceptron network (orange, Eq. (4-5)) utilizes the text unit representation (het ), its positional embedding (gt), and the partial summary representation (st) to determine if the t-th text unit is to be included in the summary. Best viewed in color.

Figure

Figure 1: An illustration of label smoothing. Words aligned to the abstract are colored orange; gap words are colored turquoise.

Figure

Figure 2: Summarization results using fLSTM1 or fCNN2 encoder with word/chunk as the extraction unit.

Figure

Figure 1: The overview architecture of the extractor netwrok

Figure

Figure 6: Figure showing the impact of the fine-tuning epochs of the BART-b + T-ID model on ROUGE L performance.

Figure

Figure 1: A topical summarization example, summarizing a sample document with respect to economy and climate topics.

Figure

Figure 3: Comparison of per topic normalized counts of NEWTS test documents versus CNN/Dailymail counts

Figure

Figure 4: Comparison of per topic normalized counts of Train Documents of our Dataset versus CNN/Dailymail

Figure

Figure 2: The step-by-step process of building the NEWTS dataset.

Figure

Figure 1: StructSum incorporates Latent Structure (LS) §2.2 and Explicit Structure (ES) §2.3 Attention to produce structure-aware representations. Here, StructSum augments the Pointer-Generator model, but the methodology that we proposed is general, and it can be applied to other encoder-decoder summarization systems

Figure

Figure 2: Comparison of % Novel n-grams between StructSum, Pointer-Generator+Coverage and the Reference. Here, “sent” indicates full novel sentences.

Figure

Figure 3: Coverage of source sentences in summary. Here the x-axis is the sentence position in the source article and y-axis shows the normalized count of sentences in that position copied to the summary.

Figure

Figure 4: Examples of induced structures and generated summaries.

Figure

Figure 1: A toy example of Semantic Overlap Summarization (SOS) Task (from multiple alternative narratives). Here, an abortion issue-related event has been reported by two news media (left-wing and right-wing). “Green” Text denotes the common information from both news media, while “Blue” and “Red” text denotes the unique perspectives of left and right wing. Some real examples from the benchmark dataset are provided in the Table 3.

Figure

Figure 2: Example of three-step summarization process: selecting, grouping and rewriting.

Figure

Figure 3: Architecture of the contextualized rewriter. The group tag embeddings are tied between the encoder (left figure) and the decoder (right figure), through which the decoder can address to the corresponding tokens in the document.

Figure

Figure 7: Example of the ability to maintain coherence.

Figure

Figure 6: Example of the ability to reduce redundancy.

Figure

Figure 1: Example showing that contextual information can benefit summary rewriting.

Figure

Figure 4: Comparison of the ability to generate non-redundant summaries.

Figure

Figure 2: Training sample generation by mutation. Mu-

Figure

Figure 1: The weakly supervised training approach in this paper and the test of a trained model.

Figure

Figure 3: Training sample generation via cross pairing.

Figure

Figure 1: Argument coverage per number of key points.

Figure

Figure 2: Precision/Recall trade-off for different key point selection policies. For each method, the highest F1 score, as well as the F1 score for the chosen threshold are specified. For the Best Match + Threshold policy, these two scores coincide.

Figure

Figure 4. DUC 2005-7 vs. QCFS Dataset Structure

Figure

Figure 7. Comprastion of Retrieval Based Algorithms Performance on the TD-QFS Dataset

Figure

Figure 5. ROUGE-Recall results of KLSum on relevance-filtered subsets of the TD-QFS dataset compared to DUC datasets.

Figure

Figure 2. Two-stage QFS Scheme.

Figure

Figure 6. Comprastion of QFS to non-QFS Algorithms Performance on the TD-QFS Dataset

Figure

Figure 1. ROUGE-Comparing QFS methods to generic summarization methods: Biased-LexRank is not significantly better than generic algorithms.

Figure

Figure 3. Comparing Retrieval Components on DUC 2005

Figure

Figure 10: Coverage (x-axis) vs Kendall’s τ (y-axis) for all metric pairs on the CNNDM dataset.

Figure

Figure 3: System-level Pearson correlation with humans on top-k systems (Sec. 4.2).

Figure

Figure 7: Abstractiveness (x-axis) vs Kendall’s τ (y-axis) for all metric pairs on the TAC dataset.

Figure

Figure 9: Coverage (x-axis) vs Kendall’s τ (y-axis) for all metric pairs on the TAC dataset.

Figure

Figure 4: F1-Scores with bootstrapping (Sec. 4.3).

Figure

Figure 5: Ease of Summarization (x-axis) vs Kendall’s τ (y-axis) for all metric pairs on the TAC dataset.

Figure

Figure 2: Disagreement between metrics on TAC and CNNDM.

Figure

Figure 1: Effect of different properties of reference summaries. We only show correlation between BERTScore and ROUGE-2 due to limited pages. The trend is similar for all other pairs as shown in the appendix. The plots for CNNDM are more dense because of more documents in the CNNDM test set as compared to TAC. “Cov” and “Abs” stand for Coverage and Abstractiveness respectively. The trend lines in red are the 10 point and 100 point moving average for TAC and CNNDM respectively.

Figure

Figure 8: System-level Kendall correlation with humans on top-k systems.

Figure

Figure 2: System-level Pearson correlation between metrics and human scores (Sec. 4.1).

Figure

Figure 6: Ease of Summarization (x-axis) vs Kendall’s τ (y-axis) for all metric pairs on the CNNDM dataset.

Figure

Figure 1: p-value of William’s Significance Test for the hypothesis “Is the system on left (y-axis) significantly better than system on top (x-axis)”. ‘BScore’ refers to BERTScore and ‘MScore’ refers to MoverScore. A dark green value in cell (i, j) denotes metric mi has a significantly higher Pearson correlation with human scores compared to metric mj (p-value < 0.05).13‘-’ in cell (i, j) refers to the case when Pearson correlation of mi with human scores is less that of mj (Sec. 4.1).

Figure

Figure 6: System-level Kendall correlation between metrics and human scores.

Figure

Figure 5: Pearson correlation between metrics and human judgments across different datasets (Sec. 4.4).

Figure

Figure 8: Abstractiveness (x-axis) vs Kendall’s τ (y-axis) for all metric pairs on the CNNDM dataset.

Figure

Figure 7: Kendall correlation between metrics and human judgements across different datasets.

Figure

Figure 4: F ′/N ration between metrics on TAC and CNNDM.

Figure

Figure 3: F/N ratio between metrics on TAC and CNNDM.

Figure

Figure 1: The patterns of the main (red) and three auxiliary tasks (green). The solid line denotes the concatenation of the document and the corresponding question, which is the input of our model; the dashed line denotes the corresponding answer for each input. All tasks are in a "text-to-text" format.

Figure

Figure 2: We insert adapter 𝑇𝑆𝑖 into original BART encoder layers in a serial manner. Each task has a unique adapter. Red rectangle denotes the adapter for the main task.

Figure

Figure 4: We compare the performance of our multi-task model with the baseline in training process. We calculate the average loss of the model at each step on the training set and validation set. This figure shows that our model can alleviate the overfitting problem.

Figure

Figure 3: It shows how the weights of all the auxiliary tasks change during training on 100 samples. The weights of ext and spo are almost monotonically increasing. The weight of nli task begins to decline except for a short period of ascent at the beginning.

Figure

Figure 1: GenCompareSum pipeline. (a) We split the document into sentences. (b) We combine these sentences into sections of several sentences. (c) We feed each section into the generative text model and generate several text fragments per section. (d) We aggregate the questions, removing redundant questions by using n-gram blocking. Where aggregation occurs, we apply a count to represent the number of textual fragments which were combined and use this as a weighting going forwards. The highest weighted textual fragments are then selected to guide the summary. (e) The similarity between each sentence from the source document and each selected textual fragment is calculated using BERTScore. (f) We create a similarity matrix from the scores calculated in the previous step. These are then summed over the textual fragments, weighted by the values calculated in step (d), to give a score per sentence. (g) The highest scoring sentences are selected to form the summary.

Figure

Figure 2: Representations of the two training settings of the T5 encoder-decoder model. The left diagram shows the unsupervised pretraining task, in which a tokenized text containing masked spans is passed to the encoder and the output target of the decoder is the prediction of the masked spans. The right diagram shows the supervised downstream task, where the pre-trained model is finetuned on pairs of tokenized sequences.

Figure

Figure 3: Data volume over time for each topic

Figure

Figure 6: Classification accuracies for 21 pairs of summaries. (a) Automatic classification using prototypes (by SVM) on the entire test set. The green avg SVM line is the mean accuracy of SVMs trained on the entire training set. (b) Automatic classification evaluated on 6 test articles per pair. (c) Human classification accuracy on 6 test articles per pair.

Figure

Figure 5: An example questionnaire used for crowd-sourced evaluation. It consists of: (a) instructions, (b) two groups of summaries, (c) question articles, and (d) a comment box for feedback. See §6.3.

Figure

Figure 1: An illustrative example of comparative summarisation. Squares are news articles, rows denote different news outlets, and the x-axis denotes time. The shaded articles are chosen to represent AI-related news during Feb and March 2018, respectively. They aim to summarise topics in each month, and also highlight differences between the twomonths.

Figure

Figure 4: Comparative summarisation methods evaluated using the balanced accuracy of 1-NN (left) and SVM (right) classifiers. Each row represent a dataset. Error bars show 95% confidence intervals.

Figure

Figure 1: Distributions of some metrics/rewards for summaries with different human ratings. Among the four presented metrics/rewards, only BERT+MLP+Pref (the rightmost sub-figure) does not use reference summaries.

Figure

Figure 4: Information about the length of the generated summaries for the CNN/DM dataset.

Figure

Figure 3: Information about the length of the generated summaries for the CASS dataset.

Figure

Figure 1: Training of the model. The blocks present steps of the analysis. All the elements above the blocks are inputs (document embedding, sentences embeddings, threshold, real summary embedding, trade-off).

Figure

Figure 2: Processing time of the summarization function (y-axis) by the number of lines of the text as input (x-axis). Results computed on an i7-8550U.

Figure

Figure 2: Production of latent code zN for review rN .

Figure

Figure 1: Unfolded graphical representation of the model.

Figure

Figure 1: Illustration of the FEWSUM model that uses the leave-one-out objective. Here predictions of the target review ri is performed by conditioning on the encoded source reviews r−i. The generator attends the last encoder layer’s output to extract common information (in red). Additionally, the generator has partial information about ri passed by the oracle q(ri, r−i).

Figure

Figure 2: Architecture of the prior score function.

Figure

Figure 1: The SELSUM model is trained to select and summarize a subset of relevant reviews r̂1:K from a full set r1:N using the approximate posterior qφ(r̂1:K |r1:N , s). To yield review subsets in test time, we fit and use a parametrized prior pψ(r̂1:K |r1:N ).

Figure

Figure 1: Illustration of the proposed approach. In Stage 1, all parameters of a large language model are pretrained on generic texts (we use BART). In Stage 2, we pre-train adapters (5% of the full model’s parameters) on customer reviews using held-out reviews as summaries. In Stage 3, we fine-tune the adapters on a handful of reviews-summary pairs.

Figure

Figure 2: Illustration of the query-based summarizer that inputs reviews and a text query consisting of aspects, such as ‘volume,’ ‘price,’ and ‘bluetooth.’ The query is automatically created from gold summaries in training and reviews in test time.

Figure

Figure 3: Two example TLDR-Auth and TLDR-PR pairs with colored spans corresponding to nuggets in Table 3

Figure

Figure 4: Training regimen for CATTS.

Figure

Figure 2: Example of a reviewer comment rewritten as a TLDR (best viewed in color). A peer review comment often begins with a summary of the paper which annotators use to compose a TLDR. Annotators are trained to preserve the original reviewer’s wording when possible (indicated by colored spans), and to avoid using

Figure

Figure 1: An example TLDR of a scientific paper. A TLDR is typically composed of salient information (indicated by colored spans) found in the abstract, intro, and conclusion sections of a paper.

Figure

Figure 4: Summaries are rated by medical practitioners along the dimensions of adequacy, faithfulness, readability and ease of revision. Their ratings are averaged for each summary.

Figure

Figure 1: An example after-visit summary generated from EHR notes associated with a patient. A novel alerting mechanism is proposed in this work to report errors found in the summary, including missing medical events and hallucinated facts. We aim to build effective detectors with self-supervision on unlabeled data for error alerting.

Figure

Figure 2: One or two event nuggets are randomly masked out from a summary sentence (a). The masked sequence (b) is fed to a denoising auto-encoder to produce a synthesized sentence that may contain hallucinated medical events (c).

Figure

Figure 3: “abdominal pain” appears in both the clinical document and after-visit summary, with the same CUI. “nausea vomiting” and “nauseous” are aligned because there is an is-a relation between the two concepts.

Figure

Figure 1: Generation of sentence and document cluster embeddings. “⊕” stands for a pooling operation, while “⊗” represents a relevance measurement function.

Figure

Figure 1: A dependency tree example. The meaning of the dependency labels can be referred to (De Marneffe and Manning 2008). We extract the following two fact descriptions: taiwan share prices opened lower tuesday ||| dealers said

Figure

Figure 2: Model framework

Figure

Figure 2: Jointly Rerank and Rewrite

Figure

Figure 3: Gates change during training.

Figure

Figure 1: Flow chat of the proposed method. We use the dashed line for Retrieve since there is an IR system embedded.

Figure

Figure 5: Summaries use different portions of error correction operations. Contrastive learning with SYSLOWCON (CL.SLC) and BATCH (CL.B) substitute errors with correct content more often than unlikelihood training with MASKENT and ENTAILRANK.

Figure

Figure 7: Probability distributions of generating the non-first tokens of proper nouns, numbers, nouns, and verbs, grouped by extrinsic errors, intrinsic errors, world knowledge, and other correct tokens. Non-first tokens do not exist for numbers and verbs, as they only contain single tokens.

Figure

Figure 3: Probability distributions of generating the first tokens of proper nouns and numbers, grouped by extrinsic errors, intrinsic errors, world knowledge, and other correct tokens.

Figure

Figure 6: ROUGE-1 improvement with oracle masks for each head at each layer on the analysis sets of XSum and NYT.

Figure

Figure 1: Illustration of attention head masking (m̃).

Figure

Figure 2: ROUGE-1 F1 improvement with oracle masks for each head at each layer on the analysis set of CNN/DM. Overall, top layers see greater improvement than bottom layers. Layer 1 is the bottom layer connected with the word embeddings.

Figure

Figure 4: Portions of summaries with errors. CL models consistently reduce both types of errors.

Figure

Figure 9: Percentages of COPY SALIENT, NON-COPY SALIENT, COPY CONTENT, NON-COPY CONTENT, FIRST and LAST attendees for each head at each layer on the analysis set of NYT.

Figure

Figure 11: Sample generated summaries by fine-tuned BART models. Intrinsic errors are highlighted in red

Figure

Figure 10: Guideline for our human evaluation (§ 7.2).

Figure

Figure 9: Guideline for our summary error annotation (§ 4).

Figure

Figure 7: [Left] ROUGE-1 F1 improvement by incrementally applying oracle masking to the next head with most ROUGE improvement per layer on XSum and NYT. Dotted lines indicate that the newly masked heads do not have individual ROUGE improvements. [Right] ROUGE-1 recall improvement by masking all heads vs. sum of improvement by masking each head separately on XSum and NYT. Better displayed with color.

Figure

Figure 8: Percentages of samples containing world knowledge as labeled by human on the outputs of XSum and CNN/DM.

Figure

Figure 8: Percentages of COPY SALIENT, NON-COPY SALIENT, COPY CONTENT, NON-COPY CONTENT, FIRST and LAST attendees for each head at each layer on the analysis set of XSum.

Figure

Figure 5: Results on CNN/DM with different sizes of training data. Our method consistently improves the summarizer.

Figure

Figure 10: Percentages of NON-COPY CONTENT and LAST attendees for each head at each layer on the analysis set of CNN/DM.

Figure

Figure 1: Sample article and system summaries by different methods. Our contrastive learning model trained on low confidence system outputs correctly generates

Figure

Figure 6: Probability distributions of generating the first tokens of nouns and verbs, grouped by extrinsic errors, intrinsic errors, world knowledge, and other correct tokens. No verb is annotated as world knowledge.

Figure

Figure 3: [Left] ROUGE-1 F1 improvement by incrementally applying oracle masking to the next head with most ROUGE improvement per layer on CNN/DM. Dotted lines indicate that the newly masked heads do not have individual ROUGE improvements. [Right] ROUGE-1 recall improvement by masking all heads vs. sum of improvement by masking each head separately on CNN/DM. Best viewed in color.

Figure

Figure 4: COPY and NON-COPY SALIENT attendee word percentages on the analysis set of CNN/DM. Top layers focus on words to be “copied", while bottom layers attend to the broader salient context.

Figure

Figure 2: Percentage of samples with intrinsic and extrinsic error spans for models fine-tuned from BART and PEGASUS on XSum and CNN/DM.

Figure

Figure 1: The factuality and ROUGE score trade-off curve on XSUM. We use different reward value rnfe for our approach and different drop rate c for the loss truncation baseline.

Figure

Figure 8: Question-summary hierarchy annotation instructions. (Page 2 / 4)

Figure

Figure 2: The distribution of entities over prior/posterior probability. Each point in the figure represents an entity (pprior(ek), ppos(ek)) and shading indicates the confidence of the classifier. (a) The distribution of entities; (b) The entity factuality classification results with KNN (k = 20) classifier. Both factual hallucinated and non-hallucinated entities are colored blue; (c) The KNN (k = 20) classification boundaries of hallucinated and non-hallucinated entities.

Figure

Figure 11: Human evaluation guidelines.

Figure

Figure 6: Visualization of hierarchical biases in HIBRIDS-ENC (left) and HIBRIDS-DEC (right) on GOVREPORT. Positive and negative values are shaded in

Figure

Figure 12: Screenshot of the human evaluation interface.

Figure

Figure 6: Entity distribution over posterior probabilities from CMLMXSUM and CMLMCNN/DM. The shading shows the classification boundaries of the classifier.

Figure

Figure 4: Results on full summary generation. In each subfigure, the left panel includes models for comparisons and the right panel shows our models. HIBRIDS on either encoder and decoder uniformly outperforms the comparisons on both datasets.

Figure

Figure 1: The question-summary hierarchy annotated for sentences in a reference summary paragraph. Summarization models are trained to generate the question-summary hierarchy from the document, which signifies the importance of encoding the document structure. For instance, to generate the follow-up question-summary pairs of Q1.1 and A1.1 from A1, it requires the understanding of both the content and the parent-child and sibling relations among §3, §3.1, and §3.4.

Figure

Figure 5: Evaluation of PEGASUSLARGE trained on datasets with different levels of noises.

Figure

Figure 2: Example path lengths and level differences (right) that encode the relative positions with regard to the document tree structure (left). Each query/key represents a block of tokens that belong to the same

Figure

Figure 8: ROC curve of entity’s posterior probability and factuality.

Figure

Figure 3: Sample output by the hierarchical encoding model (HIERENC) and HIBRIDS-ENC. Our generated structure makes more sense with the constructed follow-up questions to Q1, highlighted in green, than the comparison model HIERENC.

Figure

Figure 10: Question-summary hierarchy annotation instructions. (Page 4 / 4)

Figure

Figure 7: Posterior probabilities calculated from CLM and CMLM. Both models are trained on XSUM dataset.

Figure

Figure 5: Visualization of hierarchical biases in HIBRIDS-ENC (left) and HIBRIDS-DEC (right) on QSGen-

Figure

Figure 1: Illustration of deep communicating agents presented in this paper. Each agent a and b encodes one paragraph in multiple layers. By passing new messages through multiple layers the agents are able to coordinate and focus on the important aspects of the input text.

Figure

Figure 4: The average ROUGE-L scores for summaries that are binned by each agent’s average attention when generating the summary (see Section 5.2). When the agents contribute equally to the summary, the ROUGE-L score increases.

Figure

Figure 3: Multi-agent encoder message passing. Agents b and c transmit the last hidden state output (I) of the current layer k as a message, which are passed through an average pool (Eq. 6). The receiving agent a uses the new message z (k) a as additional input to its next layer.

Figure

Figure 2: Multi-agent-encoder-decoder overview. Each agent a encodes a paragraph using a local encoder followed by multiple contextual layers with agent communication through concentrated messages z(k)a at each layer k. Communication is illustrated in Figure 3. The word context vectors cta are condensed into agent context c∗t . Agent specific generation probabilities, pta, enable voting for the suitable out-of-vocabulary words (e.g., ’yen’) in the final distribution.

Figure

Figure 2: Overall model architecture consisting of (M1) shared text encoder, (M2) summary decoder, and (M3) dualview sentiment classification module. The shared text encoder converts the input review text into a memory bank. Based on the memory bank, the summary decoder generates the review summary word by word and receives a summary generation loss. The source-view (summary-view) sentiment classifier uses thememory bank (hidden states) from the encoder (decoder) to predict a sentiment label for the review (summary) and it receives a sentiment classification loss. An inconsistency loss is applied to penalize the disagreement between the source-view and summary-view sentiment classifiers.

Figure

Figure 1: An example of truncated review and its corresponding summary and sentiment label.

Figure

Figure 6: Sample summaries generated by our D.GPT2+CMDP model using different length bins on the DUC-2002 testing set.

Figure

Figure 1: A sample document and three summaries generated by our entity-controlled model based on DistilGPT2 (Sanh et al., 2019) and fine-tuned by our proposed method. Each summary corresponds to the requested entity inside the pair of brackets.

Figure

Figure 2: Bin % of different models with different specified length bins on the DUC-2002 dataset. Our framework improves the bin % of PG and D.GPT2 for bin 4, 7, and 10 by a wide margin.

Figure

Figure 4: Results of entity-controlled models for entities in different document sentences. Our CMDP framework consistently improves the QA′-F1 and appear % for entities at different positions.

Figure

Figure 3: Values of costs (c) and Lagrangian multipliers (λ) of PG+CMDP for length control on every checkpoint (4k iterations) during training. Each value is averaged over 4k iterations.

Figure

Figure 5: Sample summaries generated by our D.GPT2+CMDP model with abstractiveness bin 1, 2, and 3 on the Newsroom-b testing set. Extractive fragments in summaries are in blue color. Factual errors are in red color.

Figure

Figure 1: Distribution of lexical formality (Equation 6) in CNN/Daily Mail, Reddit and the complete dataset. Positive values on the X-axis indicate high formality and negative values indicate informality.

Figure

Figure 2: Distribution of formality scores of the generated summaries. Y-Axis: Fraction of datapoints, XAxis: Intervals of formality scores.

Figure

Figure 3: RL learning curve.

Figure

Figure 2: Reinforced training of the extractor (for one extraction step) and its interaction with the abstractor. For simplicity, the critic network is not shown. Note that all d’s and st are raw sentences, not vector representations.

Figure

Figure 3: The predicted extracting probabilities for each sentence calculated by the output of each iteration.

Figure

Figure 5: Example from the dataset showing the generated summary of our best models. The colored (marked) sentences correspond to our extractor’s sentence selection. The listed ROUGE scores are computed for this specific example.

Figure

Figure 1 Our QA-based semantic evaluation system architecture

Figure

Figure 2: Relationship between number of iteration and Rouge score on DailyMail test dataset with respect to the ground truth at 75 bytes.

Figure

Figure 2 Customizing a typical QA system for our evaluation approach

Figure

Figure 4 Comparison of our evaluation scores and human evaluation scores

Figure

Figure 4: Example from the dataset showing the generated summary of our best models. The colored (marked) sentences correspond to our extractor’s sentence selection. The listed ROUGE scores are computed for this specific example.

Figure

Figure 1: Model Structure: There is one encoder, one decoder and one iterative unit (which is used to polish document representation) in each iteration. The final labeling part is used to generating the extracting probabilities for all sentences combining hidden states of decoders in all iterations. We take a document consists of three sentences for example here.

Figure

Figure 3 The QA-based Summarization Evaluation Process Using DUC 2007 corpus

Figure

Figure 6: Distribution of model summary length (test set).

Figure

Figure 1: Ranking (descending order) of current 11 top-scoring summarization systems (Abstractive models are red while extractive ones are blue). Each system is evaluated based on three diverse evaluation methods: (a) averaging each system’s in-dataset ROUGE-2 F1 scores (R2) over five datasets; (b-c) evaluating systems using our designed cross-dataset measures: stiffR2, stable-R2 (Sec. 5). Notably, BERTmatch and BART are two state-of-the-art models for extractive and abstractive summarization respectively (highlighted by blue and red boxes).

Figure

Figure 4: Illustration of stiffness and stableness of ROUGE-1 F1 scores for various models. Yellow bars stand for extractive models and grey bars stand for abstractive models.

Figure

Figure 3: Bar chart for Tab. 4.

Figure

Figure 3: Characteristics of test set for each dataset (the train set possesses almost the same property thus is not displayed here): coverage,copy length, novelty, sentence fusion score, repetition. Here we choose 2-gram to calculate the novelty and 3-gram for the repetition.

Figure

Figure 2: Different metrics characterized by a relation chart among generated summaries (Gsum), references (Ref) and input documents (Doc).

Figure

Figure 5: Illustration of stiffness and stableness of factuality scores for various models. Yellow bars stand for extractive systems and grey bars stand for abstractive systems.

Figure

Figure 2: Distribution of the similarity scores between summary and abstract according to Eq. 1.

Figure

Figure 5: Average percentage of novel n-grams in the generated summaries with the filtered training dataset RSCSUM-80.

Figure

Figure 1: Extractive fragment coverage and density distributions across the compared datasets, where n indicates the number of documents.

Figure

Figure 7: Statistical significance test on the values in Tab. 8. The four positions correspond to Grammaticality, Informativeness, Relevance and Overall Quality respectively, as shown in upper left box. “=” means no statistical difference, “>” means the row performs significantly better than the column at the significance level α = 0.05, whereas “6” indicates the same at α = 0.01.

Figure

Figure 8: Participants’ individual mean scores, by conditions.

Figure

Figure 4: Average percentage of novel n-grams in the generated summaries.

Figure

Figure 4: The performance of Ours(Fβ) on TAC-2010 under different λ, γ, M , and encoding models. When we change one of them, the others are fixed. The Pearson’s r and Spearman’s ρ are reported.

Figure

Figure 1: Overall framework of our method. w and s are the token-level and sentence-level representations. n and N (m and M ) are the token number and the sentence number of the summary (pseudo reference). For multidocument summary (i.e., K > 1), we compute relevance scores between the summary x and each document dk, and then average them as the final relevance score.

Figure

Figure 2: The gap of Spearman’s ρ between Ours(Fβ) and Ours(F1) on TAC-2011 for different |Set| and |Systems|. Positive gaps mean our Fβ can improve the performance while negative gaps indicate our Fβ degrades the performance. When changing one of them, the other is fixed. “all” means the full size is applied, i.e., 10 for |Set| and 50 for |Systems|.

Figure

Figure 5: Distributions of the reversed rank from SUPERT and Ours(Fβ) for bad and good summaries on TAC-2011. The bar in the middle indicates the median.

Figure

Figure 1: The overall accuracy performance of six representative factuality checkers.

Figure

Figure 3: Ablation studies for Ours(Fβ) on TAC datasets and Ours(F1) on CNNDM. “-CentralityW.” means that we remove the centrality weighting when computing relevance scores. “-HybridR.” represents we only utilize the token-level representations when calculating relevance and redundancy scores. “- Redundancy” indicates we omit the redundancy score.

Figure

Figure 2: The overall accuracy performance of six representative factuality checkers.

Figure

Figure 4: An excerpt from anonymized SUMMSCREEN that corresponds to the instance in the Figure 1 in the main text. Character names are replaced with IDs that are permuted across episodes.

Figure

Figure 2: Left: TV show genres from ForeverDreaming. Right: TV show genres from TVMegaSite.

Figure

Figure 3: Two excerpts from SUMMSCREEN showing that generating summaries from TV show transcripts requires drawing information from a wide range of the input transcripts. We only show lines in the transcripts that are closely related to the shown parts of summaries. The number at the beginning of each line is the line number in the original transcript. For the first instance, we omit a few lines containing clues about the doctor taking pictures of the mansion at different times due to space constraints.

Figure

Figure 1: Excerpts from an example from SUMMSCREEN. The transcript and recap are from the TV show “The Big Bang Theory”. Generating this sentence in the recap requires discerning the characters’ feelings (clues in the transcript are underlined) about playing the board game (references are shown in red). Colored boxes indicate utterances belonging to the same conversations.

Figure

Figure 2: A recurrent convolutional document reader with a neural sentence extractor.

Figure

Figure 1: DailyMail news article with highlights. Underlined sentences bear label 1, and 0 otherwise.

Figure

Figure 3: Neural attention mechanism for word extraction.

Figure

Figure 4: Visualization of the summaries for a DailyMail article. The top half shows the relative attention weights given by the sentence extraction model. Darkness indicates sentence importance. The lower half shows the summary generated by the word extraction.

Figure

Figure 4: The interface for Selecting or Rating Model Output. Users can chose the final product from three “AI-generated” candidate summaries.

Figure

Figure 6: The interface for Interactive Editing. Users see an “AI-generated” summary in the text box. They can use the drop-down menu to change certain words in the first sentence. They can then press “Predict” to request the model to update the rest of the summary based on those edits.

Figure

Figure 7: The interface for Writing with Model Assistance. In a Google Doc, users can see the original article on the top and they can write their summary under the section “Write your summary here:”. First, the user types a sentence for their summary, then a Bot (played by a researcher who log in with the “SumAssist Bot account”) will insert the next sentence in gray fonts. The Bot will also insert comments on words in the user written sentence and suggest them to make changes.

Figure

Figure 2: Illustration of participants’ perception on level of efficiency, control, and trust with each interaction. These conceptual level charts show a qualitative, rather than precise, comparison between interactions.

Figure

Figure 1: Five human-AI interactions in text generation from Study 1, illustrated as summarization tasks. Explanation of the actions and visual elements are in §2.2.

Figure

Figure 3: The interface for Guiding Model Output. Users can change the desired summary length and style (formal or informal) using sliders and highlight parts of the original text that they want to include in the summary. Users can press the “Generate” button to get the “AI-generated” summary based on their inputs.

Figure

Figure 5: The interface for Post-editing. Users see an “AI-generated” summary in the text box that they can hypothetically edit.

Figure

Figure 1: The XLNet architecture with two-stream attention mechanism is leveraged to estimate whether a segment is selfcontained or not. A self-contained segment is assumed to be preceded and followed by end-of-sentence markers (eos).

Figure

Figure 4: Absolute position of the whole sentence among all segments sorted by XLNet scores of self-containedness.

Figure

Figure 3: Example of a constituent parse tree, from which tree segments are extracted.

Figure

Figure 2: DPP selects a set of summary segments (marked yellow) based on the quality and pairwise dissimilarity of segments.

Figure

Figure 1: Example sentence summaries produced on Gigaword. I is the input, G is the true headline, A is ABS+, and R is RAS-ELMAN.

Figure

Figure 3: Visualization of UMAP projections of dictionary elements. Projections form clusters, which are shown in different colors.

Figure

Figure 2: General summary generation routine. The relevance score of each sentence w.r.t mean representation is computed, and top N sentences (Oe) with highestR(·) are selected as the summary.

Figure

Figure 1: An example workflow of SemAE. The encoder producesH = 3 representations (sh) for a review sentence s, which are used to generate latent representations over dictionary elements. The decoder reconstructs the input sentences using vectors (zh) formed using latent representations (αh).

Figure

Figure 4: Head-wise visualization of UMAP (McInnes et al., 2018) dictionary element projections.

Figure

Figure 5: UMAP (McInnes et al., 2018) projections of dictionary element over different epochs (warmup epoch #4 to epoch 10). We observe that dictionary elements gradually evolve to form clusters over the epochs.

Figure

Figure 2: ROUGE-1, ROUGE-2 and ROUGE-L scores for different summarization approaches. Chartreuse (yellowish green) box shows the oracle, green boxes show the proposed summarizers and blue boxes show the baselines; From left, Oracle; Citation-Context-Comm-It: Community detection on citation-context followed by iterative selection; Citation-ContextCommunity-Div: Community detection on citation-context followed by relevance and diversification in sentence selection; Citation-Context-Discourse-Div: Discourse model on citation-context followed by relevance and diversification; CitationContext-Discourse-It: Discourse model on citation-context followed by iterative selection; Citation Summ.: Citation summary; MMR 0.3: Maximal marginal relevance with λ = 0.3.

Figure

Figure 1: The blue highlighted span in the citing article (top) shows the citation text, followed by the citation marker (pink span). For this citation, the citation-context is the green highlighted span in the reference article (bottom). The text spans outside the scope of the citation text and citationcontext are not highlighted.

Figure

Figure 3: Comparison of the effect of different citation-context extraction methods on the quality of the final summary.

Figure

Figure 1: Dot product of embeddings and its logit for a sample word and its top most similar words (top 2000 and 1000).

Figure

Figure 1: Overview of our model. The word-level RNN is shown in blue and section-level RNN is shown in green. The decoder also consists of an RNN (orange) and a “predict” network for generating the summary. At each decoding time step t (here t=3 is shown), the decoder forms a context vector ct which encodes the relevant source context (c0 is initialized as a zero vector). Then the section and word attention weights are respectively computed using the green “section attention” and the blue “word attention” blocks. The context vector is used as another input to the decoder RNN and as an input to the “predict” network which outputs the next word using a joint pointer-generator network.

Figure

Figure 2: Example of a generated summary

Figure

Figure 3: Ratio of uncommon words in the document, which cannot be found in the Top-K OpenSubtitles words, for different k values.

Figure

Figure 2: Category distribution in WikiSum.

Figure

Figure 1: Examples of how-to questions and their corresponding answer’s summarization in WikiSum.

Figure

Figure 6: Comparison of ROUGE scores of the Features Only, SAFNet and SFNet models when trained with (bars on the left) and without (bars on the right) AbstractROUGE, evaluated on CSPubSum Test. The FNet classifier suffers a statistically significant (p=0.0279) decrease in performance without the AbstractROUGE metric.

Figure

Figure 1: SAFNet Architecture

Figure

Figure 2: Comparison of the average ROUGE scores for each section and the Normalised Copy/Paste score for each section, as detailed in Section 5.1. The wider bars in ascending order are the ROUGE scores for each section, and the thinner overlaid bars are the Copy/Paste count.

Figure

Figure 4: Comparison of the accuracy of each model on CSPubSumExt Test and ROUGE-L score on CSPubSum Test. ROUGE Scores are given as a percentage of the Oracle Summariser score which is the highest score achievable for an extractive summariser on each of the papers. The wider bars in ascending order are the ROUGE scores. There is a statistically significant difference between the performance of the top four summarisers and the 5th highest scoring one (unpaired t-test, p=0.0139).

Figure

Figure 3: Comparison of the best performing model and several baselines by ROUGE-L score on CSPubSum Test.

Figure

Figure 5: Comparison of the ROUGE scores of FNet, SAFNet and SFNet when trained on CSPubSumExt Train (bars on the left) and CSPubSum Train (bars on the right) and .

Figure

Figure 1: Results of the correlation between metrics and human judgments on the CNN dataset. First raw reports correlation as measured by the Person (r) and second raw focus on Kendall (τ ) coefficient. In this experiment, parameter are optimized for each criterion.

Figure

Figure 2: Pearson correlation at the system level between metrics when considering abstractive system outputs.

Figure

Figure 4: Impact of Calibration for summarization.

Figure

Figure 3: Score distribution of text score when considering abstractive system outputs. Pyr. stands for pyramide score.

Figure

Figure 3. Visualized results of sentence topical weight. The degree of highlighting represents the overall relevance of the sentence and all topics. Underlined sentences are model-selected summary. The left document is from PubMed dataset, and the right document is from CNN/DM dataset.

Figure

Figure 1. Overall architecture of our model (Topic-GraphSum). In the graph attention layer (top right), the square nodes denote the sentence representations output from the document encoder (bottom right), and the circular nodes denote the topic representations learned by NTM (left).

Figure

Figure 2. Rouge-1 and -2 results of our full model and three ablated variants on four datasets.

Figure

Figure 8: Rouge results of our full model and the ablated version on the two datasets.

Figure

Figure 1: An example where a paragraph-by-paragraph extraction will produce an incoherent summary.

Figure

Figure 6: Impact of window length (left) and slot number (right) on model performance (R-1).

Figure

Figure 5: Proportion of sentences selected by each window.

Figure

Figure 2: The framework of our model. There are three major components: (1) The sliding encoder generates representation of each sentence in the current window. (2) The memory layer infuses history information into sentence representations via graph neural networks. (3) The predication layer aggregates learned features to compute the binary sentence labels.

Figure

Figure 3: An illustration of the information flow in our model. Paths (a) denote the interaction between memory vectors (M) and sentence representations (S) via a GAT layer. Paths (b) denote the compution of sentence labels. Paths (c) denote the updating process of memory module.

Figure

Figure 7: Comparison between the output of our full model (top) and the ablated model (bottom). We use underlined text to denote model-selected sentences and bold text to denote the ground truth sentences. The ablated model selects repetitive contents in 4-th window and noisy contents in 5-th window.

Figure

Figure 4: Position distribution of gold sentences on two datasets.

Figure

Figure 2: The Joint Model Architecture for both Document-level and Paragraph-level News Genre Tags Prediction.

Figure

Figure 1: Four News Structures: Document-level News Structure Tags (in rectangle) and Paragraph-level News Element Tags (in circle). A News Element may include one or more consecutive paragraphs.

Figure

Figure 4: Generated Summaries for Resource-poor CQA

Figure

Figure 4: Analysis of Multi-hop Reasoning

Figure

Figure 3: Gated Selective Pointer-Generator Network.

Figure

Figure 1: An example from PubMedQA. The highlighted sentences illustrate the inference process when humans answer the given question. Italic represents direct matching sentences from the question. Underlined and

Figure

Figure 1: Hierarchical and Sequential Context Modeling for Question-driven Answer Summarization

Figure

Figure 6: Duplication Analysis in Answers

Figure

Figure 2: The overview of Multi-hop Selective Generator (MSG).

Figure

Figure 2: Case Study. Bold / underlined / shadowed sentences are selected by HSCM / CA / NeuralSum, respectively.

Figure

Figure 2: Model Accuracy in terms of Answer Length

Figure

Figure 1: The Joint Learning Framework of Answer Selection and Abstractive Summarization (ASAS).

Figure

Figure 3: Case Study. ASAS generates the answer summary highly related to the question (Underlined), while PGN may misunderstand the core idea of the answer (Wavy-lined).

Figure

Figure 5: A case study with the same legend as Figure 1. The highlighted sentences are attended by MSG (3-hop).

Figure

Figure 3: Varying the salience threshold λS ∈ [0, 1) (depicted as % confidence) and its impact on ROUGE upon deleting spans ZP ∩ ZS.

Figure

Figure 2: Compression model used for plausibility and salience modeling (§3.3). We extract candidate spans ci ∈ C(T ) to delete, then compute span embeddings with pre-trained encoders (only one span embedding shown here). This embedding is then used to predict whether the span should be kept or deleted.

Figure

Figure 1: Decomposing span-based compression into plausibility and salience (§2). Plausible compressions (underlined) must maintain grammaticality, thus [to the ... wineries]PP is not a candidate. Salience identifies low-priority content from the perspective of this dataset (highlighted). Constituents both underlined and highlighted are deleted.

Figure

Figure 3: The extractive model uses three separate encoders create representations for the reference document sentences, context tokens, and topics. These are combined through an attention mechanism, encoded at a documentlevel, and passed through a feed-forward layer to compute an extraction probability for each reference sentence.

Figure

Figure 2: A paragraph from the “Family and personal life” section of Barack Obama’s Wikipedia page and selected excerpts from the cited documents which provide supporting evidence.

Figure

Figure 4: Example outputs from the abstractive model that uses the context. The model often copies sequences from the references which are sometimes correct (top) or incorrect but sensible (bottom), highlighting the difficulty of automatic evaluation. (Documents shortened for space. Sentences which are underlined were selected by the extraction step.)

Figure

Figure 1: Given a topic, reference document, and a partial summary (the context), the objective of the summary cloze task is to predict the next sentence of the summary, known as the cloze.

Figure

Figure 5: The VB+NSUBJ category selects tuples of verbs and their corresponding NSUBJ dependents in the dependency tree. In this example, 2/4 of the alignment (the solid lines) can be explained by matches between such tuples. The dashed lines cannot: The “and” alignment is not part of any tuple; Since “ran” and “sprinted” are not aligned, their corresponding tuples are not considered to be aligned, so the “Reese” match does not count toward the total.

Figure

Figure 1: An illustration of the three methods for sampling matrices during bootstrapping. The dark blue color marks values selected by the sample. Only 3 system and input samples are shown here, whenN and M are actually sampled with replacement.

Figure

Figure 1: Example answers selected by the three strategies. The only SCU marked by annotators for this sentence is SCU4, which does not include information about the location of the attacks. Therefore, an answer selection strategy that chooses ‘‘Baghdad’’ enables generating a QA pair such as QA3, which probes for information not included in the Pyramid annotation.

Figure

Figure 6: (Top) The distribution of the proportion of the QAEval-F1 score that is explained by SCU matches. (Bottom) The percentage of summaries with a score explained by a given proportion of SCU matches. We find that QAEval can be explained by SCU matches far more than ROUGE or BERTScore on average.

Figure

Figure 4: An example correct answer predicted by the model that is scored poorly by the EM or F1 QAmetrics (both would assign a score of 0 or near 0). This occurs because the answer and prediction are drawn from two different summaries, and the same event is referred to in different ways in each one.

Figure

Figure 2: A typical example of expert-written and model-generated questions answerable by the phrase in red. The model questions are often significantly more verbose than the expert questions, typically copying the majority of the input sentence.

Figure

Figure 5: The Pearson correlations between the scores of several ROUGE variants, APES, and QAEval variants on TAC’08. The results support similar findings of Eyal et al. (2019), namely, that the ROUGE metrics are highly correlated to each other but have low correlation to the QA-based metrics, suggesting the two types of metrics offer complementary signals.

Figure

Figure 3: A comparison of the correlations of QAEvalF1 on a subset of TAC’08 using expert-written and model-generated questions. Each point represents the average correlation calculated using 30 samples of {2, 4, 6, 8, 10} instances, plotted with 95% error bars. System-level correlations were calculated against the summarizers’ average responsiveness scores across the entire TAC’08 dataset. We hypothesize the model questions perform better due to their verbosity, which causes more keywords to be included in the question that the QA model can match against the summary.

Figure

Figure 4: Every token alignment used by ROUGE or BERTScore is assigned to one or more interpretable categories (defined in §5). This allows us to calculate, for this example, that matches between named-entities contribute 1/4 to the overall score, stopwords 2/4, and noun phrases 3/4 (assuming alignment weights of 1.0).

Figure

Figure 2: An example token alignment used by ROUGE or BERTScore. Each color represents a summary content unit (SCU) that marks informational content. Only 2/5 of the token alignments (the solid edges) can be explained by matches between phrases that express the same information (the green phrases).

Figure

Figure 1: Both candidate summaries are similar to the reference, but along different dimensions: Candidate 1 contains some of the same information, whereas candidate 2’s information is different, but it at least discusses the correct topic. The goal of this work is to understand if summarization evaluation metrics’ scores should be interpreted as measures of information overlap or, less desirably, topic similarity.

Figure

Figure 6: The results of running the PERM-BOTH hypothesis test to find a significant difference between metrics’ Pearson correlations with the Bonferroni correction applied per dataset and correlation level pair instead of per metric (as in Figure 5). A blue square means the test returned a significant p-value at α = 0.05, indicating the row metric has a higher correlation than the column metric. An orange outline means the result remained significant after applying the Bonferroni correction.

Figure

Figure 5: The results of running the PERM-BOTH hypothesis test to find a significant difference between metrics’ Pearson correlations. A blue square means the test returned a significant p-value at α = 0.05, indicating the row metric has a higher correlation than the column metric. An orange outline means the result remained significant after applying the Bonferroni correction.

Figure

Figure 3: The distribution of the proportion of ROUGE (top) and BERTScore (bottom) on TAC 2008 that can be explained by tokens matches that are labeled with the same SCU (Eq. 5). The averages, 25% and 15% (in red), indicate that only a small amount of their scores is between phrases that express the same information.

Figure

Figure 2: An illustration of the three permutation methods which swap system scores, document scores, or scores for individual summaries between X and Y .

Figure

Figure 4: The 95% confidence intervals for ρSUM (blue) and ρSYS (orange) calculated using Kendall’s correlation coefficient on TAC’08 (left) and CNN/DM summaries (middle, Fabbri et al. (2021); right, Bhandari et al. (2020)) are rather large, reflecting the uncertainty about how well these metrics agree with human judgments of summary quality.

Figure

Figure 6: The system- and summary-level Pearson correlations as the number of available reference summaries increases. 95% confidence error bars shown, but may be too small to see. PyrEval is missing data because the official implementation requires at least two references. Evenwith one reference summary, QAEval-F1 maintains a higher system-level correlation than ROUGE.

Figure

Figure 3: The system- and summary-level Pearson estimates of the power of the BOOT-BOTH, PERM-BOTH, andWilliams hypothesis test methods calculated on the annotations from Fabbri et al. (2021). The power for BOOT-BOTH and Williams at the system-level is ≈ 0 for all values.

Figure

Figure 1: In the answer verification task, the metrics score how likely two phrases from different contexts have the same meaning. Here, the metrics at the bottom score the similarity between “emergency responders,” which was used to generate the question from the source text, and “paramedics,” the predicted answer from a QA model in the target text.

Figure

Figure 3: Bootstrapped estimates of the stabilities of the system rankings for automatic metrics and human annotations on SummEval (left) and REALSumm (right). The τ value quantifies how similar two system rankings would be if they were computed with two random sets of M input documents. When all Mtest test instances are used, the automatic metrics’ rankings become near constant. The error regions represent ±1 standard deviation.

Figure

Figure 4: 95% confidence intervals for rSYS calculated with the BOOT-INPUTS resampling method when the system rankings for the automatic metrics are calculated using only the judged data (orange) versus the entire test set (blue). Scoring systems with more summaries leads to better (more narrow) estimates of rSYS.

Figure

Figure 7: The 95% CIs calculated using the BOOTSYSTEMS bootstrapping method with Mjud summaries in orange and Mtest in blue.

Figure

Figure 10: rSYS∆(`, u) correlations for various combinations of ` and u (see §4.2) for ROUGE (top), BERTScore (middle), and QAEval (bottom) on SummEval (left) and REALSumm (right). The values of ` and uwere chosen so that each value in the heatmaps evaluates on 10% more system pairs than the value to its left. For instance, the first row evaluates on 10%, 20%, . . . , 100% of the system pairs. The second row evaluates on 10%, 20%, . . . , 90% of the system pairs, never including the 10% of pairs which are closest in score. The first row of each of the heatmaps is plotted in Fig. 6. The correlations on realistic score differences between systems are in the upper left portion of the heatmaps and contain the lowest correlations overall. Evaluating on all pairs is the top-rightmost entry, and the “easiest” pairs (those separated by a large score margin) are in the bottom right.

Figure

Figure 2: The distributions of score values for three metrics on the SummEval dataset for ground-truth answer and QA model prediction pairs from QAEval with the same (blue) and different (orange) meanings.

Figure

Figure 1: The system-level correlation is calculated between the average X and Z scores on a set of summarization systems. xji and z j i are the scores for the summary produced by system i (represented by rows) on input document j (represented by columns).

Figure

Figure 9: The rSYS∆(`, u) correlations on the SummEval (top) and REALSumm (bottom) datasets for ` = 0 and various values of u for ROUGE-1, ROUGE-2, and ROUGE-L. The u values were chosen to select the 10%, 20%, . . . , 100% of the pairs of systems closest in score. Each u is displayed on the top of each plot.

Figure

Figure 2: The bootstrapped 95% confidence intervals for the BERTScore of each system in the REALSumm dataset using Mjud judged instances in blue and Mtest instances in orange. Evaluating systems with Mtest instances leads to far better estimate of their true scores.

Figure

Figure 5: The systems (each represented by a point) on the two datasets (shown here for REALSumm) are rather diverse in quality as measured by both human judgments and automatic metrics.

Figure

Figure 8: The 95% CIs calculated using the BOOTBOTH bootstrapping method with Mjud summaries in orange and Mtest in blue.

Figure

Figure 6: The rSYS∆(`, u) correlations on the SummEval (top) and REALSumm (bottom) datasets for ` = 0 and various values of u (additional combinations of ` and u can be found in Appendix B). The u values were chosen to select the 10%, 20%, . . . , 100% of the pairs of systems closest in score. Each u is displayed on the top of each plot. For instance, 20% of the ( N 2 ) system pairs on SummEval are separated by < 0.5 ROUGE-1, and the system-level correlation on those pairs is around 0.08. As more systems are used in the correlation calculation, the allowable gap in scores between system pairs increases, and are therefore likely easier to rank, resulting in higher correlations.

Figure

Figure 11: See Fig. 10 for a description of the heatmaps, shown here for ROUGE-2 (top) and ROUGE-L (bottom).

Figure

Figure 1: Overview pipeline of the proposed model which is executed simultaneously in two phases (a). The first phase encodes the sentences with pre-trained BERT and uses [CLS] information as the input of a graph attention layer (b). The second phase encodes the word and sentence nodes as the inputs of a heterogeneous graph layer (c). The output of the two phases is concatenated and put into an MLP layer in order to classify labels for each sentence.

Figure

Figure 1: Average of ROUGE-1,2,L F1 scores on the Daily Mail validation set within one epoch of training on the Daily Mail training set. The x-axis (multiply by 2,000) indicates the number of data example the algorithms have seen. The supervised labels in SummaRuNNer are used to estimate the upper bound.

Figure

Figure 2: Model comparisons of the average value for ROUGE-1,2,L F1 scores (f ) on Dearly and Dlate. For each model, the results were obtained by averaging f across ten trials with 100 epochs in each trail. Dearly and Dlate consist of 50 articles each, such that the good summary sentences appear early and late in the article, respectively. We observe a significant advantage of BANDITSUM compared to RNES and RNES3 (based on the sequential binary labeling setting) on Dlate.

Figure

Figure 2: Model architecture (Left: QA-span fact correction model. Right: Auto-regressive fact correction model).

Figure

Figure 1: Training example created for the QA-span prediction model (upper right) and the auto-regressive fact correction model (bottom right).

Figure

Figure 2: Sentence positions in source document for extractive summaries generated by different models on the PubMed validation set. Documents on the x-axis are ordered by increasing article length from shortest to longest. We also see a similar trend on arXiv (the plots with more details can be found in the appendix).

Figure

Figure 4: Comparison of the flat fully-connected graph used in Erkan and Radev (2004); Mihalcea and Tarau (2004); Zheng and Lapata (2019) to the hierarchical graph used in our models (b) and (c). Although the section-section multiplication reduces the edge computation proportionally to the number of sections, we found it oversimplifies the graph by assuming independence between sentences across different sections. Our final model loosens the assumption by including sectionsentence connections as shown in sub-figure (c).

Figure

Figure 1: Example of a hierarchical document graph constructed by our approach on a toy document that contains two sections {T1, T2}, each containing three sentences for a total of six sentences {s1, . . . , s6}. Each double-headed arrow represents two edges with opposite directions. The solid and dashed arrows indicate intra-section and inter-section connections respectively. When compared to the flat fully-connected graph of traditional methods, our use of hierarchy effectively reduces the number of edges from 60 to 24 in this example.

Figure

Figure 5: Sentence positions in source document for extractive summaries generated by different models on the arXiv validation set. Documents on the x-axis are ordered by increasing article length from shortest to longest.

Figure

Figure 3: ROUGE-L scores for (a) different positional functions (L=lead, U=undirected, B=boundary) and (b) different graph hierarchies (NS=no section, H=hierarchical). Each point corresponds to one configuration of the hyperparameter gridsearch described in Section 4.2.

Figure

Figure 4: There is a strong correlation between the guidance quality and output quality, demonstrating the controllability of our guided model.

Figure

Figure 1: Our framework generates summaries using both the source document and separate guidance signals. We use an oracle to select guidance during training and use automatically extracted or user-specified guidance at test time.

Figure

Figure 3: Our model can generate more novel words and achieve higher recall of novel words in the gold reference compared with baseline.

Figure

Figure 5: The factCC model will give the gold reference an accuracy of about 10%.

Figure

Figure 3: The attention weight changes by using the contrastive attention mechanism. (a) is the average attention weights of the third layer of the baseline Transformer, (b) is that of “Transformer+ContrastiveAttention”, and (c) is the opponent attention derived from the fifth head of the third layer.

Figure

Figure 1: Overall networks. The left part is the original Transformer. The right part that takes the opponent attention as bottom layer fulfils the contrastive attention mechanism.

Figure

Figure 2: Heatmaps of two sampled heads from the conventional encoder-decoder attention. (a) is of the fifth head of the third layer, and (b) is of the fifth head of the first layer.

Figure

Figure 1: The decision diagram of our human annotation process. Decision nodes are rectangular and outcome nodes are circular. We show the annotation path of two summary sentences, S1 (green arrows) and S2 (red arrows). S2 is annotated as nonsensical thus is not considered for faithfulness. S1 is annotated as unfaithful due to hallucinated content.

Figure

Figure 2: Overview of FEQA. Given a summary sentence and its corresponding source document, we first mask important text spans (e.g. noun phrases, entities) in the summary. Then, we consider each span as the “gold” answer and generate its corresponding question using a learned model. Lastly, a QA model finds answers to these questions in the documents; its performance (e.g. F1 score) against the “gold” answers from the summary is taken as the faithfulness score.

Figure

Figure 2: Compression constraints on an example sentence. (a) RST-based compression structure like that in Hirao et al. (2013), where we can delete the ELABORATION clause. (b) Two syntactic compression options from Berg-Kirkpatrick et al. (2011), namely deletion of a coordinate and deletion of a PP modifier. (c) Textual units and requirement relations (arrows) after merging all of the available compressions. (d) Process of augmenting a textual unit with syntactic compressions.

Figure

Figure 4: Examples of an article kept in the NYT50 dataset (top) and an article removed because the summary is too short. The top summary has a rich structure to it, corresponding to various parts of the document (bolded) and including some text that is essentially a direct extraction.

Figure

Figure 5: Counts on a 1000-document sample of how frequently both a document prefix baseline and a ROUGE oracle summary contain sentences at various indices in the document. There is a long tail of useful sentences later in the document, as seen by the fact that the oracle sentence counts drop off relatively slowly. Smart selection of content therefore has room to improve over taking a prefix of the document.

Figure

Figure 1: ILP formulation of our single-document summarization model. The basic model extracts a set of textual units with binary variables xUNIT subject to a length constraint. These textual units u are scored with weights w and features f . Next, we add constraints derived from both syntactic parses and Rhetorical Structure Theory (RST) to enforce grammaticality. Finally, we add anaphora constraints derived from coreference in order to improve summary coherence. We introduce additional binary variables xREF that control whether each pronoun is replaced with its antecedent using a candidate replacement rij . These are also scored in the objective and are incorporated into the length constraint.

Figure

Figure 3: Modifications to the ILP to capture pronoun coherence. It, which refers to Kellogg, has several possible antecedents from the standpoint of an automatic coreference system (Durrett and Klein, 2014). If the coreference system is confident about its selection (above a threshold α on the posterior probability), we allow for the model to explicitly replace the pronoun if its antecedent would be deleted (Section 2.2.1). Otherwise, we merely constrain one or more probable antecedents to be included (Section 2.2.2); even if the coreference system is incorrect, a human can often correctly interpret the pronoun with this additional context.

Figure

Figure 2: Examples of Layout vs Plain Summaries

Figure

Figure 3: Counts of participant responses when comparing of Plain & Stock.

Figure

Figure 4: Counts of participant responses when comparing of Layout & Stock.

Figure

Figure 1: VerbNet frame for ‘murder’

Figure

Figure 4: The average document information and document information given summary when prompting GPT-2 with different amounts of upstream sentences for the SummEval dataset.

Figure

Figure 3: The average document information and document information given summary as estimated by different sizes of GPT-2 for the SummEval dataset.

Figure

Figure 1: A comparison of token-wise information content within a document as estimated by GPT-2 in 4 scenarios: the document on its own, the document given the document, the document given a high quality summary, and the document given a low quality summary. Tokens with a darker background color have more information.

Figure

Figure 2: Distributions of Shannon Score and Information Difference on 100 summaries from the CNN/DailyMail dataset. Three different summaries are used: the original human written reference summary (in blue), the original summary with words scrambled (in orange), and a reference summary for a different document in the dataset (in green).

Figure

Figure 1: Extractive Model Architecture

Figure

Figure 2: Evaluation flow of APES.

Figure

Figure 3: Example 202 from the CNN/Daily Mail test set.

Figure

Figure 1: Example 3083 from the test set.

Figure

Figure 4: Example 4134 from the CNN/Daily Mail test set. Colors and underlines in the source reflect differences between baseline and our model attention weights: Red and a single underline reflects words attended by baseline model and not our model, Green and double underline reflects the opposite. Entities in bold in the target summary are answers to the example questions.

Figure

Figure 2: Pairwise Kendall’s tau correlations for all automatic evaluation metrics.

Figure

Figure 1: Histogram of standard deviations of inter-annotator scores between: crowd-sourced annotations, first round expert annotations, and second round expert annotations, respectively.

Figure

Figure 1: Sample argument subgraph construct from NYT news comments illustrating varying viewpoints. Claims “I honestly...” and “but I dont..” are entailed by premises, connected through Default Inference nodes, and opposing claims are connected through Issue nodes.

Figure

Figure 3: Example of the data collection interface used by crowd-source and expert annotators.

Figure

Figure 1: ROUGE-1 scores across datasets, training dataset size, data augmentation (*-a), and consistency loss (*-c) showing the generalizable and robust performance of models transferred from WikiTransfer. Standard deviation bars are also plotted.

Figure

Figure 1: An illustration of our dataset annotation pipeline. Given a question and answers to that question, professional linguists 1) select relevant sentences, 2) cluster those selected sentences, 3) summarize each cluster’s sentences, and 4) fuse clusters into a coherent, overall summary.

Figure

Figure 2: An illustration of our automatic dataset pipeline which mirrors the manual pipeline for data augmentation. Given a question and answers, relevant sentences are selected and clustered. Then, the cluster centroid sentence of non-singleton clusters is removed from the input to use as bullet point summaries.

Figure

Figure 1: Example of an incorrect summary sentence produced by PGC (see Section 4) on CNN-DM.

Figure

Figure 3: Examples of incorrect sentences produced by different summarization models on the CNN-DM test set.

Figure

Figure 2: Agreement between crowdsourced and expert annotations at increasing numbers of workers.

Figure

Figure 4: Two alternative sentences from generated summaries, one correct and one incorrect, for the given source sentence. All tested NLI models predict very high entailment probabilities for the incorrect sentence, with only BERT estimating a slightly higher probability for the correct alternative.

Figure

Figure 1: Length control vs summary length. Length control can take 10 discrete values.

Figure

Figure 2: Example of monolingual system-generated summaries.

Figure

Figure 3: Example of cross-lingual system-generated summaries.

Figure

Figure 2: Structure of the document encoder in Bi-AES.

Figure

Figure 1: Structure of Uni-AES.

Figure

Figure 3: Performance for different document lengths.

Figure

Figure 3: Models trained on synthetic data evaluated on original CNN/DM documents, of either <1000 words (short) or >2000 words (long). True uses the summary under the document’s true aspect. ‘Best’ takes the bestscoring summary under all possible input aspects.

Figure

Figure 1: Two news articles with color-coded encoder attention-based document segmentations, and selected words for illustration (left), the abridged news article (top right) and associated aspect-specific model summaries (bottom right). Top: Article from our synthetic corpus with aspects sport, tvshowbiz and health. The true boundaries are known, and indicated by black lines in the plot and ‖in the article. Bottom: Article from the RCV1 corpus with document-level human-labeled aspects sports, news and tvshowbiz (gold segmentation unknown).

Figure

Figure 2: Visualization of our three aspect-aware summarization models, showing the embedded input aspect (red), word embeddings (green), latent encoder and decoder states (blue) and attention mechanisms (dotted arrows). Left: the decoder aspect attention model; Center: the encoder attention model; Right: the source-factors model.

Figure

Figure 1: Overview of the variational hierarchical topic-aware model.

Figure

Figure 2: Overview of the topic embedding mechanism: ϕ(rd) is the topic distribution, Mt is the topic mapping matrix, and tei is the topic embedding of word xi.

Figure

Figure 3: The similarity of topic distributions and the topic number mapping between documents and summaries generated by human or the learning models.

Figure

Figure 4: Topic distribution visualization of some extracted words which are consisted of three different topic groups and a random one.

Figure

Figure 2: Model architecture for adjacency reranking variation of Co-opNet

Figure

Figure 1: Generated abstracts for a biology article (from the Bio subset of our arXiv dataset). Abstracts are ranked from most (top) to least likely (bottom) using the generator model. Abstracts with better narrative structure and domain-specific content (such as the circled abstract) are often out-ranked in terms of likelihood by abstracts with factual errors and less structure.

Figure

Figure 2: Examples of factual errors given in annotation task.

Figure

Figure 1: Distribution of common factual error types in sampled generated summaries (96.37% of all errors). We draw from the same error types for our controlled analysis to ensure we match the true distribution of errors. Here extrinsic entity refers to entities that did not previously appear in the source, while an intrinsic entity appeared in the source.

Figure

Figure 4: Part of an EDUA solution graph. Each vertex is a segment vector from a reference summary, indexed by Summary.ID (si), Sentence.ID (sij), Segmentation.ID (sijk), Segment.ID (sijkm). All segments of all reference summaries have a corresponding node. All edges connect segments from different summaries with similarity ≥ tedge. This schematic representation of a partial solution contains three fully connected subgraphs with attraction edges (solid lines), each representing an SCU, whose weight is the number of vertices (segments).

Figure

Figure 5: Visualizations of ROUGE score with different hop numbers.

Figure

Figure 8: Contour plot for score correlations with β (X-axis) and tedge (Y-axis).

Figure

Figure 1: Overview of PESG. We divide our model into four parts: (1) Prototype Reader; (2) Fact Extraction; (3) Editing Generator; (4) Fact Checker.

Figure

Figure 2: Framework of fact extraction module.

Figure

Figure 7: Alignment of an PyrEval SCU of weight 3 to segments from student summaries on autonomous vehicle.

Figure

Figure 5: Formal specification of EDUA’s input graph G consisting of all segments from all segmentations of reference summary sentences (item 2), the objective (item 6), and three scores for defining the objective function that are assigned to candidate SCUs (item 3), sets of SCUs of the same weight (item 4), and a candidate pyramid (item 5).

Figure

Figure 1: The workflows of cross-input RL (top), input-specific RL (middle) and RELIS (bottom). The ground-truth reward can by provided by humans or automatic metrics (e.g. BLEU or ROUGE) measuring the similarity between generated output text and the reference output.

Figure

Figure 4: Visualizations of editing gate.

Figure

Figure 6: A directed Depth First Search tree for EDUAC. Nodes are cliques representing candidate SCUs, as illustrated in Figure 4, labeled by their weights. Each DFS path is a partition over one way to segment all the input summaries and group all segments into SCUs. The solution is the path with the highest AP .

Figure

Figure 1: Alignment of a single PyrEval SCU of weight 5 to a manual SCU of weight 4 from a dataset of student summaries. The manual and automated SCUs express the same content, and their weights differ only by one. For each of five reference summaries (RSUM1-RSUM5), exact matches of words between the PyrEval and manual contributor are in bold, text in plain font (RSUM2, RSUM4) appears only in the manual version, and text in italics appears only in the PyrEval version. Paraphrases of the same content from RSUM4 were identified by human annotators (plain font) and PyrEval (italics). Also shown is a matching segment from a student summary, where the student used synonyms of some of the words in the reference summaries.

Figure

Figure 3: Framework of fact checker module.

Figure

Figure 2: PyrEval preprocessors segment sentences from reference (RSUM) and evaluation (ESUM) summaries into clause-like units, then convert them to latent vectors. EDUA constructs a pyramid from RSUM vectors (lower left): the horizontal bands of the pyramid represent SCUs of decreasing weight (shaded squares). WMIN matches SCUs to ESUM segments to produce a raw score, and three normalized scores.

Figure

Figure 4: Workflow of the data augmentation method with an input reference summary and output augmented sample

Figure

Figure 3: Top: An example assessment input with all the concepts (highlighted in color box ) identified through QuickUMLS, a state-of-the-art off-the-shelf medical concept extractor. Middle: Two example plan subsections with the the annotated problems,

Figure

Figure 5: Performance drops (lighter color) and gains (darker color) over baseline (first column) on ROUGE-L Recall (top 4 rows) and Precision (bottom 4 rows). The darker the cell color is, the higher performance gain the model obtains over baseline.

Figure

Figure 6: Two cherry-picked examples from T5-DAPT-CUI output, with cyan fonts highlighted the correct diseases.4

Figure

Figure 7: Two example reference (REF) and predicted summaries (PRED) from T5-ALL (input with objective sections).

Figure

Figure 1: When a sick patient arrives to the hospital, diagnostic evaluations are performed to assess the patient’s condition and deduce the problems causing the illness.

Figure

Figure 2: An input example of assessment and subjective sections available in the notes: Chief Complaint, Allergies, Review of systems.

Figure

Figure 8: Given an input assessment, we show the reference summary, example output from fine-tuning T5 and BART, and T5 DAPT with token masking and concept masking. The red fonts show the information that is outside the input text.

Figure

Figure 4: Recall at rank threshold n for summary 4B

Figure

Figure 5: IU samples with rephrasing.

Figure

Figure 2: Averaged precision at at rank threshold n

Figure

Figure 1: Extended definition of IU based on Kroll (1977). Our edits are presented in italics.

Figure

Figure 3: Averaged recall at rank threshold n

Figure

Figure 6: Screenshot of Segment Matcher

Figure

Figure 4: For all copied words, we show the distribution over the length of copied phrases they are part of. The black lines indicate the reference summaries, and the bars the summaries with and without bottom-up attention.

Figure

Figure 2: Overview of the selection and generation processes described throughout Section 4.

Figure

Figure 3: The AUC of the content selector trained on CNN-DM with different training set sizes ranging from 1,000 to 100,000 data points.

Figure

Figure 1: Example of two sentence summaries with and without bottom-up attention. The model does not allow copying of words in [gray], although it can generate words. With bottom-up attention, we see more explicit sentence compression, while without it whole sentences are copied verbatim.

Figure

Figure 1: Overview of our summarization model. As shown, “bilateral” in the FINDINGS is a significant ontological term which has been encoded into the ontology vector. After refining FINDINGS word representation, the decoder computes attention weight (highest on “bilateral”) and generates it in the IMPRESSION.

Figure

Figure 2: Histograms and arrow plots showing differences between IMPRESSION of 100 manually-scored radiology reports. Although challenges remain to reach human parity for all metrics, 81% (a), 82% (b), and 80% (c) of our system-generated Impressions are as good as human-written Impressions across different metrics.

Figure

Figure 6: Difference in ROUGE-2 between Variational models and their deterministic counterparts versus the fraction of data discarded. Positive values indicate that deterministic ROUGE-2 is lower than Variational.

Figure

Figure 5: ROUGE-L scores vs fraction of data discarded due to high BLEUVarN. The straight dashed lines indicate the performance level of the deterministic PEGASUS and BART models.

Figure

Figure 3: Difference in ROUGE-1 between Variational models and their deterministic counterparts versus the fraction of data discarded. Positive values indicate that deterministic ROUGE-1 is lower than Variational.

Figure

Figure 2: BLEUVarN curves as a function of data discarded due to low ROUGE-1 scores.

Figure

Figure 4: ROUGE-2 scores vs fraction of data discarded due to high BLEUVarN. The straight dashed lines indicate the performance level of the deterministic PEGASUS and BART models.

Figure

Figure 7: Difference in ROUGE-L between Variational models and their deterministic counterparts versus the fraction of data discarded. Positive values indicate that deterministic ROUGE-L is lower than Variational.

Figure

Figure 1: ROUGE-1 scores vs fraction of data discarded due to high BLEUVarN. The straight dashed lines indicate the performance level of the deterministic PEGASUS and BART models.

Figure

Figure 1: The complete pipeline of the proposed method.

Figure

Figure 2: Simplified encoder-decoder transformer architectures used by BART and T5.

Figure

Figure 1: Example conditional summaries for two tasks.

Figure

Figure 2: ScriptBase corpus statistics. Movies can have multiple genres, thus numbers do not add up to 1,276.

Figure

Figure 4: Example of a bipartite graph, connecting a movie’s scenes with participating characters.

Figure

Figure 1: Excerpt from “The Silence of the Lambs”. The scene heading INT. THE PANEL TRUCK - NIGHT denotes that the action takes place inside the panel truck at night. Character cues (e.g., MAN, CATHERINE) preface the lines the actors speak. Action lines describe what the camera sees (e.g., We can’t get a good glimpse of his face, but his body. . . ).

Figure

Figure 3: Example of consecutive chain (top). Squares represent scenes in a screenplay. The bottom chain would not be allowed, since the connection between s3 and s5 makes it non-consecutive.

Figure

Figure 5: Social network for “The Silence of the Lambs”; edge weights correspond to absolute number of interactions between nodes.

Figure

Figure 4: Fractions of examples in each dataset exhibiting different error types (note a single example may have multiple errors). The graphs show a significant mismatch between the error distributions of actual generation models and synthetic data corruptions.

Figure

Figure 2: Set of transformations/data corruption techniques from Kryscinski et al. (2020) used to generate training data for the entity-centric approach.

Figure

Figure 5: The dependency arc entailment (DAE) model from (Goyal and Durrett, 2020a). A pre-trained encoder is used to obtain arc representations; these are used to predict arc-level factuality decisions.

Figure

Figure 6: Performance of different train checkpoints on a held-out dataset and on the human annotated dev set for models trained on the synthetic data in the CNN/DM domain.

Figure

Figure 3: Taxonomy of error types considered in our manual annotation. On the right are example summaries with highlighted spans corresponding to the error types; the first summary is an actual BART generated summary while others are manually constructed representative examples.

Figure

Figure 1: Examples from the synthetic and human annotated factuality datasets. The entity-centric and generationcentric approaches produce bad summaries from processes which can label their errors. All models can be adapted to give word-level, dependency-level, or sentence-level highlights, except for Gen-C.

Figure

Figure 7: Screenshot of the Mechanical Turk experiment. Given an input articles, the annotators were tasked with evaluating the factuality of 3 model generated summaries on a binary scale.

Figure

Figure 1: N-gram overlap of the generated summaries with the source article at different time steps. For CNNDM and MEDIASUM, the summaries fail to achieve the target degree of abstractiveness (denoted by the dotted lines).

Figure

Figure 8: Example showing loss modification to improve abstractiveness. The table shows which tokens are retained (green checkmark) or dropped (red cross) from the loss computation at different training stages. During later stages of the training, when loss truncation is applied, copied tokens are excluded from the loss.

Figure

Figure 7: N-gram overlap of the generated summaries in CNNDM and MEDIASUM. Initializing from BART-XSUM offers no benefits over the baseline. On the other hand, loss truncation is successful at enforcing abstractiveness; generated summaries for both datasets are closer to the target abstractiveness of reference summaries.

Figure

Figure 6: Modified training under loss truncation. After K steps of standard training, loss is computed on a subset of the tokens. To encourage factuality, high-loss tokens (↑) are excluded from the final loss computation whereas tokens with low loss (↓) are excluded to encourage abstractiveness.

Figure

Figure 10: Example illustrating +factuality loss modification. The table shows which tokens are retained or dropped from the loss computation at each training stage. We can see that high-loss generally corresponds with hallucinated content.

Figure

Figure 9: Factuality of output summaries for the baseline and loss truncation variants. The plot shows that compared to the standard BART baseline, token-level loss truncation improves factuality, with comparable results on abstractiveness and ROUGE.

Figure

Figure 4: Factuality Sentence Error Rate of the generated summaries at different time steps during training. The graph shows that factual error rate is roughly proportional to abstractiveness (compare plot trends with Figure 1) for CNNDM and MEDIASUM.

Figure

Figure 3: Comparison of summary-level output probabilities between high-overlap and low-overlap subsets for the BART models. For both CNNDM and MEDIASUM, high-overlap summaries are predicted with substantially higher confidence compared to low-overlap examples.

Figure

Figure 2: ROUGE scores of the generated summaries of all datasets at different training stages.

Figure

Figure 3: Pairwise significance test outcomes for BLEU, best-performing ROUGE (rows 2-9), and ROUGE applied in Hong et al. (2014) (bottom 3 rows), with (ST1) and without (ST0) stemming, with (RS1) and without (RS0) removal of stop words, for average (A) or median (M) ROUGE precision (P), recall (R) or f-score (F), colored cells denote significant win for row i metric over column j metric with Williams test.

Figure

Figure 2: Combining linguistic quality and coverage scores provided by human assessors in DUC2004

Figure

Figure 1: Scatter-plot of mean linguistic quality and coverage scores for human assessments of summaries in DUC-2004

Figure

Figure 4: Summarization system pairwise significance test outcomes (paired t-test) for state-ofthe-art (top 7 rows) and baseline systems (bottom 5 rows) of Hong et al. (2014) evaluated with best-performing ROUGE variant: average ROUGE2 precision with stemming and stop words removed, colored cells denote a significant greater mean score for row i system over column j system according to paired t-test.

Figure

Figure 1: Our proposed modification of a multi-layer transformer architecture. The input sequence is composed of K blocks of tokens. Each transformer layer is applied within the blocks, and a bidirectional GRU network propagates information in the whole document by updating the [CLS] representation of each block.

Figure

Figure 4: Document lengths after tokenization with pretrained BERT-base tokenizer and position of the [CLS] tokens of Oracle sentences in the input documents.

Figure

Figure 3: Proportion of the extracted sentences according to their position in the input document from PubMed test dataset.

Figure

Figure 2: Average R-1 scores of extracted summaries according to the number of words in the input documents from arXiv test dataset.

Figure

Figure 1: Training curves for BanditSum based models. Average ROUGE is the average of ROUGE-1, -2 and -L F1.

Figure

Figure 1: NEWSROOM summaries showing different extraction strategies, from time.com, mashable.com, and foxsports.com. Multi-word phrases shared between article and summary are underlined. Novel words used only in the summary are italicized.

Figure

Figure 2: Example summaries for existing datasets.

Figure

Figure 4: Density and coverage distributions across the different domains and existing datasets. NEWSROOM contains diverse summaries that exhibit a variety of summarization strategies. Each box is a normalized bivariate density plot of extractive fragment coverage (x-axis) and density (y-axis), the two measures of extraction described in Section 4.1. The top left corner of each plot shows the number of training set articles n and the median compression ratio c of the articles. For DUC and New York Times, which have no standard data splits, n is the total number of articles. Above, top left to bottom right: Plots for each publication in the NEWSROOM dataset. We omit TMZ, Economist, and ABC for presentation. Below, left to right: Plots for each summarization dataset showing increasing diversity of summaries along both dimensions of extraction in NEWSROOM.

Figure

Figure 3: Procedure to compute the set F(A,S) of extractive phrases in summary S extracted from article A. For each sequential token of the summary, si, the procedure iterates through tokens of the text, aj . If tokens si and aj match, the longest shared token sequence after si and aj is marked as the extraction starting at si.

Figure

Figure 9: We designed an interactive web interface for the human evaluation experiments introduced in Section 5.4.

Figure

Figure 6: The ROUGE scores for different stopping thresholds pthres on the arXiv validating set.

Figure

Figure 3: The position distribution of extracted sentences in the PubMedtrunc dataset.

Figure

Figure 8: The ROUGE scores for different stopping thresholds pthres on the GovReport validating set.

Figure

Figure 7: The ROUGE scores for different stopping thresholds pthres on the PubMedtrunc validating set.

Figure

Figure 2: The architecture of our MemSum extractive summarizer with a multi-step episodic MDP policy. With the updating of the extraction-history embeddings h at each time step t, the scores u of remaining sentences and the stopping probability pstop are updated as well.

Figure

Figure 5: The ROUGE scores for different stopping thresholds pthres on the PubMed validating set.

Figure

Figure 4: The sentence scores of 50 sentences computed by MemSum at extraction steps 0 to 3. In the document, there is artificial redundancy in that the (2n)th and the (2n+ 1)th sentences are identical (n = 0, 1, ..., 24).

Figure

Figure 1: We modeled extractive summarization as a multi-step iterative process of scoring and selecting sentences. si represents the ith sentence in the document D.

Figure

Figure 2: A case study on the LCSTS dataset. ST is source text; Ref is reference summary; +Kw is keywords augmented; +KwKG is keywords and knowledge augmented.

Figure

Figure 3: An example of generated summaries on the LCSTS dataset. ST is source text; Ref is reference summary; +Kw is keywords topic augmented; +KwKG is keywords topic and knowledge augmented.

Figure

Figure 1: The Model Structure of KAS. The λi are soft gates for distributing copy probabilities.

Figure

Figure 2: With global variance loss, our model (green bar) can avoid repetitions and achieve comparable percentage of duplicates with reference summaries.

Figure

Figure 1: The process of attention optimization (better view in color). The original attention distribution (red bar on the left) is updated by the refinement gate rt and attention on some irrelevant parts are lowered. Then the updated attention distribution (blue bar in the middle) is further supervised by a local variance loss and get a final distribution (green bar on the right).

Figure

Figure 4: Reward calculation with Question-Answer pairs

Figure

Figure 2: The training process for the summarization framework with QA rewards

Figure

Figure 1: A document, its corresponding ground truth summary and model generated summaries.

Figure

Figure 5: The interface used for human evaluation of the summaries.

Figure

Figure 3: Model improvements after QA based rewards - SAMSUM data

Figure

Figure 1: Overview of our multi-task model with parallel training of three tasks: abstractive summary generation (SG), question generation (QG), and entailment generation (EG). We share the ‘blue’ color representations across all the three tasks, i.e., second layer of encoder, attention parameters, and first layer of decoder.

Figure

Figure 2: Attention Probability for decoding on DUC 2001 dataset example, showing the summary is more inclined to an extractive nature. Attention corresponding to the word ‘pietersen’ present in the generated summary is shown.

Figure

Figure 3: Attention Probability for decoding on a SUMPUBMED example where the attention corresponding to word ‘present’ in the generated summary is shown.

Figure

Figure 1: SUMPUBMED creation pipeline.

Figure

Figure 1: An example concept map browser. The system indicates that (t1)=“Slobodan Milosevic” is related to (t2)=“Kosovo Province.” The user clicks to investigate the relationship, and the system must generate a summary explaining how Milosevic is related to Kosovo.

Figure

Figure 3: Highlighted article, reference summary, and summaries generated by TCONVS2S and PTGEN. Words in red in the system summaries are highlighted in the article but do not appear in the reference.

Figure

Figure 2: The UI for content evaluation with highlight. Judges are given an article with important words highlighted using heat map. Judges can also remove less important highlight color by sliding the scroller at the left of the page. At the right of the page judges give the recall and precision assessment by sliding the scroller from 1 to 100 based on the given summary quality.

Figure

Figure 1: Highlight-based evaluation of a summary. Annotators to evaluate a summary (bottom) against the highlighted source document (top) presented with a heat map marking the salient content in the document; the darker the colour, the more annotators deemed the highlighted text salient.

Figure

Figure 2: Two-stage model diagram. The aspect classifier assigns aspect labels for each reference sentence Rij from references R with a threshold λ. Sentences are then grouped according to the assigned labels, which are fed to the summarization model. Groups about irrelevant aspects (i.e., a2) is ignored. Finally, the summarization model outputs summaries for each relevant aspect.

Figure

Figure 3: Precision differences in varying threshold ranges.

Figure

Figure 1: In WikiAsp, given reference documents cited by a target article, a summarizationmodelmust produce targeted aspect-based summaries that correspond to sections.

Figure

Figure 1: Example of a search tree

Figure

Figure 1: Overview of our pyramid construction.

Figure

Figure 2: Examples of head-modifier-relation triples.

Figure

Figure 3: Examples of SCUs obtained from pyramids.

Figure

Figure 6: Results for the output factor questions. Specific output factor in italics. Answer type in brackets: MC = Multiple Choice, MR = Multiple Response, LS = Likert Scale. ** indicates significance (𝜒2 or Fisher’s exact test), after Bonferroni correction, with 𝑝 ≪ 0.001, * with 𝑝 < 0.05.

Figure

Figure 5: Results for the purpose factor questions. Specific purpose factor in italics. Answer type in brackets: MC = Multiple Choice, MR = Multiple Response, LS = Likert Scale. ** indicates significance (𝜒2), after Bonferroni correction, with 𝑝 ≪ 0.001, * with 𝑝 < 0.05. † indicates noteworthy results where significance was lost after correction for the number of tests. If two options are flagged, these options are not significantly different from each other, yet both were chosen significantly more often than the other options.

Figure

Figure 1: Summarization methods that are currently the standard vs. example of summarizing while taking users’ wishes and desires into account.

Figure

Figure 3: Overview of survey procedure.

Figure

Figure 2: Participant details.

Figure

Figure 7: Results for the future feature questions. Answer type in brackets. MC = Multiple Choice, MR = Multiple Response. ** indicates significance (𝜒2 or Fisher’s exact test), after Bonferroni correction, with 𝑝 ≪ 0.001.

Figure

Figure 4: Results for the input factor questions. Specific input factor in italics. Answer type in brackets: MC =Multiple Choice, MR = Multiple Response. ** indicates significance (𝜒2), after Bonferroni correction, with 𝑝 ≪ 0.001. If two options are flagged with **, these options are not significantly different from each other, yet both have been chosen significantly more often than the other options.

Figure

Figure 5: Results for the purpose factor questions. Specific purpose factor in italics. Answer type in brackets: MC = Multiple Choice, MR = Multiple Response, LS = Likert Scale. ** indicates significance ( 2), after Bonferroni correction, with p ⌧ 0.001, * with p < 0.05. † indicates noteworthy results where significance was lost after correction for the number of tests. If two options are flagged, these options are not significantly different from each other, yet both were chosen significantly more often than the other options.

Figure

Figure 3: Overview of the survey procedure.

Figure

Figure 6: Results for the output factor questions. Specific output factor in italics. Answer type in brackets: MC = Multiple Choice, MR = Multiple Response, LS = Likert Scale. ** indicates significance ( 2 or Fisher’s exact test), after Bonferroni correction, with p⌧ 0.001, * with p < 0.05.

Figure

Figure 2: Participant details.

Figure

Figure 4: Results for the input factor questions. Specific input factor in italics. Answer type in brackets: MC = Multiple Choice, MR = Multiple Response. ** indicates significance ( 2), after Bonferroni correction, with p ⌧ 0.001. If two options are flagged with **, these options are not significantly different from each other, yet both have been chosen significantly more often than the other options.

Figure

Figure 8: Results for the future feature questions. Answer type in brackets. MC = Multiple Choice, MR = Multiple Response. ** indicates significance ( 2 or Fisher’s exact test), after Bonferroni correction, with p⌧ 0.001.

Figure

Figure 1: Distribution of sentence position predictions.

Figure

Figure 6: Typical Comparison. Our model attended at the most important information (blue bold font) matching well with the reference summary; while other state-of-the-art methods generate repeated or less important information (red italic font).

Figure

Figure 1: Comparison of extractive, abstractive, and our unified summaries on a news article. The extractive model picks most important but incoherent or not concise (see blue bold font) sentences. The abstractive summary is readable, concise but still loses or mistakes some facts (see red italics font). The final summary rewritten from fragments (see underline font) has the advantages from both extractive (importance) and abstractive advantage (coherence (see green bold font)).

Figure

Figure 5: Visualizing the consistency between sentence and word attentions on the original article. We highlight word (bold font) and sentence (underline font) attentions. We compare our methods trained with and without inconsistency loss. Inconsistent fragments (see red bold font) occur when trained without the inconsistency loss.

Figure

Figure 2: Our unified model combines the word-level and sentence-level attentions. Inconsistency occurs when word attention is high but sentence attention is low (see red arrow).

Figure

Figure 4: Decoding mechanism in the abstracter. In the decoder step t, our updated word attention α̂t is used to generate context vector h∗(α̂t). Hence, it updates the final word distribution Pfinal.

Figure

Figure 9: Self-reported usefulness.

Figure

Figure 2: Screenshot of the experiment interface for human evaluation. Participants are asked to predict which restaurant will be rated higher after 50 reviews based on the summaries of the first 10 reviews where these two restaurants have the same average rating in the first 10 reviews.

Figure

Figure 6: Fig. 6a shows that DecSum is the only method that enables humans to statistically outperform random chance. Fig. 6b further shows that DecSum leads to more individuals with high performance.

Figure

Figure 5: Summary quality evaluation using SUM-QE (Xenouleas et al., 2019). DecSum achieves strong textual non-redundancy, but leads to lower grammaticality and coherence.

Figure

Figure 10: SUM-QE evluation on referential clarity and focus.

Figure

Figure 7: Model prediction distributions of each rating group from logistic regression (LR), deep averaging networks (DAN), and Longformer. Only Longformer model can properly distinguish sentences located at different score range. LR and DAN are not robust to input length shift where models are trained with input of full 10 reviews but are tested with sentences.

Figure

Figure 4: Sentence-level sentiment distribution of summaries. DecSum can select a wider range of sentences w.r.t. sentiment diversity.

Figure

Figure 8: The effect of summary length.

Figure

Figure 3: Wasserstien distance between model predictions of summary sentences and all sentences of the first ten reviews. Lower values indicate better representativeness. Error bars represent standard errors. DecSum (1, 1, 1) is significantly better than other approaches, including DecSum (1, 0, 1), with p-value ≤ 0.0001 with paired t-tests.

Figure

Figure 1: Illustration of the selected sentences by different methods on the distribution of model predictions on all individual sentences. Our method (DecSum) covers the full distribution, while PreSumm, a text-only summarization method, concentrates on the right side, and integrated gradients, a model-explanation method, misses the middle part.

Figure

Figure 3: The results of human evaluation, where forward and backslash represent that BASE+GRAPH+CL versus the reference and BASE, respectively. Yellow, green and blue represent that our model loses, equal to competitors and wins.

Figure

Figure 1: An example of the findings and corresponding impression, where the relation information, as well as positive and negative examples, are also shown in the figure. Note that △ represents the removed word.

Figure

Figure 4: R-1 score of generated impressions from BASE and our model on the MIMIC-CXR test set, where OURS represent the BASE+GRAPH+CL.

Figure

Figure 2: The overall architecture of our proposed method with graph and contrastive learning. An example input and output at t− 1 and t step are shown in the figure, where the top is the backbone sequence-to-sequence paradigm with a graph to store relation information between critical words and the bottom is the contrastive learning module with specific positive and negative examples. m refer to a mask vector.

Figure

Figure 5: Examples of the generated impressions from BASE and BASE+GRAPH+CL as well as reference impressions. The yellow nodes in the graph indicate that these words are contained in entities.

Figure

Figure 5: Sample summaries based on OUT-OFDOMAIN and MIX-DOMAIN training on opinion articles.

Figure

Figure 3: Named Entities distribution (left) and subjective words distribution (right) in abstracts. More PERSON, less ORGANIZATION, and less subjective words are observed in OPINION.

Figure

Figure 1: A snippet of sample news story and opinion article from The New York Times Annotated Corpus (Sandhaus, 2008).

Figure

Figure 4: BLEU (left) and ROUGE-L (right) performance on In-domain and Mix-domain setup over different amount of training data. As the training data increases, In-domain outperforms Mix-domain training.

Figure

Figure 2: [Left] Part-of-speech (POS) distribution for words in abstracts. [Right] Percentage of words in abstracts that are reused from input, per POS and all words. OPINION abstracts generally reuse less words.

Figure

Figure 7: Factual errors made by the BertSumExtAbs model.

Figure

Figure 4: Score Card

Figure

Figure 16: Reference contains rhetorical sentences that interest readers.

Figure

Figure 15: Noise data in reference.

Figure

Figure 12: The Bottom-Up works bad but other models work well.

Figure

Figure 2: Distribution of source sentence used for content generation. X-axis: sentence position in source article. Y-axis: the negative log of coverage of sentence.

Figure

Figure 4: Evaluation with QA model prediction probability and accuracy on our multiple choice cloze test, with higher numbers indicating better summaries.

Figure

Figure 10: The Pointer-Generator-with-Coverage model tends to make Addition errors when Pointer-Generator does not have repetitions.

Figure

Figure 14: Reference contains grammatical errors.

Figure

Figure 2: Our ASGARD framework with documentlevel graph encoding. Summary is generated by attending to both the graph and the input document.

Figure

Figure 13: Example of positive-negtive errors.

Figure

Figure 5: Error Log

Figure

Figure 1: PolyTope verdicts each error by three coordinates according to its syntactic and semantic roles.

Figure

Figure 3: Sample construction of multiple choice cloze questions and candidate answers from reference summary and salient context. Arguments and predicates in candidate answers are color-coded and italicized.

Figure

Figure 3: A case study that compares various evaluation methods with each other.

Figure

Figure 6: Distribution of automatic summarization metrics with three types of unfaithful errors. “True” indicates summaries with such type of error.

Figure

Figure 9: Factual errors made by the Bottom-Up model.

Figure

Figure 6: Scores per Segment

Figure

Figure 8: Factual errors made by the Point-Generator-with-Coverage model.

Figure

Figure 11: The Pointer-Generator-with-Coverage model tends to incorrectly combine information from the document, thus leading to Inacc Intrinsic errors.

Figure

Figure 1: Sample knowledge graph constructed from an article snippet. The graph localizes relevant information for entities (color coded, e.g. “John M. Fabrizi”) or events (underlined) and provides global context.

Figure

Figure 5: Sample summaries for an NYT article. Summaries by our models with the graph encoder are more informative than the variant without it.

Figure

Figure 3: Sample summaries for a government report. The model with truncated input generates unfaithful content. HEPOS attention with a Sinkhorn encoder covers more salient information.

Figure

Figure 5: Aspect-level informativeness and percentages of sentences containing unfaithful errors as labeled by both human judges on PubMed. Models with efficient attentions reduce errors for later sections in the sources, e.g., “Results" and “Conclusion".

Figure

Figure 6: Aspect-level informativeness and percentages of sentences with unfaithful errors on GovReport.

Figure

Figure 2: System overview.

Figure

Figure 4: Summarizing articles truncated at different lengths by the best models: LSH (7168)+HEPOS on PubMed and SINKHORN (10240)+HEPOS on GovReport. Reading more consistently improves ROUGE-2.

Figure

Figure 2: Percentage of unique salient bigrams accumulated from the start to X% of the source. Key information is spread over the documents in GOVREPORT, highlighting the importance of understanding longer text.

Figure

Figure 1: A toy example of our HEPOS attention, with a stride of 2 and four attention heads. Dark colors indicate that heads 1 and 3 attend to the first and third tokens (“Job" and “home") in the input, heads 2 and 4 look at the second and fourth words (“in" and “care").

Figure

Figure 7: Sample summaries for a government report. Model with truncated input generates unfaithful content. Our HEPOS encoder-decoder attention with Sinkhorn encoder attention covers more salient information in “What GAO found” aspect.

Figure

Figure 1: Relations among text spans of different granularity.

Figure

Figure 3: Graph attention mechanism.

Figure

Fig. 1. Overview of a query-biased summarizer with a copying mechanism.

Figure

Figure 6: ROUGE-1 scores of COOP with approximate search in different configurations.

Figure

Figure 1: Illustration of the latent space Z and text space X . The de facto standard approach in unsupervised opinion summarization uses the simple average of input review vectors zreview (◦) to obtain the summary vector zavg (▴). The simply averaged vector zavg tends to be close to the center (i.e., has a small L2-norm) in the latent space, and a generated summary xavg (⬩) tends to become overly generic. Our proposed framework COOP finds a better aggregated vector to generate a more specific summary xCOOP(▪) from the latent vector zCOOP (⋆).

Figure

Figure 8: Illustrations of the relationships between L2norm of latent vectors ∥z∥ and the input review quality: (a) text length and (b) the information content.

Figure

Figure 2: Average L2-norm of simply averaged summary vectors for different number of input reviews.

Figure

Figure 7: Example of summaries generated by BIMEANVAE with SimpleAvg and COOP for reviews about a product on the Yelp dataset. The colors denote the corresponding opinions, and struck-through reviews in gray were not selected by COOP for summary generation (Note that SimpleAvg uses all the input reviews.) Terms that are more specific to the entity are underlined.

Figure

Figure 5: L2-norm distributions of latent vectors of input reviews and aggregated vectors.

Figure

Figure 4: COOP searches convex combinations of the latent vectors of input reviews based on the input-output word overlap between a generated summary and input reviews. × denotes the simply averaged vector.

Figure

Figure 10: The inference runtime of BIMEANVAE and Optimus with different batch sizes.

Figure

Figure 9: Approximated search performance of ROUGE-2/L scores with different batch sizes.

Figure

Figure 3: Correlation analysis of the L2-norm of latent vectors ∥z∥ and the generated text quality: (a) text length and (b) information amount.

Figure

Figure 11: Example of summaries generated by BIMEANVAE with SimpleAvg and COOP for reviews about a product on the Amazon dataset. The colors denote the corresponding opinions, and struck-through reviews in gray were not selected by COOP for summary generation (Note that SimpleAvg uses all the input reviews.) Terms that are more specific to the entity are underlined. Red and struck-through text denotes hallucinated content that has the opposite meaning compared to the input.

Figure

Figure 4: BERTScore and Intra-BERTScore for generated contrastive summaries with different hyperparameters δ. The goal is to generate high quality and distinctive summaries (upper right).

Figure

Figure 3: Encoder of the base common summarization model has type embeddings to distinguish the original entity.

Figure

Figure 1: Overview of the comparative opinion summarization task. The model takes two set of reviews about different entities to generate two contrastive opinion summaries, which contain distinctive opinions, and one common opinion summary, which describes common opinions between the two entities.

Figure

Figure 5: Sentence Annotation Task. By showing sentences of the same aspect category, it is easier for annotators to compare two group of sentences (from two entities). To further facilitate the annotation process, we also provide several additional features, such as allowing workers to group sentences that contain the same token through double clicking, and to highlight sentences through hovering over the sentence label.

Figure

Figure 2: Illustration of Co-decoding: (a) For contrastive summary generation, distinctive words are emphasized by contrasting the token probability distribution of target entity against that of the counterpart entity. (b) For common summary generation, entity-pair-specific words are highlighted by aggregating token probability distributions of all base models to alleviate the overly generic summary generation issue.

Figure

Figure 6: Summary Collection Task. We show workers two group of sentences based on labels we collected from the sentence annotation task. Similar features, such as allowing workers to group sentences that contain the same token through double clicking, are also supported in this task.

Figure

Figure 1: Proposed multi-task learning framework for sentence extraction with document classification

Figure

Figure 2: Sentence extraction model using LSTM-RNN with multi-task learning

Figure

Figure 5: Visualization of DiscourseRank. The darker the highlightning, the higher the rank score. The references and generated summaries are also shown.

Figure

Figure 2: Outline of StrSum.

Figure

Figure 4: Examples of generated summaries and induced latent discourse trees.

Figure

Figure 7: Examples of generated summaries and induced latent discourse trees for long reviews. (a) shows a movie review. The 4th sentence mentions the whole positiveness. The 10th describes that the contents are easy to follow, while the 20th to 22nd show the detail of the contents. The 27th mentions the performance and accurate portrayal, and the 8th and 16th elaborate on the latter and the former, respectively. (b) presents a pocket knife review. The 11th, 13th, 15th, and 21st sentences concisely describe the goodness in each aspect. The 14th, 24th, and 28th elaborate on the parents.

Figure

Figure 3: ROUGE-L F1 score on evaluation set with various numbers of sentences.

Figure

Figure 6: Examples of generated summaries and induced latent discourse trees for negative reviews. (a) shows a board game review. The induced tree shows that the 1st and 6th sentences present additional information about the generated summary. While the 1st to 4th indicate the heaviness of the game, the 5th and 6th criticize the artwork. The 2nd, 3rd, and 4th present the additional information about the parent. (b) presents a movie review. The 1st and 2nd sentences describe the whole evaluation, while 6th and 7th strengthen the opinion. The 3rd to 5th mention the boring points in detail. Although our model catches the negativeness, the summary is redundant probably because each sentence in the body is relatively long.

Figure

Figure 1: Example of the discourse tree of a jigsaw puzzle review. StrSum induces the latent tree and generates the summary from the children of a root, while DiscourseRank supports it to focus on the main review point.

Figure

Figure 5: 2-D latent space projected by principal component analysis. Each point corresponds to the mean of the latent distribution of a topic sentence, and each circle denotes the same Mahalanobis distance from the mean.

Figure

Figure 3: Analogy with Gaussian word embedding.

Figure

Figure 6: Example of a path distribution (blue) and level distribution (red). Both the sum of a path distribution over each level and the sum of a level distribution over each path are equal to 1.

Figure

Figure 1: Outline of our approach. (1) The latent distribution of review sentences is represented as a recursive GMM and trained in an autoencoding manner. Then, (2) the topic sentences are inferred by decoding each Gaussian component. An example of a restaurant review and its corresponding gold summary are displayed.

Figure

Figure 2: Outline of our model. We set a recursive Gaussian mixture as the latent prior of review sentences and obtain the latent posteriors of topic sentences by decomposing the posteriors of review sentences.

Figure

Figure 4: Generated topic sentences of (a) an Amazon review of heeled shoes, (b) a Yelp review of a coffee shop, and (c) an Amazon review of table chair set. Topic sentences selected as a summary are highlighted in italic.

Figure

Figure 2: Illustration of word and sentence level attention in the second decoder step (Eq. 1 and Eq. 2). Purple: attention on words, Orange: attention on sentences, Unidirectional dotted arrows: attention from previous step, Bidirectional arrows: attention from previous and to present step. Best viewed in color.

Figure

Figure 1: SWAP-NET architecture. EW: word encoder, ES: sentence encoder, DW: word decoder, DS: sentence decoder, Q: switch. Input document has words [w1, . . . , w5] and sentences [s1, s2]. Target sequence shown: v1 = w2, v2 = s1, v3 = w5. Best viewed in color.

Figure

Figure 2: T-SNE Visualization on CNN/DM Test Set

Figure

Figure 2: T-SNE Visualization on CNN/DM Test Set

Figure

Figure 1: Overview of Hierarchical Attentive Heterogeneous Graph

Figure

Figure 1: Overview of Architecture

Figure

Figure 1: ROUGE score for documents with different length. The result is calculated on the test set of CNN/DM and the trained model is based on BERT.

Figure

Figure 3: Comparison Between the ROUGE Scores Tendencies of BERTSUMEXT and DifferSum

Figure

Figure 2: T-SNE Visualization on CNN/DM.

Figure

Figure 1: Overview of ThresSum.

Figure

Figure 2: Overview of DifferSum.

Figure

Figure 4: Solid lines represent the forward pass, and dashed lines represent the gradient flow in backpropagation. For the two ablation tests, we stop the gradient at 1© and 2© respectively.

Figure

Figure 2: Our 2-decoder summarization model with a pointer decoder and a closed-book decoder, both sharing a single encoder (this is during training; next, at inference time, we only employ the memory-enhanced encoder and the pointer decoder).

Figure

Figure 3: The summary generated by our 2-decoder model covers salient information (highlighted in red) mentioned in the reference summary, which is not presented in the baseline summary.

Figure

Figure 1: Baseline model repeats itself twice (italic), and fails to find all salient information (highlighted in red in the original text) from the source text that is covered by our 2-decoder model. The summary generated by our 2-decoder model also recovers most of the information mentioned in the reference summary (highlighted in blue in the reference summary).

Figure

Figure 2: The Filler and Role Binding operation of the TPTRANSFORMER Model architecture.

Figure

Figure 1: An example document and its one line summary from XSum dataset. Document content that is composed into an abstractive summary is color-coded.

Figure

Figure 3: TP-TRANSFORMER model architecture.

Figure

Figure 3: Comparison of section-selection strategies of SEHY paired with BigBird-large on the dataset DMath.

Figure

Figure 1: The distribution of summary sentences per section type, cited from (Gidiotis and Tsoumakas, 2020).

Figure

Figure 2: Comparison of section-selection strategies of SEHY paired with BigBird-large on the dataset DCS .

Figure

Figure 4: Comparison of section-selection strategies of SEHY paired with BigBird-large on the dataset DPhy .

Figure

Figure 3: The graph layer consists of a graph attention mechanism and a feed-forward network. Through the graph attention, each node merges the neighbor relations. The neighbor relations are represented as triples, and the incoming relations and the outgoing relations are obtained through different mappings, which are marked in red and green color respectively in the Figure.

Figure

Figure 1: An example sentence annotated with a semantic dependency graphs. The green color represents the dependency of root node “hoping”. Some dependency edges are omitted for display.

Figure

Figure 2: The overview of our SemSUM model

Figure

Figure 4: Human evaluation. They are rated on a Likert scale of 1(worst) to 5(best).

Figure

Figure 3: Rouge-1 score vs. the number of selected sentences.

Figure

Figure 1: An example document: There are two different relationships among sentences: the semantic similarity (yellow) and the natural connection (green). Sentences 2, 3, 21 are the oracle sentences.

Figure

Figure 2: Overview of the proposed Multi-GraS, the word block and Multi-GCN.

Figure

Figure 1: Social bias in automatic summarization: We take steps toward evaluating the impact of the gender, age, and race of the humans involved in the summarization system evaluation loop: the authors of the summaries and the human judges or raters. We observe significant group disparities, with lower performance when systems are evaluated on summaries produced by minority groups. See §3 and Table 1 for more details on the Rouge-L scores in the bar chart.

Figure

Figure 3: An Example taken from COVID-19 dataset. Text in the same color indicates the contents they described are the same.

Figure

Figure 2: The sentence position distribution of the extracted summaries and the oracle summaries.

Figure

Figure 1: Our proposed multi-view information bottleneck framework. I(s;Y ) denotes the mutual information between sentence s and correlated signal Y , and NSP is short for Next Sentence Prediction task.

Figure

Figure 3: Intersection of averaged summary sentence overlaps across the sub-aspects. We use First for Position, ConvexFall for Diversity, and N-Nearest for Importance. The number in the parenthesis called Oracle Recall is the averaged ratio of how many the oracle sentences are NOT chosen by union set of the three sub-aspect algorithms. Other corpora are in Appendix with their Oracle Recalls: Newsroom(54.4%), PubMed (64.0%) and MScript (99.1%).

Figure

Figure 2: Volume maximization functions. Black dots are sentences in source document, and red dots are chosen summary sentences. The red-shaded polygons are volume space of the summary sentences.

Figure

Figure 4: PCA projection of extractive summaries chosen by multiple aspects of algorithms (CNNDM). Source and target sentences are black circles ( ) and cyan triangles, respectively. The blue, green, red circles are summary sentences chosen by First, ConvexFall, NN, respectively. The yellow triangles are the oracle sentences. Shaded polygon represents a ConvexHull volume of sample source document. Best viewed in color. Please find more examples in Appendix.

Figure

Figure 5: Sentence overlap proportion of each sub-aspect (row) with the oracle summary across corpora (column). y-axis is the frequency of overlapped sentences with the oracle summary. X-axis is the normalized RANK of individual sentences in the input document where size of bin is 0.05. E.g., the first / the most diverse / the most important sentence is in the first bin. If earlier bars are frequent, the aspect is positively relevant to the corpus.

Figure

Figure 1: A simple change to an article choice (in bold) in the extractive summary can improve its readability and coherence.

Figure

Figure 3: Preference judgement scores of the three judges A1, A2 and A3 across various summarizers and datasets.

Figure

Figure 5: Performance of the learning models in terms of (average) overlap between the models’ decisions and those of the annotators on 100 randomly sampled summaries generated by BanditSum.

Figure

Figure 4: Performance of the learning models in terms of (average) overlap (in %) between the models’ decisions and those of the annotators on 100 randomly sampled summaries generated by the different summarizers. Abbreviations: ST: Stories, AR: Articles, SM: Summaries, sub: subset.

Figure

Figure 2: Description of baseline models. (a) Concat model. (b) Text model.

Figure

Figure 1: Example of a post and a reply with a quote and a reply with no quote. Implicit quote is the part of post that reply refers to, but not explicitly shown in the reply.

Figure

Figure 2: Description of our model, Implicit Quote Extractor (IQE). The Extractor extracts sentences and uses them as summaries. k and j are indices of the extracted sentences.

Figure

Figure 3: Correlation between ROUGE-1-F score and maximum PageRank of each post on ECS and EPS datasets. X-axis shows rounded maximum PageRank, and Y-axis shows ROUGE-1-F and the error bar represents the standard error.

Figure

Figure 1: Description of the Appropriateness Estimator.

Figure

Figure 1: Sentence extractor architectures: a) RNN, b) Seq2Seq, c) Cheng & Lapata, and d) SummaRunner. The� indicates attention. Green blocks repesent sentence encoder output and red blocks indicates learned “begin decoding” embeddings. Vertically stacked yellow and orange boxes indicate extractor encoder and decoder hidden states respectively. Horizontal orange and yellow blocks indicate multi-layer perceptrons. The purple blocks represent the document and summary state in the SummaRunner extractor.

Figure

Figure 1: A strapline (“Don’t expect ...”) that is mistaken for a summary in the Newsroom corpus.

Figure

Figure 2: Relative locations of bigrams of gold summary in the source text across different datasets.

Figure

Figure 1: An example post of the TIFU subreddit.

Figure

Figure 6: Examples of abstractive summary generated by our model and baselines. In each set, we too show the source text and reference summary.

Figure

Figure 4: Comparison between (a) the gated linear unit (Gehring et al., 2017) and (b) the proposed normalized gated tanh unit.

Figure

Figure 5: Examples of abstractive summary generated by our model and baselines. In each set, we too show the source text and gold summary.

Figure

Figure 3: Illustration of the proposed multi-level memory network (MMN) model.

Figure

Figure 3: F-measure ROUGE-1 performance (%) vs. number of models for news-headline-generation task. X-axis is log scale (21–27).

Figure

Figure 1: Flow charts of current runtime-ensemble (a) and our proposed post-ensemble (b).

Figure

Figure 2: Left scatter-plot shows two-dimensional visualization of outputs generated from 10 models on basis of multi-dimensional scaling (Cox and Cox, 2008), and right list shows their contents. Each point in plot represents sentence embedding of corresponding output, and label indicates model ID and ROUGE-1, i.e., “ID (ROUGE).” Color intensity means score of kernel density estimation of PostCosE (see right color bar), and outputs are sorted by scores. Reference and input are as follows. Each bold word in above list means co-occurrence with reference below. Reference: interpol asks world govts to make rules for global policing Input: top interpol officers on wednesday asked its members to devise rules and procedures for policing at the global level and providing legal status to red corner notices against wanted fugitives .

Figure

Figure 4: Violation penalty for compression (left) and reconstruction (right). The x-axis is step and the y-axis is each ratio. The horizontal lines in the middle are ρ and τ , and the dashed lines represent ρ(t) and τ(t). The circles represent a step where the agent breaks the constraints.

Figure

Figure 1: Overview of previous (left) and proposed (right) approaches on CR learning paradigm.

Figure

Figure 2: Algorithmic visualization of iterative action prediction

Figure

Figure 3: Deterministic compression and reconstruction with masked language model

Figure

Figure 3: US H.R.1680 (115th)

Figure

Figure 2: Example US Bill

Figure

Figure 4: US H.R.6355 (115th)

Figure

Figure 5: California Bill Summary

Figure

Figure 1: Bill Lengths

Figure

Figure 1: Experimental set-up. Left: multi-task training, Right: training with structured input.

Figure

Figure 2: Example from the LipKey dataset, with gold-standard and generated summaries.

Figure

Figure 3: Example of articles and keyphrases in the LipKey dataset. We highlight words in the article that match its absent keyphrases with different colours. Yellow means partial match, green means acronym, and blue means morphology variants. The English translation is for illustration purposes.

Figure

Figure 1: A taxonomy of concepts.

Figure

Figure 1: CWE vs Voting for different simplicity and reading ease levels. ROUGE-2 precision and recall are shown for different levels of tuning achieved.

Figure

Figure 1: Variation in the attention coverage while summarizing an article for different topics

Figure

Figure 1: Procedure to create pretraining dataset using the nonsense corpus and our proposed pretraining tasks

Figure

Figure 1: Estimator model architecture used in COMES. Source, reference and hypothesis are all independently encoded with a pre-trained encoder. Pooling layer is used to create sentence embeddings from sequences of token embeddings. In the COMES variant, the last feed-forward layer has 4 outputs, corresponding to different summary evaluation dimensions. Dashed lines are used to indicate the reference-less variant. For the full COMET description see Rei et al. (2020).

Figure

Figure 2: ROUGE and novel n-grams results on the anonymized validation set for different runs of each model type. Lines indicates the Pareto frontier for each model type.

Figure

Figure 1: The network architecture with the decoder factorized into separate contextual and language models. The reference vector, composed of context vectors ctmpt , c int t , and the hidden state of the contextual model hdect , is fused with the hidden state of the language model and then used to compute the distribution over the output vocabulary.

Figure

Figure 2: Pairwise similarities between model outputs computed using ROUGE. Above diagonal: Unigram overlap (ROUGE-1). Below diagonal: 4-gram overlap (ROUGE-4). Model order (M-) follows Table 6.

Figure

Figure 1: The distribution of important sentences over the length of the article according to human annotators (blue) and its cumulative distribution (red).

Figure

Figure 1: Procedure to generate synthetic training data. S is a set of source documents, T + is a set of semantically invariant text transformations, T − is a set of semantically variant text transformations, + is a positive label, − is a negative label.

Figure

Figure 1: Distribution of Gold Summary Rank

Figure

Figure 2: Validation losses for BERTSUM, RoBERTa, and SynRoBERTa (ns = {1, 2}) . “[CLS]” and “[ROOT]” indicate the tokens of sentence representations for predicting labels.

Figure

Figure 1: Different from the previous work, DISCOBERT (Xu et al., 2020), NeRoBERTa selects sentences by considering both intra- and inter-sentence relationships as a nested tree structure.

Figure

Figure 1: Motivating example. A document from CNN.com (keywords generated by masking procedure are bolded), the masked version of the article, and generated summaries by three Summary Loop models under different length constraints.

Figure

Figure 2: The Summary Loop involves three neural models: Summarizer, Coverage and Fluency. Given a document and a length constraint, the Summarizer writes a summary. Coverage receives the summary and a masked version of the document, and fills in each of the masks. Fluency assigns a writing quality score to the summary. The Summarizer model is trained, other models are pretrained and frozen.

Figure

Figure 4: Histogram and average copied span lengths for abstractive summaries. A summary is composed of novel words and word spans of various lengths copied from the document. Summary Loop summaries copy shorter spans than prior automatic systems, but do not reach abstraction levels of human-written summaries.

Figure

Figure 3: The Coverage model uses a finetuned BERT model. The summary is concatenated to the masked document as the input, and the model predicts the identity of each blank from the original document. The accuracy obtained is the raw coverage score.

Figure

Figure 1: Example document with an inconsistent summary. When running each sentence pair (Di, Sj) through an NLI model, S3 is not entailed by any document sentence. However, when running the entire (document, summary) at once, the NLI model incorrectly predicts that the document highly entails the entire summary.

Figure

Figure 2: Diagram of the SUMMACZS (top) and SUMMACCONV (bottom)models.Bothmodels utilize the same NLI Pair Matrix (middle) but differ in their processing to obtain a score. The SUMMACZS is Zero-Shot, and does not have trained parameters. SUMMACCONV uses a convolutional layer trained on a binned version of the NLI Pair Matrix.

Figure

Figure 3: An example from our human evaluation.

Figure

Figure 1: Extractiveness of generated outputs versus automated metric scores for Entailment, FactCC and DAE on the Gigaword dataset. We use coverage defined in Grusky et al. (2018) to measure extractiveness, where summaries with higher coverage are more extractive. We observe that automated metrics of faithfulness are positively correlated with extractiveness.

Figure

Figure 2: Faithfulness-Abstractiveness trade-off curves. The blue dots represent the quartile models used to generate the curve. The purple dot corresponds to the baseline. DAE and Loss Truncation are depicted by the brown and orange dots respectively. The green dots correspond to our proposed systems.

Figure

Figure 3: Comparison between conditions for average time to summarize (per document) for Reddit and XSum. In general, participants in XSum took longer to complete the task, likely due to unfamiliarity with the domain.

Figure

Figure 1: Sample task interface for the AI post-edit condition for XSum, showing the provided, AI-generated summary in the text box.

Figure

Figure 8: Task interface and questions on the summarization task.

Figure

Figure 7: XSum tutorial first example with good and bad explanations.

Figure

Figure 5: This figure shows the document length of distribution of both datasets. The average length of the Reddit posts is 243.8 words, and the average length of the XSum articles is 222.3 words.

Figure

Figure 2: Average overall quality ratings for the summaries by type and dataset. For Reddit, the human reference was the worst (aside from the Random summary). For XSum, the AI-generated summary was the worst.

Figure

Figure 4: User experience plots for task difficulty, “I found it difficult to summarize the article well”, frustration, “Performing the summarization tasks was frustrating”, and assistance utility, “The provided summaries were not useful to me when I was performing the summarization tasks” for Reddit (Left) and XSum (Right). Responses were collected using 7 point rating scales.

Figure

Figure 6: Reddit tutorial first example with good and bad explanations.

Figure

Figure 9: Annotation interfaces.

Figure

Figure 2: System architecture. In this example, a sentence pair is chosen (red) and then merged to generate the first summary sentence. Next, a sentence singleton is selected (blue) and compressed for the second summary sentence.

Figure

Figure 4: A sentence’s position in a human summary can affect whether or not it is created by compression or fusion.

Figure

Figure 2: Frequency of each merging method. Concatenation is the most common method of merging.

Figure

Figure 1: Annotation interface. A sentence from a random summarization system is shown along with four questions.

Figure

Figure 3: Position of ground-truth singletons and pairs in a document. The singletons of XSum can occur anywhere; the first and second sentence of a pair also appear far apart.

Figure

Figure 1: Portions of summary sentences generated by compression (content is drawn from 1 source sentence) and fusion (content is drawn from 2 or more source sentences). Humans often grab content from 1 or 2 document sentences when writing a summary sentence.

Figure

Figure 3: The first attention head from the l-th layer is dedicated to coreferring mentions. The head encourages tokens of the same PoC to share similar representations. Our results suggest that the attention head of the 5-th layer achieves competitive performance, while most heads perform better than the baseline. The findings are congruent with (Clark et al., 2019) that provides a detailed analysis of BERT’s attention.

Figure

Figure 2: Comparison of various highlighting strategies. Thresholding obtains the best performance.

Figure

Figure 2: Our TRANS-LINKING model facilitates summary generation by reducing the shifting distance, allowing the model attention to shift from “John” to the tokens “[E]” then to “loves” for predicting the next summary word.

Figure

Figure 2: Statistics of PoC occurrences and types.

Figure

Figure 1: An illustration of the annotation interface. A human annotator is asked to highlight text spans referring to the same entity, then choose one from the five pre-defined PoC types.

Figure

Figure 2: Training progress on WIKI’s training and validation data

Figure

Figure 1: Architecture of our Q-network

Figure

Figure 3: Validation Performance among Masked Ratio for Mask-and-Fill with Masked Article. We experiment with each of the five combinations of article mask ratio and summary mask ratio, and then plot the interpolated results.

Figure

Figure 1: An example of generated negative summary using masked article. Spans that are highlighted are masked when generating the negative summary. Note that red spans are factually inconsistent with the given article and blue spans are factually consistent.

Figure

Figure 4: Generated negative summaries among various masking ratio in CNN/DM dataset. For MFMA and MF, we fix the summary masking γS = 0.6:

Figure

Figure 7: Case study on entailment based models. First example comes from and FactCC-Test and second example comes from XSumHall.

Figure

Figure 5: Validation Set Performance among BERTScore between the original reference summaries and the negative summaries we generate using the various combinations of article and summary masking ratios.

Figure

Figure 2: Overall flow of our proposed negative summary generation method Mask-and-Fill-with-Masked Article.

Figure

Figure 1: An example of a news story in our data set. The short manual summary is marked in red rectangle. The blue rectangle shows a post from a user. In the green rectangle, it is a link of a related news story. Some posts may only include comments, reactions, etc. without the link to the related news stories.

Figure

Figure 2: Our deep recurrent generative decoder (DRGD) for latent structure modeling.

Figure

Figure 1: Headlines of the top stories from the channel “Technology” of CNN.

Figure

Figure 1: Our key information guide model. It consists of key information guide network, encoder and decoder. In the key information guide network, we encode the keywords to the key information representation k.

Figure

Figure 2: Visualization of sentence selection vectors. Ii and Oi indicate the i-th sentence of the input and output, respectively. Obviously, our model can detect more salient sentences that are included in the reference summary.

Figure

Figure 1: Comparison of sentence-level attention distributions for the summaries in Table 1 on a news article. (a) is the heatmap for the gold reference summary, (b) is for the Seq2seq-baseline system, (c) is for the Point-gen-cov (See et al., 2017) system, (d) is for the Hierarchical-baseline system and (e) is for our system. Ii and Oi indicate the i-th sentence of the input and output, respectively. Obviously, the seq2seq models, including the Seq2seq-baseline model and the Point-gen-cov model, lose much salient information of the input document and focus on the same set of sentences repeatedly. The Hierarchical-baseline model fails to detect several specific sentences that are salient and relevant for each summary sentence and focuses on the same set of sentences repeatedly. On the contrary, our method with structural regularizations focuses on different sets of source sentences when generating different summary sentences and discovers more salient information from the document.

Figure

Figure 2: Average count of novel words (words that do not appear in the article). Seq2seq model generates more novel words, but less words are in the reference compared to our model.

Figure

Figure 3: Comparison of the output of two models on a news article. Bold words in text are the key information. (Baseline: enc-dec+attn; Our model: KIGN+prediction-guide)

Figure

Figure 1: The framework of our model. Entailment-aware encoder is learned by jointly training summarization generation (left part of (a), which is a seq2seq model) and entailment recognition (right part of (a), in which sentence pair in the entailment recognition corpus are encoded as u and v). Entailmentaware decoder is learned by entailment RAML training, in which the summary will be rewarded if it is entailed by the source sentence.

Figure

Figure 1: Our abstractive document summarization model, which mainly consists of three layers: document encoder layer (the top part), information selection layer (the middle part) and summary decoder layer (the bottom part).

Figure

Figure 2: ROUGE-1, ROUGE-2 and ROUGE-L F1 scores of KIGN+Prediction-guide model w.r.t different hyperparameter α.

Figure

Figure 2: Our hierarchical encoder-decoder model with structural regularization for abstractive document summarization.

Figure

Figure 3: The performance of (a) summarization generation on Gigaword validation set and (b) entailment recognition on SNLI (Bowman et al., 2015) validation set with different task batch switches (α).

Figure

Figure 4: The structural regularization reduces undesirable repetitions while summaries from the Seq2seq-baseline and the Hierarchical-baseline contains many n-gram repetitions.

Figure

Figure 3: Comparisons of structural-compression and structural-coverage analysis results on random samples from CNN/Daily Mail datasets, which demonstrate that both the Seq2seq-baseline model and the Hierarchical-baseline model are not yet able to capture them properly, but our model with structural regularizations achieves similar behavior with the gold reference summary.

Figure

Figure 4: An example of the generated review summairzation of S2S+Att, PGN and USN (Italic and bold denote words that do not appear in review).

Figure

Figure 3: Effects of user-specific vocabulary size on development set of Trip.

Figure

Figure 5: Speed comparison of classical DPPs sampling (blue), FGMInference (red) and BFGMInference (gray) with a batch size of 100.

Figure

Figure 2: The architecture of User-aware Sequence Network (USN). USN encodes two kinds of user information, user embedding (u) and user-specific vocabulary memory (U), into its two basic modules (User-aware Encoder and User-aware Decoder). 1©, and 2© show strategies based on user embedding, and represent User Selection strategy, and User Prediction strategy, respectively. 3© and 4© indicate strategies based on user-specific vocabulary memory and represent User Memory Prediction strategy and User Memory Generation strategy, respectively.

Figure

Figure 4: Conditional sampling in Macro DPPs

Figure

Figure 7: Presentation Degeneration Problem in NLG. We use tSNE (Maaten and Hinton, 2008) to reduce the dimension of word embeddings learned in the model.

Figure

Figure 1: Degenerated attention distribution behind OTR problem. The generated summary repeats the first sentence in article. We select the first 16 words of summary and show their attention over first 50 words of article.

Figure

Figure 6: Actual attention distribution learned by vanilla model and DPPs models.

Figure

Figure 1: Personalized review summarization is motivated by that different users are likely to generate different summaries for the same review, according to their own experiences, thoughts, or writing styles.

Figure

Figure 4: Effects of attribute-specific vocabulary size on review summarization on development set of TripAtt. When there is no any attribute-specific vocabulary (the size is 0) in ASN, our model degrades into S2SATT. The primary axis is for ROUGE-1 and ROUGE-L, and the second axis is for ROUGE-2.

Figure

Figure 2: Comparison of different reweighting methods on a simulated distribution. DPPs sampling reweighting approximates original distribution better since it catches the high attention area around position 160. It also samples less adjacent points around position 110.

Figure

Figure 3: The architecture of Attribute-aware Sequence Network (ASN). ASN encodes two kinds of attribute information, attribute embedding (a) and attribute-specific vocabulary memory (A), into its two basic modules (Attribute-aware Review Encoder and Attribute-aware Summary Decoder). 1©, and 2© show strategies based on attribute embedding, and represent Attribute Selection strategy, and Attribute Prediction strategy, respectively. 3© and 4© indicate strategies based on attribute-specific vocabulary memory, and represent Attribute Memory Prediction strategy and Attribute Memory Generation strategy, respectively.

Figure

Figure 3: Parameter tuning of k on the metrics of Rhyme, Integrity, and Micro-Dist-2.

Figure

Figure 1: Examples of text with rigid formats. In lyrics, the syllables of the lyric words must align with the tones of the notation. In SongCi and Sonnet, there are strict rhyming schemes and the rhyming words are labeled in red color and italic font.

Figure

Figure 2: An example of a generated summary sentence that is fused by cross-sentence EDUs.

Figure

Figure 3: Results of Co-Selective models with MTL and two-stage learning (TSL) for summarization task.

Figure

Figure 2: The framework of our model with co-selective encoding. During training, a BiLSTM reads the original sentence (x1, x2, · · · , xn) and the ground-truth keywords (k1, k2, · · · , km) into the first-level hidden states hri and hki . A jointly trained keyword extractor takes h r i as the input to predict whether the input word is a keyword or not. Co-selective encoding layer builds the second-level hidden states hr ′ i and h k′ i . Then the summary is generated via dualattention and dual-copy for both the original sentence and the keyword sequence. During testing, the ground-truth keywords are replaced by the keywords predicted by our trained keyword extractor.

Figure

Figure 4: Results of Co-Selective models with (w/) and without (w/o) fine-tuning (FT) for summarization task.

Figure

Figure 1: The overlapping keywords (marked in red) between the input sentence and the reference summary cover the main ideas of the input sentence. Our motivation is to generate summary guided by the keywords extracted from the input sentence.

Figure

Figure 1: Overall Architecture of Our Model

Figure

Figure 2: The framework of our proposed model.

Figure

Figure 3: R1/R2/RL vs Sparsity for token level and sentence level models. For sentence level model, we enforce it to extract at least three sentences.

Figure

Figure 3: Results of CoCoNet + CoCoPretrain model with different pre-training data selection strategies. “RG” is short for “ROUGE”.

Figure

Figure 2: The process of constructing the pre-training data. Given a piece of text, we divide it into an input span and an output span, and we calculate the overlap score of them by Equation 22. The top-K scored span pairs are selected.

Figure

Figure 4: The rate of correctly copied n-grams.

Figure

Figure 2: The Extractive-Abstractive model architecture. The extractor samples the evidence from the source which is used by the abstractor.

Figure

Figure 4: Summarization outputs with their evidence (highlighted), from our systems at different sparsity levels.

Figure

Figure 1: An example of a summary and its evidence (highlighted) as generated by our framework.

Figure

Figure 6: Performance on arXiv and PubMed, when we filter examples in test set with summary lenghth.

Figure

Figure 4: FAR’s performance against different values of β on arXiv and NYT.

Figure

Figure 2: Visualization of facet bias. Nodes refer to sentence representations and star is the document representation. Black solid circles mean facets. Red dashed circle means threshold in Section 3.1. The dashed bidirection arrows denote the sentence similarities.

Figure

Figure 5: The examples come from New York Times dataset.

Figure

Figure 1: Examples from New York Times. We selected part of key sentences from the source document to show in this table. “...” refers to the omissions of context sentences due to space limitation.

Figure

Figure 3: Sentence position distribution of arXiv and NYT. We use the first 40 sentences for NYT and the first 120 sentences for arXiv.

Figure

Figure 7: Sentence position distribution of 8 datasets.

Figure

Figure 5: The inference time of each system. Each time is the average of multiple runs (10 times). ”×N“ means the running time is N times (rounded up) of our method.

Figure

Figure 6: Impact of hyper-parameters λ and α.

Figure

Figure 3: A diagram for document segmentation.

Figure

Figure 2: The workflow of our proposed coarse-to-fine facet-aware ranking framework.

Figure

Figure 4: The smooth similarity curve.

Figure

Figure 1: An example from the Gov-Report dataset to introduce the process of our method. “...” refers to the omissions of context sentences due to space limitations. Highlight sentences refer to the final extracted summary sentences. The content of the arrow pointed is the facet description of the left semantic block. Bold facets represent vital facet-aware semantic blocks of the final summary.

Figure

Figure 1: Structure of our proposed Convolutional Gated Unit. We implement 1-dimensional convolution with a structure similar to the Inception (Szegedy et al., 2015) over the outputs of the RNN encoder, where k refers to the kernel size.

Figure

Figure 2: Percentage of the duplicates at sentence level. Evaluated on the Gigaword.

Figure

Figure 1: Model architecture for sequence-to-sequence with coarse-to-fine attention. The left side is the encoder that reads the document, and the right side is the decoder that produces the output sequence. On the encoder side, the top-level hidden states are used for the coarse attention weights, while the word-level hidden states are used for the fine attention weights. The context vector is then produced by a weighted average of the word-level states. In HIER, we average over the coarse attention weights, thus requiring computation of all word-level hidden states. In C2F, we make a hard decision for which chunk of text to use, and so we only need to compute word-level hidden states for one chunk.

Figure

Figure 2: Predicted summaries for each model. The source document is truncated for clarity.

Figure

Figure 3: Sentence attention visualizations for different models. From left to right: (1) STANDARD, (2) HIER, (3) C2F, (4) C2F +MULTI2 +POS.

Figure

Figure 10: Example extractive-output/abstractive-input for models in ”dewey & lebeouf” example. The extractive method used is tf-idf.

Figure

Figure 5: Translation examples from the Transformer-ED, L = 500.

Figure

Figure 4: Similarity of different length

Figure

Figure 1: CNN seq2seq model

Figure

Figure 9: Screenshot of side-by-side human evaluation tool. Raters are asked whether they prefer model output on the left or right, given a ground truth Wikipedia text.

Figure

Figure 7: Three different samples a T-DMCA model trained to produce an entire Wikipedia article, conditioned only on the title. Samples 1 and 3 are truncated due to space constraints.

Figure

Figure 5: Variance of different length

Figure

Figure 6: An example decoded from a T-DMCA model trained to produce an entire Wikipedia article, conditioned on 8192 reference document tokens.

Figure

Figure 2: Modified Decoder

Figure

Figure 4: Shows predictions for the same example from different models. Example model input can be found in the Appendix A.4

Figure

Figure 3: Shows perplexity versus L for tf-idf extraction on combined corpus for different model architectures. For T-DMCA, E denotes the size of the mixture-of-experts layer.

Figure

Figure 3: The buckets distribution of the dataset

Figure

Figure 8: Screenshot of DUC-style linguistic quality human evaluation tool.

Figure

Figure 1: The architecture of the self-attention layers used in the T-DMCA model. Every attention layer takes a sequence of tokens as input and produces a sequence of similar length as the output. Left: Original self-attention as used in the transformer-decoder. Middle: Memory-compressed attention which reduce the number of keys/values. Right: Local attention which splits the sequence into individual smaller sub-sequences. The sub-sequences are then merged together to get the final output sequence.

Figure

Figure 2: ROUGE-L F1 for various extractive methods. The abstractive model contribution is shown for the best combined tf-idf -T-DMCA model.

Figure

Figure 4: Examples of generated summaries. Colored spans contain key information from the gold reference.

Figure

Figure 1: The structure of the MemAttr model.

Figure

Figure 2: The structure of our Memory Network.

Figure

Figure 1: An example of news summarization. Colored spans are salient segments selected to form a summary, and their corresponding sentences are underlined.

Figure

Figure 2: Examples of discourse-level segmentation. a) spans in blue and yellow are the EDUs with semantically fragmented information and spans in red are the inaccurate EDU splits; b) the sub-sentential segments after merging.

Figure

Figure 1: Architecture of the original BERT model (left) and BERTSUM (right). The sequence on top is the input document, followed by the summation of three kinds of embeddings for each token. The summed vectors are used as input embeddings to several bidirectional Transformer layers, generating contextual vectors for each token. BERTSUM extends BERT by inserting multiple [CLS] symbols to learn sentence representations and using interval segmentation embeddings (illustrated in red and green color) to distinguish multiple sentences.

Figure

Figure 3: Content selector designs: a) RNN architecture; b) BERT architecture.

Figure

Figure 2: Proportion of extracted sentences according to their position in the original document.

Figure

Figure 1: Dependency discourse tree for a document from the CNN/DailyMail dataset (Hermann et al., 2015). Blue nodes indicate the roots of the tree (i.e., summary sentences) and parent-child links indicate dependency relations.

Figure

Figure 6: Overview of the neural selector architecture.

Figure

Figure 7: Position distribution of generated summaries from a strong baseline model BertEXT and our conditional summarization model with position code set to 0 (3 implementations). X axis is the position ratio. Y axis is the sentence-level proportion.

Figure

Figure 1: Proposed conditional generation framework exploiting sub-aspect functions.

Figure

Figure 2: Cumulative position distribution of oracles built on ROGUE (Blue) and BertScore (Orange). X axis is the ratio of article length. Y axis is the cumulative percentage of summary sentences.

Figure

Figure 5: Sentence-level clustering result labeled with sub-aspect features. X axis is the cluster index. Y axis is the proportion of sub-aspect features in each cluster.

Figure

Figure 3: Sample-level distribution of sub-aspect functions of the BertScore oracle. Values are the percentage in categorized samples, which adds up to 60.03% of CNN/Daily Mail training set. The remaining 39.97% do not belong to any of these 3 sub-aspects.

Figure

Figure 9: Sub-aspect mapping of generated summary with diversity-focus code [0,1,0]. Left panel: one sentence in the summary belongs to diversity sub-aspect. Right panel: two sentences in the summary belong to diversity sub-aspect. Contour lines denote the number of generated summaries.

Figure

Figure 4: Autoencoder with adversarial training strategy for unsupervised clustering of sentence-level distribution of sub-aspect functions.

Figure

Figure 8: Sub-aspect mapping of generated summary with importance-focus code [1,0,0]. Left panel: one sentence in the summary belongs to importance subaspect. Right panel: two sentences in the summary belong to importance sub-aspect. Contour lines denote the number of generated summaries.

Figure

Figure 2: ROUGE-1 distributions of the candidates in pretraining stage training set (pre-train), fine-tuning stage training set (meta-train) and fine-tuning stage test set (meta-test) on XSum dataset.

Figure

Figure 3: The Refactor’s success rates with different bin widths. W denotes the bin widths measured by ROUGE1. R denotes the success rate of the Refactor outperforming the single best base system.

Figure

Figure 1: Illustration of two-stage learning. “Doc, Hypo, Ref” represent “input document, generated hypothesis, gold reference” respectively. “Hypo’” represents texts generated during test phase. ΘBase and ΘMeta represent learnable parameters in two stages.

Figure

Figure 1: An illustration of sparse attention patterns ((a), (b), (c)) and their combination (d) in HETFORMER.

Figure

Figure 2: Test performance with different numbers of candidate summaries on CNNDM. Origin denotes the original performance of the baseline model.

Figure

Figure 4: Fine-tuned Refactor’s selection accuracy on CNNDM with different difficulties. The X-axis is the difference of ROUGE score of BART and pre-trained Refactor outputs.

Figure

Figure 1: SimCLS framework for two-stage abstractive summarization, where Doc, S, Ref represent the document, generated summary and reference respectively. At the first stage, a Seq2Seq generator (BART) is used to generate candidate summaries. At the second stage, a scoring model (RoBERTa) is used to predict the performance of the candidate summaries based on the source document. The scoring model is trained with contrastive learning, where the training examples are provided by the Seq2Seq model.

Figure

Figure 3: Positional Bias. X-asis: the relative position of the matched sentence in source documents. Y-axis: the ratio of the matched sentences. For fair comparison, articles are first truncated to the generator’s maximum input length. Origin denotes the original performance of the baseline model.

Figure

Figure 2: Illustration of our length-control algorithm.

Figure

Figure 3: Kendall’s τ correlation of evaluation metrics with and without compression ratio.

Figure

Figure 4: Comparing our length-control NAUS and the truncated CTC beam search on the Gigaward headline generation test set.

Figure

Figure 1: The overview of our NAUS approach. In each search step, input words corresponding to grey cells are selected. Besides, the blue arrow refers to the training process, and the green arrow refers to inference.

Figure

Figure 1: The comparison between PSP and previous methods. “E” and “D” represents the encoder and the decoder, respectively.

Figure

Figure 3: The overall framework of SEGTRANS model. The blue circles indicate input source text, where dark blue circles indicate paragraph boundaries. The yellow circles indicate output target text, where orange circles indicate heading boundaries. Dotted red lines indicate attention heads with segmentation-aware attention mechanism and dotted blue lines indicate attention heads with original full attention mechanism.

Figure

Figure 1: Overview of LAAM on Transformer Seq2seq. The bold values are boosted attention scores. The shadow boxes denote the attention scores of eos.

Figure

Figure 2: Loop of candidate generation and model finetuning.

Figure

Figure 3: Performance comparison (BART v.s. BRIO-Mul) w.r.t. reference summary novelty. The x-axis represents different buckets of test examples grouped by reference summary novelty (Eq. 11). Larger x-coordinates correspond to examples of which the reference summaries have higher novelty. The left figure shows the performance improvement of our model compared with the baseline model, while the right one shows model performance.

Figure

Figure 3: Architecture and training scheme of PSP. Squares in blue and red indicates frozen and tuned parameters, respectively.

Figure

Figure 2: Architecture of semantic distribution from auto-regressive language model.

Figure

Figure 4: Reliability graphs on the CNNDM and XSum datasets. The accuracy of model’s predictions is plotted against the model’s confidence on these predictions.

Figure

Figure 2: Visualization of the encoder-decoder attention weights. The x-axis are the encoder input, including prompts across the encoder Pen and the source document X . The y-axis are the decoder input, including prompts across the decoder Pde and the target summary Y . The area in the red box represents the attentions of Pde assigning to Pen. The area in the yellow box represents the attentions of Y assigning to X . Darker color shows the more highly related associations between tokens.

Figure

Figure 1: An overview of our CAST method.

Figure

Figure 3: Performance versus the number of training samples in the setting of Group B, Table 1. Notice that NAUS is trained by pseudo-groundtruth given by unsupervised edit-based search (Schumann et al., 2020). Thus, our approach is indeed unsupervised.

Figure

Figure 4: One example news article on CNN website. It contains human-annotated segments and heading-style summaries.

Figure

Figure 1: Comparison of MLE loss (LMLE) and the contrastive loss (LCtr) in our method. MLE assumes a deterministic (one-point) distribution, in which the reference summary receives all the probability mass. Our method assumes a nondeterministic distribution in which system-generated summaries also receive probability mass according to their quality. The contrastive loss encourages the order of model-predicted probabilities of candidate summaries to be coordinated with the actual quality metric M by which the summaries will be evaluated. We assign the abstractive model a dual role – a single model could be used both as a generation model and a reference-free evaluation model.

Figure

Figure 1: One example from the segmentation-based summarization task SEGNEWS. The news article is taken from a CNN news article and we truncate the article for display. CNN editors have divided this article into several sections and written a heading to section. The goal of this task is to automatically identify sub-topic segments of multiple paragraphs, and generate the heading-style summary for each segment. Dotted lines in the figure indicate segment boundaries. In this article, paragraphs 1,2 are annotated as the first segment, paragraphs 3,4 are annotated as the second segment, paragraphs 5,6 are annotated as the third segment, and paragraphs 7,8 are annotated as the forth segment. To the right of the article are the heading-style summaries for segments. Since the first segment is usually an overview of the news, we do not assign a summary to it.

Figure

Figure 2: The frequency of the non-stop words in summary appearing at different positions of the source article. The positions range from [0, 1024].

Figure

Figure 2: Variance of generated summary lengths in gold length test with soft length control.

Figure

Figure 4: Var and R-2 (Pre) of arbitrary length test with soft length control on complete test sets.

Figure

Figure 6: k-shot summarization results on XSum.

Figure

Figure 3: Var and R-2(F1) scores of gold length test with soft length control on divided test sets.

Figure

Figure 2: Coverage and density distributions of the BigSurvey.

Figure

Figure 5: Visualization of the encoder-decoder attention weights of the model with only prompts across the encoder and the decoder (left) and PSP (right). Detailed descriptions refer to Figure 2.

Figure

Figure 1: A document with its high-quality and lowquality summaries. The heat map marks the salient content in the document. The darker the colour, the more salient the content.

Figure

Figure 4: Different inner prompts for one example source document. Different colors indicate different inner prompt embeddings. “NO. of words” means the length of the text span.

Figure

Figure 2: The overall framework of HER is formulated as a contextual bandit and can be divided into a two-stage process containing rough reading and careful reading.

Figure

Figure 5: The statistics on extracted sentence number of our model. Frequency is the number of documents.

Figure

Figure 3: The statistics of model HER, BANDITSUM (Dong et al., 2018), HER w/o Local Net on the selected sentences’ indexes varying different document lengths. This is reported on the documents the lengths of which are all less than 80 on the test split.

Figure

Figure 4: A case on sentence selection of HER and HER w/o policy. The article is from CNN dataset. The highlighted indices indicate the corresponding sentences which should be extracted as summary.

Figure

Figure 1: An example of how human beings extract summary. The article is from CNN/DailyMail dataset.

Figure

Figure 1: Model architecture.

Figure

Figure 2: An example of the mixed transitive negative sampling process. The original part is in white, while the modified part is indicated as grey blocks.

Figure

Figure 3: The architecture of our proposed model for abstractive summarization. Our model consists of three parts: 1. Transformer Encoder-Decoder, 2. Entity Pointer Network, 3. Relation Pointer Network. The encoder in Transformer EncoderDecoder shares parameters with that in Relation Pointer Network.

Figure

Figure 2: A document from XSum dataset and the facts in it.

Figure

Figure 5: Sample generated summaries by our models. The intrinsic hallucinations in the summary are marked blue and the key information in the document is bolded.

Figure

Figure 1: A sample document with corresponding summaries generated by different abstractive summarization methods, in which extrinsic hallucinations are marked in yellow, and intrinsic hallucinations which are marked in blue. Note that the results of PTGEN [25] and TCONVS2S [20] come from Maynez et al. [18].

Figure

Figure 4: Sample generated summaries by our models. The extrinsic hallucinations in the summary are marked yellow and the key information in the document is bolded.

Figure

Figure 1: The overview of our model.

Figure

Figure 3: Predicted and ORACLE global attention in BART. There are attention distributions of (a) the whole source, (b) the source without the start & end tokens, (c) the source without the start & end tokens and full stops.

Figure

Figure 1: (a) Attention distribution is composed of the summation of cross attention on the samecolored lines, distinguished from that of different-colored lines which always equals 1 due to softmax. (b) Local attention gradually increases as the decoding proceeds. (c) Desired situation: growing local attention has been lower than global attention during decoding and exactly reaches it at the end.

Figure

Figure 4: Changes of the attention distribution when (a) one word in the reference is replaced by a similar word (s1) and a random word (s2), (b) the sentence order of the reference is shuffled, (c) a factual knowledge in the reference is distorted.

Figure

Figure 2: Annotation pipeline of ENTSUM

Figure

Figure 3: Distribution of sentence positions for salient and summary sentences.

Figure

Figure 1: Example of a generic summary (blue), with three entity-centric summaries from ENTSUM focusing on the entities in bold.

Figure

Figure 1: Proposed BERT-Multitask model.

Figure

Figure 2: Specificity prediction model used.

Figure

Figure 2: Different fine-tuning conditions for T5. (- -) indicates optional additive data for Paraphrasing.

Figure

Figure 1: Mixtext model and the modified MixGen for generative tasks.

Figure

Figure 1: Examples of ∆(y,y′) of the original MRT and ∆̃(y,y′) of GOLC where ROUGE-1 recall is calculated based on unigrams. In the two examples, a reference y is ⟨malaysia,markets, closed, for, holiday⟩ and a sampled summary y′ is ⟨markets, in,malaysia, closed, for, holiday⟩ and cb(y) = len(’ ’.join(y)) = 38 and cb(y ′) = len(’ ’.join(y′)) = 35.

Figure

Figure 2: Summary length distributions on CNN/Daily and Mainichi. Summary length is the number of characters.

Figure

Figure 6: ROUGE-1 score relative to that of BART(1k) system evaluated on different partitions by length.

Figure

Figure 2: Self-Attention Pattern.

Figure

Figure 4: The average mean distance across multiheads for each layer. The average mean distance of the random weight model is slightly lower than DU as some inputs are shorter than 1,024.

Figure

Figure 4: ∆R1 (Y-axis) against r at inference (X-axis).

Figure

Figure 1: Overview of the combined architecture where we highlight different aspects of this work. N0 is the original document length, N is the input length to the generation system, and M is the summary length.

Figure

Figure 8: LoBART positional embedding is initialized by copying and flipping BART’s positional embedding.

Figure

Figure 9: ROUGE-1 score relative to that of BART(1k) on Spotify Podcast (Len:Avg=5,727, 90th%=11,677).

Figure

Figure 7: Example of LoBART’s encoder-decoder attention evaluated on Podcast test set.

Figure

Figure 1: The sum of attention weights against the number of retained sentences (r) evaluated on CNNDM.

Figure

Figure 5: Example of BART’s encoder-decoder attention evaluated on CNNDM test set.

Figure

Figure 5: The impact of training-time content selection methods on BART(1k) performance.

Figure

Figure 3: Operating points for B=1 and M=144. (1) Section 4 studies local attention to reduce quadratic complexity to linear. As W decreases, the gradient of linear complexity decreases. (2) Section 5 studies content selection to move an operating point to the left.

Figure

Figure 8: Example of LoBART’s encoder-decoder attention evaluated on arXiv test set.

Figure

Figure 3: Modified architecture with model-based approximator where the base model is BART/LoBART. Model-based neural approximator is shown in orange.

Figure

Figure 6: Example of BART’s encoder-decoder attention evaluated on XSum test set.

Figure

Figure 4: Comparison of summarization metrics. Support sentences are marked in the same color as their corresponding facets. SCUs have to be annotated for each extracted summary during evaluation, while facetaware evaluation can be conducted automatically by comparing sentence indices.

Figure

Figure 5: The first three figures show the ground-truth and estimated FAR scores via human-annotated FAMs and machine-created FAMs. The fourth figure shows the fitting of linear regression on the human-annotated samples (LR-Small) and the prediction on the whole test set of CNN/Daily Mail (LR-Large). Systems are sorted in an ascending order by the ground-truth FAR on the human-annotated samples.

Figure

Figure 2: Performance of extractive methods under ROUGE, FAR, and SAR. The results under ROUGE-1/2/L often disagree with each other. UnifiedSum(E) generally performs the best in the facet-aware evaluation.

Figure

Figure 3: Comparison of extractive methods under FAR and SAR reflects their capability of extracting salient and non-redundant sentences.

Figure

Figure 1: An illustration of facet-aware evaluation. Two of three support groups of facet 1 (r1) are covered. Facet 2 (r2) cannot be covered as document sentence 4 (d4) is missing in the extracted summary. The illustration corresponds to the example in Sec. 3.1.

Figure

Figure 2: Visualization of the learned node embeddings in testing each epoch. Red nodes are words (light) and sentence (heavy) in labels of summary, while blue nodes are related to words (light) and sentences (heavy) in non-summaries. Purple nodes are words shared by sentences between summaries and non-summaries.

Figure

Figure 1: The overview architecture of the MuchSUM with three specific graph convolutional channels and a common convolutional channel shared by the three graph channels. We denote three specific channels as Node Lexical Feature Encoding Channel (A,X𝑠 ), Node Centrality Feature Encoding Channel (A,X𝑐 ) andNode Position Feature EncodingChannel (A,X𝑝 ). In the bipartite word-sentence heterogeneous graph, each sentence node (solid node) is connected to its contained word-related nodes (hollow nodes) and takes the weight of the relation as their edge feature. Different thicknesses of edges represent different edge weights.

Figure

Figure 1: (a) BERTSUMEXTABS model. An encoder encodes the document, and a word generator generates the next word given previous words, while paying attention to the document. (b) Sentence planner model. A shared encoder separately encodes the document and each sentence of the summary generated so far. The sentence generator takes the summary sentence embeddings and predicts the next sentence embedding, which the word generator is then conditioned on. Both generators integrate document information through attention.

Figure

Figure 2: Scatter plots of ROUGE scores and support scores: X-axis presents ROUGE-1 score between system and reference headlines; and Y-axis presents support score (the same to Figure 1).

Figure

Figure 4: The distribution of the support scores on JAMUL.

Figure

Figure 5: Guideline for entailment labeling

Figure

Figure 3: The distribution of the support scores on the English Gigaword dataset.

Figure

Figure 1: Histogram of support scores (recall-oriented ROUGE-1 scores between generated headlines and their source documents).

Figure

Figure 6: Examples of the improved headlines.

Figure

Figure 3: Sample of question-answer pairs generated from hallucinated summaries that are correctly answered by their source articles. Highlighted spans in the summaries are marked as extrinsic or intrinsic hallucinations by our annotators.

Figure

Figure 2: Human assessment of a system generated summary for the article in Figure 1. The annotation user interface is shown as it was shown to raters.

Figure

Figure 1: Hallucinations in extreme document summarization: the abbreviated article, its gold summary and the abstractive model generated summaries (PTGEN, See et al. 2017; TCONVS2S, Narayan et al. 2018a; and, GPTTUNED, TRANS2S and BERTS2S, Rothe et al. 2020) for a news article from the extreme summarization dataset (Narayan et al., 2018a). The dataset and the abstractive models are described in Section 3 and 4. We also present the [ROUGE-1, ROUGE-2, ROUGE-L] F1 scores relative to the reference gold summary. Words in red correspond to hallucinated information whilst words in blue correspond to faithful information.

Figure

Figure 1: Summaries produced by our model. For illustration, the compressive summary shows the removed spans strike-through.

Figure

Figure 2: Illustration of our summarization system. The model extracts the most relevant sentences from the document by taking into account the WordEncoder representation of the current sentence e(si), the SentEncoder representation of the previous sentence hsi , the current summary state representation o s i , and the representation of the document e(D). If a sentence is selected (zi = 1), its representation is fed to SentStates, and we move to the next sentence. Here, sentences s1 and s3 were selected. If the model is also compressing, the compressive layer selects words for the final summary (Compressive Decoder). See Figure 3 for details on the decoders.

Figure

Figure 5: Example output summaries on the CNN/DailyMail dataset, gold standard summary, and corresponding questions. The questions are manually written using the GOLD summary. The same EXCONSUMM summaries are shown in Figure 1, but the strikethrough spans are now removed.

Figure

Figure 3: Decision decoder architecture. Decoder contains an extractive level for sentences (orange box) and a compressive level for words (dashed gray box), using an LSTM to model the summary state. Red diamond shapes represent decision variables zi = 1 if p(zi ∣ pi) > 0.5 for selecting the sentence si, and zi = 0 if p(zi ∣ pi) ≤ 0.5 for skipping this sentence. The same for yij and p(yij ∣ qij) > 0.5 for deciding over words wij to keep in the summary.

Figure

Figure 4: Word distribution in comparison with the human summaries for CNN dataset. Density curves show the length distributions of human authored and system produced summaries.

Figure

Figure 2: Oracle sentence distribution over a paper. X-axis: 10,000 papers sampled from FacetSum, sorted by full text length from long to short; y-axis: normalized position in a paper. We provide each sub-figure’s density histogram on their right.

Figure

Figure 1: Editorial Network (EditNet)

Figure

Figure 2: An example mixed summary (annotated with the editor’s decisions) taken from the CNN/DM dataset

Figure

Figure 2: HOLMS: structure and value range (3D Gaussian peaks and spreads are both set to 1).

Figure

Figure 1: Illustration of the area under curve representing the HOLMS value.

Figure

Figure 3: Hierarchical encoder with hierarchical attention: the attention weights at the word level, represented by the dashed arrows are re-scaled by the corresponding sentencelevel attention weights, represented by the dotted arrows. The dashed boxes at the bottom of the top layer RNN represent sentence-level positional embeddings concatenated to the corresponding hidden states.

Figure

Figure 1: Feature-rich-encoder: We use one embedding vector each for POS, NER tags and discretized TF and IDF values, which are concatenated together with word-based embeddings as input to the encoder.

Figure

Figure 2: Switching generator/pointer model: When the switch shows ’G’, the traditional generator consisting of the softmax layer is used to produce a word, and when it shows ’P’, the pointer network is activated to copy the word from one of the source document positions. When the pointer is activated, the embedding from the source is used as input for the next time-step as shown by the arrow from the encoder to the decoder at the bottom.

Figure

Figure 1: SummaRuNNer: A two-layer RNN based sequence classifier: the bottom layer operates at word level within each sentence, while the top layer runs over sentences. Double-pointed arrows indicate a bi-directional RNN. The top layer with 1’s and 0’s is the sigmoid activation based classification layer that decides whether or not each sentence belongs to the summary. The decision at each sentence depends on the content richness of the sentence, its salience with respect to the document, its novelty with respect to the accumulated summary representation and other positional features.

Figure

Figure 2: Visualization of SummaRuNNer output on a representative document. Each row is a sentence in the document, while the shading-color intensity is proportional to its probability of being in the summary, as estimated by the RNN-based sequence classifier. In the columns are the normalized scores from each of the abstract features in Eqn. (6) as well as the final prediction probability (last column). Sentence 2 is estimated to be the most salient, while the longest one, sentence 4, is considered the most content-rich, and not surprisingly, the first sentence the most novel. The third sentence gets the best position based score.

Figure

Figure 7: Human evaluation instruction screenshots.

Figure

Figure 2: QAGen model: for an input text (p), it generates a question (q) followed by an answer (a).

Figure

Figure 4: Correlation between QUALS and QAGS on XSUM (left) and CNNDM (right). The average QAGS tend to increase with the increase in QUALS. The standard deviation of the QAGS for each bin is about 0.187 for XSUM and 0.127 for CNNDM.

Figure

Figure 6: Human evaluation interface using Amazon Sagemaker Ground Truth.

Figure

Figure 5: Negative log likelihood per subword token on two q-a pairs from the QAGen model according to the summary(blue) and input document (orange). Higher means unlikely. The first q-a pair (top figure) has a much higher average negative log likelihood according to the input document than according to the summary.

Figure

Figure 3: Correlation between QUALS and QAGS on XSUM (left) and CNNDM (right). The average QAGS tend to increase with the increase in QUALS.

Figure

Figure 1: Comparison between QAGS (top) and QUALS (bottom) protocols. QUALS uses only one QAGen model instead of the AE, QG and QA models used in QAGS.

Figure

Figure 2: AUTOSUMM-CREATE.

Figure

Figure 1: Overview of the proposed approach.

Figure

Figure 10: Model created with NAS module in AUTOSUMM-CREATE, as visualised through tensorboard

Figure

Figure 4: Efficiency comparison for the extractive summarization models on the CNN/DM dataset

Figure

Figure 9: Variation of performance with the increase in KD proportion on CNN DM dataset

Figure

Figure 5: Cell distribution across varying layer size

Figure

Figure 7: Cross-Data experiments

Figure

Figure 6: Layer distribution for XSUM and Contract

Figure

Figure 3: AUTOSUMM-DISTILL.

Figure

Figure 8: Performance vs Training data variation

Figure

Figure 1: An abridged example from our extreme summarization dataset showing the document and its oneline summary. Document content present in the summary is color-coded.

Figure

Figure 1: Extractive summarization model with reinforcement learning: a hierarchical encoder-decoder model ranks sentences for their extract-worthiness and a candidate summary is assembled from the top ranked sentences; the REWARD generator compares the candidate against the gold summary to give a reward which is used in the REINFORCE algorithm (Williams, 1992) to update the model.

Figure

Figure 2: Summaries produced by the LEAD baseline, the abstractive system of See et al. (2017) and REFRESH for a CNN (test) article. GOLD presents the human-authored summary; the bottom block shows manually written questions using the gold summary and their answers in parentheses.

Figure

Figure 2: Topic-conditioned convolutional model for extreme summarization.

Figure

Figure 3: Length distributions in ETCSum summaries on the CNN/DailyMail test set.

Figure

Figure 2: Stepwise HiBERT (left) and ETCSum (right) models. HiBERT builds summary informed representation by jointly modeling partially generated summary and the document during document encoding, while ETCSum takes as input the document appended with the partially generated summary.

Figure

Figure 1: Pretraining and finetuning for abstractive summarization with entity chains.

Figure

Figure 10: Instructions for human evaluations for overall quality of summaries.

Figure

Figure 8: CNN/DailyMail example predictions from FROST and CTRLSum along with their entity prompts and keywords, respectively.

Figure

Figure 6: Example XSum predictions for models presented in Tables 3 and 4.We highlight entities in orange that are not faithful to the input document. Entities in green are faithful to the input document.

Figure

Figure 9: Instructions for human evaluations for faithfulness.

Figure

Figure 3: Sentence-level vs summary-level entity chains. We report summary-level ROUGE-L (RL-Sum), entity chain-level ROUGE-2 (R2-EPlan), and ENTF1 on the CNN/DailyMail validation set. Similar observations were made for other measures.

Figure

Figure 4: Finetuning results on the XSum validation set using one of the base-sized pretrained models: PEGASUS, FROST(F), and FROST(P+F). All pretrained models were trained for 1.5m steps. See text for more details. We only report on a subset of measures, similar observations were made for other measures.

Figure

Figure 5: Finetuning results on the XSum (in blue) and CNN/DailyMail (in red) validation sets at various steps during pretraining FROST-Large. Instead of pretraining from scratch, we start with a PEGASUS-Large checkpoint, and continue pretraining for additional 1.5m steps with the planning objective. We report finetuning results for the PEGASUS finetuned baseline and our models at 0.1m, 1m, and 1.5m steps.

Figure

Figure 7: An example of generating summaries with topical and style diversity using modified entity prompts cmod on XSum.

Figure

Figure 2: An example of sentence-level and summary-level entity chains along with the reference summary.

Figure

Figure 3: Proposed model for Query based Abstractive Summarization with (i) query encoder (ii) document encoder (iii) query attention model (iv) diversity based document attention model and (v) decoder. The green and red arrows show the connections for timestep 3 of the decoder.

Figure

Figure 1: The sentence position and length of extracted summaries

Figure

Figure 3: Generation of context states and class-based representations by text representation component.

Figure

Figure 1: Graphical representation of the model

Figure

Figure 2: Generation of the latent code zi for a review ri by the encoder. Yellow boxes represent the neural networks that compute the prior and variational posterior distributions of latent codes.

Figure

Figure 4: Fewshot learning: ROUGE-1 with the mean and standard deviation over 5 runs.

Figure

Figure 1: The overview of our framework. in which the backbone is in charge of generating two summaries for a document. Then the oracle selects which summary is better for a given document. The reward model afterward transforms the oracle’s preference into a discrete signal to optimize the backbone. Our framework contains two novel components: efficient sampling from offline data and the preference-guided reward model.

Figure

Figure 9: Fewshot learning: ROUGE-2 and ROUGE-L score with the mean and std over 5 runs

Figure

Figure 5: Online learning on Reddit TIFU: ROUGE-1 with the mean and standard deviation over 5 runs.

Figure

Figure 8: Active learning: ROUGE-2 and ROUGE-L score with the mean and std over 5 runs

Figure

Figure 7: Online learning on RedditTIFU dataset: ROUGE-2 and ROUGE-L score with the mean and std over 5 runs

Figure

Figure 2: Reward model: (a) Accuracy of 3 models on all sets (b) The ROUGE-1 score of our ROMSR.

Figure

Figure 6: Ablation study: (a) Performance with different k values; (b) Quality of selected samples; (c) Semantic similarity between online documents and offline documents.

Figure

Figure 3: Active learning: ROUGE-1 with the mean and standard deviation over 5 runs.

Figure

Figure 8: ROUGE-L box plot for all candidate summary sets generated with LkO input perturbation method for k = 1, .., 5.

Figure

Figure 5: Length box plot for all candidate summary sets generated with LkO input perturbation method for k = 1, .., 5.

Figure

Figure 4: Confusion matrix for the Coherence Pairwise Classifier.

Figure

Figure 6: ROUGE-1 box plot for all candidate summary sets generated with LkO input perturbation method for k = 1, .., 5.

Figure

Figure 10: Confusion matrix for the Fluency Pairwise Classiifer.

Figure

Figure 2: IP-SPR-2 scores (measuring IP-Diversity) box plot, for all pairs of candidate summaries generated with LkO input perturbation method for k = 1, ..., 5.

Figure

Figure 9: SPR-L box plot for all pairs of candidate summaries generated with LkO input perturbation method for k = 1, .., 5.

Figure

Figure 7: SPR-1 box plot for all pairs of candidate summaries generated with LkO input perturbation method for k = 1, .., 5.

Figure

Figure 3: ROUGE-2 F1 scores box plot, for all candidate summary sets generated with LkO input perturbation method for k = 1, ..., 5.

Figure

Figure 1: A diagram of the PASS components, with an example for a collection of reviews of size d = 4, k = 1.

Figure

Figure 1: Histogram of the position of sentences selected by our method and PacSum on CNN/DM. PacSum uses position information which allows it to take advantage of the lead bias. In contrast, our method is position-agnostic but still captures the fact that earlier sentences are more important in news articles.

Figure

Figure 3: Correlation between metrics and human judgement on subsets of data. The x and y axis represent the human judgement the metric scores respectively. The red line is a linear regression fitted on full data. Each dotted line is a linear regression fitted on a model-dataset subset. Each colored point has coordinates equal to average factuality judgement, and metric score for its corresponding partition.

Figure

Figure 10: Articles web pages are provided.

Figure

Figure 2: Proportion of summaries with factual errors based on collected annotations, with breakdown of the categories of errors within. Full specification of categories of errors in Table 1.

Figure

Figure 1: We propose a linguistically grounded typology of factual errors. We select crowd workers to annotate summaries from two datasets according to this typology achieving near perfect agreement with experts. We collect FRANK, the resulting dataset, to benchmark factuality metrics and state-of-art summarization systems.

Figure

Figure 8: The sentences being annotated is highlighted in yellow. Relevant text is underlined in the article plain text.

Figure

Figure 5: Variation in partial Pearson correlation when omitting error types. Higher variation indicates greater influence of an error type in the overall correlation.

Figure

Figure 11: Entity question to ensure annotators read the text.

Figure

Figure 4: Partial Pearson correlation on different partitions of the data. Entailment metrics have highest correlation on pretrained models in the CNN/DM dataset. Their performance degrades significantly on XSum.

Figure

Figure 7: Instructions can be toggled.

Figure

Figure 9: After selecting that the sentence is not factual annotators choose the category of error.

Figure

Figure 6: Confusion matrix of different types of errors. Entry at row i, column j corresponds to the frequency of annotations that have Fi as the majority class and for which disagreeing annotator selected Fj.

Figure

Figure 1: Example of the QFS dataset and a constructed graph. Query nodes are denoted by blue circles, and document nodes by yellow circles. Root words in red letters indicate important words in a query and each sentence. The nodes in the purple dotted rectangle are especially important to generate a summary.

Figure

Figure 2: Overview of proposed QSG Transformer.

Figure

Figure 1: Comparing the uni-, bi-, and tri-gram novelty for the medium sized datasets. These datasets contain generated sequences up to 128 tokens in length. The methods are as follows: NLL (baseline), RwB-Hinge, RISK2, and RISK-3. The unique average n-gram novelty (n-grams that do not appear in the source text) is shown to increase across the board compared to the standard NLL baseline.

Figure

Figure 2: Comparison of each method for the full-data approach over a medium size dataset (CNN/DM). The methods are as follows: NLL (baseline), RwB-Hinge, RISK-2, and RISK-3. We see that the reinforcement learning approaches have led, on average, to higher ROUGE-L scores for the longer summaries compared to the NLL baseline.

Figure

Figure 2: Abstract from PLOS Medicine, entity grid, bipartite entity graph

Figure

Figure 3: One mode projection of the bipartite graph

Figure

Figure 1: Control flow of our summarization method

Figure

Figure 1: Abstract from PLOS Medicine, topical grid, bipartite topical graph, one-mode projection

Figure

Figure 2: (i) A sample text from PLOS Medicine; (ii) entity graph; (iii) projection graph of the text.

Figure

Figure 3: (i) A projection graph; (ii) several instances of a coherence pattern in Figure 1, ii.

Figure

Figure 1: (i) A sample of mined coherence patterns from abstracts; nodes are sentences and edges are entity connections; (ii) Sentences S1, S3 and S5 constitute the pattern in an input document.

Figure

Figure 4: An illustration of mapping variables to overlay graph g with coherence pattern patu.

Figure

Figure 2: Overview of our saliency predictor model.

Figure

Figure 1: Our sequence generator with RL training.

Figure

Figure 1: Illustration of the encoder and decoder attention functions combined. The two context vectors (marked “C”) are computed from attending over the encoder hidden states and decoder hidden states. Using these two contexts and the current decoder hidden state (“H”), a new word is generated and added to the output sequence.

Figure

Figure 2: Cumulated ROUGE-1 relative improvement obtained by adding intra-attention to the ML model on the CNN/Daily Mail dataset.

Figure

Figure 1: The blue distribution represents the score distribution of summaries available in the human judgment datasets of TAC-2008 and TAC-2009. The red distribution is the score distribution of summaries generated by mordern systems. The green distribution corresponds to the score distribution of summaries we generated in this work as described in section 3.

Figure

Figure 2: Example of PD K in comparison to the word distribution of reference summaries for one topic of TAC-2008 (D0803).

Figure

Figure 2: Percentage of disagreement between metrics for increasing scores of summary pairs (Scores have been normalized).

Figure

Figure 4: Pairwise correlation between evaluation metrics on various scoring range. The generated dataset uses the topics from TAC-2008 and TAC-2009. The human judgments are the ones available as part of TAC-2008 and TAC-2009.

Figure

Figure 3: The x-axis is the score of the normalized average score of s given by 1n ∑ imi(s) after the metrics have been normalized between 0 and 1. On the y-axis: F N associated to the sampled summary s. We also report the average performance of current systems.

Figure

Figure 1: figure 1a represents an example distribution of sources, figure 1b an example distribution of background knowledge and figure 1c is the resulting target distribution that summaries should approximate.

Figure

Figure 2: Visualized the efficiency of using passage nodes to enhance sentence representation. The degree of highlighting expresses the important role of the passage in the document. Underlined sentences are modelselected summaries. As result, the selected sentences belong to passages that have high scores of α (Equation 8).

Figure

Figure 1: Overview of HeterGraphLongSum model. Passages of each document are defined as a set of sentences in sequence with a fixed number of sentences. In this architecture, the edges from passage to word and sentence to passage are not taken into account because of the redundancy.

Figure

Figure 2: n-gram overlaps between the abstracts generated by different models and the input article on the arXiv dataset. We show in detail which part of the input was copied for our TLM conditioned on intro + extract.

Figure

Figure 1: Our approach for abstractive summarization of a scientific article. An older version of this paper is shown as the reference document. First, a sentence pointer network extracts important sentences from the paper. Next, these sentences are provided along with the whole scientific article to be arranged in the following order: Introduction, extracted Sentences, abstract & the rest of the paper. A transformer language model is trained on articles organized in this format. During inference, the introduction and the extracted sentences are given to the language model as context to generate a summary. In domains like news and patent documents, the introduction can be replaced by the entire document.

Figure

Figure 3: t-sne visualization of the TLM-learned word embeddings. The model appears to partition the space based on the broad paper categoty in which it frequently occurs.

Figure

Figure 1: Joint distribution of different classes. For a pair of classes 𝑐𝑖 and 𝑐 𝑗 , the value in a cell is 𝑛𝑖 𝑗 × 100/min{|𝑐𝑖 |, |𝑐 𝑗 |}, where 𝑛𝑖 𝑗 is the number of tweets that have been labeled with both 𝑐𝑖 and 𝑐 𝑗 .

Figure

Figure 3: Frequency distribution of tweets corresponding to different concerns over time.

Figure

Figure 6: Top 20 frequent entities in CORD-SUM vocabulary.

Figure

Figure 5: Summary sentences distributions of models.

Figure

Figure 2: The model contains three main modules: 1) Local Encoder: is composed of an Entity Encoder and a Sentence Encoder, the embeddings of entities and sentences are the initial features of graph nodes; 2) Heterogeneous Graph Encoder: an iteratively computed graph with FacetWeight; and 3) Extraction & Postprocess: ranks sentences while minimizing redundancy with Trigram Blocking.

Figure

Figure 3: Heat map of five section categories.

Figure

Figure 1: An example in our CORD-SUM dataset. Texts highlighted with different colors denote different facets of the summary.

Figure

Figure 7: Top 20 frequent entities in ArXiv vocabulary.

Figure

Figure 4: Oracle sentence distributions over a paper.

Figure

Figure 1: The complete pipeline of the proposed method. In the first step, we split the input text into sentences by using a regular expression handcrafted specifically for scientific documents. In the second step, we compute the sentence embeddings of the parsed sentences using SBERT. In the third step, we create a graph by comparing all the pairs of sentence embeddings obtained using cosine similarity. In the fourth step, we rank the sentences by the degree centrality in the generated graph. In the fifth and final step, we only keep a certain number of sentences or words to adjust to the length requirements of the summary.

Figure

Figure 2: The process of graph generation and ranking of the sentences. Every node in the generated complete graph represents a sentence in the document and the weight of each edge is given by the similarity between the nodes it conects. The importance of the sentence in the document is modelled as rank(si) =∑n j=1 1− sim(ei, ej), where ei and ej are the corresponding SBERT sentence embeddings of si and sj .

Figure

Figure 7: ROUGE-1 on CNN/DM for k sampled candidates at inference time, with k ∈ {1, . . . , 15}. SR stands for SummaReranker, BS and DBS refer to beam search and diverse beam search, respectively.

Figure

Figure 4: Example of a summary generated by SummaReranker trained for {R-1, R-2, R-L} on CNN/DM. The sentence in green is included in the SummaReranker summary, while the one in red is discarded.

Figure

Figure 2: Expert utilization for a base PEGASUS with SummaReranker optimized with {R-1, R-2, R-L, BS, BaS} on CNN/DM, with 10 experts.

Figure

Figure 8: ROUGE-1 on XSum for k sampled candidates at inference time, with k ∈ {1, . . . , 15}. SR stands for SummaReranker, BS and DBS refer to beam search and diverse beam search, respectively.

Figure

Figure 5: Human evaluation results on all three datasets. Black vertical bars are standard deviation across human raters.

Figure

Figure 3: Best summary candidate recall with 15 diverse beam search candidates for PEGASUS on all three datasets. SR denotes SummaReranker. Dotted lines are random baselines, and dashed lines correspond to the base PEGASUS.

Figure

Figure 1: SummaReranker model architecture, optimizing N metrics. The summarization metrics here (ROUGE-1, ROUGE-2, ..., BARTScore) are displayed as examples.

Figure

Figure 6: Novel n-grams with PEGASUS, across all datasets and with beam search and diverse beam search.

Figure

Figure 9: ROUGE-1 on Reddit TIFU for k sampled candidates at inference time, with k ∈ {1, . . . , 15}. SR stands for SummaReranker, BS and DBS refer to beam search and diverse beam search, respectively.

Figure

Figure 1: Experimental Upper bounds of our sentence regression framework and existing sentence regression framework.

Figure

Figure 2: Human evaluation scores of the summaries averaged across the test set for the summarization models (sorted by EGFB). See §A.1 for details.

Figure

Figure 1: Distribution of system ranks with bootstrapped sampling showing significant differences between systems

Figure

Figure 4: Pairwise Kendall’s Tau correlations for all automatically computed features.

Figure

Figure 3: Kendall’s Tau correlation between automatically computed features and experts’ evaluations of EGFB values and the 8 boolean attributes described in Section §2.3 (converted to numerical judgments).

Figure

Figure 6: Examples of graph-based meaning representations parsed from sentences of documents and generated summaries.

Figure

Figure 1: Example of (a) a document, (b) a summary, and (c) the corresponding document and (d) summary graph-based meaning representations. The summary graph does not contain the "consider" node, indicating a factual error (red dashed edge).

Figure

Figure 5: (a) AMR and (b) dependency representations for the summary “police have appealed for help in tracing a woman who has been missing for six years.”

Figure

Figure 3: Variation in partial Pearson correlation when omitting error types. Higher variation indicates greater influence of an error type in the overall correlation.

Figure

Figure 4: An example of a document, its generated summary and factuality predictions for word pairs, based on the dependency graph (DAE) versus AMR graph (FACTGRAPH-E). +/− means the predicted label for that edge.

Figure

Figure 2: Overview of FACTGRAPH. A sentence-level summary and document graphs are encoded by the graph encoder with structure-aware adapters. Text and graph encoders use the same pretrained model and only the adapters parameters are trained.

Figure

Figure 1: Model generated and reference summaries used for human evaluation. Words in orange correspond to incorrect or repeated information.

Figure

Figure 1: Comparison of how models adapt to target lengths from zero-shot to low-resource cases. We plot the average summary lengths for different models. We report results on XSum, similar patterns were found on CNN/DailyMail and SAMSum.

Figure

Figure 1: Architecture of the HiStruct+ model. The model consists of a base TLM for sentence encoding and two stacked inter-sentence Transformer layers for hierarchical contextual learning with a sigmoid classifier for extractive summarization. The two blocks shaded in light-green are the HiStruct injection components.

Figure

Figure 3: Two samples for human evaluation and case analysis of the extractive summaries predicted by the HiStruct+ model and the baseline model, in comparison with the gold summary (i.e., the abstract of the paper). The first sample is selected from the arXiv dataset, while the second sample is from PubMed. Top-7 sentences with the highest predicted scores are extracted, and then combined in their original order to construct a final summary. Their linear indices within the original document are shown in the second row of each table. The texts highlighted in yellow are the key words and the main content that appear in the gold summary. The phrases highlighted in green indicate typical parts of a scientific paper such as summary and future work. Sentences are split by ’<q>’.1308

Figure

Figure 2: Proportions of the extracted sentences at each linear position. The x-axis values are linear sentence indices, the y-axis values are percentages of the extracted sentences. In this figure, only the first 25 sentence indices are included due to space limitation.

Figure

Figure 3: (a) A network diagram for the NNLM decoder with additional encoder element. (b) A network diagram for the attention-based encoder enc3.

Figure

Figure 2: Example input sentence and the generated summary. The score of generating yi+1 (terrorism) is based on the context yc (for . . . against) as well as the input x1 . . .x18. Note that the summary generated is abstractive which makes it possible to generalize (russian defense minister to russia) and paraphrase (for combating to against), in addition to compressing (dropping the creation of), see Jing (2002) for a survey of these editing operations.

Figure

Figure 1: Example output of the attention-based summarization (ABS) system. The heatmap represents a soft alignment between the input (right) and the generated summary (top). The columns represent the distribution over the input after generating each word.

Figure

Figure 4: Example sentence summaries produced on Gigaword. I is the input, A is ABS, and G is the true headline.

Figure

Figure 1: Task template for our user study.

Figure

Figure 2: Example QBS for topic Airport Security

Figure

Figure 1: Aggregated human preference judgements across the same 400 instances measured in Table 3. The blue bars show preferences, the red bars show no preference.

Figure

Figure 1: Stemmed word frequencies for reference summary set d30001t from duc04: averaged across all reference summaries and for single reference summaries.

Figure

Figure 3: Orange crosses show the objective score optimized by exhaustive search minus the objective score optimized by FCHC. Blue pluses show the ROUGE-L difference between exhaustive search and FCHC. Plotted for the 1135 instances in the headline generation test set, where the source sentence has 30 words or fewer.

Figure

Figure 2: ROUGE F1 scores on the test set of headline generation for Lead-N and Lead-P baselines with different number n and percentage p of leading words.

Figure

Figure 1: Summarizing a sentence x by hill climbing. Each row is a Boolean vector at at a search step t . A black cell indicates a word is selected, and vice versa. Randomly swapping two values in the Boolean vector yields a new summary that is scored by an objective function that measures language fluency and semantic similarity. If the new summary increases the objective, this summary is accepted as the current best solution. Rejected solutions are not depicted.

Figure

Figure 4: Positional bias for different systems, calculated for the headline generation test set. The source sentence is divided into four areas: 0–25%, 25–50%, 50–75%, and 75-100% of the sentence. The y-axis shows the normalized frequency of how often a word in the summary is extracted from one of the four source sentence areas.

Figure

Figure 4: Pearson correlation with humans on SummEval w.r.t. the QG beam size.

Figure

Figure 2: Variation of the Pearson correlations between various metrics and humans, versus the number of references available. QUESTEVAL is constant, since it is independent from the references.

Figure

Figure 3: Distribution of the log probabilities of answerability – i.e. log(1 − QA( |T, q)) – for two QA models. 1) solid lines: a model trained on SQuADv2 without the negative sampled examples. 2) dashed lines: a model trained on SQuAD-v2 with the negative sampled examples. The evaluated samples belong to three distinct categories: 1) answerable, 2) unanswerable questions (but present in SQuAD-v2) and 3) the negatively sampled ones (as described in §5.1).

Figure

Figure 1: Illustration of the QUESTEVAL framework: the blue area corresponds to the precision-oriented framework proposed by Wang et al. (2020). The orange area corresponds to the recall-oriented SummaQA (Scialom et al., 2019). We extend it with a weighter component for an improved recall (red area). The encompassing area corresponds to our proposed unified approach, QUESTEVAL.

Figure

Figure 4: Coverage eliminates undesirable repetition. Summaries from our non-coverage model contain many duplicated n-grams while our coverage model produces a similar number as the reference summaries.

Figure

Figure 1: Comparison of output of 3 abstractive summarization models on a news article. The baseline model makes factual errors, a nonsensical sentence and struggles with OOV words muhammadu buhari. The pointer-generator model is accurate but repeats itself. Coverage eliminates repetition. The final summary is composed from several fragments.

Figure

Figure 3: Pointer-generator model. For each decoder timestep a generation probability pgen ∈ [0,1] is calculated, which weights the probability of generating words from the vocabulary, versus copying words from the source text. The vocabulary distribution and the attention distribution are weighted and summed to obtain the final distribution, from which we make our prediction. Note that out-of-vocabulary article words such as 2-0 are included in the final distribution. Best viewed in color.

Figure

Figure 6: Although our best model is abstractive, it does not produce novel n-grams (i.e., n-grams that don’t appear in the source text) as often as the reference summaries. The baseline model produces more novel n-grams, but many of these are erroneous (see section 7.2).

Figure

Figure 9: The baseline model incorrectly substitutes dutch for new zealand (perhaps reflecting the European bias of the dataset), fabricates irish, and struggles with out-of-vocabulary words saili and aucklandbased. Though it is not clear why, the phrase addition to our backline is changed to the nonsensical addition to their respective prospects. The pointer-generator model fixes these accuracy problems, and the addition of coverage fixes the repetition problem. Note that the final model skips over large passages of text to produce shorter sentences.

Figure

Figure 13: The baseline model appropriately replaces stumped with novel word mystified. However, the reference summary chooses flummoxed (also novel) so the choice of mystified is not rewarded by the ROUGE metric. The baseline model also incorrectly substitutes 600,000 for 25. In the final model’s output we observe that the generation probability is largest at the beginning of sentences (especially the first verb) and on periods.

Figure

Figure 8: The baseline model reports the wrong score 6-3, substitutes bedene for thiem and struggles with the uncommon word assimilation. The pointer-network models accurately reproduce the outof-vocabulary words thiem and aljaz. Note that the final model produces the novel word defeated to incorporate several fragments into a single sentence.

Figure

Figure 10: In this example, both our baseline model and final model produce a completely abstractive first sentence, using a novel word beat.

Figure

Figure 15: The baseline model fabricates a completely false detail about a u.n. peacekeeping force that is not mentioned in the article. This is most likely inspired by a connection between U.N. peacekeeping forces and northern sinai in the training data. The pointer-generator model is more accurate, correctly reporting the reshuffle of several senior military positions.

Figure

Figure 14: The baseline model incorrectly changes thwart criminals and others contributing to nigeria’s instability to destabilize nigeria’s economy – which has a mostly opposite meaning. It also produces a nonsensical sentence. Note that our final model produces the novel word says to paraphrase told cnn ‘s christiane amanpour.

Figure

Figure 5: Examples of highly abstractive reference summaries (bold denotes novel words).

Figure

Figure 12: Baseline model replaces cecily strong with mariah carey, and produces generally nonsensical output. The baseline model may be struggling with the out-of-vocabulary word beetlejuice, or perhaps the unusual non-news format of the article. Note that the final model omits – ever so slightly – from its first sentence.

Figure

Figure 2: Baseline sequence-to-sequence model with attention. The model may attend to relevant words in the source text to generate novel words, e.g., to produce the novel word beat in the abstractive summary Germany beat Argentina 2-0 the model may attend to the words victorious and win in the source text.

Figure

Figure 11: The baseline model makes several factual inaccuracies: it claims porto beat bayern munich not vice versa, the score is changed from 7-4 to 2-0, jackson is changed to james and a heroes reception is replaced with a trophy. Our final model produces sentences that are individually accurate, but they do not make sense as a whole. Note that the final model omits the parenthesized phrase ( left ) from its second sentence.

Figure

Figure 7: Examples of abstractive summaries produced by our model (bold denotes novel words).

Figure

Figure 1: Exploring scaling factor β

Figure

Figure 1: The weighted centroid embedding of text T = {t1, t2, ..., tn}

Figure

Figure 1: R-1 scores of a few systems, evaluated against the 50-word reference set of DUC 01. Systems R, S and T are from DUC 01; ICSISumm is a later competitive system (Gillick et al., 2008).

Figure

Figure 1: Average Pearson and Spearman correlations with Pyramid scores as a function of number of SCUs evaluated per topic, on the DUC ’05 and ’06 data.

Figure

Figure 2: Average Pearson and Spearman correlations with Pyramid scores as a function of number of topics used for evaluation, on the DUC ’05 and ’06 data.

Figure

Figure 5: Sample summaries for an NYT article. Our model with coherence reward overlaps the most with human summary (green is ours, blue denotes human).

Figure

Figure 2: Our proposed entity-driven abstractive summarization framework. Entity-aware content selector extracts salient sentences and abstract generator produces informative and coherent summaries. Both components are connected using reinforcement learning.

Figure

Figure 9: Sample summaries for an CNN/DM article. Our models are able to capture important information

Figure

Figure 6: Sample summaries for an NYT article. Our models capture salient information which is missed by comparisons. Numbers are replaced with “0”.

Figure

Figure 3: Our proposed entity-aware content selector. Arrows denote attention, with darker color representing higher weights.

Figure

Figure 1: Sample summary of an article from the New York Times corpus (Sandhaus, 2008). Mentions of the same entity are colored. Underlined sentence in the article occurs relatively at an earlier position in the sum-

Figure

Figure 7: Sample summaries for an NYT article. Comparison model contains grammatical errors ; our model is more coherent and with less redundant information. Numbers are replaced with “0”.

Figure

Figure 8: Sample summaries for an CNN/DM article. Our model overlap most with human summaries with

Figure

Figure 4: Accuracy of our coherence model compared to different baselines and Wu and Hu (2018) on PAIRWISE and SHUFFLE test sets.

Figure

Figure 5: Examples of summaries produced by GPG. Each two samples from CNN/DM, Gigaword and XSum (up to down). bold denotes novel words and their pointed source tokens. Bracketed numbers are the pointing probability (1− pgen) during decoding.

Figure

Figure 3: Test perplexity when increasing k

Figure

Figure 1: Alignment visualization of our model when decoding “closes”. Posterior alignment is more accurate for model interpretation. In contrast, the prior alignment probability is spared to “announced” and “closure”, which can be manually controlled to generate desired summaries. Decoded samples are shown when aligned to “announced” and “closure” respectively. Highlighted source words are those that can be directly aligned to a target token in the gold summary.

Figure

Figure 6: Examples of generated summaries. Examples are taken from CNN/DM, Gigaword and XSum (from up to down). Darker means higher pointing probability.

Figure

Figure 2: Architecture of the generalized pointer. The same encoder is applied to encode the source and target. When decoding “closes”, we first find top-k source positions with the most similar encoded state. For each position, the decoding probability is computed by adding its word embedding and a predicted relation embedding.

Figure

Figure 4: Pointing Ratio of the standard pointer generator and GPG (evaluated on the test data). GPG enables the point mode more often, but quite a few pointed tokens are edited rather than simply copied.

Figure

Figure 2: Examples of summaries generated by SummVD and different baselines exposed in §4.2 on a same article belonging to CNN/DM corpus.

Figure

Figure 1: SummVD Pipeline illustrating the sequence of operations needed to achieve an extractive summary from a given text document.

Figure

Figure 3: Average time to compute a summary, against the number of input words for SummVD and TextRank (gensim implementation). Time is in logarithmic scale.

Figure

Figure 4: Heatmap when removing the penalization term. We can see s[0] does not receive attention at all. Best viewed in color.

Figure

Figure 2: An instance of our contrastive samples. Given an annotated (D, Ŝ), we randomly discard a summary sentence ŝj′ , and fill di′1 and di′2 to form the contrastive pair. di′1 has the highest ROUGE score with ŝj′ . di′2 is randomly sampled.

Figure

Figure 3: Example of attention heatmap between document sentences (rows) and gold summary sentences (columns). s[0]: The illusion is the work of German body-painting artist Joerg Duesterwald, who spent hours painting his model. s[1]: Stunning set of pictures was taken in front of a rockface in a forest in Langenfeld, Germany, yesterday. Best viewed in color.

Figure

Figure 1: e Architecture of the Hybrid MemNet Model

Figure

Figure 1: Overview of our approach to create selfsupervised pre-training datasets from unlabelled scientific documents. The aspect-based summarization model is pre-trained on unlabelled documents, the section headings as aspects, and the following paragraphs corresponding to the aspects as aspect-based summaries.

Figure

Figure 2: Histogram of 50 most frequent aspects in the self-supervised samples (top: PubMed⋆, bottom: FacetSum⋆). PubMed⋆ has [150K,1.4K,214,33] unique aspects with frequency of higher than [1,10,100,1000] (FacetSum⋆:[96K,841,120,21]). Aspects removed from the NoOverlap datasets are highlighted in red.

Figure

Figure 3: Aspect-based summarization performance with limited supervised examples. Pre-training with in-domain and out-of-domain datasets significantly improves the low-resource training sample performance. Top: evaluation done on PubMed dataset, Bottom: evaluation is done on FacetSum dataset. ( —– BART , –•– BART + pre-trained on PubMed⋆, –×– BART + pretrained on FacetSum⋆, - - - BART fine-tuned on all samples)

Figure

Figure A.4: Effect of each variable to HaRiM. ∆ represents ps2s − plm. The last figure at the righter down shows the effect of replacing auxiliary LM probability with empty-sourced decoder inference (HaRiMlmless). Figure 1 shows article-summary pair as a datapoint in the plot, here we show each token of the decoded output as a datapoint.

Figure

Figure A.3: Pearson’s ρ correlation between metric scores on FRANK-BBC/XSUM split. The highter the correlation, the similar the metric behavior becomes.

Figure

Figure A.6: Graphical model representation attributing to the factors that affects metric (M )-human (H) correlation. A is the graphical model that supports the use of partial correlation as argued in (Pagnoni et al., 2021). B is the graphical model that adheres to our argument that why should we measure correlation, ignoring the effect of the generation system (S) whose effect is hindered by observed child node, text.

Figure

Figure A.1: Permutation test done for metric scores on FRANK-CNN/DM. 1 (filled grid) represents significant difference in metric performance, 0 represents negligible difference with confidence >=.95 (p <= 0.05), i.e. HaRiM is significantly more correlated to human judgements than all the other metrics except itself with a confidence of >=95%.

Figure

Figure A.7: Averaged experts’ judgements vs. Averged turkers’ judgements on SummEval, (datapoints are outputs from abstractive summarization models)

Figure

Figure 1: Effects of replacing the auxiliary language model (q(yi|y<i)) with an empty-sourced encoderdecoder model (p(yi|y<i; {}). Left compares the values of plm, and Right compares the HaRiM values. The values are calculated on the summary-article pairs in FRANK benchmark. The high correlation of HaRiM suggests that the effect of replacement is minimal.

Figure

Figure A.2: Pearson’s ρ correlation between metric scores on FRANK-CNN/DM split. The highter the correlation, the similar the metric behavior becomes. Red boxes highlights notable observation which is unexpected behavioral similarity between metrics.

Figure

Figure A.8: Averaged experts’ judgements vs. Averged turkers’ judgements on SummEval, (datapoints are outputs from extractive summarization models)

Figure

Figure 2: Factuality label counts from FRANK benchmark. Legend shows the value of factuality annotation, varying from 0 (unfactual) to 1 (factual). The factuality labels for XSUM corpus are almost binary.

Figure

Figure A.5: Boxplot of HaRiM and log-likelihood scales, varying with the evaluating summarizer weight. base+cnn: BART-base fine-tuned on CNN/DailyMail, brio: BRIO (Meng et al., 2021), large+cnn: BARTlarge fine-tuned on CNN/DailyMail, large+cnn+para: further fine-tuned checkpoint of the previous model on ParaBank2 corpus as suggested in (Yuan et al., 2021).

Figure

Figure 3: System architectures for ‘Struct+2Way+Word’ (left) and ‘Struct+2Way+Relation’ (right). βt,i (left) measures the structural importance of the i-th source word; βt,i (right) measures the saliency of the dependency edge pointing to the i-th source word. gep,i is the structural embedding of the parent. In both cases δt,i replaces αt,i to become the new attention value used to estimate the context vector ct.

Figure

Figure 2: System architectures for ‘Struct+Input’ (left) and ‘Struct+Hidden’ (right). A critical question we seek to answer is whether the structural embeddings (sei ) should be supplied as input to the encoder (left) or be exempted from encoding and directly concatenated with the encoder hidden states (right).

Figure

Figure 1: An example dependency parse tree created for the source sentence in Table 1. If important dependency edges such as “father← had” can be preserved in the summary, the system summary is likely to preserve the meaning of the original.

Figure

Figure 1: fdecoder tree (top) consumes the partial tree representations of time t one by one to build hidden representation hTt ; fdecoder seq (bottom) consumes the embeddings of summary words to build partial summary representation hyt .

Figure

Figure 2: F-scores of systems on preserving relations of reference summaries (top) and source texts (bottom). We vary the threshold from 1.0 (strict match) to 0.7 in the x-axis to allow for strict and lenient matching of dependency relations.

Figure

Figure 1: An illustration of our CopyTrans architecture. The self-attention mechanism allows (i) a source word to attend to lower-level representations of all source words (including itself) to build a higher-level representation for it, and (ii) a summary word to attend to all source words, summary words prior to it, as well as the token at the current position (‘MASK’) to build a higher-level representation.

Figure

Figure 2: Effectiveness of position-aware beam search (§3.1). A larger beam tends to give better results.

Figure

Figure 1: An illustration of the generation process. A sequence of placeholders (“[MASK]”) are placed following the source text. Our model simultaneously predicts the most probable tokens for all positions, rather than predicting only the most probable next token in an autoregressive setting. We obtain the token that has the highest probability, and use it to replace the [MASK] token of that position. Next, the model makes new predictions for all remaining positions, conditioned on the source text and all summary tokens seen thus far. Our generator produces a summary having the exact given length and with a proper endpoint.

Figure

Figure 2: Strong position bias can cause the abstractor to use only content at the beginning of the input to generate a summary. By exposing the chunks progressively, our approach makes use of this characteristic to consolidate information from multiple transcript chunks.

Figure

Figure 1: An example of a grounded summary where spans of summary text are tethered to the original audio. The user can tap to hear the audio clip, thus interpreting a system-generated summary in context.

Figure

Figure 6: The n-gram abstractiveness and percentage of novel n-gram metrics across different n-grams on TLDRHQ’s test set. As seen, BART generates more abstractive summaries than BERTSUMABS as it mitigates the gap between BERTSUMABS and ground truth summary.

Figure

Figure 3: S score inter-rater agreement for annotation without context (left), and annotation with context (right)

Figure

Figure 4: The proportion of instances containing TLDR in TLDR9+ dataset. As seen, the number of TLDRs is increasing each year. At the time of conducting this research, the submission data dumps are partially uploaded for 2021 (until 2021-06), while there is no comments uploaded for 2021 in the Pushshift repository.

Figure

Figure 2: The proportion of TLDRs over entire posts (submissions and comments) submitted per year (Figures (c) and (d)). At the time of writing this paper, submissions dumps are partly uploaded for 2021 (until 2021-06), while there is no comments dumps uploaded for 2021.

Figure

Figure 5: Heatmaps of TLDRHQ showing (a) the oracle sentence’s importance to its relative position; (b) percentage of novel n-grams; and (c) n-gram abstractiveness. The heat extent shows the number of the instances within the specific bin.

Figure

Figure 7: A sample from TLDRHQ test set along with the model generated summaries. Underlined text in source shows the important regions of the source for generating TLDR summary.

Figure

Figure 1: An example Reddit post with TLDR summary. As seen, the TLDR summary is extremely short, and highly abstractive.

Figure

Figure 3: Detailed illustration of our summarization framework. Task-1 (t1): source sentence extraction (right-hand gray box). Task-2 (t2): introductory sentence extraction (left-hand gray box). As shown, the identified salient introductory sentences at training stages are incorporated into the representations of source sentences by the Select(·) function (orange box) with k = 3. Plus sign shows the concatenation layer. The feed-forward neural network is made of one linear layer.

Figure

Figure 4: (a) Our system’s generated summary, (b) Sentence graph visualization of our system’s generated summary. Green and gray nodes are introductory and non-introductory sentences, respectively. Edge thickness denotes the ROUGE score strength between pair of sentences. Parts, from which sentences are sampled, are shown inside brackets. The summary is truncated due to space limitations. Ground-truth summary-worthy sentences are underlined, and colored spans show pointers from introductory to non-introductory sentences.

Figure

Figure 1: A truncated human-written extended summary. Top box: introductory information, bottom box: non-introductory information. Colored spans are pointers from introductory sentences to associated nonintroductory detailed sentences.

Figure

Figure 2: Our model uses introductory sentences as pointers to the source sentences. It then forms the final extended summary by extracting salient sentences from the source. Highlights in red show the salient parts.

Figure

Figure 3: Time spent on annotation (in minutes) vs. correlation with the full-sized score. We gather annotation times in buckets with a width of ten minutes and show the 95% confidence interval for each bucket.

Figure

Figure 4: Relation of type I error rates at p < 0.05 to the total number of annotators for different designs, all with 100 documents and 3 judgements per summary. We conduct the experiment with both the t-test and approximate randomization test (ART). We show results both with averaging results per document and without any aggregation. We run 2000 trials per design. The red line marks the nominal error rate of 0.05.

Figure

Figure 6: Reliabilities of nested vs. crossed designs for Rank and Likert for both tasks.

Figure

Figure 8: Screenshots of the Annotator Instructions.

Figure

Figure 9: Screenshots of the Annotation Interfaces.

Figure

Figure 2: Score distribution of Likert for both tasks. Each data point shows the number of times a particular score was assigned to each system.

Figure

Figure 5: Power for 100 documents and 3 judgements per summary with different number of total annotators.

Figure

Figure 7: Power for p < 0.05 of nested and crossed designs for ARTagg and regression. X-axis shows the number of judgements elicited, Y-axis the power-level.

Figure

Figure 1: Schematic representation of our study design. Rows represent annotators, columns documents. Each blue square corresponds to a judgement of the summaries of all five systems for a document. Every rectangular group of blue squares forms one block.

Figure

Figure 4: Ranking accuracy between shuffled and original summaries of different lengths (in characters). We sample 10,000 pairs and group them in buckets of 20 characters and clamp differences between -200 and 200.

Figure

Figure 6: Histograms of the lengths of summaries generated by the summarizers in SummEval and their mean lengths. Both in characters.

Figure

Figure 3: Bias matrix for BAS with specific analysis for BART and Pegasus. The upper triangular matrix indicates τ+ for the given summarizer pair, the lower τ−. The area of each circle is proportional to the number of pairs in H+/H− for the cell. To read off the behaviour of the CM on a specific summarizer, we follow both the corresponding row and column. A high score in the row, combined with a low score in the corresponding cell in the column implies the CM is biased towards generations by this particular summarizer.

Figure

Figure 1: Distribution of human coherence scores for the 17 systems in the SummEval dataset. The red dots indicate the mean score of each system.

Figure

Figure 5: Intra-system correlations of the best CMs as well as the human upper bound on the SummEval dataset. Bars indicate 95% confidence intervals determined by bootstrap resampling with 1000 samples.

Figure

Figure 2: Bias Matrices for the best CMs. We also show the bias matrix for the architecture confounder for reference. See Figure 3 for a brief tutorial to bias matrix analysis.

Figure

Figure 7: Summary quality as a function of metric optimized and amount of optimization, using best-of-N rejection sampling. We evaluate ROUGE, our main reward models, and an earlier iteration of the 1.3B model trained on approximately 75% as much data (see Table 11 for details). ROUGE appears to peak both sooner and at a substantially lower preference rate than all reward models. Details in Appendix G.3.

Figure

Figure 1: Fraction of the time humans prefer our models’ summaries over the human-generated reference summaries on the TL;DR dataset.4Since quality judgments involve an arbitrary decision about how to trade off summary length vs. coverage within the 24-48 token limit, we also provide length-controlled graphs in Appendix F; length differences explain about a third of the gap between feedback and supervised learning at 6.7B.

Figure

Figure 3: Evaluations of four axes of summary quality on the TL;DR dataset.

Figure

Figure 4: Transfer results on CNN/DM. (a) Overall summary quality on CNN/DM as a function of model size. Full results across axes shown in Appendix G.2. (b) Overall scores vs. length for the 6.7B TL;DR supervised baseline, the 6.7B TL;DR human feedback model, and T5 fine-tuned on CNN/DM summaries. At similar summary lengths, our 6.7B TL;DR human feedback model nearly matches T5 despite never being trained to summarize news articles.

Figure

Figure 5: Preference scores versus degree of reward model optimization. Optimizing against the reward model initially improves summaries, but eventually overfits, giving worse summaries. This figure uses an earlier version of our reward model (see rm3 in Appendix C.6). See Appendix H.2 for samples from the KL 250 model.

Figure

Figure 6: Reward model performance versus data size and model size. Doubling amount of training data leads to a ~1.1% increase in reward model validation accuracy, whereas doubling the model size leads to a ~1.8% increase. The 6.7B model trained on all data begins approaching the accuracy of a single human.

Figure

Figure 1: The framework of QFS-BART. The QA module calculates the answer relevance scores, and we incorporate the scores as explicit answer relevance attention to the encoder-decoder attention.

Figure

Figure 4: Screenshot of Content Support Task.

Figure

Figure 3: Screenshot of Best-Worst Scaling Task.

Figure

Figure 1: Overview of the OPINIONDIGEST framework.

Figure

Figure 2: Sensitivity analysis on hyper-parameters. Above row: Top-k opinion (k) vs merging threshold (θ); Bottom row: Top-k opinion (k) vs max token size (L).

Figure

Figure 5: Screenshot of Aspect-Specific Summary Task.

Figure

Figure 2: Cosine similarities between summaries generated by lead systems and reference in embedding space on the CNN/DailyMail test set.

Figure

Figure 2: Truncated articles lead to performance improvement for max, avg and InferSent representation.

Figure

Figure 1: ROUGE recall, precision and F1 scores for lead, random, textrank and Pointer-Generator on the CNN/DailyMail test set.

Figure

Figure 3: This scatter plot shows the human coverage score and embedding similarity on DUC2001. The baseline system is shortened to ‘b’.

Figure

Figure 1: For each number of articles, we sample and compute the correlation for 50 times and plot the average as well as standard deviation. The decreasing size of error bar shows that enough articles are provided for each system and it is not the reason of the performance discrepancy between DUC2001 and DUC2002.

Figure

Figure 2: Process for dataset creation. In this example, two source sentences, s1 and s2, are selected, and the hypothesis sentence is generated by a sentence fusion operation. The selected sentences are considered errororiginating if the hypothesis sentence poses an error. These source sentences are referred to as corresponding sentences.

Figure

Figure 6: Comparison of FactCCX and SumPhrase outputs. The sentence with a blue underline was identified as an error-corresponding sentence by SumPhrase, whereas the span with a red underline was localized by FactCCX. The dependency relation between the red phrases was determined as erroneous by SumPhrase.

Figure

Figure 4: Example of phrase-level labels. The blue and red edges denote consistent and inconsistent labels, respectively, whereas the green edge indicates that the inter-phrase relation is unlabeled.

Figure

Figure 3: Motivating example for phrase-level labeling.

Figure

Figure 1: Overview of proposal. A synthetic dataset is created and used as weak supervision to train the SumPhrase error localization model.

Figure

Figure 5: SumPhrase model. The blue and red frames display the intra-phrase and inter-phrase detection parts, respectively, and the green frame indicates the corresponding sentence localization part.

Figure

Figure 4: Attention heatmap when generating the example summary. Ii and Oi indicate the i-th sentence of the input and output, respectively.

Figure

Figure 1: The framework of summary-to-headline generation.

Figure

Figure 1: Hierarchical encoder-decoder framework and comparison of the attention mechanisms.

Figure

Figure 4: Headlines generated by lead-flat-att and summ-hieratt for two examples from the NYT test data. S1 indicates the summary extracted by the Lead method.

Figure

Figure 3: Examples of generated summaries.

Figure

Figure 3: Heat map of the distribution of summary-level attention weights when generating every word for the example in Figure 2. Darker color indicates higher weight.

Figure

Figure 2: An example of headline generation from NYT test data. G is the true headline, L is the output of lead-flat-att, O is the output of our approach summ-hieratt. S1 to S5 are the document-level summaries, and each summarization method is indicated in “[]” at the end.

Figure

Figure 2: Results of different setting of hyperparameters tested on 500 samples from the DailyMail test set.

Figure

Figure 3: Proportions of model outputs that get a human score ≥ 4. For example, around 95% of summaries by Weak-Sup (ours) are scored 4 or 5 in terms of accuracy.

Figure

Figure 2: Visualizing the ROUGE-1 results in Table 2. The green dashed line marks the performance of BART fine-tuned on the whole MA-News training set.

Figure

Figure 1: Illustration of our approach. Left: Constructing weak supervisions using ConceptNet, including (1) extracting aspects and (2) synthesizing aspect-based summaries. Right: Augmenting aspect information, including (3) identifying aspect related words in the document using Wikipedia and (4) feeding both aspect and related words into summarization model.

Figure

Figure 3: A sample summary comparison on the Multi-News dataset. OTExtSum based summary sentences are

Figure

Figure 1: Illustration of Optimal Transport Extractive Summariser (OTExtSum): the formulation of extractive summarisation as an optimal transport (OT) problem. Optimal sentence extraction is conceptualised as obtaining the optimal extraction vector m∗, which achieves an OT plan from a document D to its optimal summary S∗ that has the minimum transportation cost. Such a cost is defined as the Wasserstein distance between the document’s semantic distribution TFD and the summary’s semantic distribution TFS and is used to measure the summary’s semantic coverage.

Figure

Figure 4: A sample summary comparison on the BillSum dataset. OTExtSum based summary sentences are

Figure

Figure 7: This is how our task will look to Mechanical Turk Workers.

Figure

Figure 2: Interpretable visualisation of the OT plan from a source document to a resulting summary on the CNN/DM dataset. The higher the intensity, the more the semantic content of a particular document token is covered by a summary token. Purple line highlights the transportation from the document to the summary of semantic content of token “month”, which appears in both the document and the summary. Red line highlights how the semantic content of token “sponsor”, which appears in the document only but not the summary, are transported to token “tour” and “extension”, which are semantically closer and have lower transport cost, and thus achieve a minimum transportation cost in the OT plan.

Figure

Figure 5: A sample summary comparison on the PubMed dataset. OTExtSum based summary sentences are

Figure

Figure 2: Score distribution of LS10 across CNN/DM and XSum. Each data point shows the number of times a score was assigned to each system.

Figure

Figure 4: Screenshot of the evaluation page for BWS annotation.

Figure

Figure 6: Screenshot of the evaluation page for Likert Scale annotation.

Figure

Figure 1: Score distribution of LS with a 5-point scale across CNN/DM and XSum. Each data point shows the number of times a score was assigned to each system.

Figure

Figure 3: Screenshot of the instruction page for BWS annotation.

Figure

Figure 5: Screenshot of the instruction page we used for Likert Scale annotation.

Figure

Figure 6: A sample summary comparison on the CNN/DM dataset. OTExtSum based summary sentences

Figure

Figure 4: XSum class-level performance, averaged across all models.

Figure

Figure 9: XSum correlations.

Figure

Figure 6: Pearson correlations for Extractive, Paraphrase and Evidence samples in Gigaword and CNN/DM.

Figure

Figure 7: Gigaword correlations.

Figure

Figure 1: Distribution of the different class of samples in all datasets.

Figure

Figure 2: Gigaword class-level performance, averaged across all models.

Figure

Figure 8: CNN/DM correlations.

Figure

Figure 3: CNN/DM class-level performance, averaged across all models.

Figure

Figure 5: Pearson correlation between different metrics for all three datasets.

Figure

Fig. 5 to perform human analysis.

Figure

Figure 2: Overview of Hard Typed Decoder (left) and Our Reinforced Hard Typed Decoder (right).

Figure

Figure 1: Current models tend to output general and less meaningful summaries.

Figure

Figure 3: Learning Curve of RHTD Figure 4: Generated Summaries on E.g. 2 Figure 5: Generated Summaries on E.g. 3

Figure

Figure 1: A high level description of the NHG model. The model predicts the next headline word yt given the words in the document x1 . . . xN and already generated headline words y1 . . . yt−1.

Figure

Figure 2: Validation set (EN) perplexities of the NHG model with different pre-training methods.

Figure

Figure 3: Overarching system process flow

Figure

Figure 5: ROUGE-1 score comparisons for various budgets, on the 3 datasets used in this study.

Figure

Figure 1: Undirected, weighted graph-of-words example. W = 8 and overspans sentences. Stemmed words, weighted k-core decomposition. Numbers inside parentheses are CoreRank scores. For clarity, non-(nouns and adjectives) in italic have been removed.

Figure

Figure 2: k-core decomposition of a graph and illustration of the value added by CoreRank. While nodes ? and ?? have the same core number (=2), node ? has a greater CoreRank score (3+2+2=7 vs 2+2+1=5), which better reflects its more central position in the graph.

Figure

Figure 4: Size of the DUC2001 documents in development and test sets.

Figure

Figure 3: Comparison between NLI models augmented with Falsesum and FactCC across different measures of summary extractiveness. The x-axis shows the median overlap score of each test subset.

Figure

Figure 1: Overview of the Falsesum generation framework. Falsesum preprocesses and formats the source document (A) and a gold summary (B) before feeding it to a fine-tuned generator model. The model produces a factually inconsistent summary, which can then be used to obtain (A,D) or (A,E) as the negative (non-entailment) NLI premise-hypothesis example pair. We also use the original (A,B) as a positive NLI example (entailment).

Figure

Figure 2: Input format design of Falsesum. The framework first extracts the predicate and argument spans from the source document and the gold summary. The spans are then corrupted, lemmatized, and shuffled before being inserted into the input template.

Figure

Figure 5: Drop of mean BLANC-tune value when parameters differ from optimal. The drop is shown as a fraction of the optimal mean BLANC value. The summaries probed are: CNN and DM (from the CNN/Daily Mail dataset), Top and Rand (first two sentences and random two sentences from random news articles). The parameters probed are: ’gap-infer 2/1’ is gap = 2 and gap mask = 1; ’gap-tune 2/1’ is gaptune = 2 and gap masktune = 1; ’p-replace 0.1’ is preplace = 0.1; ’toks-normal 4’ is Lnormal = 4; ’tune-rand’ is making tokens masking random rather than even at tuning.

Figure

Figure 5: Spearman correlation between SummEval experts scores and ESTIME by embeddings from different layers of the model. Thick lines: ESTIME (as defined by Equation 1 and considered through the paper). Thin lines: Nw as defined by Equation 3.

Figure

Figure 2: Spearman correlation between SummEval experts scores and ESTIME using embeddings taken from different layers of the model.

Figure

Figure 3: Spearman and Kendall Tau-c correlations - system level - between SummEval experts scores of consistency and ESTIME using embeddings taken from different layers of the model.

Figure

Figure 12: Kandall Tau-c correlation between SummEval experts scores and ESTIME by embeddings from different layers of the model. Thick lines: sparsity of the masking is defined by the distance 8 and the margin 50 (see Section 2), as used through the paper. Thin lines: Distance 8, margin 100.

Figure

Figure 4: Kandall Tau-c correlation between SummEval experts scores and ESTIME by embeddings from different layers of the model. Thick lines: ESTIME (as defined by Equation 1 and considered through the paper). Thin lines: Nw as defined by Equation 3.

Figure

Figure 1: Factor by which Spearman correlation of BLANC with human scores increases when only part of text is used for BLANC. The text part is selected as sentences with top BLANC values (thin lines) or as contiguous sentences with highest BLANC (thick lines).

Figure

Figure 1: Kandall Tau-c correlation between SummEval experts scores and ESTIME using embeddings taken from different layers of the model.

Figure

Figure 9: Spearman correlation between SummEval experts scores and EESTIME by embeddings from different layers of the model. Thick lines: the model is bert-large-uncased-whole-word-masking. Thin lines: the model is bert-base-uncased.

Figure

Figure 7: Spearman correlation between SummEval experts scores and ESTIME by embeddings from different layers of the model. Thick lines: unnormalized embeddings (as used through the paper). Thin lines: normalized embeddings.

Figure

Figure 2: Factor by which Spearman correlation of BLANC with human scores increases when only part of text is used for BLANC. The text part is selected as sentences with top BLANC values (thin lines) or as contiguous sentences having highest average BLANC (thick lines). The resulting BLANC is calculated as average over BLANC of the sentences.

Figure

Figure 13: Spearman correlation between SummEval experts scores and EESTIME by embeddings from different layers of the model. Thick lines: sparsity of the masking is defined by the distance 8 and the margin 50 (see Section 2), as used through the paper. Thin lines: Distance 8, margin 100.

Figure

Figure 11: Spearman correlation between SummEval experts scores and EESTIME by embeddings from different layers of the model. Thick lines: all tokens are used, as is done throughout the paper. Thin lines: tokens of determiners (part of speech) are not used.

Figure

Figure 8: Example of text from SummEval dataset.

Figure

Figure 8: Kandall Tau-c correlation between SummEval experts scores and ESTIME by embeddings from different layers of the model. Thick lines: the model is bert-large-uncased-whole-word-masking. Thin lines: the model is bert-base-uncased.

Figure

Figure 6: Example of a summary with a wide coverage (left) and a narrow coverage (right). Both summaries are supposed to cover first four paragraphs of ’Harry Potter And the Sorcerer’s Stone’ by J.K.Rowling.

Figure

Figure 6: Kandall Tau-c correlation between SummEval experts scores and ESTIME by embeddings from different layers of the model. Thick lines: unnormalized embeddings (as used through the paper). Thin lines: normalized embeddings.

Figure

Figure 3: Factor by which Spearman correlation of BLANC with human scores increases when only part of text is used for BLANC. The text part is selected as sentences with BLANC exceeding threshold.

Figure

Figure 4: Drop of mean BLANC-help value when parameters differ from optimal. The drop is shown as a fraction of the optimal mean BLANC value. The summaries probed are: CNN and DM (from the CNN/Daily Mail dataset), Top and Rand (first two sentences and random two sentences from random news articles). The parameters probed are: ’gap 3/1’ is gap = 3 and gap mask = 1; ’gap 3/2’ is gap = 3 and gap mask = 2; ’toks-normal 5’ is Lnormal = 5; ’toks-lead 2’ is Llead = 2; ’toks-follow 2’ is Lfollow = 2.

Figure

Figure 10: Kandall Tau-c correlation between SummEval experts scores and ESTIME by embeddings from different layers of the model. Thick lines: all tokens are used, as is done throughout the paper. Thin lines: tokens of determiners (part of speech) are not used.

Figure

Figure 7: Example of a summary with a wide coverage (left) and a narrow coverage (right). Both summaries are supposed to cover the same text taken from CNN/Daily Mail dataset. The text is shown in Figure 8.

Figure

Figure 1: At t = 4, pgen4 weighs the probability of copying a word from V ext higher than generating a word from the fixed vocabulary V †. The decoder learns to interpret the weighted sum of hL4 and c4 in order to compute a probability distribution for the most appropriate text realisation given the context of the triples. The attention mechanism highlights f2 as the most important triple for the generation of the upcoming token. The attention scores are distributed among the entries of V ext, and accumulated into the final distribution over V . As a result, the model copies “science fiction”, one of the surface forms associated with f2.

Figure

Figure 2: The percentage with which the unique predicates from the input triples are covered in the summaries.

Figure

Figure 1. ROUGE-1 vs. λ

Figure

Figure 3. Average braille length vs. λ

Figure

Figure 2. ROUGE-2 vs. λ

Figure

Figure 6: Faithfulness-abstractiveness trade-off curve, shown as the dashed red line, on Gigaword dataset. We plot each model’s average faithfulness score evaluated by AMT against its extractiveness level. Our model lies above the graph, performing better than MLE-baseline, DAE (Goyal and Durrett, 2021), and Loss Truncation (Kang and Hashimoto, 2020).

Figure

Figure 7: Summaries changed using the corrector. We mark hallucinated entities in the summaries with red.

Figure

Figure 2: Example output using different strategies of corrector and contrastor. The first two rows show the original document and summary with highlighted entities and their respective labels (date, number, ent). We mark hallucinated entities in the summaries with red, factual entities in document and summary with green and underlined, and

Figure

Figure 8: Example summaries from XSum and Gigaword. Nonfactual components are marked with red.

Figure

Figure 3: Relative effect using different sentence selection criteria on XSum. Adding FactCC to criteria consistently improves factuality. Full result in Table 10.

Figure

Figure 4: Zero-shot and few-shot results. The lines represent each models’s performance when fine-tuned on 0 (zero-shot), 1, 10, 100, and 1000 examples. FACTPEGASUS consistently improves sentence error with more training data. Without the corrector and contrastor, factuality decreases with just 10 examples.

Figure

Figure 1: Illustration of FACTPEGASUS. For pre-training (a), we use the factGSG objective introduced in Section 3.1 that transforms a text document into a pseudo-summarization dataset. We select the pseudo-summary using the combination of ROUGE and FactCC. Here, sentence A is selected as the pseudo-summary, and we mask this sentence in the original text to create the pseudo-document. During fine-tuning (b), the connector (i) simulates the factGSG task by appending the same mask token used in (a) to the input document, so that we have the same setup in both training stages. Then, corrector (ii) removes hallucinations (highlighted in red) from the summary. Finally, contrastive learning in (iii) encourages the model to prefer the corrected summary over the perturbed summary.

Figure

Figure 5: Factuality dynamics result. We show token error, sentence error, and FactCC as training progresses. FACTPEGASUS slows down factuality degradation for all metrics compared to BART-base.

Figure

Figure 9: Example summaries from WikiHow. The article is truncated to fit the page. Nonfactual information are marked with red.

Figure

Figure 3: Sample summaries generated by different systems on movie reviews and arguments. We only show a subset of reviews and arguments due to limited space.

Figure

Figure 4: Sampling effect on RottenTomatoes.

Figure

Figure 1: Examples for an opinion consensus of professional reviews (critics) about movie “The Martian” from www.rottentomatoes.com, and a claim about “death penalty” supported by arguments from idebate.org. Content with similar meaning is highlighted in the same color.

Figure

Figure 2: Evaluation of importance estimation by mean reciprocal rank (MRR), and normalized discounted cumulative gain at top 3 and 5 returned results (NDCG@3 and NDCG@5). Our regression model with pairwise preference-based regularizer uniformly outperforms baseline systems on both datasets.

Figure

Figure 3: When the second arrested appears, as the sentence becomes ungrammatical, the discriminator determines that this example comes from the generator. Hence, after this time-step, it outputs low scores.

Figure

Figure 4: Real examples with methods referred in Table 1. The proposed methods generated summaries that grasped the core idea of the articles.

Figure

Figure 1: Proposed model. Given long text, the generator produces a shorter text as a summary. The generator is learned by minimizing the reconstruction loss together with the reconstructor and making discriminator regard its output as humanwritten text.

Figure

Figure 2: Architecture of proposed model. The generator network and reconstructor network are a seq2seq hybrid pointer-generator network, but for simplicity, we omit the pointer and the attention parts. loss varies widely from sample to sample, and thus the rewards to the generator are not stable either. Hence we add a baseline to reduce their difference. We apply self-critical sequence training (Rennie et al., 2017); the modified reward rR(x, x̂) from reconstructor R with the baseline for the generator is

Figure

Figure 1: A graphical illustration of the topic-aware convolutional architecture. Word and topic embeddings of the source sequence are encoded by the associated convolutional blocks (bottom left and bottom right). Then we jointly attend to words and topics by computing dot products of decoder representations (top left) and word/topic encoder representations. Finally, we produce the target sequence through a biased probability generation mechanism.

Figure

Figure 4: The Rouge-1 and Rouge-L score for each pre-training method and the basic model on the development set during the training process.

Figure

Figure 4: Quality of rankings given by Fast Rerank.

Figure

Figure 3: Quality of candidate templates under different ranges.

Figure

Figure 3: ROUGE metrics on Gigaword and DUC2004 w.r.t a different number of concept candidates. Updates were conducted by hard assignment P̂Ci argmax and random selection P̂ C i random.

Figure

Figure 1: “summary1” only copies keyword from the source text, while “summary2” generates new concepts to convey the meaning.

Figure

Figure 3: This figure shows the Rouge-2 score for each pre-training method and the basic model on the development set during the training process. We put the result for Rouge-1 and Rouge-L score in Appendix A.2

Figure

Figure 1: Overview of the Fast Rerank Module.

Figure

Figure 2: The structure of the proposed model: (a) the Bi-Directional Selective Encoding with Template model (BiSET) and (b) the bi-directional selective layer.

Figure

Figure 2: The structure of the Basic Model. We use LSTM and self-attention module to encode the sentence and document respectively. Xi represent the word embedding for sentence i. Si and Di represent the independent and document involved sentence embedding for sentence i respectively.

Figure

Figure 1: An example for the Mask pre-training task. A sentence is masked in the original paragraph, and the model is required to predicted the missing sentence from the candidate sentences.

Figure

Figure 2: The architecture of our model. Blue bar represents the attention distribution over the inputs. Purple bar represents the concept distribution over the inputs. Noted that, this distribution can be sparse since not every word has its upper concept. Green bar represents the vocabulary distribution generated from seq2seq component.

Figure

Figure 3: Graph structure of HETERDOCSUMGRAPH for multi-document summarization (corresponding to the Graph Layer part of Figure 1). Green, blue and orange boxes represent word, sentence and document nodes respectively. d1 consists of s11 and s12 while d2 contains s21 and s22. As a relay node, the relation of document-document, sentence-sentence, and sentencedocument can be built through the common word nodes. For example, sentence s11, s12 and s21 share the same word w1, which connects them across documents.

Figure

Figure 8: A generated summary example of CNN/DM.

Figure

Figure 3: Annotation interface and instructions for XSUM factual consistency task.

Figure

Figure 4: The plot of the improvement of BertSUM+TA over BertSUM as a function of the document length for (a) CNN/DM and (b) Xsum, where the improvement is measured by the amount of increase in the ROUGE scores. The documents in each corpus are equally divided into 10 different groups based on their lengths. Each point of a curve indicates the average ROUGE score in its corresponding group.

Figure

Figure 3: (a) TA for multi-head attention; (b) Mask matrices in decoder SA (left) and decoder CA (right).

Figure

Figure 1: Model Overview. The framework consists of three major modules: graph initializers, the heterogeneous graph layer and the sentence selector. Green circles and blue boxes represent word and sentence nodes respectively. Orange solid lines denote the edge feature (TF-IDF) between word and sentence nodes and the thicknesses indicate the weight. The representations of sentence nodes will be finally used for summary selection.

Figure

Figure 5: Relationship between number of source documents (x-axis) and R̃ (y-axis).

Figure

Figure 1: The structure of BertSUM with TA, where the names in bold are our proposed modules in TA.

Figure

Figure 3: Exploring anchor set size k.

Figure

Figure 2: Distribution of tokens represented by ϕv in SIA, learned on CNN/DM using PFA and (5).

Figure

Figure 4: Relationships between the average degree of word nodes of the document (x-axis) and R̃, which is the mean of R-1, R-2 and R-L (lines for left y-axis), and between ∆R̃, which is the delta R̃ of HETERSUMGRAPH and Ext-BiLSTM (histograms for right y-axis).

Figure

Figure 2: Annotation interface and instructions for CNN/DM factual consistency task.

Figure

Figure 2: An example of anchor set for the bigram “great success” when top-3 results are extracted.

Figure

Figure 1: Overview of QAGS. A set of questions is generated based on the summary. The questions are then answered using both the source article and the summary. Corresponding answers are compared using a similarity function and averaged across questions to produce the final QAGS score.

Figure

Figure 9: A generated summary example of NYT, where the generation of BertSUM comes from the original paper (Liu and Lapata, 2019).

Figure

Figure 1: The transition from old-fashioned to newly-introduced protocol for designing reference-based automatic evaluation metrics in the document summarization task. The curved arrows on the right show that both summaries are derived from the source document.

Figure

Figure 2: The detailed update process of word and sentence nodes in Heterogeneous Graph Layer. Green and blue nodes are word and sentence nodes involved in this turn. Orange edges indicate the current information flow direction. First, for sentence s1, word w1 and w3 are used to aggregate word-level information in (a). Next,w1 is updated by the new representation of s1 and s2 in (b), which are the sentences it occurs. See Section 3.3 for details on the notation.

Figure

Figure 5: The conditional generation results by pluging TEMA into the GPT2 small.

Figure

Figure 1: The framework of the proposed method

Figure

Figure 1: Overview of our MADY model, containing a hierarchical multi-scale abstraction modeling module (HMAM) on the left and a dynamic key-value memory-augmented attention network (DMA) on the right.

Figure

Figure 2: (a) A visualization of the interaction matrix&B . (b) The reshaped matrix controlled by novelty. (c) the reshaped matrix according to relevance. For simplicity, the thresholds were set to n= = nA = 0.5.

Figure

Figure 2: The influence and relation of some major factors, i.e., the coefficient β, tweets volume, and tweets latency

Figure

Figure 1: (a) Position of highlights hits in the documents; (b) Top-4 tweets hits in the documents; (c) The probability of highlights hit vs. tweets hit in the documents; (d) The maximum similarity between highlights and tweets per document

Figure

Figure 1: Example contrasting the Autoencoder (AE) and Information Bottleneck (IB) approaches to summarization. While AE (top) preserves any detail that helps to reconstruct the original, such as population size in this example, IB (bottom) uses context to determine which information is relevant, which results in a more appropriate summary.

Figure

Figure 3: Bar plot of per-token pgen and entropy of the generation distribution (purple) and copy distribution (blue), plotted under correlation contributions CC(pgen, Hgen) (purple) and CC(pgen, Hcopy) (blue) for a randomly-sampled CNN/DailyMail test summaries.

Figure

Figure 2: Distribution of pgen across all tokens in the test split of the CNN/DailyMail corpus. Sentence-final punctuation makes up 5% of tokens in the dataset, which accounts for 22% of pgen’s mass

Figure

Figure 1: (Top) Correlation contributions CC(pgen, Hgen) (green) and CC(pgen, Hcopy) (purple) for a randomlysampled summary. (Bottom) Bar plot of per-token pgen (orange), and entropy of the generation distribution (green) and copy distribution (purple) for the same summary.

Figure

Figure 4: Bar plot of per-token pgen and entropy of the generation distribution (purple) and copy distribution (blue), plotted under correlation contributions CC(pgen, Hgen) (purple) and CC(pgen, Hcopy) (blue) for a randomly-sampled XSum test summaries.

Figure

Figure 1: Illustration of neural coherence model which is built upon ARC-II proposed by (Hu et al. 2014).

Figure

Figure 1: Model Framework. The top figure describes the framework for contrastive learning, where for each document x, we create different types of negative samples and compare them with x to get a ranking loss. The bottom figure is the evaluator which generates the final evaluation score. For short, here we use SS , SL and SLS to indicate S Score, L Score and LS Score.

Figure

Figure 3: Comparison of HT, GraphSum (GSum in figure), BASS under various length of input tokens.

Figure

Figure 4: Illustration of novel n-grams in generated summaries form different systems.

Figure

Figure 1: Illustration of a unified semantic graph and its construction procedure for a document containing three sentences. In Graph Construction, underlined tokens represent phrases., co-referent phrases are represented in the same color. In The Unified Semantic Graph, nodes of different colors indicate different types, according to section 3.1.

Figure

Figure 2: Illustration of our graph-based summarization model. The graph node representation is initialized from merging token representations in two-level. The graph encoder models the augmented graph structure. The decoder attends to both token and node representations and utilizes graph structure by graph-propagation attention.

Figure

Figure 4: The figure compares high-frequency semantic units and semantic units in the summary of each article in CNN/Daily Mail, which includes 287k articlesummary pairs in total. The x-axis represents the ratio of high-frequency semantic units which also show up in summaries. The y-axis is the number of articles in the CNN/Daily Mail training set. The threshold of the cosine similarity is set as 0.5.

Figure

Figure 1: Our training and inference stages. The semantic unit embeddings with darker colors indicate that greater attention mask values are applied.

Figure

Figure 2: Our model overview. (a) The two-stage training process. (b) The inference process.

Figure

Figure 3: Comparison of the mean gold scores assigned for Q2 and Q3 to each of the 32 systems in the DUC05 dataset, and the corresponding scores predicted by SUM-QE. Scores range from 1 to 5. The systems are sorted in descending order according to the gold scores. SUM-QE makes more accurate predictions forQ2 than for Q3, but struggles to put the systems in the correct order.

Figure

Figure 1: SUM-QE rates summaries with respect to five linguistic qualities (Dang, 2006a). The datasets we use for tuning and evaluation contain human assigned scores (from 1 to 5) for each of these categories.

Figure

Figure 2: Illustration of different flavors of the investigated neural QE methods. An encoder (E) converts the summary to a dense vector representation h. A regressor Ri predicts a quality score SQi using h. E is either a BiGRU with attention (BiGRU-ATT) or BERT (SUM-QE).R has three flavors, one single-task (a) and two multi-task (b, c).

Figure

Figure 1: A histogram illustrating the score distribution in the real learner data

Figure

Figure 3: The merged LSTM model

Figure

Figure 4: Attention mechanism architecture in the attention-based LSTM model for summary assessment

Figure

Figure 5: Combining three approaches using ensemble modelling

Figure

Figure 2: Similarity matrices of two summaries for the same reading passage from the simulated learner data. Summary A is a good summary and Summary B is a bad summary. The rows of the matrix represent sentences in the summary and the columns of the matrix represent sentences in the reading passage.

Figure

Figure 2: A Comparison between our model, SummaRuNNer and Oracle when applied to documents with increasing length, left-up: ROUGE-1 on Pubmed dataset, right-up: ROUGE-2 on Pubmed dataset, left-down: ROUGE-1 on arXiv dataset, right-down: ROUGE-2 on arXiv dataset

Figure

Figure 1: The structure of our model, sei, sri represent the sentence embedding and sentence representation of sentence i, respectively. The binary decision of whether the sentence should be included in the summary is based on the sentence itself (A), the whole document (B) and the current topic (C). The document representation is simply the concatenation of the last hidden states of the forward and backward RNNs, while the topic segment representation is computed by applying LSTM-Minus, as shown in detail in the left panel (Detail of C).

Figure

Figure 4: Copy rate learning curve of two balancing mechanisms. Unbalanced 100: |mr| = 102|mc|; Unbalanced 1000: |mr| = 103|mc|.

Figure

Figure 7: The relative position distribution of different redundancy reduction methods on Pubmed(left) and arXiv(right) datasets.

Figure

Figure 4: Comparing the average ROUGE scores and average unique n-gram ratios of different models on the Pubmed dataset, conditioned on different degrees of redundancy and lengths of the document (extremely long documents - i.e., 1% of the dataset are not shown because of space constraints).7

Figure

Figure 5: Comparing the average ROUGE scores and average unique n-gram ratios of different models with different word length limits on the Pubmed dataset. See Appendices for similar results on arXiv.

Figure

Figure 3: The average ROUGE scores, average unique n-gram ratios, and average NID scores with different λ used in the MMR-Select on the validation set. Remember that the higher the Unique n-gram Ratio, the lower NID, the less redundancy contained in the summary.

Figure

Figure 1: Information amount evaluation with language models. Here we take a subsequence x3x4 as example. [M] denotes mask and PLMs/MLMs/ALMs are three different options for language models. I(x3x4|·) = − log[P (x3|·)P (x4|·)], where conditions for different models are omitted for brevity.

Figure

Figure 1: Sample summary of an article from the test set of CNN/DailyMail corpus (Hermann et al., 2015). The words used in summary are colored. In this sample, one sentence (green) is directly copied from article and two are rewritten to be concise.

Figure

Figure 3: Architecture of our hierarchical reinforcement learning and reward composition (green lines) of extraction.

Figure

Figure 6: Comparing the average ROUGE scores and average unique n-gram ratios of different models on the arXiv dataset, conditioned on different degrees of redundancy and lengths of the document.8

Figure

Figure 1: The average unique n-gram ratio in the documents across different datasets. To reduce the effect of length difference, stopwords were removed.

Figure

Figure 5: Summary comparison with the baseline models. Overlapped content is colored. Our model overlaps the most with human summary and generates less redundancy.

Figure

Figure 2: The pipeline of the MMR-Select+ method, where Ŝ, Ŷ and S̄, Ȳ are the summary and labels generated by the MMR-Select algorithm and the normal greedy algorithm, respectively. S and Y are the ground truth summary and the oracle labels.

Figure

Figure 2: The overview of the HYSUM framework. Hierarchical representation module first encodes the article sentences si into vectors hj . Then each sentence vector becomes two versions by adding with two different markers mc,mr. When the pointer network (arrows denote attention and darker color represents higher weights) selects the copy version hci of a sentence, it will be copied. Otherwise when the rewriting version hri is selected, the sentence will be rewritten to reduce redundancy.

Figure

Figure 2: Comparison of CoCo values assigned to high (top 50%) and low (bottom 50%) human judgments.

Figure

Figure 1: (a) The causal graph of text summarization reflects the causal relationships among the fact C, source document X , language prior K, and the modelgenerated summary Y . (b) According to Eq. (6), the causal effect of X on Y can be obtained by subtracting the effect of K on Y from the total effect.

Figure

Figure 2: The position distribution of extracted sentences by different models on the PubMed-Long test set.

Figure

Figure 1: The model architecture of GRETEL

Figure

Figure 2: Relative position distributions of selected sentences in the original document of two testing corpora (CNN/DM and XSum), obtained by different lead bias demoting strategies.

Figure

Figure 1: The overall architecture of our proposed lead bias demoting method.

Figure

Figure 1: Architecture of the EDU rewriter. The group tag embedding build connection between encoder and decoder.

Figure

Figure 1: Pyramid scores of KPLM+TSR under different summary length limits.

Figure

Figure 3: Sentence extraction module of JECS. Words in input document sentences are encoded with BiLSTMs. Two layers of CNNs aggregate these into sentence representations hi and then the document representation vdoc. This is fed into an attentive LSTM decoder which selects sentences based on the decoder state d and the representations hi, similar to a pointer network.

Figure

Figure 1: Diagram of the proposed model. Extraction and compression are modularized but jointly trained with supervision derived from the reference summary.

Figure

Figure 5: Effect of changing the compression threshold on CNN. The y-axis shows the average of the F1 of ROUGE-1,-2 and -L. The dotted line is the extractive baseline. The model outperforms the extractive model and achieves nearly optimal performance across a range of threshold values.

Figure

Figure 2: Text compression example. In this case, “intimate”, “well-known”, “with their furry friends” and “featuring ... friends” are deletable given compression rules.

Figure

Figure 4: Text compression module. A neural classifier scores the compression option (with their furry friends) in the sentence and broader document context and decides whether or not to delete it.

Figure

Figure 2: An example of masked sentences prediction. The third sentence in the document is masked and the hierarchical encoder encodes the masked document. We then use TransDecS to predict the original sentence one token at a time.

Figure

Figure 1: Next token entropies computed on 10K generation steps from PEGASUSCNN/DM, PEGASUSXSUM, BARTCNN/DM and BARTXSUM respectively, broken into two cases: an Existing Bigram means the bigram just generated occurs in the input document, while a Novel Bigram is an organic model generation. These cases are associated with low entropy and high entropy actions, respectively. The x-axis shows the entropy (truncated at 5), and the y-axis shows the count of bigram falling in each bin. The dashed lines indicate the median of each distribution.

Figure

Figure 2: A document (left) weighted with respect to a reference summary and two system outputs (right), with Corr-F/Corr-A scores. The colour represents the sum of argument- and fact-level weights for each token (Eqs. 3 and 4). The darker the colour, the more important the fact is.

Figure

Figure 1: Illustration of DISCOBERT for text summarization. Sentence-based BERT model (baseline) selects whole sentences 1, 2 and 5. The proposed discourse-aware model DISCOBERT selects EDUs {1- 1, 2-1, 5-2, 20-1, 20-3, 22-1}. The right side of the figure illustrates the two discourse graphs we use: (i) Coref(erence) Graph (with the mentions of ‘Pulitzer prizes’ highlighted as examples); and (ii) RST Graph (induced by RST discourse trees).

Figure

Figure 1: The framework of our proposed model. Based on the encoder self-attention graph, we calculate the centrality score for each source word to guide the copy module.

Figure

Figure 4: Proportion of extracted sentences by different unsupervised models against their positions.

Figure

Figure 3: Correlating syntactic distance between neighboring tokens with the entropy change in those tokens’ generation decisions for PEGASUS summaries. The median entropy change is depicted as a dashed black line. At points of high syntactic distance, the model’s behavior is less restricted by the context, correlating with higher entropy.

Figure

Figure 6: The instruction for argument-level human highlight annotation.

Figure

Figure 4: Correlation between attention entropy and prediction entropy of PEG(ASUS) and BART on C(NN/DM) and X(Sum). We compute the mean value of the attention entropy within each bucket of prediction entropy. The uncertainty of attention strongly correlates with the entropy of the model’s prediction.

Figure

Figure 2: KL divergence with ROUGE F1 in the Gigaword test set for SAGCopy Indegree-1 model. Each point in the above plots represents an sample. The bottom plots show the average ROUGE score for different KL values.

Figure

Figure 5: The human annotation interface for fact level. Human judges are required to highlight content in the document that is supporting the fact printed in bold “The Queen has tweeted her thanks” (FACT1 of the summary in Figure 1 in the paper).

Figure

Figure 3: (Left) Model architecture of DISCOBERT. The Stacked Discourse Graph Encoders contain k stacked DGE blocks. (Right) The architecture of each Discourse Graph Encoder (DGE) block.

Figure

Figure 3: The guidance process for SAGCopy Indegree model, showing that the keyword “northern” is correctly copied for our model.

Figure

Figure 2: Prediction entropy values by relative sentence positions. For example, 0.0 indicates the first 10% of tokens in a sentence, and 0.9 is the last 10% of tokens. PEGASUSCNN/DM and BARTCNN/DM make highly uncertain decisions to start, but then entropy decreases, suggesting that these models may be copying based on a sentence prefix. Entropies on XSum are more constant across the sentence.

Figure

Figure 5: Vocabulary projected attention attending to the last input yt−2, current input yt−1, current output yt, and next output yt+1. When the prediction entropy is low, the attention mostly focus a few tokens including the current input yt−1 and current output yt.

Figure

Figure 7: The human annotation interface for argument level. Human judges are required to highlight content in the document that is supporting the phrase printed in bold “on social media” (argument ARGM-LOC of FACT2 of the summary in Figure 1 in the paper).

Figure

Figure 1: List of SRL propositions and corresponding tree MR with two facts for the sentence “The queen has tweeted her thanks to people who sent her 90th birthday messages on social media”.

Figure

Figure 1: The architecture of our hierarchical encoder, the token level Transformer encodes tokens and then the sentence level Transformer learns final sentence representations from representations at <s>.

Figure

Figure 3: Another document (left) weighted with respect to a reference summary and two system outputs (right), with Corr-F/Corr-A scores (see Fig. 2 for details).

Figure

Figure 9: Human highlight annotation for the FACT1 “The Queen has tweeted her thanks” of the summary in Figure 1 in the paper.

Figure

Figure 2: Example of discourse segmentation and RST tree conversion. The original sentence is segmented into 5 EDUs in box (a), and then parsed into an RST discourse tree in box (b). The converted dependencybased RST discourse tree is shown in box (c). Nucleus nodes including [2], [3] and [5], and Satellite nodes including [2] and [4] are denoted by solid lines and dashed lines, respectively. Relations are in italic. The EDU [2] is the head of the whole tree (span [1-5]), while the EDU [3] is the head of the span [3-5].

Figure

Figure 4: The instruction for fact-level human highlight annotation.

Figure

Figure 3: An example of Sentence Shuffling. The sentences in the document are shuffled and then pass through the hierarchical encoder, then a Pointer Network with TransDecP as its decoder is adopted to predict the positions of original sentences in the shuffled document.

Figure

Figure 8: Human highlight annotation for the argument ARG1 of FACT1 “her thanks” of the summary in Figure 1 in the paper.

Figure

Figure 5: Visualization of attentionweights between the historical summaries and the input review.

Figure

Figure 4: Ablation experiments on the dataset Sports. Figure (a) shows the result on ROUGE-1 and Rouge-2 metrics, and Figure (b) shows the result on Rouge-L metric.

Figure

Figure 3: Four-way evaluation for our content attribution methods. The reported value is the NLL loss with respect to the predicted token. Lower is better for display methods and higher is better for removal methods (we “break” the model more quickly). n = 0 means the baseline when there is no token or sentence displayed in DISP or removed or masked in RM.

Figure

Figure 3: The framework of our model. Attentions marked in grey are from the naive Transformer. Then, the reasoning unit is consists of the inter-reasoning attention marked in green and the personalized intra-reasoning attention marked in yellow. Finally, the memory-decoder attention marked in red incorporates the historical reasoning memory into the decoder layer.

Figure

Figure 1: An example of product review and its corresponding summary and historical summaries of corresponding user and product. We mark the relevant historical summaries in red.

Figure

Figure 2: Comparison of Summarization tasks. Single-document Summarization (SDS task) focuses on generating summary S based on a single documentD. Multi-document Summarization (MDS task) creates a holistic summary S covering multiple articles D. The MIRANEWS task differs by producing summary S based only on the events pertinent in the main article D, while reaching to a set of assisting documents A for complementary background.

Figure

Figure 1: An example where the summary (top section) contains information that is not explicitly included in its main document (middle section), but is covered in the related assisting document (bottom section). We highlight the information in the summary that is aligned to its corresponding main and assisting documents with yellow and pink colors, respectively.

Figure

Figure 4: An example from MIRANEWS, where the key information in the gold summary and summaries generated by systems conditioning on the main document (BART-S) or both on the main and assisting documents (rest variants) were only mentioned in the assisting documents. Facts in the gold summary supported by the as-

Figure

Figure 1: Our two-stage ablation-attribution framework. First, we compare a decoder-only language model (not fine-tuned on summarization task, and not conditioned on the input article) and a full summarization model. They are colored in gray and orange respectively. the The higher the difference, the more heavily model depends on the input context. For those context-dependent decisions, we conduct content attribution to find the relevant supporting content with methods like Integrated Gradient or Occlusion.

Figure

Figure 3: Example of a page on newser.com: a newser.com article is a news event including editor-picked links to relevant news articles from other news websites. This example shows the webpage https:// www.newser.com/story/305823/starship-prototype-lands-doesnt-explode.html. In the webpage (D1), three extra news pieces (D2, D3, D4) from nytimes, newser, and CNBC are linked. All of these four news articles report on the same event of starship prototype landing.

Figure

Figure 6: Several generated summaries and the corresponding review, reference summary.

Figure

Figure 2: Map of model behavior on XSum (top) and CNN/DM (bottom). The x-axis and y-axis show the distance between LM∅ and Sfull, and distance between S∅ and Sfull. The regions characterize different generation modes, defined in Section 3.

Figure

Figure 6: Ablation on the number of keyphrases in testing.

Figure

Figure 3: Distribution of the number of keyphrases.

Figure

Figure 5: Case study.

Figure

Figure 2: General framework of our model. There are mainly three parts, keyphrases prediction network, induction network and condition generation network.

Figure

Figure 1: Proposed summarization framework: generative process and neural parametrization. Shaded nodes represent observed variables, unshaded nodes indicate latent variables, arrows represent conditional dependencies between variables, and plates refer to repetitions of sampling steps. Dashed lines denote optional queries at test time. Latent queries create a query-focused view of the input document, which together with a query-agnostic view serve as input to a decoder for summary generation.

Figure

Figure 4: Fine tuning on the hyper-parameter λ.

Figure

Figure 1: Our framework trains guidance induction and summary generation jointly. It avoids the domain mismatch of the external tools and the guidance extraction is refined during training.

Figure

Figure 6: Screenshot of the (partial) screening test workers had to pass before participating in the HITs.

Figure

Figure 4: Snapshot of the page with instructions for users on the data collection interface.

Figure

Figure 5: Interface to construct summary used to collect data from workers on Amazon Mechanical Turk.

Figure

Figure 2: Intents used in the SUBSUME dataset.

Figure

Figure 7: Screenshot of the summary overview page.

Figure

Figure 3: ROUGE-L F1 for SBERT-EX and SBERT-QB for each intent. From left to right, intents are ordered in increasing order of their subjectiveness score shown in Table 1. The Pearson’s correlation between the subjectiveness score and the F1 score for SBERT-EX and SBERT-QB is −0.97 and −0.77 respectively.

Figure

Figure 3: Selected RTT questions with the FQD, PRQD and QSV objective measures.

Figure

Figure 2: Question semantic volume maximization using convex hull. The and are the selected and non-selected candidates RTT questions using convex hull. The left side figure shows the toy-example of the convex hull. The right side figure shows the selected RTT question with respect to the gold question .

Figure

Figure 1: The highlighted text shows important key aspects of the question which need to be considered while generating the summary.

Figure

Figure 2: Learning curves of HATS in terms of Rouge-L.

Figure

Figure 1: The overall architecture of our model.

Figure

Figure 5: Proportion of novel grams in summaries generated by different models on the CNN/DM test set.

Figure

Figure 4: An example of a generated summary by TED. The reference summary and parts of the input article are also included.

Figure

Figure 1: Overall structure of our model. TED first pretrains on news articles and then finetunes with theme modeling and denoising. (from left to right).

Figure

Figure 2: An example of the pretraining task: predict the Lead-3 sentences (as the target summary) using the rest of the article.

Figure

Figure 3: Theme modeling is essentially updating TED with a semantic classifier. The input sentence pair is first processed by adding a “class” token in the beginning and a “separation” token between the two sentences. Then the sentence pair is fed into the transformer encoder, and the first output vector is classified to “similar” or “distinct”.

Figure

Figure 1: An example of a long document abstractive summary from the LongSumm data set, presented using SUMMVis (Vig et al., 2021).

Figure

Figure 3: The first three images are ROUGE and BERTSCORE common test cases in the top quartile. The last four images are complementary high-quality summaries in top quartile suggested by SPICE and BLEU. The figures depict portions of the source document that align with the system-generated summary.

Figure

Figure 2: BART summaries in the ROUGE top quartile (left) and the SPICE top quartile (right).

Figure

Figure 1: Abstract and citations of (Bergsma and Lin 2006). The abstract emphasizes their pronoun resolution techniques and improved performance; the citation sentences reveal that their noun gender dataset is also a major contribution to the research community, but it is not covered in the abstract.

Figure

Figure 2: Overview of the dataset construction process.

Figure

Figure 4: Comparison of our hybrid summary with the abstract and pure cited text spans summary, for paper P05-1004 in the CL-SciSumm 2016 test set. Our hybrid summary covers both the authors’ original motivations (green) and the technical details influential to the research community (red).

Figure

Figure 3: Overview of our summarization models.

Figure

Figure 1: CNNLM for learning sentence representations. d=300, l = 3, 3 left context words are used in this figure.

Figure

Figure 2: The saliency-selection netowrk.

Figure

Figure 1: The focus-attention mechanism.

Figure

Figure 5: Semantic similarity between model outputs and reference summaries. Desired length is set as reference summary length.

Figure

Figure 2: TAPT performance over different pretraining epoch numbers in the email domain in terms of using and not using RecAdam.

Figure

Figure 2: Examining hyperparameter ℵ on the AEG dataset. ROUGE-L F1 scores and Length Variance V ar of different models under different ℵ are shown (ℵ = 2, 10, 50, 250).

Figure

Figure 3: ROUGE-1 results of BART fine-tuning, DAPT and SDPT over different numbers of training data for email (left) and dialog (right) domains. We consider both low-resource settings (50, 100, 200 and 300 (∼2%) samples), medium-resource settings (25% and 50% samples), and high-resource settings (75% and 100% samples).

Figure

Figure 3: Length distribution of reference summaries on the Annotated English Gigaword dataset. Summaries with 30 to 75 characters cover the majority cases.

Figure

Figure 1: Illustration of the Length Attention Unit. Firstly, decoder hidden state (blue) and remaining length (yellow) are employed to compute the attention weights al. Then, the length context vector clt (green) is produced by calculating the weighted sum between attention weights and pre-defined length embeddings (purple). Better viewed in color.

Figure

Figure 4: Length distribution of reference summaries on the CNN/Daily Mail dataset. Summaries exceed 2000 characters are ignored, since they only cover 0.009% of the dataset.

Figure

Figure 2: The framework of the summarization model

Figure

Figure 1: A dependency tree example. We extract the following two fact descriptions: Ahmadinejad essentially called Yukiya Amano, the director general of the IAEA, a U.S. puppet ||| said the U.N.A has no jurisdiction in Iran and Irap

Figure

Figure 1: The overall architecture of our proposed method and a training process for inconsistent example.

Figure

Figure 1: Latent variable extractive summarization model. senti is a sentence in a document and sum senti is a sentence in a gold summary of the document.

Figure

Figure 1: Visualization of copy probabilities

Figure

Figure 2: Average ROUGE-L improvement on CNN/Daily mail test set samples with different golden summary length.

Figure

Figure 1: Model Overview, N represents decoder layer number and L represents summary length.

Figure

Figure 2: The architecture of our extractive summarization model. The sentence and document level transformers can be pretrained.

Figure

Figure 1: The architecture of HIBERT during training. senti is a sentence in the document above, which has four sentences in total. sent3 is masked during encoding and the decoder predicts the original sent3.

Figure

Figure 3: Truncated examples from the test sets along with human, PG baseline and RLR+C outputs. Factual accuracy scores (s) are also shown for the model outputs. For the Stanford example, clinical observations

Figure

Figure 4: Distributions of the top 10 most frequent trigrams from model outputs on the Stanford test set.

Figure

Figure 1: A (truncated) radiology report and summaries with their ROUGE-L scores. Compared to the human summary, Summary A has high textual overlap (i.e., ROUGE-L) but makes a factual error; Summary B has a lower ROUGE-L score but is factually correct.

Figure

Figure 5: More examples from the test splits of both datasets along with human, PG baseline and RLR+C summaries. In the first example, the baseline output successfully copied content from the context, but missed important observations. In the second exam-

Figure

Figure 2: Our proposed training strategy. Compared to existing work which relies only on a ROUGE reward rR, we add a factual correctness reward rC which is enabled by a fact extractor. The summarization model is updated via RL, using a combination of the NLL loss, a ROUGE-based loss and a factual correctness-based loss. For simplicity we only show a subset of the clinical variables in the fact vectors v and v̂.

Figure

Figure 1: The architecture of our hierarchical T5.

Figure

Figure 4: Lite2.xPyramid curves (for system-level correlations) and its comparison to replacing random sentences’ SCUs with STUs.

Figure

Figure 5: A part of the Amazon Mechanical Turk webpage used for human evaluation.

Figure

Figure 3: Precision-Recall curve on different dataset, while taking FactCC as the baseline.

Figure

Figure 2: Illustration of consistency checking stage. In the evidence reasoning process, each sentence pair is scored by a reasoning model. Then each score is combined into the consistency score in the evidence aggregation process.

Figure

Figure 3: The Amazon Mechanical Turk user interface for collecting human labels of SCUs’ presence.

Figure

Figure 1: The illustration of our metrics. This data example is from REALSumm (Bhandari et al., 2020) (we omit unnecessary content by ‘...’). For gold labels, ‘1’ stands ‘present’ and ‘0’ stands ‘not present’. Other scores are the 2-class entailment probabilities, p2c(e), from our finetuned NLI model.

Figure

Figure 4: A part of the Amazon Mechanical Turk webpage used for collecting summaries.

Figure

Figure 1: Our two-stage fact consistency assessment framework.

Figure

Figure 2: Number of entities in the generated summary from BART and ECC.

Figure

Figure 1: Visualization of teacher cross attention weights when generating pseudo labels with normal (left) and smoothed (right) attention weights. This example is generated by the BART teacher trained on CNNDM (see §4.4). Its training and inference hyperparameters are described in detail in §4.2.

Figure

Figure 1: Example pipeline of our method to generate to-do items from email. Highlight sentences are extracted from the email text. Highlight action nodes are extracted from the constructed action graph. The highlight actions and sentences are then utilized to generate the to-do item.

Figure

Figure 2: Model Architecture. The sentences and actions are first encoded and then fed to the highlight classifiers. The hidden representations of sentences and actions, along with their probability of being highlights are then used in the cross-attention layer in the decoder. The email encoder has the same structure as BART encoder. The graph encoder utilizes graph attention networks to encode the action graph.

Figure

Figure 3: ROUGE scores when varying the value of k in the top k sentences extracted as highlights.

Figure

Figure 4: Example 2 of visualization of cross attention weight when the student generate summaries with different attention temperatures.

Figure

Figure 3: Example 1 of visualization of cross attention weight when the student generate summary with different attention temperatures.

Figure

Figure 2: Distributions of evident cross attention weights (≥ 0.15) when teachers generate pseudo labels with different attn. temperatures w.r.t. token positions.

Figure

Figure 1: Entity Coverage Control for seq2seq model.

Figure

Figure 1: Architecture of HERMAN. Note that the binary classifier for predicting whether a summary is verified (z labels) is omitted here. It simply takes the context vectors of the summary tokens and run through a MLP classifier.

Figure

Figure 1: PACSUM’s performance against different values of λ1 on the NYT validation set with with λ2 = 1. Optimal hyper-parameters (λ1, λ2, β) are (−2, 1, 0.6).

Figure

Figure 1: An overview of our model which generates summaries with guiding entities. Instead of generating summaries from left to right, our approach can control the process of generation and incorporate selected entities into summaries precisely.

Figure

Figure 3: The enhanced abstract generator of our LSTM-L module. To make model generate different entities, we encode all possible entities to initialize the LSTM-L generator. This can also guide the LSTM-R to generate different entities.

Figure

Figure 2: Our controllable neural model with guiding entities. The original article texts are encoded with a BiLSTM layer. We utilize a pretrained BERT named entity recognition tool to extract entities from input texts. The decoder consists of two LSTMs: LSTM-L and LSTM-R. Our model starts generating the left and right part of a summary with selected entities and can guarantee that entities appear in final output summaries.

Figure

Figure 4: The total entity (a) and novel entity (b) proportion of our model outputs compared to different baselines on Gigaword dataset. Our model can generate summaries with significantly more entities than other methods.

Figure

Figure 2: The position and content attribution on CNN/DM, the test set is broken down based on COMPRESSION.

Figure

Figure 2: Results of different document encoders with Pointer on normal and shuffled CNN/DailyMail. ∆R denotes the decrease of performance when the sentences in document are shuffled.

Figure

Figure 1: Different behaviours of two decoders (SeqLab and Pointer) under different testing environment. (a) shows repetition scores of different architectures when extracting six sentences on CNN/DailyMail. (b) shows the relationship between ∆R and positional bias. The abscissa denotes the positional bias of six different datasets and ∆R denotes the average ROUGE difference between the two decoders under different encoders. (c) shows average length of k-th sentence extracted from different architectures.

Figure

Figure 1: The accuracy on CNN/DM dataset, test set is broken down based on P-Value and C-Value.

Figure

Figure 3: ∆(D) for different datasets.

Figure

Figure 1: MATCHSUM framework. We match the contextual representations of the document with gold summary and candidate summaries (extracted from the document). Intuitively, better candidate summaries should be semantically closer to the document, while the gold summary should be the closest.

Figure

Figure 2: Distribution of z(%) on six datasets. Because the number of candidate summaries for each document is different (short text may have relatively few candidates), we use z / number of candidate summaries as the X-axis. The Y-axis represents the proportion of the best-summaries with this rank in the test set.

Figure

Figure 5: ψ of different datasets. Reddit is excluded because it has too few samples in the test set.

Figure

Figure 4: Datasets splitting experiment. We split test sets into five parts according to z described in Section 3.2. The X-axis from left to right indicates the subsets of the test set with the value of z from small to large, and the Y-axis represents the ROUGE improvement of MATCHSUM over BERTEXT on this subset.

Figure

Figure 4: ROUGE-2 F1 score on different groups of input sentences in terms of their length for s2s+att baseline and our SEASS model on English Gigaword test sets.

Figure

Figure 2: Overview of the Selective Encoding for Abstractive Sentence Summarization (SEASS).

Figure

Figure 3: Precision of extracted sentence at step t of the NN-SE baseline and the NEUSUM model.

Figure

Figure 1: Overview of the NEUSUM model. The model extracts S5 and S1 at the first two steps. At the first step, we feed the model a zero vector 0 to represent empty partial output summary. At the second and third steps, the representations of previously selected sentences S5 and S1, i.e., s5 and s1, are fed into the extractor RNN. At the second step, the model only scores the first 4 sentences since the 5th one is already included in the partial output summary.

Figure

Figure 2: Position distribution of selected sentences of the NN-SE baseline, our NEUSUM model and oracle on the test set. We only draw the first 30 sentences since the average document length is 27.05.

Figure

Figure 2: Examples of alignment results generated by our unsupervised method between the abstractive summaries and corresponding source sentences in the Gigaword test set.

Figure

Figure 1: Generative process of the contextual matching model.

Figure

Figure 2: The overview of the BERT-based model for sub-sentential extraction (SSE). In this simplified example, the document has 3 sentences. The first and the third sentences have two extraction units and the second sentence has one. After encoding the document with pre-trained BERT encoder, an average pooling layer are used to aggregate information of each extraction unit. The final Transformer layer captures the document-level information and then the MLP predicts the extraction probability.

Figure

Figure 1: A screenshot example of the documentsummary pair in the CNN/Daily Mail dataset.

Figure

Figure 1: Hierarchical Meeting Summary Network (HMNet) model structure. [BOS] is the special start token inserted before each turn, and its encoding is used in turn-level transformer encoder. Other tokens’ encodings enter the cross-attention module in decoder.

Figure

Figure 2: Percentage of novel n-grams in the reference and the summaries generated by HMNet and UNS (Shang et al., 2018) in AMI’s test set.

Figure

Figure 4: Qualitative analysis studying how the number of sections in document affect model performance. The mean ROUGE and the ROUGE delta are reported.

Figure

Figure 2: Boundary distance distribution for summaryworthy sentences on the arXiv and PubMed dataset.

Figure

Figure 1: The model architecture of FASUM. It has L layers of transformer blocks in both the encoder and decoder. The knowledge graph is obtained from information extraction results and it participates in the decoder’s attention.

Figure

Figure 5: ROUGE scores for BART-LB onDUC2003 andDUC2004 under different hyper-parameters: beamwidth andmaximum length of summary. Other hyper-parameters are set according to Table 3.

Figure

Figure 3: The detailed update process of word, sentence, and section nodes in HEROES. The figure is a toy example consisting of 3 sections, 7 sentences, and 5 unique words, where the vertical dashed lines are section boundaries. Green, blue, red nodes are word, sentence, section nodes involved in the update in this turn. Orange edges denote the direction of information flow.

Figure

Figure 2: Percentage of novel n-grams for summaries in XSum test set.

Figure

Figure 3: Ratio of novel n-grams, i.e. not in the article’s leading sentence, in summaries from BART, BART-LB, BARTFinetune and the reference summary in XSum’s test set.

Figure

Figure 2: The distribution of the overlapping ratio of nonstopping words between: (Red) the reference summary and the article; (Green) the reference summary and the article excluding the first 3 sentences, i.e. Rest; and (Blue) the leading 3 sentences, i.e. Lead-3, and Rest. The area under each curve is 1. All ratios are computed on CNN/DailyMail.

Figure

Figure 1: Average overlapping ratio of words between an article sentence and the reference summary, grouped by the normalized position of the sentence (the size of bin is 0.05). The ratio is computed on the training set of corresponding datasets.

Figure

Figure 1: The overall workflow of HEROES. Contents with solid (resp. dashed) borders are retained (resp. filtered) by the content ranking module. Sentences selected by the extractive summarization module are marked in colors.

Figure

Figure 4: Averaged difference in ROUGE-1 score between the summaries from BART and BART-LB, grouped by the length of reference summary. For example, the first bin refers to the 0-20% percentile of articles with the shortest reference summary in the corresponding dataset.

Figure

Figure 3: Agreement trajectories averaged over 100 runs per topic in the TAC 2009 corpus.

Figure

Figure 1: Illustration of traditional evaluation models based on reference summaries (top) and the new model (bottom) which is based on pairwise preferences.

Figure

Figure 2: Example of a pairwise preference annotation of two sentences. The first sentence is preferred over the second sentence because the first sentence contains more important information given that the information is not already known.

Figure

Figure 1: Assume a document (x1, x2, · · · , x8) contains three sentences (i.e., SENT. 1, SENT. 2 and SENT. 3). A SEQ2SEQ Transformer model can be pre-trained with our proposed objective. It takes the transformed document (i.e., a shuffled document, the first segment of a document, or a masked document) as input and learns to recover the original document (or part of the original document) by generation. SR: Sentence Reordering; NSG: Next Sentence Generation; MDG: Masked Document Generation.