OASum: Large-Scale Open Domain Aspect-based Summarization

Abstract

Aspect or query-based summarization has recently caught more attention, as it can generate differentiated summaries based on users’ interests. However, the current dataset for aspect or query-based summarization either focuses on specific domains, on a relatively small scale, or contains only a few aspect types. Such limitations hinder further explorations in this direction. In this work, we take advantage of crowd-sourcing knowledge on Wikipedia and automatically create a high-quality, large-scale open-domain aspect-based summarization dataset named OASum, which contains more than 3.7 million instances with around 1 million different aspects on 2 million Wikipedia pages. We provide benchmark results on OASum and demonstrate its ability for diverse aspect-based summarization generation. To overcome the data scarcity problem on specific domains, we also perform zero-shot, few-shot, and fine-tuning on seven downstream datasets. Specifically, zero/few-shot and fine-tuning results show that the model pre-trained on our corpus demonstrates a strong aspect or query-focused generation ability compared with the backbone model. Our dataset and pre-trained checkpoints are publicly available.

Publication
In Findings of the Association for Computational Linguistics
Kaiqiang Song
Kaiqiang Song
Senior Research Scientist

Kaiqiang Song (宋凯强) is a Senior Research Scientist at Tencent AI Lab, Seattle, specializing in Natural Language Processing. His research focuses on advancing artificial intelligence through machine learning, NLP, and large language models. He is dedicated to optimizing AI model architectures for practical applications like text summarization and text generation, bridging the gap between foundational AI research and real-world impact.