Abstract
State-of-the-art models for keyphrase generation require large amounts oftraining data to achieve good performance. However, obtaining keyphrase-labeleddocuments can be challenging and costly. To address this issue, we present aself-compositional data augmentation method. More specifically, we measure therelatedness of training documents based on their shared keyphrases, and combinesimilar documents to generate synthetic samples. The advantage of our methodlies in its ability to create additional training samples that keep domaincoherence, without relying on external data or resources. Our results onmultiple datasets spanning three different domains, demonstrate that our methodconsistently improves keyphrase generation. A qualitative analysis of thegenerated keyphrases for the Computer Science domain confirms this improvementtowards their representativity property.