Abstract
Identifying DNA- (DBPs) and RNA-binding proteins (RBPs) is crucial for theunderstanding of cell function, molecular interactions as well as regulatoryfunctions. Owing to their high similarity, most of the existing approaches facechallenges in differentiating between DBPs and RBPs leading to highcross-prediction errors. Moreover, identifying proteins which bind to both DNAand RNA (DRBPs) is also quite a challenging task. In this regard, we propose anovel framework viz. LAMP-PRo which is based on pre-trained protein languagemodel (PLM), attention mechanisms and multi-label learning to mitigate theseissues. First, pre-trained PLM such ESM-2 is used for embedding the proteinsequences followed by convolutional neural network (CNN). Subsequentlymulti-head self-attention mechanism is applied for the contextual informationwhile label-aware attention is used to compute class-specific representationsby attending to the sequence in a way that is tailored to each label (DBP, RBPand non-NABP) in a multi-label setup. We have also included a novel cross-labelattention mechanism to explicitly capture dependencies between DNA- andRNA-binding proteins, enabling more accurate prediction of DRBP. Finally, alinear layer followed by a sigmoid function are used for the final prediction.Extensive experiments are carried out to compare LAMP-PRo with the existingmethods wherein the proposed model shows consistent competent performance.Furthermore, we also provide visualization to showcase model interpretability,highlighting which parts of the sequence are most relevant for a predictedlabel. The original datasets are available at http://bliulab.net/iDRBP\_MMC andthe codes are available at https://github.com/NimishaGhosh/LAMP-PRo.