Abstract
In this research, we explored the improvement in terms of multi-class diseaseclassification via pre-trained language models over Medical-Abstracts-TC-Corpusthat spans five medical conditions. We excluded non-cancer conditions andexamined four specific diseases. We assessed four LLMs, BioBERT, XLNet, andBERT, as well as a novel base model (Last-BERT). BioBERT, which was pre-trainedon medical data, demonstrated superior performance in medical textclassification (97% accuracy). Surprisingly, XLNet followed closely (96%accuracy), demonstrating its generalizability across domains even though it wasnot pre-trained on medical data. LastBERT, a custom model based on the lighterversion of BERT, also proved competitive with 87.10% accuracy (just underBERT's 89.33%). Our findings confirm the importance of specialized models suchas BioBERT and also support impressions around more general solutions likeXLNet and well-tuned transformer architectures with fewer parameters (in thiscase, LastBERT) in medical domain tasks.