Adapting pre-trained language models (PrLMs) (e.g., BERT) to new domains hasgained much attention recently. Instead of fine-tuning PrLMs as done in mostprevious work, we investigate how to adapt the features of PrLMs to new domainswithout fine-tuning. We explore unsupervised domain adaptation (UDA) in thispaper. With the features from PrLMs, we adapt the models trained with labeleddata from the source domain to the unlabeled target domain. Self-training iswidely used for UDA which predicts pseudo labels on the target domain data fortraining. However, the predicted pseudo labels inevitably include noise, whichwill negatively affect training a robust model. To improve the robustness ofself-training, in this paper we present class-aware feature self-distillation(CFd) to learn discriminative features from PrLMs, in which PrLM features areself-distilled into a feature adaptation module and the features from the sameclass are more tightly clustered. We further extend CFd to a cross-languagesetting, in which language discrepancy is studied. Experiments on twomonolingual and multilingual Amazon review datasets show that CFd canconsistently improve the performance of self-training in cross-domain andcross-language settings.