In this paper, we address the text and image matching in cross-modalretrieval of the fashion industry. Different from the matching in the generaldomain, the fashion matching is required to pay much more attention to thefine-grained information in the fashion images and texts. Pioneer approachesdetect the region of interests (i.e., RoIs) from images and use the RoIembeddings as image representations. In general, RoIs tend to represent the"object-level" information in the fashion images, while fashion texts are proneto describe more detailed information, e.g. styles, attributes. RoIs are thusnot fine-grained enough for fashion text and image matching. To this end, wepropose FashionBERT, which leverages patches as image features. With thepre-trained BERT model as the backbone network, FashionBERT learns high levelrepresentations of texts and images. Meanwhile, we propose an adaptive loss totrade off multitask learning in the FashionBERT modeling. Two tasks (i.e., textand image matching and cross-modal retrieval) are incorporated to evaluateFashionBERT. On the public dataset, experiments demonstrate FashionBERTachieves significant improvements in performances than the baseline andstate-of-the-art approaches. In practice, FashionBERT is applied in a concretecross-modal retrieval application. We provide the detailed matching performanceand inference efficiency analysis.