Large-scale cross-lingual language models (LM), such as mBERT, Unicoder andXLM, have achieved great success in cross-lingual representation learning.However, when applied to zero-shot cross-lingual transfer tasks, most existingmethods use only single-language input for LM finetuning, without leveragingthe intrinsic cross-lingual alignment between different languages that isessential for multilingual tasks. In this paper, we propose FILTER, an enhancedfusion method that takes cross-lingual data as input for XLM finetuning.Specifically, FILTER first encodes text input in the source language and itstranslation in the target language independently in the shallow layers, thenperforms cross-lingual fusion to extract multilingual knowledge in theintermediate layers, and finally performs further language-specific encoding.During inference, the model makes predictions based on the text input in thetarget language and its translation in the source language. For simple taskssuch as classification, translated text in the target language shares the samelabel as the source language. However, this shared label becomes less accurateor even unavailable for more complex tasks such as question answering, NER andPOS tagging. For better model scalability, we further propose an additionalKL-divergence self-teaching loss for model training, based on auto-generatedsoft pseudo-labels for translated text in the target language. Extensiveexperiments demonstrate that FILTER achieves new state of the art (77.0 onaverage) on the challenging multilingual multi-task benchmark, XTREME.