Language model pre-training has shown promising results in various downstreamtasks. In this context, we introduce a cross-modal pre-trained language model,called Speech-Text BERT (ST-BERT), to tackle end-to-end spoken languageunderstanding (E2E SLU) tasks. Taking phoneme posterior and subword-level textas an input, ST-BERT learns a contextualized cross-modal alignment via our twoproposed pre-training tasks: Cross-modal Masked Language Modeling (CM-MLM) andCross-modal Conditioned Language Modeling (CM-CLM). Experimental results onthree benchmarks present that our approach is effective for various SLUdatasets and shows a surprisingly marginal performance degradation even when 1%of the training data are available. Also, our method shows further SLUperformance gain via domain-adaptive pre-training with domain-specificspeech-text pair data.