End-to-end (E2E) spoken language understanding (SLU) can infer semanticsdirectly from speech signal without cascading an automatic speech recognizer(ASR) with a natural language understanding (NLU) module. However, pairedutterance recordings and corresponding semantics may not always be available orsufficient to train an E2E SLU model in a real production environment. In thispaper, we propose to unify a well-optimized E2E ASR encoder (speech) and apre-trained language model encoder (language) into a transformer decoder. Theunified speech-language pre-trained model (SLP) is continually enhanced onlimited labeled data from a target domain by using a conditional maskedlanguage model (MLM) objective, and thus can effectively generate a sequence ofintent, slot type, and slot value for given input speech in the inference. Theexperimental results on two public corpora show that our approach to E2E SLU issuperior to the conventional cascaded method. It also outperforms the presentstate-of-the-art approaches to E2E SLU with much less paired data.