Much recent work on Spoken Language Understanding (SLU) falls short in atleast one of three ways: models were trained on oracle text input and neglectedthe Automatics Speech Recognition (ASR) outputs, models were trained to predictonly intents without the slot values, or models were trained on a large amountof in-house data. We proposed a clean and general framework to learn semanticsdirectly from speech with semi-supervision from transcribed speech to addressthese. Our framework is built upon pretrained end-to-end (E2E) ASR andself-supervised language models, such as BERT, and fine-tuned on a limitedamount of target SLU corpus. In parallel, we identified two inadequate settingsunder which SLU models have been tested: noise-robustness and E2E semanticsevaluation. We tested the proposed framework under realistic environmentalnoises and with a new metric, the slots edit F1 score, on two public SLUcorpora. Experiments show that our SLU framework with speech as input canperform on par with those with oracle text as input in semantics understanding,while environmental noises are present, and a limited amount of labeledsemantics data is available.