In recent years, the use of deep learning in language models gained muchattention. Some research projects claim that they can generate text that can beinterpreted as human-writing, enabling new possibilities in many applicationareas. Among the different areas related to language processing, one of themost notable in applying this type of modeling is programming languages. Foryears, the Machine Learning community has been researching this softwareengineering area, pursuing goals like applying different approaches toauto-complete, generate, fix, or evaluate code programmed by humans.Considering the increasing popularity of the Deep-Learning-enabled languagemodels approach, we detected a lack of empirical papers that compare differentdeep learning architectures to create and use language models based onprogramming code. This paper compares different neural network architectureslike AWD-LSTMs, AWD-QRNNs, and Transformer while using transfer learning anddifferent tokenizations to see how they behave in building language modelsusing a Python dataset for code generation and filling mask tasks. Consideringthe results, we discuss each approach's different strengths and weaknesses andwhat gaps we find to evaluate the language models or apply them in a realprogramming context.