Human ratings are one of the most prevalent methods to evaluate theperformance of natural language processing algorithms. Similarly, it is commonto measure the quality of sentences generated by a natural language generationmodel using human raters. In this paper, we argue for exploring the use ofsubjective evaluations within the process of training language generationmodels in a multi-task learning setting. As a case study, we use acrowd-authored dialogue corpus to fine-tune six different language generationmodels. Two of these models incorporate multi-task learning and use subjectiveratings of lines as part of an explicit learning goal. A human evaluation ofthe generated dialogue lines reveals that utterances generated by themulti-tasking models were subjectively rated as the most typical, most movingthe conversation forward, and least offensive. Based on these promising firstresults, we discuss future research directions for incorporating subjectivehuman evaluations into language model training and to hence keep the human userin the loop during the development process.