In training a deep learning system to perform audio transcription, twopractical problems may arise. Firstly, most datasets are weakly labelled,having only a list of events present in each recording without any temporalinformation for training. Secondly, deep neural networks need a very largeamount of labelled training data to achieve good quality performance, yet inpractice it is difficult to collect enough samples for most classes ofinterest. In this paper, we propose factorising the final task of audiotranscription into multiple intermediate tasks in order to improve the trainingperformance when dealing with this kind of low-resource datasets. We evaluatethree data-efficient approaches of training a stacked convolutional andrecurrent neural network for the intermediate tasks. Our results show thatdifferent methods of training have different advantages and disadvantages.