This paper presents a deep reinforcement learning algorithm for onlineaccompaniment generation, with potential for real-time interactivehuman-machine duet improvisation. Different from offline music generation andharmonization, online music accompaniment requires the algorithm to respond tohuman input and generate the machine counterpart in a sequential order. We castthis as a reinforcement learning problem, where the generation agent learns apolicy to generate a musical note (action) based on previously generatedcontext (state). The key of this algorithm is the well-functioning rewardmodel. Instead of defining it using music composition rules, we learn thismodel from monophonic and polyphonic training data. This model considers thecompatibility of the machine-generated note with both the machine-generatedcontext and the human-generated context. Experiments show that this algorithmis able to respond to the human part and generate a melodic, harmonic anddiverse machine part. Subjective evaluations on preferences show that theproposed algorithm generates music pieces of higher quality than the baselinemethod.