Abstract
Visual paragraph generation aims to automatically describe a given image fromdifferent perspectives and organize sentences in a coherent way. In this paper,we address three critical challenges for this task in a reinforcement learningsetting: the mode collapse, the delayed feedback, and the time-consumingwarm-up for policy networks. Generally, we propose a novel Curiosity-drivenReinforcement Learning (CRL) framework to jointly enhance the diversity andaccuracy of the generated paragraphs. First, by modeling the paragraphcaptioning as a long-term decision-making process and measuring the predictionuncertainty of state transitions as intrinsic rewards, the model isincentivized to memorize precise but rarely spotted descriptions to context,rather than being biased towards frequent fragments and generic patterns.Second, since the extrinsic reward from evaluation is only available until thecomplete paragraph is generated, we estimate its expected value at each timestep with temporal-difference learning, by considering the correlations betweensuccessive actions. Then the estimated extrinsic rewards are complemented bydense intrinsic rewards produced from the derived curiosity module, in order toencourage the policy to fully explore action space and find a global optimum.Third, discounted imitation learning is integrated for learning from humandemonstrations, without separately performing the time-consuming warm-up inadvance. Extensive experiments conducted on the Standford image-paragraphdataset demonstrate the effectiveness and efficiency of the proposed method,improving the performance by 38.4% compared with state-of-the-art.