When describing images with natural language, the descriptions can be mademore informative if tuned using downstream tasks. This is often achieved bytraining two networks: a "speaker network" that generates sentences given animage, and a "listener network" that uses them to perform a task.Unfortunately, training multiple networks jointly to communicate to achieve ajoint task, faces two major challenges. First, the descriptions generated by aspeaker network are discrete and stochastic, making optimization very hard andinefficient. Second, joint training usually causes the vocabulary used duringcommunication to drift and diverge from natural language. We describe an approach that addresses both challenges. We first develop anew effective optimization based on partial-sampling from a multinomialdistribution combined with straight-through gradient updates, which we namePSST for Partial-Sampling Straight-Through. Second, we show that the generateddescriptions can be kept close to natural by constraining them to be similar tohuman descriptions. Together, this approach creates descriptions that are bothmore discriminative and more natural than previous approaches. Evaluations onthe standard COCO benchmark show that PSST Multinomial dramatically improve [email protected] from 60% to 86% maintaining comparable language naturalness, andhuman evaluations show that it also increases naturalness while keeping thediscriminative power of generated captions.