Active speaker detection is an important component in video analysisalgorithms for applications such as speaker diarization, video re-targeting formeetings, speech enhancement, and human-robot interaction. The absence of alarge, carefully labeled audio-visual dataset for this task has constrainedalgorithm evaluations with respect to data diversity, environments, andaccuracy. This has made comparisons and improvements difficult. In this paper,we present the AVA Active Speaker detection dataset (AVA-ActiveSpeaker) thatwill be released publicly to facilitate algorithm development and enablecomparisons. The dataset contains temporally labeled face tracks in video,where each face instance is labeled as speaking or not, and whether the speechis audible. This dataset contains about 3.65 million human labeled frames orabout 38.5 hours of face tracks, and the corresponding audio. We also present anew audio-visual approach for active speaker detection, and analyze itsperformance, demonstrating both its strength and the contributions of thedataset.