Generative adversarial networks (GAN) are a powerful subclass of generativemodels. Despite a very rich research activity leading to numerous interestingGAN algorithms, it is still very hard to assess which algorithm(s) performbetter than others. We conduct a neutral, multi-faceted large-scale empiricalstudy on state-of-the art models and evaluation measures. We find that mostmodels can reach similar scores with enough hyperparameter optimization andrandom restarts. This suggests that improvements can arise from a highercomputational budget and tuning more than fundamental algorithmic changes. Toovercome some limitations of the current metrics, we also propose several datasets on which precision and recall can be computed. Our experimental resultssuggest that future GAN research should be based on more systematic andobjective evaluation procedures. Finally, we did not find evidence that any ofthe tested algorithms consistently outperforms the original one.