Abstract
In this paper, we introduce the MLM (Multiple Languages and Modalities)dataset - a new resource to train and evaluate multitask systems on samples inmultiple modalities and three languages. The generation process and inclusionof semantic data provide a resource that further tests the ability formultitask systems to learn relationships between entities. The dataset isdesigned for researchers and developers who build applications that performmultiple tasks on data encountered on the web and in digital archives. A secondversion of MLM provides a geo-representative subset of the data with weightedsamples for countries of the European Union. We demonstrate the value of theresource in developing novel applications in the digital humanities with amotivating use case and specify a benchmark set of tasks to retrieve modalitiesand locate entities in the dataset. Evaluation of baseline multitask and singletask systems on the full and geo-representative versions of MLM demonstrate thechallenges of generalising on diverse data. In addition to the digitalhumanities, we expect the resource to contribute to research in multimodalrepresentation learning, location estimation, and scene understanding.