A Survey on Data Collection for Machine Learning: a Big Data -- AI Integration Perspective

Abstract

Data collection is a major bottleneck in machine learning and an activeresearch topic in multiple communities. There are largely two reasons datacollection has recently become a critical issue. First, as machine learning isbecoming more widely-used, we are seeing new applications that do notnecessarily have enough labeled data. Second, unlike traditional machinelearning, deep learning techniques automatically generate features, which savesfeature engineering costs, but in return may require larger amounts of labeleddata. Interestingly, recent research in data collection comes not only from themachine learning, natural language, and computer vision communities, but alsofrom the data management community due to the importance of handling largeamounts of data. In this survey, we perform a comprehensive study of datacollection from a data management point of view. Data collection largelyconsists of data acquisition, data labeling, and improvement of existing dataor models. We provide a research landscape of these operations, provideguidelines on which technique to use when, and identify interesting researchchallenges. The integration of machine learning and data management for datacollection is part of a larger trend of Big data and Artificial Intelligence(AI) integration and opens many opportunities for new research.

Quick Read (beta)

loading the full paper ...