AI Competitions and Benchmarks: Dataset Development

Abstract

Machine learning is now used in many applications thanks to its ability topredict, generate, or discover patterns from large quantities of data. However,the process of collecting and transforming data for practical use is intricate.Even in today's digital era, where substantial data is generated daily, it isuncommon for it to be readily usable; most often, it necessitates meticulousmanual data preparation. The haste in developing new models can frequentlyresult in various shortcomings, potentially posing risks when deployed inreal-world scenarios (eg social discrimination, critical failures), leading tothe failure or substantial escalation of costs in AI-based projects. Thischapter provides a comprehensive overview of established methodological tools,enriched by our practical experience, in the development of datasets formachine learning. Initially, we develop the tasks involved in datasetdevelopment and offer insights into their effective management (includingrequirements, design, implementation, evaluation, distribution, andmaintenance). Then, we provide more details about the implementation processwhich includes data collection, transformation, and quality evaluation.Finally, we address practical considerations regarding dataset distribution andmaintenance.

Quick Read (beta)

loading the full paper ...