Abstract
The success of multi-task learning can depend heavily on which tasks aregrouped together. Naively grouping all tasks or a random set of tasks canresult in negative transfer, with the multi-task models performing worse thansingle-task models. Though many efforts have been made to identify taskgroupings and to measure the relatedness among different tasks, it remains achallenging research topic to define a metric to identify the best taskgrouping out of a pool of many potential task combinations. We propose a metricof task relatedness based on task difficulty measured by pointwise V-usableinformation (PVI). PVI is a recently proposed metric to estimate how muchusable information a dataset contains given a model. We hypothesize that taskswith not statistically different PVI estimates are similar enough to benefitfrom the joint learning process. We conduct comprehensive experiments toevaluate the feasibility of this metric for task grouping on 15 NLP datasets inthe general, biomedical, and clinical domains. We compare the results of thejoint learners against single learners, existing baseline methods, and recentlarge language models, including Llama 2 and GPT-4. The results show that bygrouping tasks with similar PVI estimates, the joint learners yieldedcompetitive results with fewer total parameters, with consistent performanceacross domains.