Suggesting Code Edits in Interactive Machine Learning Notebooks Using Large Language Models

Abstract

Machine learning developers frequently use interactive computationalnotebooks, such as Jupyter notebooks, to host code for data processing andmodel training. Jupyter notebooks provide a convenient tool for writing machinelearning pipelines and interactively observing outputs, however, maintainingJupyter notebooks, e.g., to add new features or fix bugs, can be challengingdue to the length and complexity of the notebooks. Moreover, there is noexisting benchmark related to developer edits on Jupyter notebooks. To addressthis, we present the first dataset of 48,398 Jupyter notebook edits derivedfrom 20,095 revisions of 792 machine learning repositories on GitHub, andperform the first study of the using LLMs to predict code edits in Jupyternotebooks. Our dataset captures granular details of cell-level and line-levelmodifications, offering a foundation for understanding real-world maintenancepatterns in machine learning workflows. We observed that the edits on Jupyternotebooks are highly localized, with changes averaging only 166 lines of codein repositories. While larger models outperform smaller counterparts in codeediting, all models have low accuracy on our dataset even after finetuning,demonstrating the complexity of real-world machine learning maintenance tasks.Our findings emphasize the critical role of contextual information in improvingmodel performance and point toward promising avenues for advancing largelanguage models' capabilities in engineering machine learning code.

Quick Read (beta)

loading the full paper ...