Efficient and Accurate In-Database Machine Learning with SQL Code Generation in Python

Abstract

Following an analysis of the advantages of SQL-based Machine Learning (ML)and a short literature survey of the field, we describe a novel method forIn-Database Machine Learning (IDBML). We contribute a process for SQL-codegeneration in Python using template macros in Jinja2 as well as the prototypeimplementation of the process. We describe our implementation of the process tocompute multidimensional histogram (MDH) probability estimation in SQL. Forthis, we contribute and implement a novel discretization method called equalquantized rank (EQR) variable-width binning. Based on this, we provide datagathered in a benchmarking experiment for the quantitative empirical evaluationof our method and system using the Covertype dataset. We measured accuracy andcomputation time. Our multidimensional probability estimation was significantlymore accurate than Naive Bayes, which assumes independent one-dimensionalprobabilities and/or densities. Also, our method was significantly moreaccurate and faster than logistic regression. However, our method was 2-3% lessaccurate than the best current state-of-the-art methods we found (decisiontrees and random forests) and 2-3 times slower for one in-memory dataset. Yet,this fact motivates for further research in accuracy improvement and in IDBMLwith SQL code generation for big data and larger-than-memory datasets.

Quick Read (beta)

loading the full paper ...