Abstract
Leastmean squares (LMS) solvers such as Linear / Ridge / LassoRegression,SVD and ElasticNet not only solve fundamental machine learning problems, butare also the building blocks in a variety of other methods, such as decisiontrees and matrix factorizations. We suggest an algorithm that gets a finite set of $n$ $d$dimensional realvectors and returns a weighted subset of $d+1$ vectors whose sum is\emph{exactly} the same. The proof in Caratheodory's Theorem (1907) computessuch a subset in $O(n^2d^2)$ time and thus not used in practice. Our algorithmcomputes this subset in $O(nd)$ time, using $O(\log n)$ calls to Caratheodory'sconstruction on small but "smart" subsets. This is based on a novel paradigm offusion between different data summarization techniques, known as sketches andcoresets. As an example application, we show how it can be used to boost theperformance of existing LMS solvers, such as those in scikitlearn library, upto x100. Generalization for streaming and distributed (big) data is trivial.Extensive experimental results and complete open source code are also provided.
Quick Read (beta)
Fast and Accurate LeastMeanSquares Solvers
Abstract
Leastmean squares (LMS) solvers such as Linear / Ridge / LassoRegression, SVD and ElasticNet not only solve fundamental machine learning problems, but are also the building blocks in a variety of other methods, such as decision trees and matrix factorizations.
We suggest an algorithm that gets a finite set of $n$ $d$dimensional real vectors and returns a weighted subset of $d+1$ vectors whose sum is exactly the same. The proof in Caratheodory’s Theorem (1907) computes such a subset in $O({n}^{2}{d}^{2})$ time and thus not used in practice. Our algorithm computes this subset in $O(nd)$ time, using $O(\mathrm{log}n)$ calls to Caratheodory’s construction on small but "smart" subsets. This is based on a novel paradigm of fusion between different data summarization techniques, known as sketches and coresets.
As an example application, we show how it can be used to boost the performance of existing LMS solvers, such as those in scikitlearn library, up to x100. Generalization for streaming and distributed (big) data is trivial. Extensive experimental results and complete open source code are also provided.
Fast and Accurate LeastMeanSquares Solvers
Alaa Maalouf ^{†}^{†}thanks: These authors contributed equally to this work. [email protected] Ibrahim Jubran^{1}^{1}footnotemark: 1 [email protected] Dan Feldman [email protected] The Robotics and Big Data Lab, Department of Computer Science, University of Haifa, Haifa, Israel
noticebox[b]\[email protected]
1 Introduction and Motivation
LeastMeanSquares (LMS) solvers are the family of fundamental optimization problems in machine learning and statistics that include linear regression, Principle Component Analysis (PCA), Singular Value Decomposition (SVD), Lasso and Ridge regression, Elastic net, and many more Golub und Reinsch (1971); Jolliffe (2011); Hoerl und Kennard (1970); Seber und Lee (2012); Zou und Hastie (2005); Tibshirani (1996); Safavian und Landgrebe (1991). See formal definition below. First closed form solutions for problems such as linear regression were published by e.g. Pearson Pearson (1900) around 1900 but were probably known before. Nevertheless, today they are still used extensively as building blocks in both academy and industry for normalization Liang u. a. (2013); Kang u. a. (2011); Afrabandpey u. a. (2016), spectral clustering Peng u. a. (2015), graph theory Zhang und Rohe (2018), prediction Copas (1983); Porco u. a. (2015), dimensionality reduction Laparra u. a. (2015), feature selection Gallagher u. a. (2017) and many more; see more examples in Golub und Van Loan (2012).
LeastMeanSquares solver in this paper is an optimization problem that gets as input an $n\times d$ real matrix $A$, and another $n$dimensional real vector $b$ (possibly the zero vector). It aims to minimize the sum of squared distances from the rows (points) of $A$ to some hyperplane that is represented by its normal or vector of $d$ coefficients $x$, that is constrained to be in a given set $X\subseteq {\mathbb{R}}^{d}$:
$$\underset{x\in X}{\mathrm{min}}f({\parallel Axb\parallel}_{2})+g(x).$$ 
Here, $g$ is called a regularization term. For example: in linear regression $X={\mathbb{R}}^{d}$, $f(x)={x}^{2}$ and $g(x)=0$ for every $x\in X$. In Lasso $f(y)={y}^{2}$ and $g(y)=\alpha \cdot {\parallel x\parallel}_{1}$ for every $y\in {\mathbb{R}}^{d}$ and $\alpha >0$. Such LMS solvers can be computed via the covariance matrix ${A}^{T}A$. For example, the solution to linear regression of minimizing ${\parallel Axb\parallel}_{2}$ is ${({A}^{T}A)}^{1}{A}^{T}b$.
1.1 Related work
While there are many LMS solvers and implementations, there is always a tradeoff between their accuracy and running time; see comparison table in Bauckhage (2015) with references therein. The reason is related to the fact that computing the covariance matrix of $A$ can be done essentially in two ways: (i) summing the $d\times d$ outer product ${a}_{i}{a}_{i}^{T}$ of the $i$th row ${a}_{i}^{T}$ of $A$ over every $i$, $1\le i\le n$. This is due to the fact that ${A}^{T}A={\sum}_{i=1}^{n}{a}_{i}{a}_{i}^{T}$, or (ii) factorization of $A$, e.g. using SVD or the QR decomposition.
Numerical issues.
Method (i) is easy to implement for streaming rows of $A$ by maintaining only ${d}^{2}$ entries of the covariance matrix for the $n$ vectors seen so far, or maintaining its inverse ${({A}^{T}A)}^{1}$ as explained e.g. in Golub und Van Loan (2012). This takes $O({d}^{2})$ for each vector insertion. However, every such addition may introduce another numerical error which accumulates over time. This error increases significantly when running the algorithms using 32 bit floating point representation, which is common for GPU computations. This solution is similar to maintaining the set of $d$ rows of the matrix $D{V}^{T}$, where $A=UD{V}^{T}$ is the SVD of $A$, which is not a subset of the original input matrix $A$ but has the same covariance matrix ${A}^{T}A=V{D}^{2}V$. A common problem is that to compute ${({A}^{T}A)}^{1}$ the matrix ${A}^{T}A$ must be invertible. This may not be the case due to numerical issues. In algorithms such as Lasso, the input cannot be a covariance matrix, but can be its Cholesky decomposition Bjorck (1967) $VD$. However, Cholesky decomposition can be applied only on positivedefinite matrices, which is not the case even for small numerical errors that are added to ${A}^{T}A$. See Section 4 for more details and empirical evidence.
Runningtime issues.
Method (ii) above utilizes factorizations such as SVD, i.e., $A=UD{V}^{T}$ to compute the covariance matrix via ${A}^{T}A=V{D}^{2}{V}^{T}$ or the QR decomposition $A=QR$ to compute ${A}^{T}A={R}^{T}{Q}^{T}Q{R}^{T}={R}^{T}R$. This approach is known to be much more stable. However, it is much more time consuming: while in theory the running time is $O(n{d}^{2})$ as in the first method, the constants that are hidden in the $O(\cdot )$ notation are significantly larger. Moreover, unlike Method (i), it is impossible to compute such factorizations exactly for streaming data Clarkson und Woodruff (2009).
Caratheodory’s Theorem Carathéodory (1907)
states that every point contained in the convex hull of $n$ points in ${\mathbb{R}}^{d}$ can be represented as a convex combination of a subset of at most $d+1$ points, which we call the Caratheodory set; see Section 2 and Fig. 1. This implies that we can maintain a weighted (scaled) set of ${d}^{2}+1$ points (rows) whose covariance matrix is the same as $A$, since $(1/n){\sum}_{i}{a}_{i}{a}_{i}^{T}$ is the mean of $n$ matrices and thus in the convex hull of their corresponding points in ${\mathbb{R}}^{{d}^{2}}$; see Algorithm 3. The fact that we maintain such a small sized subset of points instead of updating linear combinations of all the $n$ points seen so far, significantly reduces the numerical errors as shown in Fig. 9(b)–9(a). Unfortunately, computing this set from Caratheodory’s Theorem takes $O({n}^{2}{d}^{2})$ or $O(n{d}^{3})$ time via $O(n)$ calls to an LMS solver. This fact makes it nonpractical to use in an LMS solver, as we aim to do in this paper, and may explain the lack of source code for this algorithm on the web.
Approximations via Coresets and Sketches.
In the recent decades numerous approximation and data summarization algorithms were suggested to approximate the covariance matrix or its singular values. This is by computing a small matrix $S$ whose covariance ${S}^{T}S$ approximates, in some sense, the covariance matrix ${A}^{T}A$ of the input data $A$. The term coreset is usually used when $S$ is a weighted (scaled) subset of rows from the $n$ rows of the input matrix. The matrix $S$ is sometimes called a sketch if each rows in $S$ is a linear combination of few or all its rows, i.e. $S=WA$ for some matrix $W\in {\mathbb{R}}^{s\times n}$. However, those coresets and sketches usually yield $(1+\epsilon )$multiplicative approximations for ${\parallel Ax\parallel}_{2}^{2}$ by ${\parallel Sx\parallel}_{2}^{2}$ where the matrix $S$ is of ${(d/\epsilon )}^{O(1)}$ rows and $x$ may be any vector, or the smallest/largest singular vector of $S$ or $A$; see lower bounds in Feldman u. a. (2010). Moreover, a $(1+\epsilon )$approximation to ${\parallel Ax\parallel}_{2}^{2}$ by ${\parallel Sx\parallel}_{2}^{2}$ does not guarantee an approximation to the actual entries or eigenvectors of $A$ by $S$ that may be very different; see Drineas u. a. (2006).
Accurately handling big data.
The algorithms in this paper return accurate coresets ($\epsilon =0$), which is less common in the literature, including computations of the exact covariance matrix ${A}^{T}A$. Such coresets can easily support infinite stream of input rows using memory that is linear in their size, and also support dynamic/distributed data in parallel. This is by the useful mergeandreduce property of coresets that allow them to handle big data; see details e.g. in Agarwal u. a. (2004). Unlike traditional coresets that pay additional logarithmic multiplicative factors due to the usage of mergereduce trees and increasing error, the suggested weighted subsets in this paper do not introduce additional error to the resulting compression since they preserve the result accurately. The actual numerical errors are measured in the experimental results, with analysis that explain the differences.
A main advantage of a coreset over a sketch is that it preserves sparsity of the input rows Feldman u. a. (2016), which usually reduces theoretical running time. Our experiments show, as expected, that coresets can also be used to significantly improve the numerical stability of existing algorithms, even if the running time is the same. Another advantage is that the same coreset can be used for parameter tuning over a large set of candidates. In addition to other reasons, this significantly reduced the running time of such algorithms in our experiments; see Section 4.
1.2 Our contribution
A natural question that follows from the previous section is: can we maintain the optimal solution for LMS problems both accurately and fast? We answer this question affirmably by suggesting:

(i)
the first algorithm that computes the Caratheodory set of $n$ input points in time that is linear in the input $O(nd)$ for asymptotically large $n$, and using only $O(\mathrm{log}n)$ calls to an LMS solver. This is by using a novel approach of coreset/skecthes fusion that is explained in the next section; see Algorithm 2 and Theorem 2.

(ii)
an algorithm that maintains a ("coreset") matrix $S\in {\mathbb{R}}^{({d}^{2}+1)\times d}$ such that: (a) its set of rows is a weighted subset of the matrix $A\in {\mathbb{R}}^{n\times d}$ whose rows are the input points, and (b) the covariance matrices of $S$ and $A$ are the same, i.e., ${S}^{T}S={A}^{T}A$; see Algorithm 3 and Theorem 3.2.

(iii)
example applications for boosting the performance of existing solvers by running them on the matrix $S$ above or its variants for Linear/Ridge/Lasso Regressions and Elasticnet.

(iv)
extensive experimental results on synthetic and realworld data for common LMS solvers of Scikitlearn library with either CPython or Intel’s distribution. Either the running time or numerical stability is improved up to two orders of magnitude.
 (v)
1.3 Novel approach: Coresets meet Sketches
As explained in Section 1.1, the covariance matrix ${A}^{T}A$ of $A$ itself can be considered as a sketch which is relatively less numerically stable to maintain (especially its inverse, as desired by e.g. linear regression). The Caratheodory set that corresponds to the outer products of the rows of $A$ is a coreset whose covariance matrix is ${A}^{T}A$ and, as a weighted subset of the original rows, is more numerically stable but takes much more time to compute; see Theorem 2.2.
We thus suggest a metaalgorithm that combines these two approaches: sketches and coresets. It may be generalized to other, notnecessarily accurate $\epsilon $coresets and sketches ($\epsilon >0$); see Section 5.
The input to our metaalgorithm is 1) a set $P$ of $n$ items, 2) an integer $k$ from $1$ to $n$ where $n$ is maximum accuracy but longest running time, and 3) a pair of coreset and sketch construction schemes for the problem at hand.
The output is a coreset for the problem whose construction time is faster; see Fig. 1.
Step I: Compute a balanced partition $\{{P}_{1},\mathrm{\cdots},{P}_{k}\}$ of the input set $P$ into $k$ clusters of roughly the same size. While the correctness holds for any such arbitrary partition (e.g. see Algorithm 3.1), to reduce numerical errors – a partition that minimizes the sum of loss with respect to the problem at hand would be optimal.
Step II: Compute a sketch ${S}_{i}$ for each cluster ${P}_{i}$, where $i\in \{1,\mathrm{\cdots},k\}$, using the input sketch scheme. This step does not return a subset of $P$ as desired, and is usually numerically less stable.
Step III: Compute a coreset $B$ for the union $S={S}_{1}\cup \mathrm{\cdots}{S}_{k}$ of sketches from Step II, using the input coreset scheme. Note that $B$ is not a subset (or coreset) of $P$.
Step IV: Compute the union $C$ of clusters in ${P}_{1},\mathrm{\cdots},{P}_{k}$ that correspond to the selected sketches in Step III, i.e. $C={\bigcup}_{{S}_{i}\in B}{P}_{i}$. By definition, $C$ is a coreset for the problem at hand.
Step V: Recursively compute a coreset for $C$ until a sufficiently coreset is obtained. This step is used to reduce running time, without selecting $k$ that is too small.
We then run an existing solver on the coreset $C$ to obtain a faster accurate solution for $P$. Algorithm 2 and 3.1 are special cases of this metaalgorithm, where the sketch is simply the sum of a set of points/matrices, and the coreset is the existing (slow) implementation of the Caratheodory set from Theorem 2.2.
Paper organization.
In Section 2 we give our notations, definitions and the current stateoftheart result. Section 3 presents our main algorithms for efficient computation of the Caratheodory (core)set and a subset that preserves the inputs covariance matrix, their theorems of correctness and proofs. Section 4 demonstrates the applications of those algorithms to common LMS solvers, with extensive experimental results on both realworld and synthetic data via the Scikitlearn library with either CPython or Intel’s Python distributions. We conclude the paper with open problems and future work in Section 5.
2 Notation and Preliminaries
For integers $n,d\ge 1$, we denote by ${\mathbb{R}}^{n\times d}$ the set of $n\times d$ real matrices, and $[n]=\{1,\mathrm{\cdots},n\}$. To avoid abuse of notation, we use the big O notation where $O(\cdot )$ is a set Cormen u. a. (2009). A weighted set is a pair $(P,u)$ where $P=\{{p}_{1},\mathrm{\cdots},{p}_{n}\}$ is an ordered finite set in ${\mathbb{R}}^{d}$, and $u:P\to [0,\mathrm{\infty})$ is a positive weights function. A linear system solver is an algorithm that solves a system of $n$ linear equations with $d$ variables, i.e., return $x\in {\mathbb{R}}^{d}$ such that $Ax=b$ for a given $A\in {\mathbb{R}}^{n\times d}$ and $b\in {\mathbb{R}}^{n}$, assuming there is such a solution.
Given a point $q$ inside the convex hull of a set of points $P$, Caratheodory’s Theorem proves that there a subset of at most $d+1$ points in $P$ whose convex hull also contains $q$. This geometric definition can be formulated as follows.
Definition 2.1 (Caratheodory set).
Let $\mathrm{(}P\mathrm{,}u\mathrm{)}$ be a weighted set of $n$ points in ${\mathrm{R}}^{d}$ such that ${\mathrm{\sum}}_{p\mathrm{\in}P}u\mathit{}\mathrm{(}p\mathrm{)}\mathrm{=}\mathrm{1}$. A weighted set $\mathrm{(}S\mathrm{,}w\mathrm{)}$ is called a Caratheodory Set for $\mathrm{(}P\mathrm{,}u\mathrm{)}$ if: (i) its size is $\mathrm{}S\mathrm{}\mathrm{\le}d\mathrm{+}\mathrm{1}$, (ii) its weighted mean is the same, ${\mathrm{\sum}}_{p\mathrm{\in}S}w\mathit{}\mathrm{(}p\mathrm{)}\mathrm{\cdot}p\mathrm{=}{\mathrm{\sum}}_{p\mathrm{\in}P}u\mathit{}\mathrm{(}p\mathrm{)}\mathrm{\cdot}p$, and (iii) its sum of weights ${\mathrm{\sum}}_{p\mathrm{\in}S}w\mathit{}\mathrm{(}p\mathrm{)}\mathrm{=}\mathrm{1}$.
Caratheodory’s Theorem suggests a constructive proof for computing this set in $O({n}^{2}{d}^{2})$ time Carathéodory (1907); Cook und Webster (1972). This is implemented in Algorithm 1, which takes as input a weighted set $(P,u)$ such that ${\sum}_{p\in P}u(p)=1$ and computes a Caratheodory set $(S,w)$ for $(P,u)$ in $O({n}^{2}{d}^{2})$ time. However, as observed e.g. in Nasser u. a. (2015) it can be computed only for the first $m=d+1$ points, and then be updated point by point in $O(m{d}^{2})=O({d}^{3})$ time per point, to obtain $O(n{d}^{3})$ overall time. This still takes $\mathrm{\Theta}(n)$ calls to a linear system solver that solves $Ax=b$ for a given matrix $A$ and vector $b$, each of $\mathrm{\Theta}(d)$ rows and columns, in $O({d}^{3})$ time per call.
Theorem 2.2 (Carathéodory (1907)).
A Caratheodory set $\mathrm{(}S\mathrm{,}w\mathrm{)}$ can be computed for any weighted set $\mathrm{(}P\mathrm{,}u\mathrm{)}$ where ${\mathrm{\sum}}_{p\mathrm{\in}P}u\mathit{}\mathrm{(}p\mathrm{)}\mathrm{=}\mathrm{1}$ in $t\mathit{}\mathrm{(}n\mathrm{,}d\mathrm{)}\mathrm{\in}O\mathit{}\mathrm{(}\mathrm{1}\mathrm{)}\mathrm{\cdot}\mathrm{min}\mathit{}\mathrm{\{}{n}^{\mathrm{2}}\mathit{}{d}^{\mathrm{2}}\mathrm{,}n\mathit{}{d}^{\mathrm{3}}\mathrm{\}}$ time.
3 Faster Caratheodory Set
In this section, we present our main algorithm that reduces the running time for computing a Caratheodory set from $O(\mathrm{min}\{{n}^{2}{d}^{2},n{d}^{3}\})$ in Theorem 2.2 or Nasser u. a. (2015) to $O(nd)$ for sufficiently large $n$; see Theorem 3.1. A visual illustration of the corresponding Algorithm 2 is shown in Fig 1. We also present a second algorithm, called CaratheodoryMatrix, which computes a small weighted subset of a the given input that has the same covariance matrix as the input data; see Algorithm 3.
Theorem 3.1 (CaratheodorySet Booster).
Let $\mathrm{(}P\mathrm{,}w\mathrm{)}$ be a weighted set of $n$ points in ${\mathrm{R}}^{d}$, and $k\mathrm{\ge}d\mathrm{+}\mathrm{2}$ be an integer. Let $\mathrm{(}C\mathrm{,}u\mathrm{)}$ be the output of a call to $\mathrm{\text{FastCaratheodorySet}}\mathit{}\mathrm{(}P\mathrm{,}u\mathrm{,}k\mathrm{)}$; See Algorithm 2. Let $t\mathit{}\mathrm{(}k\mathrm{,}d\mathrm{)}$ be the time it takes to compute a Caratheodory Set for $k$ points in ${\mathrm{R}}^{d}$, as in Theorem 2.2. Then $\mathrm{(}C\mathrm{,}u\mathrm{)}$ is a Caratheodory set of $\mathrm{(}P\mathrm{,}w\mathrm{)}$ that can be computed in time
$$O\left(nd+t(k,d)\cdot \frac{\mathrm{log}n}{\mathrm{log}(k/d)}\right).$$ 
Proof.
We use the notation and variable names as defined in Algorithm 2.
Identify the input set $P=\{{p}_{1},\mathrm{\cdots},{p}_{n}\}$ and the set $C$ that is computed at Line 2 of Algorithm 2 as $C=\{{c}_{1},\mathrm{\cdots},{c}_{C}\}$. We will first prove that the weighted set $(C,w)$ computed in Lines 2–2 at the current (some) iteration is a Caratheodory set for $(P,u)$, i.e., ${\sum}_{p\in P}u(p)\cdot p={\sum}_{p\in C}w(p)\cdot p$, ${\sum}_{p\in P}u(p)={\sum}_{p\in C}w(p)$ and $C\le d+1$.
For every $i\in \{1,\mathrm{\cdots},k\}$, let $U({P}_{i})={\sum}_{p\in {P}_{i}}u(p)$.
Let $(\stackrel{~}{\mu},\stackrel{~}{w})$ be the pair computed at the current iteration at Line 2. By the definition of Caratheodory, we have that the pair $(\stackrel{~}{\mu},\stackrel{~}{w})$ computed at Line 2 is a Caratheodory set of the weighted set $(\{{\mu}_{1},\mathrm{\cdots},{\mu}_{k}\},{u}^{\prime})$, i.e., it satisfies that
$$\sum _{{\mu}_{i}\in \stackrel{~}{\mu}}\stackrel{~}{w}({\mu}_{i})=1,\sum _{{\mu}_{i}\in \stackrel{~}{\mu}}\stackrel{~}{w}({\mu}_{i}){\mu}_{i}=\sum _{i=1}^{k}U({P}_{i})\cdot {\mu}_{i}\text{and}\mathit{\hspace{1em}}\stackrel{~}{\mu}\le d+1.$$  (1) 
Observe that by the definition of ${\mu}_{i}$ for every $i\in \{1,\mathrm{\cdots},k\}$ at Line 2 we have that
$$\sum _{i=1}^{k}U({P}_{i})\cdot {\mu}_{i}=\sum _{i=1}^{k}U({P}_{i})\cdot \left(\frac{1}{U({P}_{i})}\cdot \sum _{p\in {P}_{i}}u(p)\cdot p\right)=\sum _{i=1}^{k}\sum _{p\in {P}_{i}}u(p)p=\sum _{p\in P}u(p)p.$$  (2) 
We now have that
$$\begin{array}{cc}\hfill \sum _{p\in C}w(p)p& =\sum _{{\mu}_{i}\in \stackrel{~}{\mu}}\sum _{p\in {P}_{i}}\frac{\stackrel{~}{w}({\mu}_{i})u(p)}{U({P}_{i})}\cdot p=\sum _{{\mu}_{i}\in \stackrel{~}{\mu}}\stackrel{~}{w}({\mu}_{i})\sum _{p\in {P}_{i}}\frac{u(p)}{U({P}_{i})}p=\sum _{{\mu}_{i}\in \stackrel{~}{\mu}}\stackrel{~}{w}({\mu}_{i}){\mu}_{i}\hfill \\ & =\sum _{i=1}^{k}U({P}_{i})\cdot {\mu}_{i}=\sum _{p\in P}u(p)p.\hfill \end{array}$$  (3) 
where the first equality holds by the definitions of $C$ and $w$, the third equality holds by the definition of ${\mu}_{i}$ at Line 2, the fourth equality is by (1), and the last equality is by (2).
We also have that the new sum of weights is equal to
$$\sum _{p\in C}w(p)=\sum _{{\mu}_{i}\in \stackrel{~}{\mu}}\sum _{p\in {P}_{i}}\frac{\stackrel{~}{w}({\mu}_{i})u(p)}{U({P}_{i})}=\sum _{{\mu}_{i}\in \stackrel{~}{\mu}}\frac{\stackrel{~}{w}({\mu}_{i})}{U({P}_{i})}\cdot \sum _{p\in {P}_{i}}u(p)=\sum _{{\mu}_{i}\in \stackrel{~}{\mu}}\frac{\stackrel{~}{w}({\mu}_{i})}{U({P}_{i})}\cdot U({P}_{i})=\sum _{{\mu}_{i}\in \stackrel{~}{\mu}}\stackrel{~}{w}({\mu}_{i})=1.$$  (4) 
Combining (3) and (4) yields that the weighted $(C,w)$ computed before the recursive call at Line 2 of the algorithm is a Caratheodory set for the weighted input set $(P,u)$. Since at each iteration we either return such a Caratheodory set $(C,w)$ at Line 2 or return the input weighted set $(P,u)$ itself at Line 2, by induction we get that the output weighted set of a call to $\mathrm{\text{FastCaratheodorySet}}(P,u,k)$ is a Caratheodory set for the original input $(P,u)$.
By (1) we have that $C$ contains at most $C\le (d+1)\cdot \frac{n}{k}=n\cdot \frac{d+1}{k}$ points. Hence, there are at most ${\mathrm{log}}_{\frac{k}{d+1}}(n)$ recursive calls before the stopping condition at line 2 is met. The time complexity of each iteration is ${n}^{\prime}+t(k,d)$ where ${n}^{\prime}=P\cdot d$ is the number of points in the current iteration. Thus the total time complexity is
$$\sum _{i=1}^{\mathrm{log}(n)}\frac{nd}{{2}^{i1}}+t(k,d)\le 2nd+{\mathrm{log}}_{\frac{k}{d+1}}(n)\cdot t(k,d)\in O\left(nd+\frac{\mathrm{log}n}{\mathrm{log}(k/(d+1))}\cdot t(k,d)\right).$$ 
∎
\nl$\{{P}_{1},\mathrm{\cdots},{P}_{k}\}:=$  a partition of $P$ into $k$ disjoint subsets (clusters), each contains at most $n/k$ points. 
Theorem 3.2.
Let $A\mathrm{\in}{\mathrm{R}}^{n\mathrm{\times}d}$ be a matrix, and $k\mathrm{\ge}{d}^{\mathrm{2}}\mathrm{+}\mathrm{2}$ be an integer. Let $S\mathrm{\in}{\mathrm{R}}^{\mathrm{(}{d}^{\mathrm{2}}\mathrm{+}\mathrm{1}\mathrm{)}\mathrm{\times}d}$ be the output of a call to $\mathrm{\text{CaratheodoryMatrix}}\mathit{}\mathrm{(}A\mathrm{,}k\mathrm{)}$; see Algorithm 3. Let $t\mathit{}\mathrm{(}k\mathrm{,}d\mathrm{)}$ be the computation time of Caratheodory given $k$ point in ${\mathrm{R}}^{{d}^{\mathrm{2}}}$. Then $S$ satisfies that ${A}^{T}\mathit{}A\mathrm{=}{S}^{T}\mathit{}S$. Furthermore, $S$ can be computed in $O\mathit{}\mathrm{(}n\mathit{}{d}^{\mathrm{2}}\mathrm{+}t\mathit{}\mathrm{(}k\mathrm{,}{d}^{\mathrm{2}}\mathrm{)}\mathrm{\cdot}\frac{\mathrm{log}\mathit{}n}{\mathrm{log}\mathrm{(}k\mathrm{/}{d}^{\mathrm{2}}\mathrm{)}\mathrm{)}}\mathrm{)}$ time.
Proof.
We use the notation and variable names as defined in Algorithm 3.
Since $(C,w)$ at Line 3 of Algorithm 3 is the output of a call to $\mathrm{\text{FastCaratheodorySet}}(P,u,k)$, by Theorem 3.1 we have that: (i) the weighted means of $(C,w)$ and $(P,u)$ are equal, i.e.,
$$\sum _{p\in P}u(p)\cdot p=\sum _{p\in C}w(p)\cdot p,$$  (5) 
(ii) $C\le {d}^{2}+1$ since $P\subseteq {\mathbb{R}}^{({d}^{2})}$, and (iii) $C$ is computed in $O(n{d}^{2}+{\mathrm{log}}_{\frac{k}{{d}^{2}+1}}(n)\cdot t(k,{d}^{2}))$ time.
Combining (5) with the fact that ${p}_{i}$ is simply the concatenation of the entries of ${a}_{i}{a}_{i}^{T}$, we get that
$$\sum _{{p}_{i}\in P}u({p}_{i}){a}_{i}{a}_{i}^{T}=\sum _{{p}_{i}\in C}w({p}_{i})\cdot {a}_{i}{a}_{i}^{T}.$$  (6) 
By the definition of $S$ on Line 3, we have that
$${S}^{T}S=\sum _{{p}_{i}\in C}(\sqrt{n\cdot w({p}_{i})}\cdot {a}_{i}){(\sqrt{n\cdot w({p}_{i})}\cdot {a}_{i})}^{T}=n\cdot \sum _{{p}_{i}\in C}w({p}_{i})\cdot {a}_{i}{a}_{i}^{T}.$$  (7) 
We also have that
$${A}^{T}A=\sum _{i=1}^{n}{a}_{i}{a}_{i}^{T}=n\cdot \sum _{{p}_{i}\in P}(1/n){a}_{i}{a}_{i}^{T}=n\cdot \sum _{{p}_{i}\in P}u({p}_{i}){a}_{i}{a}_{i}^{T},$$  (8) 
where the second derivation holds since $u\equiv 1/n$.
4 Experimental Results
In this section we apply our fast construction of the Carathoodory Set $S$ from previous section to boost the running time of common LMS solvers in Table 1 by a factor of ten to hundreds, or to improve their numerical accuracy by a similar factor to support, e.g. 32 bit floating point representation. This is by running the given solver as a black box on the small matrix $C$ that is returned by Algorithms 5–8, which is based on $S$. That is, our algorithm does not compete with existing solvers but relies on them, which is why we called it "booster". Open code for our algorithms is provided Maalouf u. a. (2019).
From Caratheodory Matrix to LMS solvers.
As stated in Theorem 3.2, Algorithm 3 gets an input matrix $A\in {\mathbb{R}}^{n\times d}$ and an integer $k>d+1$, and returns a matrix $S\in {\mathbb{R}}^{({d}^{2}+1)\times d}$ of the same covariance ${A}^{T}A={S}^{T}S$, where $k$ is a parameter for setting the desired accuracy. To "learn" a given label vector $b\in {\mathbb{R}}^{n}$, Algorithm 4 partitions the matrix ${A}^{\prime}=(A\mid b)$ into $m$ partitions, computes using Algorithm 3 a subset for each partition that preserves its covariance matrix, and returns the union of subsets as a pair $(C,y)$ where $C\in {\mathbb{R}}^{(m{(d+1)}^{2}+m)\times d}$ and $y\in {\mathbb{R}}^{m{(d+1)}^{2}+m}$. For $m=1$, it is easy to see that for every $x\in {\mathbb{R}}^{d}$,
$$\begin{array}{c}\hfill \parallel Axb\parallel =\parallel {A}^{\prime}{(x\mid 1)}^{T}\parallel =\parallel S{(x\mid 1)}^{T}\parallel =\parallel (C\mid y){(x\mid 1)}^{T}\parallel =\parallel Cxy\parallel ,\end{array}$$  (9) 
where the third and fourth equalities are by Theorem 3.2 and the construction of $C$, respectively. This enables us to replace the original pair $(A,b)$ by the smaller pair $(C,y)$ for the solvers in Table 1 as in Algorithms 5–8. A scaling factor $\beta $ is also needed in Algorithms 7–8.
Cross validation.
To select the value of the regularization term $\alpha $, the existing Python solvers we used partition the rows of $A$ into $m$ folds (subsets) and run the solver $m\cdot \mathbb{A}$ times for every fold and $\alpha \in \mathbb{A}$ to select the desired $\alpha $; see Kohavi u. a. (1995) for details. For consistency, Algorithm 4 computes a coreset for each of these $m$ folds in Line 4 and concatenate them in Line 4. Thus, (9) holds similarly for $m>1$.
The experiments.
We evaluated the algorithms in Table 1 using the common Python’s solvers in its right two columns. Most of these experiments were repeated twice: using the default CPython distribution Wikipedia contributors (2019a) and Intel’s distribution LTD (2019) of Python. All the experiments were conducted on a standard Lenovo Z70 laptop with an Intel i75500U CPU @ 2.40GHZ and 16GB RAM. We used the $3$ following realworld datasets in our experiments:

(i)
3D Road Network (North Jutland, Denmark) Data Set Kaul u. a. (2013). It contains $434874$ records. We used the $2$ attributes: (Web Mercaptor (Google format) longitude [real], Web Mercaptor (Google format) latitude [real]) to predict the attribute (Height in meters [real]).

(ii)
Individual household electric power consumption Data Set dataset:power (2012). It contains $2075259$ records. We used the $2$ attributes: (global active power [kilowatt  real], global reactive power [kilowatt  real]) to predict the attribute (voltage [volt  real]).

(iii)
House Sales in King County, USA dataset:sales (2015). It contains $21,600$ records. We used the following $8$ attributes: (bedrooms [integer], sqft living [integer], sqft lot [integer], floors [integer], waterfront [boolean], sqft above [integer], sqft basement [integer], year built [integer]) to predict the (house price [integer]) attribute.
The synthetic data is an $n\times d$ matrix $A$ and vector $b$ of length $n$, both of random entries in $[0,1000]$. As expected by the analysis, since our compression introduces no error to the computation accuracy, the actual values of the data had no affect on the results, unlike the size of the input which affects the computation time. Table 2 summarizes the experimental results.
4.1 Discussion
Why running time is faster?
The number of rows in the reduced matrix $C$ is $O({d}^{2})$, which is usually much smaller than the original matrix $A$. This also explains why some coresets (dashed red line) failed for small values of $n$ in Fig. 1(b),1(c),3(b) and 3(c). The construction of $C$ takes $O(n{d}^{2})$. Solving Linear Regression takes the same time, with or without the coreset. However, the constant are much smaller since the time for computing $C$ becomes neglected for large values of $n$, as shown in Fig. 11. We emphasize that, unlike common coresets, there is no accuracy loss due to the use of our coreset, ignoring $\pm {10}^{15}$ additive errors/improvements. The improvement factor in running time due to our booster is in order of up to x10. The contribution of the coreset is much significant, already for smaller values of $n$, when it boosts other solvers that use cross validation for parameter tuning as explained above. In this case, the time complexity reduced by a factor of $m\cdot \mathbb{A}$ since the coreset is computed only once for each of the $m$ folds, regardless of the size $\mathbb{A}$. In this case, the running time improvement is between x10–x100.
As shown in the graphs, the computations via Intel’s Python distribution reduced the running times by 1540%, with or without the booster, probably due to its tailored implementation for our hardware.
Why numerical stability is better?
A sketch that simply sums the $1$rank matrices of outer products of rows in the input matrix ${A}^{\prime}=(A\mid b)$ yields its covariance matrix $B={A}^{\prime T}{A}^{\prime}$. The Cholesky decomposition $B={L}^{T}L$ then returns a small matrix $L\in {\mathbb{R}}^{d\times d}$ that can be plugged to the solvers, similarly to our coreset. This algorithm which we call SKETCH + CHOLESKY is so simple and there is no hope to improve its running time via our much more involved booster. Nevertheless, it is numerically unstable for the reasons that are explained in Section 1.1. In fact, on most of our experiments we could not even apply this technique at all using 32bit floating point representation. This is because the resulting approximation to ${A}^{\prime T}{A}^{\prime}$ was not a positive definite matrix as required by the Cholesky Decomposition, and we could not compute the matrix $L$ at all. In case of success, the running time of the booster was slower by at most a factor of $2$ but even in these cases numerical accuracy is improved up to orders of magnitude; See Fig. 9(b)–9(a) for histogram of errors using such 32bit float representation which is especially common in GPUs for saving memory, running time and power Wikipedia contributors (2019b). For the special case of Linear regression, we can avoid Cholesky decomposition and compute the solution ${({A}^{T}A)}^{1}{A}^{T}b$ directly after maintaining such a sketch for ${A}^{T}A$ and ${A}^{T}b$. However, this sketch wich we call SKETCH + INVERSE still has large numerical issues compared to our coreset computation as shown in Fig. 9(b)–9(a).
\nl$C:=$  the first $d$ columns of $S$. 
\nl$y:=$  the last column of $S$. 
5 Conclusion and Future Work
We presented a novel framework that combines sketches and coresets. As an example application, we proved that the set from the Caratheodory Theorem can be computed in $O(nd)$ overall time for sufficiently large $n$ (for a set of $n$ points in ${\mathbb{R}}^{d}$). This is instead of $O({n}^{2}{d}^{2})$ time as in the original theorem. We then generalized the result for a matrix $S$ whose rows are a weighted subset of the input matrix and their covariance matrix is the same. Our experimental results section shows how to significantly boost the numerical stability or running time of existing LMS solvers by applying them on $S$. Future work includes: (a) applications of our framework to combine other sketchcoreset pairs e.g. as listed in Phillips (2016), (b) Experiments for streaming/distributed/GPU data and other potential applications such as for Deep Learning e.g. as part of the Stochastic Gradient Descent that uses the LMS adaptive filter Widrow u. a. (1977); Mandic (2004), and (c) experiments with higher dimensional data: we may compute each of the $O({d}^{2})$ entries in the covariance matrix by calling our algorithm with $d=2$ and the corresponding pair of columns in the $d$ columns of the input matrix.
References
 dataset:power (2012) \[email protected]dataset:power 2012 : Individual household electric power consumption Data Set . https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption. 2012
 dataset:sales (2015) \[email protected]dataset:sales 2015 : House Sales in King County, USA. https://www.kaggle.com/harlfoxem/housesalesprediction. 2015
 Afrabandpey u. a. (2016) \[email protected]Afrabandpey u. a. 2016 Afrabandpey, Homayun ; Peltola, Tomi ; Kaski, Samuel: Regression Analysis in SmallnLargep Using Interactive Prior Elicitation of Pairwise Similarities. In: FILM 2016, NIPS Workshop on Future of Interactive Learning Machines, 2016
 Agarwal u. a. (2004) \[email protected]Agarwal u. a. 2004 Agarwal, Pankaj K. ; HarPeled, Sariel ; Varadarajan, Kasturi R.: Approximating extent measures of points. In: Journal of the ACM (JACM) 51 (2004), Nr. 4, S. 606–635
 Bauckhage (2015) \[email protected]Bauckhage 2015 Bauckhage, Christian: NumPy/SciPy Recipes for Data Science: Ordinary Least Squares Optimization. In: researchgate. net, Mar (2015)
 Bjorck (1967) \[email protected]Bjorck 1967 Bjorck, Ake: Solving linear least squares problems by GramSchmidt orthogonalization. In: BIT Numerical Mathematics 7 (1967), Nr. 1, S. 1–21
 Carathéodory (1907) \[email protected]Carathéodory 1907 Carathéodory, Constantin: Über den Variabilitätsbereich der Koeffizienten von Potenzreihen, die gegebene Werte nicht annehmen. In: Mathematische Annalen 64 (1907), Nr. 1, S. 95–115
 Clarkson und Woodruff (2009) \[email protected]Clarkson und Woodruff 2009 Clarkson, Kenneth L. ; Woodruff, David P.: Numerical linear algebra in the streaming model. In: Proceedings of the fortyfirst annual ACM symposium on Theory of computing ACM (Veranst.), 2009, S. 205–214
 Cook und Webster (1972) \[email protected]Cook und Webster 1972 Cook, WD ; Webster, RJ: Caratheodory’s theorem. In: Canadian Mathematical Bulletin 15 (1972), Nr. 2, S. 293–293
 Copas (1983) \[email protected]Copas 1983 Copas, John B.: Regression, prediction and shrinkage. In: Journal of the Royal Statistical Society: Series B (Methodological) 45 (1983), Nr. 3, S. 311–335
 Cormen u. a. (2009) \[email protected]Cormen u. a. 2009 Cormen, Thomas H. ; Leiserson, Charles E. ; Rivest, Ronald L. ; Stein, Clifford: Introduction to algorithms. MIT press, 2009
 Drineas u. a. (2006) \[email protected]Drineas u. a. 2006 Drineas, Petros ; Mahoney, Michael W. ; Muthukrishnan, Shan: Sampling algorithms for l 2 regression and applications. In: Proceedings of the seventeenth annual ACMSIAM symposium on Discrete algorithm Society for Industrial and Applied Mathematics (Veranst.), 2006, S. 1127–1136
 Feldman u. a. (2010) \[email protected]Feldman u. a. 2010 Feldman, Dan ; Monemizadeh, Morteza ; Sohler, Christian ; Woodruff, David P.: Coresets and sketches for high dimensional subspace approximation problems. In: Proceedings of the twentyfirst annual ACMSIAM symposium on Discrete Algorithms Society for Industrial and Applied Mathematics (Veranst.), 2010, S. 630–649
 Feldman u. a. (2016) \[email protected]Feldman u. a. 2016 Feldman, Dan ; Volkov, Mikhail ; Rus, Daniela: Dimensionality Reduction of Massive Sparse Datasets Using Coresets. In: Advances in neural information processing systems (NIPS), 2016
 Gallagher u. a. (2017) \[email protected]Gallagher u. a. 2017 Gallagher, Neil ; Ulrich, Kyle R. ; Talbot, Austin ; Dzirasa, Kafui ; Carin, Lawrence ; Carlson, David E.: Crossspectral factor analysis. In: Advances in Neural Information Processing Systems, 2017, S. 6842–6852
 Golub und Reinsch (1971) \[email protected]Golub und Reinsch 1971 Golub, Gene H. ; Reinsch, Christian: Singular value decomposition and least squares solutions. In: Linear Algebra. Springer, 1971, S. 134–151
 Golub und Van Loan (2012) \[email protected]Golub und Van Loan 2012 Golub, Gene H. ; Van Loan, Charles F.: Matrix computations. Bd. 3. JHU press, 2012
 Hoerl und Kennard (1970) \[email protected]Hoerl und Kennard 1970 Hoerl, Arthur E. ; Kennard, Robert W.: Ridge regression: Biased estimation for nonorthogonal problems. In: Technometrics 12 (1970), Nr. 1, S. 55–67
 Jolliffe (2011) \[email protected]Jolliffe 2011 Jolliffe, Ian: Principal component analysis. Springer, 2011
 Kang u. a. (2011) \[email protected]Kang u. a. 2011 Kang, Byung ; Lim, Woosang ; Jung, Kyomin: Scalable kernel Kmeans via centroid approximation. In: Proc. NIPS, 2011
 Kaul u. a. (2013) \[email protected]Kaul u. a. 2013 Kaul, Manohar ; Yang, Bin ; Jensen, Christian S.: Building accurate 3d spatial networks to enable next generation intelligent transportation systems. In: 2013 IEEE 14th International Conference on Mobile Data Management Bd. 1 IEEE (Veranst.), 2013, S. 137–146
 Kohavi u. a. (1995) \[email protected]Kohavi u. a. 1995 Kohavi, Ron u. a.: A study of crossvalidation and bootstrap for accuracy estimation and model selection. In: Ijcai Bd. 14 Montreal, Canada (Veranst.), 1995, S. 1137–1145
 Laparra u. a. (2015) \[email protected]Laparra u. a. 2015 Laparra, Valero ; Malo, Jesús ; CampsValls, Gustau: Dimensionality reduction via regression in hyperspectral imagery. In: IEEE Journal of Selected Topics in Signal Processing 9 (2015), Nr. 6, S. 1026–1036
 Liang u. a. (2013) \[email protected]Liang u. a. 2013 Liang, Yingyu ; Balcan, MariaFlorina ; Kanchanapally, Vandana: Distributed PCA and kmeans clustering. In: The Big Learning Workshop at NIPS Bd. 2013 Citeseer (Veranst.), 2013
 LTD (2019) \[email protected]LTD 2019 LTD, Intel: Accelerate Python* Performance. https://software.intel.com/enus/distributionforpython. 2019
 Maalouf u. a. (2019) \[email protected]Maalouf u. a. 2019 Maalouf, Alaa ; Jubran, Ibrahim ; Feldman, Dan: Open code for the algorithms in this paper. 2019. – Open code will be provided upon the publication of this paper.
 Mandic (2004) \[email protected]Mandic 2004 Mandic, Danilo P.: A generalized normalized gradient descent algorithm. In: IEEE signal processing letters 11 (2004), Nr. 2, S. 115–118
 Nasser u. a. (2015) \[email protected]Nasser u. a. 2015 Nasser, Soliman ; Jubran, Ibrahim ; Feldman, Dan: Coresets for Kinematic Data: From Theorems to RealTime Systems. In: arXiv preprint arXiv:1511.09120 (2015)
 Pearson (1900) \[email protected]Pearson 1900 Pearson, Karl: X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. In: The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 50 (1900), Nr. 302, S. 157–175
 Peng u. a. (2015) \[email protected]Peng u. a. 2015 Peng, Xi ; Yi, Zhang ; Tang, Huajin: Robust subspace clustering via thresholding ridge regression. In: TwentyNinth AAAI Conference on Artificial Intelligence, 2015
 Phillips (2016) \[email protected]Phillips 2016 Phillips, Jeff M.: Coresets and sketches. In: arXiv preprint arXiv:1601.00617 (2016)
 Porco u. a. (2015) \[email protected]Porco u. a. 2015 Porco, Aldo ; Kaltenbrunner, Andreas ; Gómez, Vicenç: Lowrank approximations for predicting voting behaviour. In: Workshop on Networks in the Social and Information Sciences, NIPS, 2015
 Safavian und Landgrebe (1991) \[email protected]Safavian und Landgrebe 1991 Safavian, S R. ; Landgrebe, David: A survey of decision tree classifier methodology. In: IEEE transactions on systems, man, and cybernetics 21 (1991), Nr. 3, S. 660–674
 Seber und Lee (2012) \[email protected]Seber und Lee 2012 Seber, George A. ; Lee, Alan J.: Linear regression analysis. Bd. 329. John Wiley & Sons, 2012
 Tibshirani (1996) \[email protected]Tibshirani 1996 Tibshirani, Robert: Regression shrinkage and selection via the lasso. In: Journal of the Royal Statistical Society: Series B (Methodological) 58 (1996), Nr. 1, S. 267–288
 Widrow u. a. (1977) \[email protected]Widrow u. a. 1977 Widrow, Bernard ; McCool, John ; Larimore, Michael G. ; Johnson, C R.: Stationary and nonstationary learning characteristics of the LMS adaptive filter. In: Aspects of Signal Processing. Springer, 1977, S. 355–393
 Wikipedia contributors (2019a) \[email protected]Wikipedia contributors 2019a Wikipedia contributors: CPython — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=CPython&oldid=896388498. 2019
 Wikipedia contributors (2019b) \[email protected]Wikipedia contributors 2019b Wikipedia contributors: List of Nvidia graphics processing units — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=List_of_Nvidia_graphics_processing_units&oldid=897973746. 2019
 Zhang und Rohe (2018) \[email protected]Zhang und Rohe 2018 Zhang, Yilin ; Rohe, Karl: Understanding Regularized Spectral Clustering via Graph Conductance. In: Advances in Neural Information Processing Systems, 2018, S. 10631–10640
 Zou und Hastie (2005) \[email protected]Zou und Hastie 2005 Zou, Hui ; Hastie, Trevor: Regularization and variable selection via the elastic net. In: Journal of the royal statistical society: series B (statistical methodology) 67 (2005), Nr. 2, S. 301–320