Abstract
Traditional online algorithms encapsulate decision making under uncertainty,and give ways to hedge against all possible future events, while guaranteeing anearly optimal solution as compared to an offline optimum. On the other hand,machine learning algorithms are in the business of extrapolating patterns foundin the data to predict the future, and usually come with strong guarantees onthe expected generalization error. In this work we develop a framework for augmenting online algorithms with amachine learned oracle to achieve competitive ratios that provably improve uponunconditional worst case lower bounds when the oracle has low error. Ourapproach treats the oracle as a complete black box, and is not dependent on itsinner workings, or the exact distribution of its errors. We apply this framework to the traditional caching problem  creating aneviction strategy for a cache of size $k$. We demonstrate that naivelyfollowing the oracle's recommendations may lead to very poor performance, evenwhen the average error is quite low. Instead we show how to modify the Markeralgorithm to take into account the oracle's predictions, and prove that thiscombined approach achieves a competitive ratio that both (i) decreases as theoracle's error decreases, and (ii) is always capped by $O(\log k)$, which canbe achieved without any oracle input. We complement our results with anempirical evaluation of our algorithm on real world datasets, and show that itperforms well empirically even using simple offtheshelf predictions.
Quick Read (beta)
Competitive caching with machine learned advice
Abstract
Traditional online algorithms encapsulate decision making under uncertainty and give ways to hedge against all possible future events, while guaranteeing a nearly optimal solution, as compared to an offline optimum. On the other hand, machine learning algorithms are in the business of extrapolating patterns found in the data to predict the future, and usually come with strong guarantees on the expected generalization error.
In this work we develop a framework for augmenting online algorithms with a machine learned oracle to achieve competitive ratios that provably improve upon unconditional worst case lower bounds when the oracle has low error. Our approach treats the oracle as a complete black box, and is not dependent on its inner workings, or the exact distribution of its errors.
We apply this framework to the traditional caching problem—creating an eviction strategy for a cache of size $k$. We demonstrate that naively following the oracle’s recommendations may lead to very poor performance, even when the average error is quite low. Instead we show how to modify the Marker algorithm to take into account the oracle’s predictions, and prove that this combined approach achieves a competitive ratio that both decreases as the oracle’s error decreases, and is always capped by $O(\mathrm{log}k)$, which can be achieved without any oracle input. We complement our results with an empirical evaluation of our algorithm on real world datasets, and show that it performs well empirically even when using simple offtheshelf predictions.
1 Introduction
Despite the success and prevalence of machine learned systems across many application domains, there are still a lot of hurdles that one needs to overcome to deploy an ML system in practice [SHG${}^{+}$15]. As these systems are rarely perfect, a key challenge is dealing with errors that inevitably arise.
There are many reasons that learned systems may exhibit errors when deployed. First, most of them are trained to be good on average, minimizing some expected loss. In doing so, the system may invest its efforts in reducing the error on the majority of inputs, at the expense of increased error on a handful of outliers. Another problem is that generalization error guarantees only apply when the train and test examples are drawn from the same distribution. If this assumption is violated, either due to distribution drift or adversarial examples [SZS${}^{+}$14], the machine learned predictions may be very far from the truth. In all cases, any system backed by machine learning needs to be robust enough to handle occasional errors.
While machine learning is in the business of predicting the unknown, online algorithms provide guidance on how to act without any knowledge of future inputs. These powerful methods show how to hedge decisions so that, regardless of what the future holds, the online algorithm performs nearly as well as the optimal offline algorithm. However these guarantees come at a cost: since they protect against the worst case, online algorithms may be overly cautious, which translates to high competitive ratios even for seemingly simple problems.
In this work we ask:
What if the online algorithm is equipped with a machine learned oracle? How can one use this oracle to combine the predictive power of machine learning with the robustness of online algorithms?
We focus on a prototypical example of this area: the online paging, or caching problem. In this setting, a series of requests arrives one at a time to a server equipped with a small amount of memory. Upon processing a request, the server places the answer in the memory (in case an identical request comes in the future). Since the local memory has limited size, the server must decide which of the current elements to evict. It is well known that if the local memory or cache has size $k$, then any deterministic algorithm incurs competitive ratio $\mathrm{\Omega}(k)$. However, an $O(k)$ bound can be also achieved by almost any reasonable strategy, thus this metric fails to distinguish between algorithms that perform well in practice and those that perform poorly. The competitive ratio of the best randomized algorithm is $\mathrm{\Theta}(\mathrm{log}k)$ which, despite far from trivial, is also much higher than what is observed on real inputs.
In contrast, we show how to use machine learned predictions to achieve a competitive ratio of $2+O(\mathrm{min}(\sqrt{\eta /\mathrm{\text{Opt}}},\mathrm{log}k)$, when using an oracle with total absolute loss $\eta $ (see Section 3.2 for a precise statement of results). Thus when the predictions are accurate (small $\eta $), our approach circumvents the worst case lower bounds. On the other hand, even when the oracle is inaccurate (large $\eta $), the performance degrades gracefully to almost match the worst case bound.
1.1 Our contribution
The conceptual contribution of the paper lies in formalizing the interplay between machine learning and competitive analysis by introducing a general framework (Section 2), and a set of desiderata for online algorithms that use a machine learned oracle.
We look for approaches that:

•
Make minimal assumptions on the machine learned oracle. Specifically, since most machine learning guarantees are on the expected performance, our results are parametric as a function of the error of the oracle, $\eta $, and not the distribution of the error.

•
Are robust: a better oracle (one with lower $\eta $) results in a smaller competitive ratio

•
Are worstcase competitive: no matter the performance of the oracle on the particular instance, the algorithm behaves comparably to the best online algorithm for the problem.
We instantiate the general framework to the online caching problem, specifying the prediction made by the oracle, and presenting an algorithm that uses that prediction effectively (Section 3.2). Along the way we show that algorithmic innovation is necessary: simply following the recommendations of the predictor may lead to poor performance, even when the average error is small (Section 3.1). Instead, we adapt the Marker algorithm [FKL${}^{+}$91] to carefully incorporate the feedback of the predictor. The resulting approach, which we call Predictive Marker has guarantees that capture the best of both worlds: the algorithm performs better as the error of the oracle decreases, but performs nearly as well as the best online algorithm in the worst case.
Our analysis generalizes to multiple loss functions (such as absolute loss and squared loss). This freedom in the loss function with the blackbox access to the oracle allows our results to be strengthened with future progress in machine learning and reduces the task of designing better algorithms to the task of finding better predictors.
We complement our theoretical findings with empirical results (Section 5). We test the performance of our algorithm on public data using offtheshelf machine learning models. We compare the performance to the Least Recently Used (LRU) algorithm, which serves as the gold standard, the original Marker algorithm, as well as directly using the predictor. In all cases, the Predictive Marker algorithm outperforms known approaches.
Before moving to the main technical content, we provide a simple example that highlights the main concepts of this work.
1.2 Example: Faster Binary Search
Consider the classical binary search problem. Given a sorted array $A$ on $n$ elements and a query element $q$, the goal is to either find the index of $q$ in the array, or state that it is not in the set. The textbook method is binary search: compare the value of $q$ to that of the middle element of $A$, and recurse on the correct half of the array. After $O(\mathrm{log}n)$ probes, the method either finds $q$ or returns.
Instead of applying binary search, one can train a classifier, $h$, to predict the position of $q$ in the array. (Although this may appear to be overly complex, Kraska et.al [KBC${}^{+}$17] empirically demonstrate the advantages of such a method.) How to use such a classifier? A simple approach is to first probe the location at $h(q)$; if $q$ is not found there, we immediately know whether it is smaller or larger. Suppose $q$ is larger than the element in $A[h(q)]$ and the array is sorted in increasing order. We probe elements at $h(q)+2,h(q)+4,h(q)+8$, and so on, until we find an element larger than $q$ (or we hit the end of the array). Then we simply apply binary search on the interval that’s guaranteed to contain $q$.
What is the cost of such an approach? Let $t(q)$ be the true position of $q$ in the array (or the position of the largest element smaller than $q$ if it is not in the set). The absolute loss of the classifier on $q$ is then ${\u03f5}_{q}=h(q)t(q)$. On the other hand, the cost of running the above algorithm starting at $h(q)$ is at most $2(\mathrm{log}h(q)t(q))=2\mathrm{log}{\u03f5}_{q}$.
If the queries $q$ come from a distribution, then the expected cost of the algorithm is:
$$2{\mathbb{E}}_{q}\left[\mathrm{log}\left(h(q)t(q)\right)\right]\le 2\mathrm{log}{\mathbb{E}}_{q}\left[h(q)t(q)\right]=2\mathrm{log}{\mathbb{E}}_{q}[{\u03f5}_{q}],$$ 
where the inequality follows by Jensen’s inequality. This gives a tradeoff between the performance of the algorithm and the absolute loss of the predictor. Moreover, since ${\u03f5}_{q}$ is trivially bounded by $n$, this shows that even relatively weak classifiers (those with average error of $\sqrt{n}$) this can lead to an improvement in asymptotic performance.
1.3 Related work
Our work builds upon the foundational work on competitive analysis and online algorithms; for a great introduction see the book by Borodin and ElYaniv [BEY98]. Specifically, we look at the standard caching problem, see, for example, [MR95]. While many variants of caching have been studied over the years, our main starting point will be the Marker algorithm by Fiat et al. [FKL${}^{+}$91].
As we mentioned earlier, competitive analysis fails to distinguish between algorithms that perform well in practice, and those that perform well only in theory. Several fixes have been proposed to address these concerns, ranging from resource augmentation, where the online algorithm has a larger cache than the offline optimum [ST85], to models of realworld inputs that restrict the inputs analyzed by the algorithm, for example, insisting on locality of reference [AFG02], or the more general Working Set model [Den68].
The idea of making assumptions on the nature of the input to prove better bounds is common in the literature. The most popular of these is that the data arrive in a random order. This is a critical assumption in the secretary problem, and, more generally, in other streaming algorithms, see for instance the survey by McGregor [McG14]. While the assumption leads to algorithms that give good insight into the structure of the problem, it rarely holds true, and is often very hard to verify.
Another common assumption on the structure of the input gives rise to Smoothed Analysis, introduced in a pioneering work by Spielman and Teng [ST04] explaining the practical efficiency of the Simplex method. Here, they assume that any worst case instance is perturbed slightly before being passed to the algorithm; the idea is that this perturbation may be due to measurement error, or some other noise inherent in the data. The goal then is to show that the worst case inputs are brittle, and do not survive addition of random noise. Since its introduction this method has been used to explain the unusual effectiveness of many practical algorithms such as ICP [AV06], Lloyd’s method [AMR11], and local search [ERV16], in the face of exponential worst case bounds.
The prior work that is closest in spirit to ours looks for algorithms that optimistically assume that the input has a certain structure, but also have worst case guarantees when that fails to be the case. One such assumption is that the data are coming from a stochastic distribution and was studied in the context of online matching [MGZ12] and bandit learning [BS12]; both of these works provide improved guarantees if the input is stochastic but retain the worstcase guarantees otherwise. Subsequent work has provided a graceful decay in performance when the input is mostly stochastic (analogous to our robustness property) both in the context online matching [EKM15] and bandit learning [LMPL18]. In a related note, Ailon et al. [ACC${}^{+}$11] consider “selfimproving” algorithms that effectively learn the input distribution, and adapt to be nearly optimal in that domain. Contrasting to these works, our approach utilizes a different structure in the data: the fact that the sequence can be predicted.
Our work is not the first to use predictions to enhance guarantees in online decisionmaking. Predictability as an assumption has also used been used in online learning by Rakhlin and Sridharan [RS13] where losses of next round are predicted and the guarantees scale with how erroneous these precitions are. Our focus is on competitive analysis approaches where requests affect the state of the system; as a result, a single misprediction can have longlasting effect on the system.
With respect to using predictions in competitive analysis, another approach was suggested by Mahdian et al. [MNS12], who assume the existence of an optimistic, highly competitive, algorithm, and then provide a meta algorithm with a competitive ratio that interpolates between that of the worstcase algorithm and that of the optimistic one. Although this sounds similar to our approach, one of our key challenges lies in developing an algorithm that can use the predictions effectively. As we show, naively following the predictions can lead to disastrous results.
In other words, we do not assume anything about the data, or the availability of good algorithms that work in restricted settings. Rather, we use the oracle to implicitly classify instances into “easy” and “hard” depending on their predictability. The “easy” instances are those on which the latest machine learning technology, be it perceptrons, decision trees, SVMs, Deep Neural Networks, GANs, LSTMs, or whatever else may come in the future, has small error. On these instances our goal is to take advantage of the oracle, and obtain low competitive ratios. (Importantly, our approach is completely agnostic to the inner workings of the predictor and treats it as a black box.) The “hard” instances, are those where the prediction quality is poor, and we have to rely more on classical competitive analysis to obtain good results.
A previous line of work has also considered the benefit of enhancing online algorithms with oracle advice (see [BFK${}^{+}$16] for a recent survey). This setting assumes access to an infallible oracle and studies the amount of information that is needed to achieve desired competitive ratio guarantees. Our work differs in two major regards. First, we do not assume that the oracle is perfect, as that is rarely the case in machine learning scenarios. Second, we study the tradeoff between oracle error and the competitive ratio, rather than focusing on the number of perfect predictions necessary.
Another avenue of research close to our setting asks what happens if the algorithm cannot view the whole input, but must rely on a sample of the input to make its choices. Introduced in the seminal work of Cole and Roughgarden [CR14], this notion of Learning from Samples, can be viewed as first designing a good prediction function, $h$, and then using it in the algorithms. Indeed, some of the follow up work [MR16, BRS17] proves tight bounds on precisely how many samples are necessary to achieve good approximation guarantees. In contrast, we assume that the online algorithm is given access to a machine learned oracle, but does not know any details of its inner workings—we know neither the average performance of the oracle, nor the distribution of the errors.
Very recently, two papers explored domains similar to ours. Medina and Vassilvitskii [MV17] showed how to use a machine learned oracle to optimize revenue in repeated posted price auctions. Their work has a mix of offline calculations and online predictions and focuses on the specific problem of revenue optimization. Kraska et al. [KBC${}^{+}$17] demonstrated empirically that introducing machine learned components to classical algorithms (in their case index lookups) can result in significant speed and storage gains. Unlike this work, their results are experimental, and they do not provide tradeoffs on the performance of their approach visàvis the error of the machine learned predictor.
Finally, since publication of the original paper, learning augmented algorithms has emerged as a rich area. Subsequently to our work, researchers have studied how to incorporate machine learned predictions in other settings such as ski rental and nonclairvoyant scheduling [PSK18], bin packing [ADJ${}^{+}$19], bloom filters [Mit18], and streaming algorithms [HIKV19].
2 Online Algorithms with Machine Learned Advice
In this section, we introduce a general framework for combining online algorithms with machine learning predictions, which we term Online with Machine Learned Advice model (OMLA). Before introducing the model, we review some basic notions from machine learning and online algorithms.
2.1 Preliminaries
Machine learning basics
We are given a feature space $\mathcal{X}$, describing the salient characteristics of each item and a set of labels $\mathcal{Y}$. An example is a pair $(x,y)$, where $x\in \mathcal{X}$ describes the specific features of the example, and $y\in \mathcal{Y}$ gives the corresponding label. In the binary search example, $x$ can be thought as the query element $q$ searched and $y$ as its true position $t(x)$.
A hypothesis is a mapping $h:\mathcal{X}\to \mathcal{Y}$ and can be probabilistic in which case the output on $x\in \mathcal{X}$ is some probabilistically chosen $y\in \mathcal{Y}$. In binary search, $h(x)$ corresponds to the predicted position of the query.
To measure the performance of a hypothesis, we first define a loss function $\mathrm{\ell}:\mathcal{Y}\times \mathcal{Y}\to {\mathbb{R}}^{\ge 0}$. When the labels lie in a metric space, we define absolute loss ${\mathrm{\ell}}_{1}(y,\widehat{y})=y\widehat{y}$, squared loss ${\mathrm{\ell}}_{2}(y,\widehat{y})={(y\widehat{y})}^{2}$, and, more generally, classification loss ${\mathrm{\ell}}_{c}(y,\widehat{y})={\mathrm{\U0001d7cf}}_{y\ne \widehat{y}}$.
Competitive analysis
To obtain worstcase guarantees for an online algorithm (that must make decisions as each element arrives), we compare its performance to that of an offline optimum (that has the benefit of hindsight). Let $\sigma $ be the input sequence of elements for a particular online decisionmaking problem, ${\mathrm{\mathit{c}\mathit{o}\mathit{s}\mathit{t}}}_{A}(\sigma )$ be the cost incurred by an online algorithm $\mathcal{A}$ on this input, and $\mathrm{\text{Opt}}(\sigma )$ be the cost incurred by the optimal offline algorithm. Then algorithm $\mathcal{A}$ has competitive ratio cr if for all sequences $\sigma $,
$${\mathrm{\mathit{c}\mathit{o}\mathit{s}\mathit{t}}}_{\mathcal{A}}(\sigma )\le \mathrm{\text{cr}}\cdot \mathrm{\text{Opt}}(\sigma ).$$ 
The Caching Problem
The caching (or online paging) problem considers a system with two levels of memory: a slow memory of size $m$ and a fast memory of size $k$. A caching algorithm is faced with a sequence of requests for elements. If the requested element is in the fast memory, a cache hit occurs and the algorithm can satisfy the request at no cost. If the requested item is not in the fast memory, a cache miss occurs, the algorithm fetches the item from the slow memory, and places it in the fast memory before satisfying the request. If the fast memory is full, then one of the items must be evicted. The eviction strategy forms the core of the problem. The goal is to find an eviction policy that results in the fewest number of cache misses.
It is well known that the optimal offline algorithm at time $t$ evicts the element from the cache that will arrive the furthest in the future; this is typically referred in the literature as Bélády’s optimal replacement paging algorithm [Bel66]. On the other hand, without the benefit of foresight, any deterministic caching algorithm achieves a competitive ratio of $\mathrm{\Omega}(k)$, and any randomized caching algorithm achieves a competitive ratio of $\mathrm{\Omega}(\mathrm{log}k)$ [MR95].
2.2 OMLA Definition
We first specify the input and the predictions made by the machine learned predictor $h\in \mathscr{H}$ from a hypothesis class $\mathscr{H}$. The online input consists of a set of elements $\mathcal{Z}$. For a specific input $\sigma $, its elements are denoted by $z({\sigma}_{1}),z({\sigma}_{2}),\mathrm{\dots}$ and its length by $\sigma $. Formalizing the machine learning task, we assume a feature space $\mathcal{X}$ and a label space $\mathcal{Y}$. The $i$th element $z({\sigma}_{i})$ has features $x({\sigma}_{i})\in \mathcal{X}$ and a label $y({\sigma}_{i})\in \mathcal{Y}$. For any element $i$, the predictor returns a predicted label $h(x({\sigma}_{i}))$. To ease notation we will also denote this by $h({\sigma}_{i})$. For ease of presentation, we assume that this mapping from features to labels is deterministic; our results extend to randomized mappings by applications of Jensen’s inequality (see Section 3.5).
In defining the framework we are not concerned with the semantics of the labels, i.e. what is the quantity that $h$ is predicting or how it was trained – we are only interested in its performance. The error of the predictor $h$ on a sequence $\sigma $ with respect to loss function $\mathrm{\ell}$ is therefore:
$${\eta}_{\mathrm{\ell}}(h,\sigma )=\sum _{i}\mathrm{\ell}(y({\sigma}_{i}),h({\sigma}_{i})).$$ 
Instantiated with the absolute loss function, the error of the predictor is ${\eta}_{{\mathrm{\ell}}_{1}}(h,\sigma )={\sum}_{i}y({\sigma}_{i})h({\sigma}_{i})$. We will use ${\eta}_{1}(h,\sigma )$ as a shorthand for this absolute loss.
Definition 1.
The Online with Machine Learned Advice (OMLA) instance consists of:

•
An input $\sigma =\{z({\sigma}_{1}),z({\sigma}_{2}),\mathrm{\dots},z({\sigma}_{\sigma })$; each $z({\sigma}_{i})\in \mathcal{Z}$ has features $x({\sigma}_{i})\in \mathcal{X}$ and labels $y({\sigma}_{i})\in \mathcal{Y}$.

•
A predictor $h:\mathcal{X}\to \mathcal{Y}$ that predicts a label $h({\sigma}_{i})$ for each $x({\sigma}_{i})\in \mathcal{X}$.

•
The error of predictor $h$ at sequence $\sigma $ w.r.t. loss $\mathrm{\ell}$, ${\eta}_{\mathrm{\ell}}(h,\sigma )$.
Our goal is to create online algorithms that, when augmented with a predictor $h$, can use its advice to achieve an improved competitive ratio. To evaluate how well an algorithm $\mathcal{A}$ performs with respect to this task, we extend the definition of competitive ratio to be a function of the predictor’s error. We first define the set of predictors that are sufficiently accurate.
Definition 2.
For a fixed optimization problem $\mathrm{\Pi}$, let ${\mathrm{\text{Opt}}}_{\mathrm{\Pi}}\mathit{}\mathrm{(}\sigma \mathrm{)}$ denote the value of the optimal solution on the input $\sigma $. We say that a predictor $h$ is $\u03f5$accurate with respect to a loss function $\mathrm{\ell}$ for problem $\mathrm{\Pi}$ if for any $\sigma $:
$${\eta}_{\mathrm{\ell}}(h,\sigma )\le \u03f5\cdot {\mathrm{\text{Opt}}}_{\mathrm{\Pi}}(\sigma ).$$ 
We will use ${\mathrm{H}}_{\mathrm{\ell}}\mathit{}\mathrm{(}\u03f5\mathrm{)}$ to denote the class of $\u03f5$accurate predictors, omitting the quantifier on $\mathrm{\Pi}$ for notational clarity.
At first glance, it may appear unnatural to tie the error of the prediction to the value of the optimal solution. However, our goal is to have a definition that is invariant to simple padding arguments. For instance, consider a sequence ${\sigma}^{\prime}=\sigma \sigma $, which concatenates two copies of an instance $\sigma $. It is clear that the prediction error of any predictor doubles, but this is not due to the predictor suddenly being worse. One could instead normalize the prediction error by the length of the sequence, but in many problems, including caching, one can artificially increase the length of the sequence without impacting the value of the optimum solution, or the impact of predictions. Normalizing by the value of the optimum addresses both of these problems.
Call an algorithm $\mathcal{A}$ $\u03f5$assisted if it has access to an $\u03f5$accurate predictor. The competitive ratio of an $\u03f5$assisted algorithm is itself a function of $\u03f5$.
Definition 3.
Let ${\mathrm{\text{cr}}}_{\mathrm{A}\mathit{}\mathrm{(}h\mathrm{)}}\mathit{}\mathrm{(}\sigma \mathrm{)}$ be the competitive ratio of algorithm $\mathrm{A}$ which uses a predictor $h$ when applied on sequence $\sigma $. The competitive ratio of an $\u03f5$assisted algroithm $\mathrm{A}$ is:
$${\mathrm{\text{cr}}}_{\mathcal{A},\mathrm{\ell}}(\u03f5)=\underset{\sigma ,h\in {\mathscr{H}}_{\mathrm{\ell}}(\u03f5)}{\mathrm{max}}{\mathrm{\text{cr}}}_{\mathcal{A}(h)}(\sigma ).$$ 
We now define the desiderata that we wish our algorithm to satisfy. We would like our algorithm to perform as well as the offline optimum when the predictor is perfect, degrade gracefully with the error of the predictor, and perform as well as the best online algorithm regardless of the error of the predictor. We define these properties formally.
Definition 4.
$\mathcal{A}$ is $\beta $consistent if ${\mathrm{\text{cr}}}_{\mathrm{A}\mathrm{,}\mathrm{\ell}}\mathit{}\mathrm{(}\mathrm{0}\mathrm{)}\mathrm{=}\beta $.
Definition 5.
$\mathcal{A}$ is $\alpha $robust for a function $\alpha \mathit{}\mathrm{(}\mathrm{\cdot}\mathrm{)}$, if ${\mathrm{\text{cr}}}_{\mathrm{A}\mathrm{,}\mathrm{\ell}}\mathit{}\mathrm{(}\u03f5\mathrm{)}\mathrm{=}O\mathit{}\mathrm{(}\alpha \mathit{}\mathrm{(}\u03f5\mathrm{)}\mathrm{)}$.
Definition 6.
$\mathcal{A}$ is $\gamma $competitive if ${\mathrm{\text{cr}}}_{\mathrm{A}\mathrm{,}\mathrm{\ell}}\mathit{}\mathrm{(}\u03f5\mathrm{)}\mathrm{\le}\gamma $ for all values of $\u03f5$.
Our goal is to find algorithms that simultaneously optimize the aforementioned three properties. They are ideally $1$consistent: recovering the optimal solution when the predictor is perfect. They are $\alpha (\cdot )$robust for a slow growing function $\alpha $: seamlessly handling errors in the predictor. Finally, they are worstcase competitive: they perform as well as the best online algorithms even when the predictor’s error is high.
2.3 Caching with ML Advice
To instantiate the framework for the caching problem, we need to define the oracle, the label space of the predictions, and their semantics. The element space $\mathcal{Z}$ corresponds to the universe of elements that may be requested. The input sequence $\sigma ={\sigma}_{1},{\sigma}_{2},\mathrm{\dots},{\sigma}_{n}$ is the actual sequence of elements that are requested (fixed in advance and oblivious to the choices of the algorithm but unknown to it). Each element $z({\sigma}_{i})\in \mathcal{Z}$ has corresponding features $x({\sigma}_{i})$ . These can encapsulate everything that is known about $z({\sigma}_{i})$ at the time $i$, for example, the times this element arrived in the past. The exact choice of $\mathcal{X}$ is orthogonal to our setting, though of course richer features typically lead to smaller errors.
One of the design choices when adding ML advice to the problem is the question of what to predict. For caching problems, a natural candidate is predicting the next time a particular element is going to appear. It is well known [Bel66] that when such predictions are perfect, the online algorithm can recover the best offline optimum.
Formally, the label space $\mathcal{Y}$ is a set of positions in the sequence, $\mathcal{Y}={\mathbb{N}}^{+}$. Given a sequence $\sigma $, $y({\sigma}_{i})={\mathrm{min}}_{t>i}\left\{\tau :x({\sigma}_{t})=x({\sigma}_{i})\right\}$. If the element is never seen again, we set $y({\sigma}_{i})=n+1$. Note that $y({\sigma}_{i})$ is completely determined by the sequence $\sigma $. We use $h({\sigma}_{i})$ to denote the outcome of the prediction on an element with features $x({\sigma}_{i})$. Note that the feature is not only a function of the element identity $z({\sigma}_{i})$; when an element reappears, its features may be drastically different.
3 Main result: Predictive Marker
In this section, we describe the main result: an algorithm that satisfies the three desiderata of the previous section. Before describing our algorithm, we show that combining the predictions with ideas from competitive analysis is to a large extent essential; blindly evicting the element that is predicted the furthest in the future by the predictor (or simple modifications of this idea) can result to poor performance both with respect to robustness and competitiveness.
3.1 Blindly following the predictor is not sufficient
Evicting element predicted the furthest in the future.
An immediate way to use the predictor is to treat its output as truth and optimize based on the whole predicted sequence. This corresponds to the Bélády rule that evicts the element predicted to appear the furthest in the future. We refer to this algorithm as algorithm $\mathcal{B}$ (as it follows the Bélády rule). Since this rule achieves offline optimality, this approach is consistent, i.e. if the predictor is perfect, this algorithm is expost optimal. Unfortunately this approach does not have similarly nice performance with respect to the other two desiderata. With respect to robustness, the degradation with the average error of the predictor is far from the best possible, while a completely unreliable predictor leads to unbounded competitive ratios, far from the ones of the best online algorithm.
Theorem 3.1.
The competitive ratio of $\u03f5$assisted algorithm $\mathcal{B}$ is ${\mathrm{\text{cr}}}_{\mathcal{B},{\mathrm{\ell}}_{1}}(\u03f5)=\mathrm{\Omega}(\u03f5)$.
The implication is that when the error of the predictor is much worse than the offline optimum, the competitive ratio becomes unbounded. With respect to robustness, the rate of decay is far from optimal as we will see in Section 3.3.
Proof of Theorem 3.1.
We will show that for every $\u03f5$, there exist a sequence $\sigma $ and a predictor $h$ such that the absolute error ${\eta}_{1}(h,\sigma )\le \u03f5\cdot \mathrm{\text{Opt}}$ while the competitive ratio of algorithm $\mathcal{B}$ is $\frac{\u03f51}{2}$. For ease of presentation, assume that $\u03f5>3$.
Consider a cache of size $k=2$ and three elements $a,b,c$; the initial configuration of cache is $a,c$. The sequence consists of repetitions of the following sequence with $\u03f5$ requests per repetition. The actual sequence will be $a\underset{\u03f51}{\underset{\u23df}{bcbc\mathrm{\dots}bc}}a\underset{\u03f51}{\underset{\u23df}{bcbc\mathrm{\dots}bc}}\mathrm{\dots}$ ($a$ appears once and then $bc$ appears $(\u03f51)/2$ times).
In any repetition, the predictor accurately predicts the arrival time of all elements apart from two: i) when element $a$ arrives, it predicts that it will arrive again two steps after the current time (instead of in the first step of the next repetition) and ii) when $b$ arrives for the last time in one repetition, it predicts it to arrive again in the fourth position of the next repetition (instead of the second). As a result, the absolute error of the predictor is $\u03f5$ ($\u03f52$ error in the $a$misprediction and $2$ error in the $b$misprediction).
The optimal solution has two mistakes per repetition (one to bring $a$ in the cache and one to directly evict it afterwards). Instead, the algorithm never evicts $a$ as it is predicted to arrive much earlier than all other elements, and therefore has $\u03f51$ cache misses. This means that the competitive ratio of this algorithm is $\mathrm{\Omega}({\eta}_{1}(h,\sigma )/\mathrm{\text{Opt}}(\sigma ))$ which completes the proof. ∎
Evicting elements with proven wrong predictions.
The problem in the above algorithm is that algorithm $\mathcal{B}$ keeps too much faith in predictions that have been already proven to be wrong (as the corresponding elements are predicted to arrive in the past). It is tempting to “fix” the issue by evicting any element whose predicted time has passed, and evict again the element predicted the furthest in the future if no such element exists. We call this algorithm $\mathcal{W}$ as it takes care of wrong predictions. Formally, let $h(j,t)$ denote the last prediction about ${z}_{j}$ at or prior to time $t$. At time $t$ algorithm $\mathcal{W}$ evicts an arbitrary item from the set $$ if ${S}_{t}\ne \mathrm{\varnothing}$ and $\mathrm{arg}{\mathrm{max}}_{{z}_{i}\in \text{Cache(t)}}h(i,t)$ otherwise. We show that algorithm $\mathcal{W}$ has similarly bad performance guarantees.
Theorem 3.2.
The competitive ratio of $\u03f5$assisted algorithm $\mathcal{W}$ is ${\mathrm{\text{cr}}}_{\mathcal{W},{\mathrm{\ell}}_{1}}(\u03f5)=\mathrm{\Omega}(\u03f5)$.
Proof.
Consider a cache of size $k=3$ and four elements $a,b,c,d$; the initial configuration of cache is $a,b,c$ and then $d$ arrives. The actual sequence consists of repetitions of the following sequence with $(\u03f5/2)+1$ requests per repetition (for ease of presentation, assume that $\u03f5>6)$. The actual sequence is $d\underset{\u03f5/2}{\underset{\u23df}{abcabc\mathrm{\dots}abc}}d\underset{\u03f5/2}{\underset{\u23df}{abcabc\mathrm{\dots}abc}}\mathrm{\dots}$ and is denoted by $\sigma $.
In any repetition, the predictor $h$ accurately predicts the arrival time of element $d$ but always makes mistake in elements $a,b,c$ by predicting them to arrive two time steps earlier. As a result, the absolute error of the predictor is $\u03f5$ (error of $2$ for any of the appearances of $a,b,c$).
The optimal solution has two mistakes per repetition (one to bring element $d$ and one to evict it afterwards). Instead the algorithm always evicts elements $a,b,c$ because they are predicted earlier than their actual arrival and are therefore evicted as “wrong” predictions. This means that the competitive ratio of this algorithm is also $\mathrm{\Omega}({\eta}_{1}(h,\sigma )/\mathrm{\text{Opt}}(\sigma ))$ which completes the proof. ∎
Beyond blindly trusting the predictor.
The common problem in both examples is that there is an element that should be removed but the algorithm is tricked into keeping it in the cache. To deal with this in practice, most popular heuristics such as LRU (Least Recently Used) and FIFO (First In First Out) avoid evicting recent elements when some elements have been dormant for a long time. This tends to utilize nice locality properties leading to strong empirical performance (especially for LRU). However, such heuristics impose a strict eviction policy which leads to weak performance guarantees. Moreover, incorporating additional information provided by the predictor becomes complicated.
Competitive analysis has also built on the idea of evicting dormant elements via developing algorithms with stronger theoretical guarantees such as Marker. In the next subsection, we show how we can incorporate predictions in the Marker algorithm to enhance its performance when the predictions are good while retaining the worstcase guarantees. Interestingly, via our framework, we can provide improved guarantees for the aforementioned heuristics such as LRU, improving their worstcase guarantees while retaining their practical performance (see Section 4.2).
3.2 Predictive Marker Algorithm
We now present our main technical contribution, a predictionbased adaptation of the Marker algorithm [FKL${}^{+}$91]. This $\u03f5$assisted algorithm gets a competitive ratio of $2\cdot \mathrm{min}(O(\sqrt{\u03f5},2{H}_{k})$ where ${H}_{k}=1+1/2+\mathrm{\cdots}+1/\mathrm{k}$ denotes the $k$th Harmonic number.
Classic Marker algorithm.
We begin by recalling the Marker algorithm and the analysis of its performance. The algorithm runs in phases. At the beginning of each phase, all elements are unmarked. When an element arrives and is already in the cache, the element is marked. If it is not in the cache, a random unmarked element is evicted, the newly arrived element is placed in the cache and is marked. Once all elements are marked and a new cache miss occurs, the phase ends and we unmark all of the elements.
For the purposes of analysis, an element is called clean in phase $r$ if it appears during phase $r$, but does not appear during phase $r1$. In contrast, elements that also appeared in the previous phase are called stale. The marker algorithm has competitive ratio of $2{H}_{k}1$ and the analysis is tight [ACN00]. We use a slightly simpler analysis that achieves competitive ratio of $2{H}_{k}$ below.
The crux of the upper bound lies in two claims. The first relates the performance of the optimal offline algorithm to the total number of clean elements $Q$ across all phases.
Claim 1 ([FKL${}^{+}$91]).
Let $Q$ be the number of clean elements. Then the optimal algorithm suffers at least $\mathrm{Q}/2$ cache misses.
The second comes from bounding the performance of the algorithm as a function of the number of clean elements.
Claim 2 ([FKL${}^{+}$91]).
Let $Q$ be the number of clean elements. Then the expected number of cache misses of the marker algorithm is $Q\cdot {H}_{k}$.
Predictive Marker.
The algorithm of [FKL${}^{+}$91] is part of a larger family of marking algorithms, which never evict marked elements when there are unmarked elements present. Any algorithm in this family has a worst case competitive ratio of $k$. Therefore pairing predictions with a marking style algorithm would avoid the pathological examples we saw previously.
A natural approach is to use predictions for tiebreaking, specifically evicting the element whose predicted next appearance time is furthest in the future. When the predictor is perfect (and has zero error), the stale elements never result in cache misses, and therefore, by Claim 1, the algorithm has a competitive ratio of $2$. On the other hand, by using the Marker algorithm and not blindly trusting the oracle, we can guarantee a worstcase competitive ratio of $k$.
We extend this direction to further reduce the worstcase competitive ratio to $O({H}_{k})$. To achieve this, we combine the predictionbased tiebreaking rule with the random tiebreaking rule. Suppose an element $e$ is evicted during the phase. We construct a blame graph to understand the reason why $e$ is evicted. There are two cases: either it was evicted when a clean element $c$ arrived, then we add a directed edge from $e$ to $c$. Alternatively, it was evicted because a stale element $s$ arrived, but $s$ was previously evicted. In this case, we add a directed edge from $e$ to $s$. Note that the graph is always a set of chains (paths). The total length of the chains represents the total number of evictions incurred by the algorithm during the phase, whereas the number of distinct chains represents the number of clean elements. We call the lead element in every chain representative and denote it by $\omega (r,c)$, where $r$ is the index of the phase and $c$ the index of the chain in the phase.
Our modification is simple – when a stale element arrives, it evicts a new element in a predictionbased manner if the chain containing it has length less than ${H}_{k}$. Otherwise it evicts a random unmarked element. (Looking ahead to the analysis, this switch to uniform evictions results in at most ${H}_{k}$ additional elements added to any chain during the course of the phase. This guarantees that the competitive ratio is at most $O({H}_{k})$ in expectation; we make the argument formal in Theorem 3.3.
The key to the analysis is the fact that the chains are disjoint, thus the interactions between evictions can be decomposed cleanly. We give a formal version of the algorithm in Algorithm 3.2. For simplicity, we drop dependence on $\sigma $ from the notation.
[!h] \[email protected]@algorithmic[1] \REQUIRECache $\mathcal{C}$ of size $k$ initially empty ($\mathcal{C}\leftarrow \mathrm{\varnothing}$). \STATEInitialize phase counter $r\leftarrow 1$, unmark all elements ($\mathcal{M}\leftarrow \mathrm{\varnothing}$), and set round $i\leftarrow 1$. \STATEInitialize clean element counter ${q}_{r}\leftarrow 0$ and tracking set $\mathcal{S}\leftarrow \mathrm{\varnothing}$. \STATEElement ${z}_{i}$ arrives, and the predictor gives a prediction ${h}_{i}$. Save prediction $p({z}_{i})\leftarrow {h}_{i}$. \IF${z}_{i}$ results in cache hit or the cache is not full (${z}_{i}\in \mathcal{C}$ or $$) \STATEAdd to cache $C\leftarrow C\cup \{{z}_{i}\}$ without evicting any element and go to step 26 \ENDIF\IFthe cache is full and all cache elements are marked ($\mathcal{M}=k$) \STATEIncrease phase ($r\leftarrow r+1$), initialize clean counter (${q}_{r}\leftarrow 0$), save current cache ($\mathcal{C}\to \mathcal{S}$) as the set of elements that are possibly stale in the new phase, and unmark elements ($\mathcal{M}\leftarrow \mathrm{\varnothing}$). \ENDIF\IF${z}_{i}$ is a clean element (${z}_{i}\notin \mathcal{S}$) \STATEIncrease number of clean elements: ${q}_{r}\leftarrow {q}_{r}+1$. \STATEInitialize size of new clean chain: $n(r,{q}_{r})\leftarrow 1$. \STATESelect to evict unmarked element with highest predicted time: $e={\mathrm{arg}\mathrm{max}}_{z\in \mathcal{C}\mathcal{M}}p(z)$. \ENDIF\IF${z}_{i}$ is a stale element (${z}_{i}\in \mathcal{S}$) \STATEIt is the representative of some clean chain. Let $c$ be this clean chain: ${z}_{i}=\omega (r,c)$. \STATEIncrease length of the clean chain $n(r,c)\leftarrow n(r,c)+1$. \IF$n(r,c)\le {H}_{k}$ \STATESelect to evict unmarked element with highest predicted time: $e={\mathrm{arg}\mathrm{max}}_{z\in \mathcal{C}\mathcal{M}}p(z)$. \ELSE\STATESelect to evict a random unmarked element $e\in \mathcal{C}\mathcal{M}$. \ENDIF\STATEUpdate cache by evicting $e$: $\mathcal{C}\leftarrow \mathcal{C}\cup \{{z}_{i}\}\{e\}$. \STATESet $e$ as representative for the chain: $\omega (r,c)\leftarrow e$. \ENDIF\STATEMark incoming element ($\mathcal{M}\leftarrow \mathcal{M}\cup \{{z}_{i}\}$), increase round ($i\leftarrow i+1$), and go to step 3.
3.3 Analysis
In order to analyze the performance of the proposed algorithm, we begin with a technical definition that captures how slowly a loss function $\mathrm{\ell}$ can grow.
Definition 7.
Let ${A}_{T}\mathrm{=}{a}_{\mathrm{1}}\mathrm{,}{a}_{\mathrm{2}}\mathrm{,}\mathrm{\dots}\mathrm{,}{a}_{T}$, be a sequence of increasing integers of length $T$, that is $$, and ${B}_{T}\mathrm{=}{b}_{\mathrm{1}}\mathrm{,}{b}_{\mathrm{2}}\mathrm{,}\mathrm{\dots}\mathrm{,}{b}_{T}$ a sequence of nonincreasing reals of length $T$, ${b}_{\mathrm{1}}\mathrm{\ge}{b}_{\mathrm{2}}\mathrm{\ge}\mathrm{\dots}\mathrm{\ge}{b}_{T}$. For a fixed loss function $\mathrm{\ell}$, we define its spread ${S}_{\mathrm{\ell}}\mathrm{:}{\mathrm{N}}^{\mathrm{+}}\mathrm{\to}{\mathrm{R}}^{\mathrm{+}}$ as:
$${S}_{\mathrm{\ell}}(m)=\mathrm{min}\{T:\underset{{A}_{T},{B}_{T}}{\mathrm{min}}\mathrm{\ell}({A}_{T},{B}_{T})\ge m\}$$ 
The following Lemma instantiates the spread metric for loss metrics we consider and is proved in the Appendix A.
Lemma 3.1.
For absolute loss, ${\mathrm{\ell}}_{1}(A,B)={\sum}_{i}{a}_{i}{b}_{i}$, the spread of ${\mathrm{\ell}}_{1}$ is ${S}_{{\mathrm{\ell}}_{1}}(m)\le \sqrt{5m}$.
For squared loss, ${\mathrm{\ell}}_{2}(A,B)={\sum}_{i}{({a}_{i}{b}_{i})}^{2}$, the spread of ${\mathrm{\ell}}_{2}$ is ${S}_{{\mathrm{\ell}}_{2}}(m)\le \sqrt[3]{14m}.$
We now prove the main theorem of the paper.
Theorem 3.3.
Let a loss function $\mathrm{\ell}$ with spread bounded by ${S}_{\mathrm{\ell}}$. If ${S}_{\mathrm{\ell}}$ is concave, the competitive ratio of $\u03f5$assisted Predictive Marker PM is bounded by:
$${\mathrm{\text{cr}}}_{\mathrm{\text{PM}},\mathrm{\ell}}(\u03f5)\le 2\cdot \mathrm{min}(1+2{S}_{\mathrm{\ell}}\left(\u03f5\right),2{H}_{k}).$$ 
To prove this theorem, we first introduce an analogue of Claim 2, which decomposes the total cost into that incurred by each of the chains individually.
To aid in our analysis, we consider the following marking algorithm, which we call SM (Special Marking). Initially we simply evict an arbitrary unmarked element. At some point, the adversary designates an arbitrary element not in the cache as special. For the rest of the phase, upon a cache miss, if the arriving element is special, the algorithm evicts a random unmarked element and designates the evicted element as special. If the arriving element is not special, the algorithm proceeds as before, evicting an arbitrary unmarked element.
Lemma 3.2.
Using algorithm SM, in expectation at most ${H}_{k}$ special elements cause cache misses per phase.
Proof.
Consider the unmarked elements in the cache that never reappear during the phase. If one of these is designated special, it will not cause another cache miss before the end of the phase. We turn our analysis to elements that will reappear during the phase.
Let $A$ denote the subset of these elements that may become special; this set dynamically shrinks over time as elements appear and become marked. At the time the first special element causes a cache miss, these are the elements that are not already marked in the cache that will reappear during the phase, as well as those outside the cache that will appear before the end of the phase. Order the elements in $A$ in decreasing order of their first arrival time. Observe that at the outset $A$ contains at most $k1$ elements.
For any $i\in \{1,\mathrm{\dots},k1\}$, we define ${\mathcal{E}}_{i}$ as the event that an element becomes special when it is the $i$th element in the active ordering. Our goal will be to show that
$$Pr[{\mathcal{E}}_{i}]\le \frac{1}{i+1}.$$  (1) 
A key to the analysis is the fact that once event ${\mathcal{E}}_{i}$ occurs, only elements arriving even later (i.e. those with lower index in the active set) can become special. Therefore, given Equation (1), we can bound the expected number of misses caused by special elements as:
$$1+\sum _{i=1}^{k1}\frac{1}{i+1}={H}_{k},$$ 
where the first term is due to the first special element arriving, the the second term is due to the events ${\mathcal{E}}_{1}$ through ${\mathcal{E}}_{k1}$.
Consider the last time an element becomes special while there are more than $i$ elements in the active ordering; let $\omega $ be the special element. As we argued above, until this point ${\mathcal{E}}_{i}$ could not have occurred.
Now consider the time that $\omega $ reappears. If there are exactly $i$ elements in the active set, the probability that ${\mathcal{E}}_{i}$ occurs is bounded by $\frac{1}{i+1}$. We may have selected either one of the first $i$ active elements, or an element in the cache that never appears before the end of the phase (at least one such element must exist, otherwise there are no cache misses during the phase). If there are fewer than $i$ elements, the probability of ${\mathcal{E}}_{i}$ occurring is $0$. ∎
We now provide the lemma that lies in the heart of our robustness property.
Lemma 3.3.
For any loss metric $\mathrm{\ell}$, any phase $r$, the expected length of any chain is at most $1+{\mathcal{S}}_{\mathrm{\ell}}({\eta}_{r,c})$ where ${\eta}_{r,c}$ is the cumulative error of the predictor on the elements in the chain and ${\mathcal{S}}_{\mathrm{\ell}}$ is the spread of the loss metric.
Proof.
The clean element that initiates the clean chain evicts one of the unmarked elements upon arrival. Since it does so based on the Belady rule, it evicts the element ${s}_{1}$ that is predicted to reappear the latest in the future. If the predictor is perfect, this element will never appear in this phase. If, on the other hand, ${s}_{1}$ comes back (is a stale element) let ${s}_{2}$ be the element it evicts, which is predicted to arrive the furthest among the current unmarked elements.
Suppose there are $m$ such evictions: ${s}_{1},{s}_{2},\mathrm{\dots},{s}_{m}$. The elements were predicted to arrive in reverse order of their evictions. This is because elements ${s}_{j}$ for $j>i$ were unmarked and in the cache when element ${s}_{i}$ got evicted; therefore ${s}_{i}$ was predicted to arrive later. However, the actual arrival order is the reverse. If ${\eta}_{r,c}$ is the total error of these elements, setting the actual arriving times as the sequence ${A}_{T}$ and the predicted ones as the sequence ${B}_{T}$ in the definition of spread (Definition 7), it means that $m\le {S}_{\mathrm{\ell}}({\eta}_{r,c})$. ∎
Combining the two above lemmas, we can obtain a bound on the expected length of any chain.
Lemma 3.4.
For any loss metric $\mathrm{\ell}$, any phase $r$, the expected length of any chain is at most $\mathrm{min}(1+2{\mathcal{S}}_{\mathrm{\ell}}({\eta}_{r,c}),2\mathrm{log}k)$ where ${\eta}_{r,c}$ is the cumulative error of the predictor on the elements in the chain and ${\mathcal{S}}_{\mathrm{\ell}}$ is the spread of the loss metric.
Proof.
The proof follows from combining the two above lemmas. By Lemma 3.2, if the chain switches to random evictions, it incurs another ${H}_{k}$ cache misses in expectation after the switch point (and its length increases by the same amount), capping the total length by $2{H}_{k}\le 2\mathrm{log}k$. If the chain does not switch to random evictions, it has Belady evictions and, by Lemma 3.3, it incurs at most ${\mathcal{S}}_{\mathrm{\ell}}({\eta}_{r,c})$ misses from stale elements. ∎
Proof of Theorem 3.3.
Fix an arbitrary sequence of arrivals. Let $Q$ be the number of clean elements (and therefore also chains). Any cache miss corresponds to a particular eviction in one clean chain; there are no cache misses not charged to a chain by construction. By Lemma 3.4, we can bound the evictions from the clean chain $c$ of the $r$th phase by $\mathrm{min}(1+2\cdot {S}_{\mathrm{\ell}}({\eta}_{r,c}),2\mathrm{log}k)$. Since both ${S}_{\mathrm{\ell}}$ and the minimum operator are concave functions, the way to maximize the length of chains is to apportion the total error, $\eta $, equally across all of the chains. Thus for a given error $\eta $ and number $Q$ of clean chains, the competitive ratio is maximized when the error in each chain is ${\eta}_{r,c}=\eta /Q$ each. The total number of stale elements is then: $Q\cdot \mathrm{min}(2\cdot {S}_{\mathrm{\ell}}(\eta /\mathrm{Q}),2{\mathrm{H}}_{\mathrm{k}})$. By Claim 1, $\mathrm{Q}/2\le \mathrm{\text{Opt}}(\sigma )$, implying the result since $\mathrm{\text{Opt}}\le Q$. ∎
We now specialize the results for the absolute and squared losses.
Corollary 1.
The competitive ratio of $\u03f5$assisted Predictive Marker with respect to the absolute loss metric ${\mathrm{\ell}}_{1}$ is bounded by ${\mathrm{\text{cr}}}_{\mathcal{P}\mathcal{M},{\mathrm{\ell}}_{1}}(\u03f5)\le \mathrm{min}(2+2\cdot \sqrt{5\u03f5},4{H}_{k})$.
Corollary 2.
The competitive ratio of $\u03f5$assisted Predictive Marker with respect to the absolute loss metric ${\mathrm{\ell}}_{2}$ is bounded by ${\mathrm{\text{cr}}}_{\mathcal{P}\mathcal{M},{\mathrm{\ell}}_{1}}(\u03f5)\le \mathrm{min}(2+2\cdot \sqrt[3]{14\u03f5},4{H}_{k})$.
3.4 Tightness of analysis
Robustness rate of Predictive Marker.
We show that our analysis is tight: any marking algorithm that uses the predictor in a deterministic way cannot achieve an improved guarantee with respect to robustness.
Theorem 3.4.
Any deterministic $\u03f5$assisted marking algorithm $\mathcal{A}$, that only uses the predictor in tiebreaking among unmarked elements in a deterministic fashion, has a competitive ratio of ${\mathrm{\text{cr}}}_{\mathcal{A},\mathrm{\ell}}(\u03f5)=\mathrm{\Omega}\left(\mathrm{min}({\mathcal{S}}_{\mathrm{\ell}}\left(\u03f5\right),k)\right)$.
Proof.
Consider a cache of size $k$ with $k+1$ elements and any $\u03f5$ such that $$. We will construct an instance that exhibits the above lower bound. Since $\mathcal{A}$ uses marking, we can decompose its analysis into phases. Let $\sigma $ be the request sequence, and assume that we do not have any repetition of an item inside the phase; as a result the $i$th element of phase $r$ corresponds to element ${\sigma}_{(r1)k+i}$.
Suppose the predictor is always accurate on elements 2 through $k{S}_{\mathrm{\ell}}(\u03f5)+1$ in each phase.
For the last ${S}_{\mathrm{\ell}}(\u03f5)1$ elements of phase $r$ as well as the first element of the of the next phase, the elements are predicted to come again at the beginning of the subsequent phase, at time $t=rk+1$. Since the algorithm is deterministic, we order the elements so that their evictions are in reverse order of their arriving time. By the definition of spread, the error of the predictor in these elements is exactly $\u03f5$ and the algorithm incurs a cache miss in each of them. On the other hand, the offline optimum has only $1$ miss per phase, which concludes the proof. ∎
On the rate of robustness in caching.
Theorem 3.4 establishes that the analysis of Predictive Marker is tight with respect to the rate of robustness, and suggests that algorithms that use the predictor in a deterministic manner may suffer from similar rates. However, a natural question that comes up is whether a better rate can be achieved using the predictor in a randomized way. We conjecture that a rate of $\mathrm{log}(1+\sqrt{\u03f5})$ with respect to the absolute loss is possible, similar to the exponential improvement randomized schemes obtain over the deterministic guarantees of $k$ with respect to worstcase competitiveness.
3.5 Randomized predictors
We now remove the assumption that the predictor $h$ is deterministic and extend the definition of $\u03f5$accurate predictors (Definition 2) to hold in expectation. The randomness may either come in how the inputs are generated or in the predictions of $h$.
Definition 8.
For a fixed optimization problem $\mathrm{\Pi}$, let ${\mathrm{\text{Opt}}}_{\mathrm{\Pi}}\mathit{}\mathrm{(}\sigma \mathrm{)}$ denote the value of the optimal solution on input $\sigma $. Assume that the predictor is probabilitic and therefore the error of the predictor at $\sigma $ is a random variable ${\eta}_{\mathrm{\ell}}\mathit{}\mathrm{(}h\mathrm{,}\sigma \mathrm{)}$. We say that a predictor $h$ is $\u03f5$accurate in expectation for $\mathrm{\Pi}$ if:
$$\mathbb{E}\left[{\eta}_{\mathrm{\ell}}(h,\sigma )\right]\le \u03f5\cdot \mathbb{E}\left({\mathrm{\text{Opt}}}_{\mathrm{\Pi}}(\sigma )\right).$$ 
Similarly an algorithm is $\u03f5$assisted if it has access to an $\u03f5$accurate predictor in expectation.
Analogously to the previous part, we can show:
Theorem 3.5.
Let a loss function $\mathrm{\ell}$ with spread bounded by ${S}_{\mathrm{\ell}}$. If ${S}_{\mathrm{\ell}}$ is concave, the competitive ratio of $\u03f5$assisted (in expectation) Pr
$${\mathrm{\text{cr}}}_{\mathcal{P}\mathcal{M},\mathrm{\ell}}(\u03f5)\le 2\cdot \mathrm{min}(1+2{S}_{\mathrm{\ell}}\left(\u03f5\right),2{H}_{k}).$$ 
Proof.
For ease of notation assume that the outcomes of the predictors are finite. For each of these potential realizations, we can bound the performance of the algorithm by Theorem 3.3. The proof then follows by appliyng an additional Jensen’s inequality on all the possible realizations due to the concavity of the spread and the min operator. ∎
4 Discussion and extensions
Thus far we have shown how to use an $\u03f5$accurate predictor to get a caching algorithm with an $O(\sqrt{\u03f5})$ competitive ratio for the absolute loss metric. We now provide a deeper discussion of the main results. In Section 4.1, we give a finer tradeoff of competitiveness and robustness. We then discuss some traits that limit the impact of the errors of the predictors in Section 4.2.Subsequently, we show that common heuristic approaches, such as LRU, can be expressed as predictors in our framework. This allows us to combine their predictive power with robust guarantees when they fail. Finally, in Section 4.3, we provide a blackbox way to combine robust and competitive approaches.
4.1 Robustness vs competitiveness tradeoffs.
One of the free parameters in Algorithm 3.2 is the length of the chain when the algorithm switches from following the predictor to random evictions. If the switch occurs after chains grow to $\gamma {H}_{k}$ in length, this provides a tradeoff between competitiveness and robustness.
Theorem 4.1.
Suppose that, for $\gamma >0$, the algorithm uses $\gamma {H}_{k}$ as switching point (line 18 in Algorithm 3.2); denote this algorithm by $\mathcal{P}\mathcal{M}(\gamma )$. Let a loss function $\mathrm{\ell}$ with spread bounded by ${S}_{\mathrm{\ell}}$. If ${S}_{\mathrm{\ell}}$ is concave, the competitive ratio of $\u03f5$assisted $\mathcal{P}\mathcal{M}(\gamma )$ is bounded by:
$${\mathrm{\text{cr}}}_{\mathcal{P}\mathcal{M}(\gamma ),\mathrm{\ell}}(\u03f5)\le 2\cdot \mathrm{min}(1+\frac{1+\gamma}{\gamma}{S}_{\mathrm{\ell}}\left(\u03f5\right),\gamma {H}_{k},k).$$ 
Proof.
The proof follows the proof of Theorem 3.3 but slightly changes the Lemma 3.2 to account for the new switching point. In particular, with respect to the second term, the expected length of each clean chain is at most ${H}_{k}$ after the switching point, and, at most $\gamma {H}_{k}$ before the switching point by construction.
With respect to the robustness term, the length of each clean chain before the switch is bounded by the spread of the metric on this subsequence. Since the total length is in expectation at most $(1+\gamma )/\gamma $ higher, we need to adjust the first term accordingly.
Finally, the length of its clean chain is at most $k$ regardless of the tiebreaking since we are using marking which provides the last term. ∎
Let us reflect on the above guarantee. When $\gamma \to 0$ then the algorithm is more conservative (switching to random evictions earlier); this reduces the worstcase competitive ratio but at the cost of abandoning the predictor unless it is extremely accurate. On the other hand, setting $\gamma $ very high makes the algorithm trust the predictor more, reducing the competitive ratio when the predictor is accurate at the expense of a worst guarantee when the predictor is unreliable.
4.2 Practical traits of Predictive Marker
Locality.
The guarantee in Theorem 3.3 bounds the competitive ratio as a function of the quality of the prediction. One potential concern is that if the predictions have of a small number of very large errors, then the applicability of Predictive Marker may be quite limited.
Here we show that this is not the case. Due to the phasebased nature of the analysis, the algorithm essentially “resets” at the end of every phase, and therefore the errors incurred one phase do not carry over to the next. Moreover, the competitive ratio in every phase is bounded by $O({H}_{k})$.
Formally, for any sequence $\sigma $, we can define phases that consist of exactly $k$ distinct elements. Let $\mathrm{\text{cl}}(r,\sigma )$ be the number of clean elements in phase $r$ of sequence $\sigma $, and let ${\eta}_{\mathrm{\ell},r}(h,\sigma )$ denote the error of predictor $h$ restricted only to elements occurring in phase $r$.
Theorem 4.2.
Consider a loss function $\mathrm{\ell}$ with spread ${S}_{\mathrm{\ell}}$. If ${S}_{\mathrm{\ell}}$ is concave, the competitive ratio of Predictive Marker $PM$ at sequence $\sigma $ when assisted by a predictor $h$ is at most:
$${\mathrm{\text{cr}}}_{\mathcal{P}\mathcal{M},\mathrm{\ell}}\le \frac{{\sum}_{r}\mathrm{\text{cl}}(r,\sigma )\cdot \mathrm{min}(1+2{S}_{\mathrm{\ell}}({\eta}_{\mathrm{\ell},r}(h,\sigma ),2{H}_{k})}{{\sum}_{r}\mathrm{\text{cl}}(r,\sigma )}$$ 
Proof.
This theorem illustrates a nice property of our algorithm. If the predictor $h$ is really bad for a period of time (i.e. its errors are localized) then the clean chains of the corresponding phases will contribute the second term (the logarithmic worstcase guarantee) but the other phases will provide enhanced performance utilizing the predictor’s advice. In this way, the algorithm adapts to the quality of the predictions, and bad errors do not propagate beyond the end of a phase. This quality is very useful in caching where most patterns are generally well predicted but there may be some unforeseen sequences.
Robustifying LRU.
Another practical property of our algorithm is that it can seamlessly incorporate heuristics that are known to perform well in practice. In particular, the popular Least Recently Used (LRU) algorithm can be expressed within the Predictive Marker framework. Consider the following predictor, $h$: when an element ${\sigma}_{i}$ arrives at time $i$, the LRU predictor predicts next arrival time $h({\sigma}_{i})=i$.
Note that, by doing so, at any point of time, among the elements that are in the cache, the element that is predicted the furthest in the future is exactly the one that has appeared the least recently. Also note that any marked element needs to have arrived later than any unmarked element. As a result, if we never switched to random evictions (or had $k$ in the RHS of line 18 in Algorithm 3.2), the Predictive Marker algorithm assisted with the LRU predictor is exactly the LRU algorithm.
The nice thing that comes from this observation is that we can robustify the analysis of LRU. LRU, and its variants like LRU(2), tend to have very good empirical performance as using the recency of requests is a good predictor about how future requests will arise. However, the worstcase guarantee of LRU is unfortunately $\mathrm{\Theta}(k)$ since it is a deterministic algorithm. By expressing LRU as a predictor in the Predictive Marker framework and using a switching point of ${H}_{k}$ for each clean chain, we exploit most of this predictive power while also guaranteeing a logarithmic worstcase bound on it.
4.3 Combining robustness and competitiveness in a blackbox manner
In the previous section, we showed how we can slightly modify a classical competitive algorithm to ensure that it satisfies nice consistency and robustness properties when given access to a good predictor, while retaining the worstcase competitiveness guarantees otherwise. In this part, we show that, in fact, achieving the requirements individually is enough. In particular, we show a blackbox way to combine an algorithm that is robust and one that is worstcase competitive. This reduction leads to a slightly worse bound, but shows that proving the robustness property (i.e. a graceful degradation with the error of the predictor) is theoretically sufficient to augment an existing worstcase competitive algorithm.
Theorem 4.3.
For the caching problem, let $A$ be an $\alpha $robust algorithm and $B$ a $\gamma $competitive algorithm. We can then create a blackbox algorithm $ALG$ that is both $9\alpha $robust and $9\gamma $competitive.
Proof.
We proceed by simulating $A$ and $B$ in parallel on the dataset, and maintaining the cache state and the number of misses incurred by each. Our algorithm switches between following the strategy of $A$ and the strategy of $B$. Let ${c}_{t}(A)$ and ${c}_{t}(B)$ denote the cost (number of misses) of $A$ and $B$ up to time $t$. Without loss of generality, let $ALG$ begin by following strategy of $A$; it will do so until a time $t$ where ${c}_{t}(A)=2\cdot {c}_{t}(B)$. At this point $ALG$ switches to following the eviction strategy of $B$, doing so until the simulated cost of $B$ is double that of $A$: a time ${t}^{\prime}$ with ${c}_{{t}^{\prime}}(B)=2\cdot {c}_{{t}^{\prime}}(A)$. At this point it switches back to following eviction strategy of $A$, and so on. When $ALG$ switches from $A$ to $B$, the elements that $A$ has in cache may not be the same as those that $B$ has in the cache. In this case, it needs to reconcile the two. However, this can be done lazily (at the cost of an extra cache miss for every element that needs to be reconciled). To prove the bound on the performance of the algorithm, we need to show that ${c}_{t}(ALG)\le 9\cdot min({c}_{t}(A),{c}_{t}(B))$ for all $t$. We decompose the cost incurred by $ALG$ into that due to following the different algorithms, which we denote by ${f}_{t}(ALG)$, and that due to reconciling caches, ${r}_{t}(ALG)$.
We prove a bound on the following cost ${f}_{t}$ by induction on the number of switches. Without loss of generality, suppose that at time $t$, $ALG$ switched from $A$ to $B$, and at time ${t}^{\prime}$ it switches from $B$ back to $A$. By induction, suppose that ${f}_{t}(ALG)\le 3\mathrm{min}({c}_{t}(A),{c}_{t}(B))=3{c}_{t}(B)$, where the equality follows since $ALG$ switched from $A$ to $B$ at time $t$. In both cases, assume that caches are instantly reconciled. Then:
${f}_{{t}^{\prime}}(ALG)$  $={f}_{t}(ALG)+({c}_{{t}^{\prime}}(B){c}_{t}(B))$  
$={f}_{t}(ALG)+2{c}_{{t}^{\prime}}(A)1/2{\mathrm{c}}_{\mathrm{t}}(\mathrm{A})$  
$\le 3{c}_{t}(B)+2({c}_{{t}^{\prime}}(A){c}_{t}(A))+3/2\cdot {\mathrm{c}}_{\mathrm{t}}(\mathrm{A})$  
$=3{c}_{t}(A)+2({c}_{{t}^{\prime}}(A){c}_{t}(A))$  
$\le 3{c}_{{t}^{\prime}}(A)$  
$=3\mathrm{min}({c}_{{t}^{\prime}}(A),{c}_{{t}^{\prime}}(B))$ 
What is left is to bound the following cost for the time since the last switch. Let $s$ denote the time of the last switch and, assume without loss of generality that it was done from $A$ to $B$. Let ${s}^{\prime}$ denote the last time step. By the previous set of inequalities (changing the second equation to inequality) and the fact that the algorithm never switched back to $A$ after $s$, it holds that ${f}_{{s}^{\prime}}(ALG)\le 3{c}_{{s}^{\prime}}(A)\le 6\mathrm{min}({c}_{{s}^{\prime}}(A),{c}_{{s}^{\prime}}(B))$.
To bound the reconciliation cost, assume the switch at time $t$ is from $A$ to $B$. We charge the reconciliation of each element in $B\setminus A$ to the cache miss when the element was last evicted by $A$. Therefore the overall reconciliation cost is bounded by ${r}_{t}(ALG)\le {c}_{t}(A)+{c}_{t}(B)\le 3\mathrm{min}({c}_{t}(A),{c}_{t}(B)$. ∎
Observe that the above construction can extend beyond caching and applies to any setting where we can bound the cost that the algorithm needs to incur to reconcile the states of the robust and the worstcase competitive algorithm. In particular, this occurs in the more general $k$server problem.
5 Experiments
In this section we evaluate our approach on real world datasets, empirically demonstrate its dependence on the errors in the oracle, and compare it to standard baselines.
Datasets and Metrics
We consider two datasets taken from different domains to demonstrate the wide applicability of our approach.

•
BK is data extracted from BrightKite, a now defunct social network. We consider sequences of checkins, and extract the top $100$ users with the longest nontrivial check in sequences—those where the optimum policy would have at least $50$ misses. This dataset is publicly available at [CML11, Bri]. Each of the user sequences represents an instance of the caching problem.

•
Citi is data extracted from CitiBike, a popular bike sharing platform operating in New York City. We consider citi bike trip histories, and extract stations corresponding to starting points of each trip. We create 12 sequences, one for each month of 2017 for the New York City dataset. We consider only the first 25,000 events in each file. This data is publicly available at [Cit].
We give some additional statistics about each datasets in Table 1.
Dataset  Num Sequences  Sequence Length  Unique Elements 

BK  100  2,101  67– 800 
Citi  24  25,000  593 – 719 
Our main metric for evaluation will be the competitive ratio of the algorithm, defined as the number of misses incurred by the particular strategy divided by the optimum number of misses.
Predictions
We run experiments with both synthetic predictions to showcase the sensitivity of our methods to learning errors, and with preditions using an off the shelf classifier, published previously [AKTV14].

•
Synthetic Predictions. For each element, we first compute the true next arrival time, $y(t)$, setting it to $n+1$ if it does not appear in the future. To simulate the performance of an ML system, we set $h(t)=y(t)+\u03f5$, where $\u03f5$ is drawn i.i.d. from a lognormal distribution with mean parameter $0$ and standard deviation $\sigma $. We chose the lognormal distribution of errors to showcase the effect of rare but large failures of the learning algorithm. Finally, observe that since we only compare the relative predicted times for each method, adding a bias term to the predictor would not change the results.

•
PLECO Predictions. In their work Anderson et al. [AKTV14] developed a simple framework to model repeat consumption, and published the parameters of their PLECO (Power Law with Exponential Cut Off) model for the BrightKite dataset. While their work focused on predicting the relative probabilities of each element (re)appearing in the subsequent time step, we modify it to predict the next time an element will appear. Specifically, we set $h(t)=t+1/p(t)$, where $p(t)$ represents the probability that element that appeared at time $t$ will reappear at time $t+1$.
Algorithms
We consider multiple algorithms for evaluation.

•
LRU is the Least Recently Used policy that is wildly successful in practice.

•
Marker is the classical Marker algorithm due to Fiat et al. [FKL${}^{+}$91].

•
PredictiveMarker is the algorithm we develop in this work. We set the switching cost to $k$, and therefore never switch to random evictions.

•
Blind Oracle is the algorithm $\mathcal{B}$ described in Section 3.1, which evicts the element predicted to appear furthest in the future.
5.1 Results
We set $k=10$, and summarize the synthetic results on the BK dataset in Figure 1. Observe that the performance of Predictive Marker is consistently better than LRU and standard Marker, and degrades slowly as the average error increases, as captured by the theoretical analysis. Second, we empirically verify that blindly following the oracle works well when the error is very low, but quickly becomes incredibly costly.
Algorithm  Competitive Ratio on BK  Competitive Ratio on Citi 

Blind Oracle  2.049  2.023 
LRU  1.280  1.859 
Marker  1.310  1.869 
Predictive Marker  1.266  1.810 
The results using the PLECO predictor are shown in Table 2, where we keep $k=10$ for the BK dataset and set $k=100$ for Citi; we note that the ranking of the methods is not sensitive to the cache size, $k$. We can again see that the Predictive Marker algorithm outperforms all others, and is 2.5% better than the next best method, LRU. While the gains appear modest, we note they are statistically significant at $$. Moreover, the offtheshelf PLECO model was not tuned or optimized for predicting the next appearance of each element.
In that regard, the large difference in performance between using the predictor directly (Blind Oracle) and using it in combination with Marker (Predictive Marker) speaks to the power of the algorithmic method. By considering only the straightforward use of the predictor in the Blind Oracle setting, one may deem the ML approach not powerful enough for this application; what we show is that a more judicious use of the same model can result in tangible and statistically significant gains.
6 Conclusion
In this work, we introduce the study of online algorithms with the aid of machine learned predictors. This combines the empirical success of machine learning with the rigorous guarantees of online algorithms. We model the setting for the classical caching problem and give an oraclebased algorithm whose competitive ratio is directly tied to the accuracy of the machine learned oracle.
Our work opens up two avenues for future work. On the theoretical side, it would be interesting to see similar predictor based algorithms for other online settings such as the $k$server problem; this has already led to a fruitful line of current research as we discussed in Section 1.3. On the practical side, our caching algorithm shows how we can use machine learning in a safe way, avoiding problems caused by rare wildly inaccurate predictions. At the same time, our experimental results show that even with simple predictors, our algorithm provides an improvement compared to LRU. In essence, we have reduced the worst case performance of the caching problem to that of finding a good (on average) predictor. This opens up the door for practical algorithms that need not be tailored towards the worstcase or specific distributional assumptions, but still yield provably good performance.
Acknowledgements
The authors would like to thank Andrés MuñozMedina and Éva Tardos for valuable discussions on the presentation of the paper, Shuchi Chawla and Seffi Naor for useful feedback regarding Section 3.4, Ola Svensson for suggesting the locality extension (Theorem 4.2) as well as an anonymous reviewer for pointing towards the direction of Theorem 4.3.
References
 [ACC${}^{+}$11] Nir Ailon, Bernard Chazelle, Kenneth L. Clarkson, Ding Liu, Wolfgang Mulzer, and C. Seshadhri. Selfimproving algorithms. SIAM J. Comput., 40(2):350–375, 2011.
 [ACN00] Dimitris Achlioptas, Marek Chrobak, and John Noga. Competitive analysis of randomized paging algorithms. Theor. Comput. Sci., 234(12):203–218, 2000.
 [ADJ${}^{+}$19] Spyros Angelopoulos, Christoph Dürr, Shendan Jin, Shahin Kamali, and Marc Renault. Online computation with untrusted advice. arXiv preprint arXiv:1905.05655, 2019.
 [AFG02] Susanne Albers, Lene M. Favrholdt, and Oliver Giel. On paging with locality of reference. In Proceedings of the Thiryfourth Annual ACM Symposium on Theory of Computing, STOC ’02, pages 258–267, New York, NY, USA, 2002. ACM.
 [AKTV14] Ashton Anderson, Ravi Kumar, Andrew Tomkins, and Sergei Vassilvitskii. The dynamics of repeat consumption. In Proceedings of the 23rd International Conference on World Wide Web, WWW ’14, pages 419–430, New York, NY, USA, 2014. ACM.
 [AMR11] David Arthur, Bodo Manthey, and Heiko Röglin. Smoothed analysis of the kmeans method. J. ACM, 58(5):19:1–19:31, 2011.
 [AV06] David Arthur and Sergei Vassilvitskii. Worstcase and smoothed analysis of the ICP algorithm, with an application to the kmeans method. In 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2006), 2124 October 2006, Berkeley, California, USA, Proceedings, pages 153–164, 2006.
 [Bel66] L. A. Belady. A study of replacement algorithms for a virtualstorage computer. IBM Syst. J., 5(2):78–101, June 1966.
 [BEY98] Allan Borodin and Ran ElYaniv. Online Computation and Competitive Analysis. Cambridge University Press, New York, NY, USA, 1998.
 [BFK${}^{+}$16] Joan Boyar, Lene M. Favrholdt, Christian Kudahl, Kim S. Larsen, and Jesper W. Mikkelsen. Online algorithms with advice: A survey. SIGACT News, 47(3):93–129, August 2016.
 [Bri] Brightkite data. http://snap.stanford.edu/data/locbrightkite.html.
 [BRS17] Eric Balkanski, Aviad Rubinstein, and Yaron Singer. The limitations of optimization from samples. In STOC, 2017.
 [BS12] Sébastien Bubeck and Aleksandrs Slivkins. The best of both worlds: Stochastic and adversarial bandits. In COLT 2012  The 25th Annual Conference on Learning Theory, June 2527, 2012, Edinburgh, Scotland, pages 42.1–42.23, 2012.
 [Cit] Citibike system data. http://https://www.citibikenyc.com/systemdata.
 [CML11] Eunjoon Cho, Seth A. Myers, and Jure Leskovec. Friendship and mobility: User movement in locationbased social networks. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’11, pages 1082–1090, New York, NY, USA, 2011. ACM.
 [CR14] Richard Cole and Tim Roughgarden. The sample complexity of revenue maximization. In Symposium on Theory of Computing, STOC 2014, New York, NY, USA, May 31  June 03, 2014, pages 243–252, 2014.
 [Den68] Peter J. Denning. The working set model for program behavior. Commun. ACM, 11(5):323–333, May 1968.
 [EKM15] Hossein Esfandiari, Nitish Korula, and Vahab Mirrokni. Online allocation with traffic spikes: Mixing adversarial and stochastic models. In Proceedings of the Sixteenth ACM Conference on Economics and Computation, pages 169–186. ACM, 2015.
 [ERV16] Matthias Englert, Heiko Röglin, and Berthold Vöcking. Smoothed analysis of the 2opt algorithm for the general TSP. ACM Trans. Algorithms, 13(1):10:1–10:15, 2016.
 [FKL${}^{+}$91] Amos Fiat, Richard M. Karp, Michael Luby, Lyle A. McGeoch, Daniel D. Sleator, and Neal E. Young. Competitive paging algorithms. J. Algorithms, 12(4):685–699, December 1991.
 [HIKV19] ChenYu Hsu, Piotr Indyk, Dina Katabi, and Ali Vakilian. Learningbased frequency estimation algorithms. In International Conference on Learning Representations, 2019.
 [KBC${}^{+}$17] Tim Kraska, Alex Beutel, Ed H. Chi, Jeff Dean, and Neoklis Polyzotis. The case for learned index structures. 2017.
 [LMPL18] Thodoris Lykouris, Vahab Mirrokni, and Renato Paes Leme. Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th ACM Annual Symposium on Theory of Computing (STOC), 2018.
 [McG14] Andrew McGregor. Graph stream algorithms: A survey. SIGMOD Rec., 43(1):9–20, May 2014.
 [MGZ12] Vahab S. Mirrokni, Shayan Oveis Gharan, and Morteza Zadimoghaddam. Simultaneous approximations for adversarial and stochastic online budgeted allocation. In Proceedings of the TwentyThird Annual ACMSIAM Symposium on Discrete Algorithms, SODA 2012, Kyoto, Japan, January 1719, 2012, pages 1690–1701, 2012.
 [Mit18] Michael Mitzenmacher. A model for learned bloom filters and optimizing by sandwiching. In Advances in Neural Information Processing Systems (NeurIPS). 2018.
 [MNS12] Mohammad Mahdian, Hamid Nazerzadeh, and Amin Saberi. Online optimization with uncertain information. ACM Trans. Algorithms, 8(1):2:1–2:29, 2012.
 [MR95] Rajeev Motwani and Prabhakar Raghavan. Randomized Algorithms. Cambridge University Press, New York, NY, USA, 1995.
 [MR16] Jamie Morgenstern and Tim Roughgarden. Learning simple auctions. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 2326, 2016, pages 1298–1318, 2016.
 [MV17] Andres Muñoz Medina and Sergei Vassilvitskii. Revenue optimization with approximate bid predictions. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, pages 1856–1864, 2017.
 [PSK18] Manish Purohit, Zoya Svitkina, and Ravi Kumar. Improving online algorithms via ml predictions. In Advances in Neural Information Processing Systems, pages 9661–9670, 2018.
 [RS13] Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In Proceedings of the 26th Annual Conference on Learning Theory (COLT), 2013.
 [SHG${}^{+}$15] D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, JeanFrancois Crespo, and Dan Dennison. Hidden technical debt in machine learning systems. In Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS’15, pages 2503–2511, Cambridge, MA, USA, 2015. MIT Press.
 [ST85] Daniel D. Sleator and Robert E. Tarjan. Amortized efficiency of list update and paging rules. Commun. ACM, 28(2):202–208, February 1985.
 [ST04] Daniel A. Spielman and ShangHua Teng. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. J. ACM, 51(3):385–463, 2004.
 [SZS${}^{+}$14] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.
Appendix A Proof of Lemma 3.1
In this section, we provide the proof of the lemma connecting spread to absolute and squared loss. Before doing so, we provide a useful auxiliary lemma.
Lemma A.1.
For odd $T=2n+1$, one pair $({A}_{T},{B}_{T})$ minimizing either absolute or squared loss subject to the constraints of the spread definition is ${A}_{2n+1}=(0\mathrm{\dots}2n)$ and ${B}_{T}=(n\mathrm{\dots}n)$.
Proof.
First we show that there exists a ${B}_{T}$ minimizing the loss with ${b}_{i}={b}_{j}$ for all $i,j$. Assume otherwise; then there exist two subsequent $i,j$ with ${b}_{i}^{\prime}>{b}_{j}^{\prime}$. Since $$ by the assumption on spread, ${\mathrm{min}}_{x\in {b}_{i},{b}_{j}}\{\mathrm{\ell}({a}_{i},b)+\mathrm{\ell}({a}_{j},b)\}\le \mathrm{\ell}({a}_{i},{b}_{i})+\mathrm{\ell}({a}_{j},{b}_{j})$. Applying this recursively, we conclude that such a ${B}_{T}$ exists.
Second, we show that there exist an ${A}_{T}$ that consists of elements ${a}_{i+1}={a}_{i}+1$. Since the elements of ${B}_{T}$ are all equal to $b$, the sequence ${\sum}_{i=0}^{2n}\mathrm{\ell}({a}_{i},b)$ is minimized for both absolute and squared loss when ${a}_{i}=b+in$.
Last, the exact value of $b$ does not make a difference and therefore we can set it to be $b=n$ concluding the lemma. ∎
Lemma 3.1 restated:
For absolute loss, ${\mathrm{\ell}}_{1}(A,B)={\sum}_{i}{a}_{i}{b}_{i}$, the spread of ${\mathrm{\ell}}_{1}$ is ${S}_{{\mathrm{\ell}}_{1}}(m)\le \sqrt{5m}$.
For squared loss, ${\mathrm{\ell}}_{2}(A,B)=\sum {({a}_{i}{b}_{i})}^{2}$, the spread of ${\mathrm{\ell}}_{2}$ is ${S}_{{\mathrm{\ell}}_{2}}(m)\le \sqrt[3]{14m}.$
Proof.
It will be easier to restrict ourselves to odd $T=2n+1$ and also assume that $T\ge 3$. This will give an upper bound on the spread (which is tight up to small constant factors). By Lemma A.1, a pair of sequence minimizing absolute/squared loss is ${A}_{T}=(0,\mathrm{\dots},2n)$ and ${B}_{T}=(n,\mathrm{\dots},n)$. We now provide bounds on the spread based on this sequence, that is we find a $T=2n+1$ that satisfies the inequality $\mathrm{\ell}({A}_{T},{B}_{T})\le m$.
Absolute loss: The absolute loss of the above sequence is:
$$\mathrm{\ell}({A}_{T},{B}_{T})=2\cdot \sum _{j=1}^{n}j=2\cdot \frac{n(n+1)}{2}=n(n+1)=\frac{T1}{2}\cdot \frac{T+1}{2}=\frac{{T}^{2}1}{4}.$$ 
A $T$ that makes $\mathrm{\ell}({A}_{T},{B}_{T})\ge m$ is $T=\sqrt{4m+1}$. Therefore, for absolute loss ${S}_{\mathrm{\ell}}(m)\le \sqrt{5m}$, since $m\ge 1$
Squared loss: The squared loss of the above sequence is:
$$\mathrm{\ell}({A}_{T},{B}_{T})=2\cdot \sum _{j=1}^{n}{j}^{2}=2\cdot \frac{n(n+1)(2n+1)}{6}=\frac{({T}^{2}1)\cdot T}{12}=\frac{{T}^{3}T}{12}\ge \frac{8{T}^{3}}{9\cdot 12}=\frac{2{T}^{3}}{27}$$ 
where the inequality holds because $T\ge 3$.
A $T$ that makes $\mathrm{\ell}({A}_{T},{B}_{T})\ge m$ is $T=\sqrt[3]{14m}$. Therefore, for squared loss ${S}_{\mathrm{\ell}}(m)\le \sqrt[3]{14m}$. ∎