Online Algorithms for Weighted Paging with Predictions
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 123 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/49440 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | |
Genre |
00:00
NeuroinformatikError messageWeb pagePredictionWeightoutputCache (computing)Block (periodic table)AlgorithmPerfect groupMathematical analysisStapeldateiMultiplication signReduction of orderWeb pageWeightPredictabilityMathematical optimizationOptimization problemCache (computing)Online-AlgorithmusRandomized algorithmAlgorithmBlack boxLinear independenceBlock (periodic table)System callEndliche ModelltheorieOrder (biology)LoginError messageDeterminismState of matterGraphics tabletBound stateRule of inferenceOcean currentCategory of beingSequenceComputer configurationEntire functionGroup actionCASE <Informatik>NumberoutputResultantTotal S.A.Revision controlParameter (computer programming)Different (Kate Ryan album)Maxima and minimaSet (mathematics)Asymptotic analysisTerm (mathematics)Focus (optics)StapeldateiContent (media)Numbering schemeAxiom of choiceData structureSubsetPoint (geometry)Limit of a functionException handlingSummierbarkeitDistanceMeasurementPrime idealAdditionRootSquare numberRaw image format2 (number)Game controllerSoftware frameworkNatural numberRoundness (object)Line (geometry)Beat (acoustics)LinearizationGodPatch (Unix)Interpreter (computing)Online helpWordComputer programmingCuboidPresentation of a groupInheritance (object-oriented programming)Condition numberService (economics)Coma BerenicesRight angleProduct (business)BlogStandard deviationLevel (video gaming)Observational studyPerfect groupWater vaporThermal expansionMatching (graph theory)Real numberPower (physics)Programmable read-only memoryArithmetic progressionSingle-precision floating-point formatWave packetNeuroinformatikSinc functionCausalityMedical imagingMomentumRow (database)Musical ensembleOperator (mathematics)Disk read-and-write headSound effectMobile appProcedural programming1 (number)Wechselseitige InformationDemosceneMoment (mathematics)Connected spaceTable (information)MassEnterprise architectureKeyboard shortcutComputer animation
Transcript: English(auto-generated)
00:10
Hello, my name is Kevin Sun. Today I'll be talking about weighted paging with predictions. This is joint work with Zhihao Jiang and Damali Pinagrahi. In the paging problem, there are N pages, and our cache has size K. At each time,
00:24
a page P is requested. If P is in the cache, then we do nothing, and this is known as a cache hit. If P is not in the cache, this is known as a cache miss. We have to choose a page to evict from the cache, and then we fetch P into the cache. The goal is to minimize the number of evictions. In the weighted
00:41
version of this problem, every page has a weight, and the goal is to minimize the total weight of evicted pages. Now we consider some of the known results for this problem. We measure an algorithm's performance by its competitive ratio, which is defined as the worst case ratio over taken overall inputs of the algorithm's cost to the optimal cost. For the unweighted problem, the offline
01:04
algorithm is known as Belaidi's rule, which states that among the K pages in the cache, evict the one that will be requested farthest in the future. The deterministic algorithm, or given by Sleeter and Tarjan, they proved that commonly used heuristics, such as first-in-first-out and least recently
01:23
used, are OFK competitive, and they also gave a matching lower bound. For randomized algorithms, Fia and others gave an algorithm known as the Markov algorithm, and they showed that it's O of log K competitive, and also proved a matching lower bound. In the weighted setting, the offline version of this
01:41
problem is reducible to a minimum cost maximum flow. The deterministic algorithm is given by Krobeck and others, who gave an algorithm that's O of K competitive, and the lower bound from the unweighted setting continues to hold. For randomized algorithms in a weighted setting, Bonsal and others
02:01
gave an O of log K competitive algorithm. Now all of these algorithms assume that the input is worst case, but the question we want to ask now is what if we have predictions about the future? In practice, input to the paging is not adversarial, but rather we do have data gathered historically that might be
02:21
able to tell us what the future holds. So in order to discuss paging with predictions, we have to first define the prediction models. So the first one we'll look at is we'll call per request prediction, PRP for short. This was introduced by Lykouris and Vasovitsky in 2018. When a page arrives, under
02:41
arrival of a page P, the prediction tells us the next arrival time of P. So for example, if the cache state is currently A, B, C, D, time is 5, the current request is B, then for each of the pages that have appeared, we see the next arrival time. So B will next appear at time 7, A 8, D will appear at 10, and C
03:04
will appear at 12. Another prediction model we look at is the L strong look ahead model introduced by Albers in 1997. This model says that at each time, the algorithm sees the request and the next L distinct request not
03:21
including the current request. So in this example, 2 strong look ahead would see the next 2 distinct requests not including the current request B. So we would see E, B, A, which are the next 3 requests. If the predictions are entirely correct, notice that the PRP predictions allow us to use Balady's
03:42
rule to obtain an optimal solution. Recall that among the K pages in the cache, we simply evict the one that will appear farthest in the future. In this example, it would be C. For L strong look ahead, Albers proved that if a look ahead of size L improves the deterministic competitive ratio
04:00
from O of K to O of K minus L, and also gave a randomized algorithm with competitive ratio log of K minus L, and also showed that these lower bounds hold as well. So now let's consider the per request prediction model with errors. So recall that if they're perfect, then we have an optimal
04:24
offline solution. But what if it has errors? Lykros and Vasovitskiy in 2018 gave a modification of the well known marker algorithm whose competitive ratio satisfies three desirable properties. One is that the performance increases as the error decreases, where the error is suitably
04:44
defined, as we'll see later. When the error is zero, that is, if the predictions are perfect, then this algorithm is constant competitive. And no matter what the predictions say, this algorithm is always log K competitive. Rahaki gave another algorithm who also satisfies these
05:05
properties, but the dependence on the prediction error is smaller. Now consider our results for weighted paging. So the first thing we show is that PRP and L strong look ahead, these two predictions yield the same asymptotic bounds as algorithms without predictions. In
05:22
other words, for deterministic algorithm equipped with perfect per request predictions, it will still be at least K competitive, and randomized algorithms will be at least log K competitive, and the same holds for L strong look ahead, as long as L is at most N minus K. So in light of these lower bounds, we consider a different
05:41
prediction model, which we'll call strong PRP, and it combines the strength of PRP and L strong look ahead. The details will be given later in the talk. But we also show that if the strong PRP is perfect, then we can obtain a two competitive algorithm. And if strong PRP is not perfect, then we give various upper and
06:02
lower bounds in terms of the error, which we'll also define later in the talk. So for now, we'll focus on the per request predictions with deterministic lower bound. Before looking at the lower bound, let's take a look at a simple example that illustrates why weighted paging is difficult. Suppose the cache
06:21
is size two, and there are three pages A, B, and C, whose weights are one W and W squared for some large constant W. And suppose the cache contents are currently B and C. So the example is similar to the scheme rental problem, for those of us who are familiar with that. At each point, the adversary can request a pair A, B. In order to serve this
06:43
pair, the algorithm has two choices. One is to evict B and fetch to A, and then when the B is requested to evict to A. So serving this pair would cost W plus one. The other choice is to evict C for a cost of W squared. So initially, this is more expensive, but the benefit is
07:01
that any subsequent A, B pairs would be automatically served if we evict C. The problem is that the algorithm doesn't know how many times the pair A, B will be requested. If this is a very large number, then it's worth it to evict C in the beginning. And if it's a very small number, then it's worth it to swap A and B repeatedly. But if the algorithm never evicts C, then swapping A and B will
07:24
eventually cost more than evicting C in the beginning. So at some point, the algorithm evicts C and suppose this happens after T requests. The adversary can then request the C. And then now we examine what's the cost of algorithm and optimal solution of this sequence. So the
07:42
algorithm has paid about W times T, W for each pair, and then W squared at the end. Whereas optimal solution is the minimum of these two quantities. So if we look at the ratio, it's at least two. Now let's generalize the example to a cache of size k. So in this case, there are
08:02
k plus one pages, a zero, a one through a k. And we'll set the weight of page ai to wi for some constant wle's too. So at each time, the algorithm's cache is missing some page, and the adversary identifies this page, let's call it al. At this point, the adversary requests an entire L block, rather than just the page
08:23
al as we saw before, where an L block is the sequence a zero, a one through al. So the purpose of this L block is so that the arrival time predictions given by the PRP model are satisfied. So they aren't quite satisfied if we do this, but as we'll see later, by
08:42
modifying the L blocks, we'll be able to satisfy the predictions. If the adversary would immediately request an al right now, that might not satisfy predictions given previously to the algorithm. So to serve this L block, the algorithm pays at least W to the L because al is missing from the cache, and so it must be fetched. Note
09:02
that we can charge the cost of any algorithm to fetching rather than evicting because over the course of an entire sequence of any input sequence, the number of times that any page is fetched only differs by most once from the number of times it's evicted. So
09:21
now we know to serve an L block, the algorithm has to pay at least W to the L. The question becomes how much does the optimal algorithm pay? And instead of considering optimal algorithm directly, we consider K algorithms, one defined for each I in 1 through K, where the algorithm I is defined as on a cache miss, we
09:42
evict A0 if possible, but if A0 is the one being requested, then we evict Ai. Notice that this is well-defined because there are K plus one pages and the cache is size K, so exactly one of these cases will happen. And notice that AlgI is essentially swapping between A0 and Ai, so it always contains all the pages except one of A0 Ai is missing. Now let's
10:04
consider how much does AlgI pay on an L block. And recall that we're charging all the algorithms to the fetching cost because these two problems of fetching and evicting are essentially the same. If I is less than L, then algorithm I, in order to serve
10:23
A0 might have to fetch it at a cost of one, and to serve Ai, which will appear in this L block, it might have to fetch it for a cost of W to the I, and so the total cost to serve this L block would be at most W to the I plus one. And if I is bigger than L, then in order to save A0, it might have to fetch
10:42
A0 at a cost of one, but Ai will not be appearing in this L block, so the cost of the algorithm in this case is just at most one. Summing across the K algorithms, we see that the total cost of the total cost of all the algorithms to serve this L block is on the order of W to the L. So now that we
11:03
know the algorithm pays at least W to the L, and we have K algorithms, L1 through K, when taken all together pay at most W to the L. So we can conclude with an averaging argument. The optimal algorithm is better than the best algorithm I in hindsight, and the cost of this best algorithm is
11:20
at most an average cost. And so we can say that the optimal solution cost at most one over K times W to the L, which is at most one over K times O of ALG, and so algorithm is omega K times OPT. So far, the predictions might not be satisfied.
11:40
Recall that the adversary checks which page AL is missing from the cache and then requests an L block, A0, A1 through AL. The problem with this is that the sequence given exactly by this might not satisfy the predictions given previously by the algorithm. Recall that the PRP, every time the
12:01
algorithm receives a page request, PRP tells the algorithm the exact next arrival time of that page. And if we simply request A0 through AL, it might not satisfy the predictions given previously. So there are two ways that the prediction could be violated, and we show in both cases how to modify the L block in order to simultaneously satisfy the
12:21
predictions and also provide this adversarial input to the algorithm. So one problem is if the algorithm has not evicted some page X, but PRP predicts X will arrive soon. So think of X as some large page like AK, and in this case, so the small pages are handled OK because every L block
12:42
contains A0, A1 through AL. But if the algorithm hasn't evicted some large page like AK, then the adversary doesn't have the opportunity to request AK to meet the prediction. But we show that in this case, we simply request an X block instead of an L block, depending on whatever the algorithm was missing. And we show that the number of times
13:03
this happened isn't very large, and we could charge this to the regular blocks. The other situation is the algorithm has evicted some page, but the prediction says it will arrive much later than than now. So in this case, instead of giving a standard L block, we pad it by duplicating the
13:22
requests within the L block as necessary, so that the arrival time of this page X will arrive as late as necessary. We've looked at the lower bound for deterministic algorithms with PRP. We've shown that any deterministic algorithm that's told the next arrival time of each request does not
13:41
perform asymptotically any better than an algorithm without predictions. The same holds for randomized algorithms, except that the input is constructed randomly rather than deterministically. For L-strong lookahead, we omit the proof, but we note that the same result holds. So an algorithm for weighted paging with L-strong lookahead, if L is
14:02
not too big, performs no better asymptotically than algorithms without predictions. So now we consider a different prediction model that effectively combines the strengths of PRP and strong lookahead. And then we'll show that this algorithm, if the predictions are perfect, this prediction model can
14:21
yield a two competitive algorithm. And then we also consider various upper and lower bounds in terms of the error of this prediction. So as I mentioned before, strong PRP combines the two prediction models. So in particular, at each time T, the algorithm can see the current request PT. It can also see the next arrival time of PT. This is from
14:42
the PRP. And it can also see all the pages until that time. This is like strong lookahead. Going back to our example, if the cache currently has pages A, B, C, D, the time is five, the current request is B. According to PRP, we can see the next arrival time of A, B, C, and D. From two
15:01
strong lookahead, we can see the next three pages, E, B, and A. But SPRP allows us to see E, B, A all the way to C, because C is the farthest appearing page that we've already seen. Notice that after every page has been seen once, then SPRP is
15:21
equivalent to N-1 strong lookahead. So now we discuss the algorithm that's two competitive, assuming that the strong per request predictions are always correct. So at each time, J, we let LJ denote the lookahead that we can see, all the pages we can
15:40
see. So the algorithm proceeds as follows. And the first time we see how far ahead we can see. And we run an optimal algorithm on these pages. So recall that the SPRP is completely correct. So we can simply serve these pages optimally. Say we
16:00
finish at time T, then at that time, we look ahead in our SPRP, and we run an optimal algorithm on everything that we can see. Notice that when we're handling this current batch of pages, we ignore all predictions that we see that arrive while we are serving these pages. We continue in this batch by
16:24
batch manner, where within each batch, we're ignoring future input until the end. Now let's analyze this algorithm. So notice that the last time a page is requested in a particular batch, the SPRP tells the algorithm the next time that this page will appear. So in essence, in the
16:41
next batch, as well as all the pages until that request. So in other words, it essentially adds this page to the next batch, which means the batches satisfy the nesting structure where each one is a subset of the next one, except for possibly the second to last batch. Now, recall that the
17:02
algorithm serves the input batch by batch. So it runs an optimal solution on each batch, ignoring everything that comes after it. So one option for the algorithm when serving a particular batch is to transition to the optimal solution for
17:21
the entire sequence to that cache state at the same time, at this current time, and then do whatever the optimal solution, again for the entire sequence, does for this particular batch. The algorithm does not necessarily know the optimal solution. In fact, it doesn't know the entire optimal solution for
17:41
the entire sequence. But in computing the optimal solution for this batch, one option it considers is to move to that sequence, move to that cache state and imitate OPT. So based on the nesting structure, we can show that the cost of transitioning is at most the total optimal cost, and then running the
18:04
optimal solution on this batch, just imitating it, would also cost at most OPT. And so overall, this algorithm would be too competitive. So far, we've assumed that the predictions for both our lower and upper bounds are perfect, that they never contain
18:20
any error. But in practice, they are likely to contain error. So now we discuss some ways of measuring this error. Recall that each time there's a request PT, and there is an actual next arrival time. We'll call that YT, and the prediction gives us a predicted next arrival time. We'll call that Y hat
18:41
T. And to measure the error of the overall prediction, we sum over T the weighted absolute difference. To give an example, suppose the input is ABC ACB, and so when the first A arrives, the actual next arrival time is four, and for B it's six,
19:01
for C it's five, and suppose the predictions were two five eight. And in this case, the error would be two times, that's four minus two times the weight of A plus the weight of B plus three times the weight of C. So instead of considering the error on its own, we actually consider a normalized version, which is obtained by dividing the error by the optimal cost to serve the input. So this
19:22
gives us a few benefits. The first one is that if we duplicate the input and the prediction so that they're both twice as long, then the raw error value will increase, but in some sense it hasn't gotten worse just because the input is longer. So it is beneficial that epsilon doesn't increase in
19:42
this situation. Another benefit is that if the error raw value is large, but the optimum value is also large, then in some sense it's not actually that bad, compared to if the optimal solution was small, then having a large error would actually be worse than if the optimum value was large. With
20:03
epsilon defined this way, my course in Vatsovitsky gave an algorithm that whose competitive ratio is the minimum of one plus square root epsilon m log k. So notice that as epsilon the error decreases, if it's small enough, then so does the competitive ratio, and no matter what, this competitive ratio is always at
20:22
most log k. This overcomes the lower bound for algorithms with no predictions. Rohatky improved the dependence on epsilon from square root epsilon to epsilon over k times log k. So notice both algorithms have sublinear dependence on epsilon,
20:41
and if epsilon is small enough, then they're both little o of log k, which again overcomes the lower bound for algorithms without predictions. Our first result for this error, defined this way, says that such a result is not obtainable if we have weighted in the weighted paging
21:00
setting. So in particular, no SPRP algorithm for weighted paging has competitive ratio little o log k plus little o of epsilon. Unfortunately, the error as defined before is somewhat sensitive to off-by-one errors. So in other words, a small mistake in the prediction could yield a
21:21
large error. So for example, if the prediction looks like this, a b c b d c a, and the actual input is the same exact thing except there's a y inserted in the beginning, then the error as defined before will be off for every single page by a distance of
21:40
at least one. So the error value we get is at least the sum of all the pages in the entire input, which seems quite large given that the only real mistake is that it was off by one. So for us, we consider eta prime, which is a modified version of at a distance. The exact definition we won't get
22:01
into in this talk, but for this example, what would happen is that these pages that are correct could actually all be matched up. And so eta prime as defined for this prediction sequence would only be the weight of this missing page w y. For this new measurement of error, we also give a similar
22:24
lower bound. We still show that no algorithm with SPRP and this amount of error has competitive ratio little o l k plus little epsilon prime. But we also give an algorithm that uses an additional cash slot but is o of one plus epsilon prime
22:41
So in other words, we do get the linear dependence on the error. Now we discuss some opportunities for future work. Our first point is to find and analyze other prediction models. So the prediction models mentioned today, including per request prediction and strong look ahead and strong per
23:01
request prediction, they all allow us to see arbitrarily far into the future. And in practice, this isn't very likely to be obtained. So our first point would be to ask, what are some realistic prediction models for paging? And using these prediction models, can we obtain, can we
23:20
design competitive algorithms that ideally, of course, would beat the lower bounds for algorithms without predictions and also be competitive, constant competitive if the predictions are entirely correct. Another approach would be to open the black box of predictions. So
23:41
in other words, algorithms discussed today, treat the predictions as a single black box and have no control over how these predictions are obtained. Perhaps by digging deeper and specializing the predictions to fit the algorithm's needs, we can design algorithms that have better competitive ratios. The second line of
24:01
work would be to extend predictions to other algorithmic frameworks. So in particular, the optimal algorithm for weighted paging without predictions solves and rounds a linear program, whereas the algorithms discussed today all are combinatorial in nature. And so one question would be is, would we be able to obtain, would
24:22
we be able to incorporate predictions of this combinatorial nature into a framework such as linear programming? To do this, being able to do this would give us a truly robust algorithm for weighted paging. In particular, an algorithm whose competitive ratio is beats O of LK if the
24:45
prediction error is small, but it's also always at most LK. That's it. Thanks.