We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Algorithmic and statistical perspectives of randomized sketching for ordinary least-squares

00:00

Formale Metadaten

Titel
Algorithmic and statistical perspectives of randomized sketching for ordinary least-squares
Serientitel
Teil
6
Anzahl der Teile
10
Autor
Lizenz
CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
Garvesh Raskutti - Algorithmic and statistical perspectives of randomized sketching for ordinary least-squares In large-scale data settings, randomized 'sketching' has become an increasingly popular tool. In the numerical linear algebra literature, randomized sketching based on either random projections or sub-sampling has been shown to achieve optimal worst-case error. In particular the sketched ordinary least-squares (OLS) solution and the CUR decomposition have been shown to achieve optimal approximation error bounds in a worst-case setting. However, until recently there has been limited work on consider the performance of the OLS estimator under a statistical model using statistical metrics. In this talk I present some recent results which address both the performance of sketching in the statistical setting, where we assume an underlying statistical model and show that many of the existing intuitions and results are quite different from the worst-case algorithmic setting.
Algebraische StrukturAnalysisApproximationBruchrechnungKurveMathematikMethode der kleinsten QuadrateMultivariate AnalyseNumerische MathematikOrdnung <Mathematik>PerspektiveRauschenSignifikanztestSchwach besetzte MatrixStatistikMengenlehreNormalverteilungSingularität <Mathematik>ModelltheorieIterationRobustheitMatrizenrechnungKombinatorKategorie <Mathematik>VektorraumDimensionsanalyseBerechenbare FunktionOrdnungsreduktionUniformer RaumAnalogieschlussÜbergangGlobale OptimierungAlgebraisches ModellArithmetisches MittelAuswahlaxiomBeweistheorieEinfach zusammenhängender RaumErwartungswertFreiheitsgradGeradeGreen-FunktionGrenzwertberechnungHadamard-MatrixHyperbelverfahrenInverser LimesLineare RegressionMetrisches SystemPaarvergleichPhysikalische TheoriePhysikalisches SystemProjektive EbeneRangstatistikRechenschieberResultanteSigma-AlgebraStichprobenumfangStichprobenfehlerTensorTermTheoremZentrische StreckungLinearisierungGüte der AnpassungKonstanteReelle ZahlEinflussgrößeRegulärer GraphStochastische AbhängigkeitParametersystemRuhmasseGewicht <Ausgleichsrechnung>NormalvektorWurzel <Mathematik>VorhersagbarkeitHelmholtz-ZerlegungResiduumGebundener ZustandVektorpotenzialRandomisierungDistributionenraumQuadratzahlUnentscheidbarkeitSummierbarkeitPunktBetrag <Mathematik>Klasse <Mathematik>ResolventeSingulärwertzerlegungKartesische KoordinatenSortierte LogikGrößenordnungDruckspannungUmwandlungsenthalpieEinfügungsdämpfungFokalpunktSchätzfunktionGraphfärbungAusreißer <Statistik>KonditionszahlSupremum <Mathematik>DifferenteMultiplikationsoperatorKreisbewegungStandardabweichungMinkowski-MetrikKonzentrizitätAnalysisApproximationFinite-Elemente-MethodeFolge <Mathematik>Maß <Mathematik>Multivariate AnalysePerspektiveStatistikModelltheorieEntscheidungsmodellMatrizenrechnungMittelwertBerechenbare FunktionUnendlichkeitOrdnungsreduktionUniformer RaumÄquivalenzklasseAuswahlaxiomDeterministischer ProzessDualitätstheorieFreiheitsgradHadamard-MatrixIndexberechnungLemma <Logik>Lokales MinimumMetrisches SystemMKS-SystemPaarvergleichProjektive EbeneRangstatistikResultanteSechseckStichprobenumfangStichprobenfehlerTensorTermTheoremThermische ZustandsgleichungTorusUntere SchrankeZentrische StreckungZustandsdichteGüte der AnpassungKonstanteReelle ZahlRegulärer GraphRuhmasseNormalvektorZwölfExt-FunktorVorhersagbarkeitHelmholtz-ZerlegungResiduumGebundener ZustandRandomisierungDistributionenraumQuadratzahlSummierbarkeitFächer <Mathematik>AdditionPunktWellenlehreSpezielle unitäre GruppeSingulärwertzerlegungNeunzehnInklusion <Mathematik>RichtungSchätzfunktionRobuste StatistikSchnitt <Mathematik>Ausreißer <Statistik>Abgeschlossene MengeEinhängung <Mathematik>BestimmtheitsmaßSupremum <Mathematik>DifferenteMultiplikationsoperatorInverseFigurierte ZahlComputeranimation
Transkript: Englisch(automatisch erzeugt)
So firstly, thanks very much to the organizers for inviting me. It's a real privilege to have the opportunity to present here.
And thanks to everyone for coming. So yeah, today I'll be talking about kind of statistical and then computational or algorithmic perspectives of randomized sketching algorithms. So in particular, I'll be kind of trying to sort of unify or explain sort of two different perspectives of sketching, which I'll introduce. So kind of the broad focus of this workshop is the idea of kind of
when you have large scale data problems in machine learning or statistics or optimization or computer science. You kind of are looking at either statistical or computational or sort of both perspectives of that. And that's sort of a real issue in many kind of modern, large scale data problems. And so kind of in an ideal world, what you really want is something that
is computationally feasible, that you can run on sort of whatever computational resources you have available. But then it also has kind of good performance in some statistical or other metric that you might be interested in. And so, I mean, there's many different ways that people have
done that, and we've had a few different approaches in this workshop, and we will hear more today. But one approach that's kind of received a fair amount of attention in the last sort of five to ten years is this notion of sketching. And also, I'll define what that precisely means in the context of the problem that I talk about. But what kind of sketching refers to is if you have very large scale
data, so at a high level you have very, very large scale data. If you kind of sketch or project it onto something low dimensional and then run whatever algorithm you're going to run on the large data set, but on now this sketched or smaller data set. That then you kind of will, that's one way to certainly improve,
like reduce computation cuz you've reduced the size of your data. And ideally, you sort of have not lost too much for computation from a statistical or accuracy perspective. And so, I mean, how you do the sketching is sort of, it depends on the problem, but there has been, and this is a very,
very sparse sampling of the sketching literature. There's been lots and lots of other work done, but here are just a few relevant examples. But if you do particularly things like randomized projections or sub-sampling, there are some results to show that, in some sense, you gain computationally because you've reduced the data size and run your algorithm in a much smaller data set.
But then also potentially not lost too much in terms of accuracy. And this is sort of broadly based on ideas due to the Johnson-Lindenstrauss lemma for dimension reduction and then concentration of measure. But in particular for ordinary least squares, which is what I'll be talking about today. There's some results to show that you potentially don't lose much
computationally. There's also some work related to the CUR decomposition. So that's kind of trying to do like SVD type decompositions or some kind of low dimensional reduction techniques. But using kind of random projections to speed up computation.
And there's also been this kind of line of work on kind of iterative sketching for kind of, if you want to solve linear systems using spectral specifications. So here are just a couple of examples where this idea of sketching, where reducing the data size is potentially going to sort of allow us to do faster computations for things without losing too much.
So I guess just to sort of explain the genesis of this problem. So in some ways, there's two sort of perspectives of this. So there's a lot of the work that's been done on sketching prior to this has largely been, I would say, more in the applied math theoretical computer
science literature. And so I'm kind of describing this as the algorithmic or computational perspective of sketching. So people like Michael Mahoney, Dan Spielman, and others have kind of worked on this a lot. And the general principle of it is that by doing sketching, you still get sort of close to optimal worst case error bounds. And I'll kind of define what that means for
this particular problem, even by doing sketching. So you're kind of gaining a lot computationally and basically not losing much from an accuracy perspective. And there's some kind of results that sort of support this. But I'm sort of originally trained in statistics and kind of used to thinking about things in a statistical way.
And so when I think about sort of the sketching type methods that are being talked about in the literature here, that in some sense, you're throwing away a lot of your data. And I'll talk about exactly how much for the problem that I'll be focusing on. But, and this is a lot like more than half your data, basically.
And in some sense, what these results are saying is that you're not losing much by kind of throwing away a lot of your data. So this didn't really sort of make much sense to me. Cuz obviously if you have more data, you're gonna, if you have less than half your data, you should be losing a lot of it. And so this was something that at some level when I started working on this problem in early 2013, I sort of didn't really
understand why you were gaining something, or why you were not losing too much. And so the goal of this talk is basically to kind of unify these two perspectives. So essentially, I sort of ran into Michael, I guess at a conference at the beginning of 2013. And sort of having read some of this work on sketching,
I sort of didn't really understand exactly sort of how you were not losing too much. And we'll see that the answer to that in the context of this ordinary least squares or regression type problem is that it's just that sort of we're coming at it from two different perspectives. So we had a lot of long discussions and
arguments about this. And I guess eventually we just decided to sort of resolve by just writing a paper together, which I'll talk about. So just to sort of be concrete about the problem that I'm looking at, so this is kind of the simplest problem that you can think of, but it's still sort of very widely used. It's this notion of you're doing least squares for
large scale problems and analogously just solving large scale linear systems. So if we think about ordinary least squares, you have two things, your data, like your x and your y. So x is an n by p matrix and y is an n dimensional vector. And to make things concrete, we're assuming that both n and p are extremely large, but n's much, much larger than p.
And for simplicity and without loss of generality, we're assuming that the rank of x is just p. So you can kind of easily apply this if you have lower rank structure, but we're gonna assume for simplicity that the rank of x equals p. And so we all sort of know how to solve the ordinary least squares problem.
So this is something that can be done, but obviously the computational cost of that is order of np squared, which is perfectly reasonable in many cases. But when you're in the extremely large scale data setting, this is potentially not feasible. And particularly if you're doing least squares
iteratively, you in some sense ideally would wanna sort of potentially reduce this computation. So although this is sort of a very simple problem that in some sense you would think is more or less solved, as has kind of been described in lots of earlier talks, so Alex touched upon this in his talk. And then Vinay mentioned this, that in some sense a lot of problems are reduced to sort of solving the least
squares problem, so any gains you can get computationally to solve that are certainly beneficial. So the idea of doing sketching in the least squares sense is, again, pretty simple. You apply some sketching matrix S, so S is this sort of R by N matrix, so R is much, much smaller
than N. And then you're going to do your least squares, but now on your sketch data. So you have Sx and Sy, so now Sx is R by P, and Sy is just an R dimensional vector. And you're gonna get an estimator that's kind of based on minimizing this least squares problem.
Okay, and so the computational complexity of this is order of Rp squared plus whatever computation was involved in getting that sketch. So if you can come up with a computationally efficient way to get that sketch, then you can potentially save a lot of computation by doing this. And I'm gonna talk a little bit about what are reasonable
choices for sketching matrices later. But the idea is that you, in some sense, want your beta S to be a reasonably good approximation to beta OLS under some metrics. And I'll talk about what metrics are used and sort of how the choice of the metrics is really important.
Okay, so this is sort of the prior work and what was basically saying that sketching works really well. So this is kind of work by and back in 2011, which basically supported the idea of doing sketching for least squares was basically you're gaining a lot and
not losing much from an accuracy perspective. So basically what you assume is you assume that y is y and x are fixed and they can be whatever they are. And a reasonable way of measuring how much you lose by doing your sketching is you kind of compare
these kind of residual mean squared errors. So y minus x beta S squared divided by sort of the original sort of the accuracy of the original OLS estimator. And here, to make it worst case, I'm taking supreme over y.
Equivalently, I mean, a question you might have is, why not take a supremum over x as well? And for most of the results I talk about, you can apply that, but there's gonna be one result where it gets a little bit complicated. So essentially you can assume that it's just sort of any worst case setup. And what their results show is basically that essentially you
get that this quantity is only sort of one plus delta worse than the original least squares estimator. So that's why in some sense, and this was for various sketching schemes, and I'll illustrate some of them shortly, like doing random projections or doing subsampling types of sketching schemes.
So this was kind of the computational or algorithmic perspective, which was saying that you're not really losing anything by doing sketching. Okay, all right, and so now, so again, going back to the original goal, this was something I
didn't get when I sort of saw these results. How can you throw away a lot of your data? So r is much, much smaller than n, so you would think that you would be losing by taking sort of only a fraction of r on n samples, so you must be losing something. And yet you would, according to these results, not really losing anything. So up to constant,
you're doing just as well as you were before. And so this was something I sort of wasn't really understanding, so I'm used to sort of thinking about things typically, question? Was there any condition on n and p in the previous result? Yes, so the condition, the only condition you need is that the, yeah, so the r has to be bigger than p log p.
So that's the only result you need, is that r has to be bigger than p log p. So you kind of, like you're getting enough of the rows so that you're getting most of it. Yeah, okay, and so,
so the way I'm sort of used to thinking about things, and this is really the way it's just to think about things typically is that when you're trying to solve least squares, this comes from some underlying generative model. So like a typical, like a Gaussian linear model. So you have this simple Gaussian linear model where you
have some true parameter beta, and then you're adding some noise which we're assuming for simplicity is isotropic, but that's sort of a fairly easy assumption to relax. So it's zero mean noise that's isotropic. And then there's a number of different metrics you can choose, and I'm gonna talk about two particular metrics
here. One, and typically you're often used to talking about efficiency when you're comparing estimators. So you can think of your sketched least squares solution as an estimator, and then your original OLS solution as an estimator too. And so one, this first metric which looks quite similar to the original one is the so-called residual efficiency,
which is just kind of comparing the residuals. But the difference is you're now taking an expectation, and this expectation's being taken over the noise epsilon. So you're taking an expectation over the noise epsilon, and this is how you're kind of comparing these two estimators. But the second one, which is often what statisticians are more interested in, is the so-called prediction efficiency.
So we're used to talking about prediction error, which is like x beta hat minus x beta, where beta's the true parameter. And this is the second metric we're gonna see. And we'll see that there's gonna be a very important distinction between these two metrics. And so ideally you want to determine good sketching
schemes that sort of do well in terms of all three of these criteria. But these two are really statistical types of criteria, and then the original, what I presented on the previous slide was an algorithmic or computational criteria.
Okay, so now I'll talk a little bit about what sketching schemes are actually typically used in practice. And so there's a number that are widely used. So essentially there's two classes of sketching that are, I would say, very common. They're both based on kind of randomization. There are also some deterministic sketching schemes that are suggested.
Computationally, randomized sketching schemes generally work well. So one's based on sampling, and in particular, I'll talk about what leverage score sampling is shortly. But trying to sample in a way that's random, so you could think about sampling rows or columns or doing something more clever than that. And the other sort of approach that's widely used is
random projections. So you can imagine if your S is a random Gaussian or using that. And then there's also a Hadamard projection, that's a different kind of projection. But essentially there's two classes of randomized sketching that are widely used. And I'll be sort of focusing on these classes of sketching,
either doing sort of sampling or doing projection. Okay, so I mentioned this notion of leverage score sampling. So this is something that's sort of gained, particularly by Michael Mahoney and others, gained a lot of sort of traction in the last maybe sort of five years or so
in the theoretical computer science and machine learning literature. And so what leverage score. So when you're thinking about sampling kind of your Xs and your Ys, I mean the natural choice or thing to do would be just to sample uniformly. And that works okay, but essentially,
and we'll see this with some simulations later too. If you can do sampling in a more potentially clever way or more efficient way that picks the samples that do best in terms of mean squared error, you can potentially do more with less samples. And so one way to do this that's sort of been shown is
to pick the samples that have what are known as high leverage scores. So to explain high leverage scores, so you can think about the singular value decomposition of X, so that's U sigma V transpose. And the points that are so called have high leverage score. If you imagine that you have your U matrix,
which is essentially the row singular vectors. And if you take the two norm of the rows of these singular vectors, then these are what are so called the leverage scores. And these are numbers between zero and one, such that the sum of these n leverage scores add up to p. And in some sense, the idea of leverage score sampling is
that you pick the rows that have the highest leverage scores. Okay, and so this is a bit typical. It's been shown that the points with high leverage in some sense are best in terms of kind of reducing the mean squared error that we've talked about. And we'll see that in simulations,
they tend to work pretty well. And I guess on another sort of note that this is kind of a quick diversion from what I'm talking about, but it's also related, is that in general, this concept of leverage scores and points with high leverage, that actually goes back to some work in the robust statistics community in the 50s and 60s.
And in fact, so if you present this idea of sampling points with high leverage score to a statistics community, they'll often say that that's sort of a really, like not a smart thing to do at all. And the reason for that is because in some sense, typically points with high leverage reflect points that
may be outliers. And so in some sense like this, and so this is being proposed as a scheme to include points that reduce your mean squared error, but a statistician might often say that you're picking those points because they're sort of outliers. And so this again also illustrates maybe a different perspective in terms of how a statistician might look at high
leverage points and someone in sort of numerical linear algebra or computer science. Because in some sense, the high leverage points are best in terms of reducing mean squared error because they, if you view your least squares problems as having like no noise and no outliers, then those points are the ones that in some sense contain
the most information. And so they skew your mean squared error estimate, your ordinary least squares estimate the most. But on the other hand, if what you're fitting is instead the fact that they're outliers or noise, which is what typically people in statistics care about, then you're essentially just biasing your estimator in completely the wrong way.
So that's sort of a side note as to another sort of difference in perspective, but for now I'm focusing on the fact that these, in some sense your model's right, which is obviously a very strong assumption. But we, and so we're picking these high leverage points
because they reduce the overall mean squared error, okay? Any questions at this point? And so a natural question you might sort of have related to this is that our original goal was to reduce the computation of ordinary least squares and in some sense to get these leverage scores which we're gonna use for sampling,
we had to compute an SVD, which is the same computation as ordinary least squares. So why are we doing this and are we gaining anything? And so there's some other work that's done by Jarnais and Mahoney and others that shows that you can actually compute these leverage scores, not the exact leverage scores, but approximate leverage scores fairly efficiently.
So with a computation of auto NP, so you can potentially do this computation fairly efficiently as well. So that's why doing this computation isn't sort of a, you're not just solving the original problem, it's something that's as hard as the original one. Can you do this sampling based sketching like in
online fashion or you have to store all your data and pick the rows you're going to keep? Yeah, so that's a good question. I don't know of any scheme that allows you to do online because to compute the leverage- Something you can. Sorry? If you just do- Yeah, yeah, absolutely. So that's one of the, yeah, I agree.
So that's one of the potential weaknesses that if you're doing this in an online way, I don't think there's been any work on that. And I don't know how you'd do that cuz you sort of need to know all of the rows, yeah. Okay, so I guess I'm sort of building up to the goal is to
sort of explain these two different perspectives. And so just to sort of build up to that, there's a slide that in some sense talks about how you can relate how these two metrics perform, or how these three metrics perform, in terms of properties of this simple projection matrix.
So typically when you're solving ordinary least squares and recall that we had the SVD of x on the previous slide. If you're just solving least squares, that's projecting y onto the row space of x. And so if we do sketching, then what you're doing is, rather than doing a projection onto the row space of x,
you're kind of doing a projection using this sort of a bleak projection matrix. So this is a projection matrix, so if you square it, you'll get back what you had originally. But then it's oblique in the sense that it's not, you can't, if you take a transpose, it's not the same thing.
And so you can relate it to quantities that involve this oblique projection matrix. And I guess the main thing to take away from this is that if you look at these last two statistical metrics, then in some sense that you can see that up to constant, that this prediction efficiency kind of scales like n over p
times the residual efficiency. And that in a nutshell explains kind of the difference between these two perspectives and why you potentially will do very well in terms of worst case efficiency and residual efficiency, but not so well in terms of prediction efficiency, which is what kind of statisticians would potentially be most interested in.
And so this is, I guess, the main result and probably the one thing that you should sort of take most away from this talk. If we look at a comparison of these results, and so remember that p is kind of the low dimension of your least squares, r is the number of rows that you've sketched,
and n is kind of the big dimension or the number of samples that you have. If we look at, and these are three particular sketching schemes, so SR stands for kind of leverage scores with, I guess, that's with some sort of correction for the bias, SGP is kind of a Gaussian or sub-Gaussian type projection.
And then HAD is just a Hadamard projection. But for all of these sketching schemes, you see that for the worst case and for the residual error, we see that we get kind of a one plus a p over r term. So as long as r is a little bit bigger than p, which is like p log p relating to what Simone asked,
then you can see that you get the performance guarantees like what Dronis and Mahoney were getting earlier. But then if you look at the prediction efficiency, you get something that's very different, which kind of means that, which is consistent with what my sort of intuition was initially, that you need r to be of order and to even get close
in terms of performance according to this metric, which is what statisticians typically care about. So in some sense, sketching isn't really helping you in any reasonable way if you kind of are looking at this prediction efficiency metric. So even though it works well from a worst case perspective,
it's not working well at all in terms of prediction efficiency. But do you have the lower bounds at all? I'm getting to that on the last slide, yeah. That's a good question, yeah. Okay, so this is, yeah, so as I was pointing out, these are just upper bounds on these sketching schemes. Like are there lower bounds and yeah, like there are and I'll talk about those.
Yep. Is it possible to remove the expectations in the ratios? So, and get high probability results? I suspect that it probably is. We haven't done that, but I, sorry, it made the analysis easier to not do that, but I think you can probably get away with that, yeah.
Do you think it could change the magnitude of the last bound? To like, you mean to like- Is it because it's a ratio of expectation and? Do you think it would change to something better than n over r? So I guess I'll talk about lower bounds at the end that suggests that you can't really do better than that.
Unless you have very specialized cases, and I'll talk about that on the next slide. But yeah, in general, you can't really do better than that. Again, so this is like just a different upper bound. And this is for a particular sketching scheme that I guess
is not traditionally used in the sketching literature. But this I guess is, it's trying to take advantage of the fact that you potentially have leverage score distributions that you can take advantage of and exploit. And it also supports some of the simulation results I'll show. And it shows that you can potentially do better in terms
of the statistical leverage scores if you do this leverage score sketching without doing a kind of correction for bias. And so you get the k here is typically gonna be less than n, cuz it's taking the largest and the largest k leverage scores. And so you're gonna potentially do better using the sketching
scheme, but it's under this very, very strong assumption. Again, just to show you some simulations to sort of compare these different sketching schemes and also to compare the different metrics. So there's gonna be six different sketching schemes I use, the first four are related to sampling type approaches
that I talked about and the last couple are related to projection type approaches. So what we're doing, and this is not a very large data setting, but I've got some simulations in a larger data setting, but here we've got n equals 1,000 and p equals 50. And we vary r, r is varied as we'll see from around
maybe 50 to about 1,000, cuz we typically want r to be bigger than p but smaller than n. And we're gonna sample x uniformly according to, we're gonna use a multivariate t distribution. And the reason for that is cuz I've talked a bit about leverage score sketching and leverage score sampling.
And so what this allows us to do is it allows us to generate x that has different leverage score distributions. So some x that has sort of very uniform leverage scores and some x that don't. So we're gonna use t distributions with three different numbers of degrees of freedom, 1, 2, and
10 degrees of freedom. And we're gonna compare these six different sampling or sketching schemes. So in particular, just to illustrate sort of why we're using these different distributions, this is a plot of what the leverage scores look like for these three different values.
So in some sense, the intuition for this is that if you have new equals one, that's a very heavy tailed distribution. And so that means that you're gonna have, in some sense, more non-uniform leverage. So new equals one has a very heavy tailed distribution cuz you're kind of dividing, you have fewer degrees of freedom. So in some sense, you're gonna have more, there's gonna
be a less uniform leverage score distribution, which is what you see here. And for new equals ten, that's kind of getting close to a normal distribution. And so you have very uniform leverage scores. So we're just comparing the performance for different choices of x here.
And so if we look at some of these, sorry, that's a bit crowded. But if we look at these six different sketching schemes, so there's a, I mean, one of the main points to take out of this and what particularly relating to the results is that look at the scale of this. So this is looking at the algorithmic or computational respect.
If you look at the y-axis, you can see that everything's pretty close to one, or particularly for the first two cases, everything's kind of fairly close to one. So this supports the theory that you're kind of within one plus delta if you look at the kind of the worst case type of performance that and Mahoney were dealing with.
And then on the other hand, if you look at the statistical perspective, which is that third metric that I talked about, the prediction efficiency, you can see that the y-axis here is substantially larger than one. Because in some sense, you're losing in terms of efficiency
by having less data to deal with. So this is kind of just showing the plots for when we look at these two different metrics, okay? So we can see that in general sampling schemes and projection schemes both seem to work reasonably well.
But in particular, the reason why I showed that second result was that, that seems to do the best computationally, even though there's not much in terms of theory for that. So this was kind of doing these leverage score sketching schemes without doing any kind of bias correction.
So what is the green curve? The green curve, that's what I guess Janais and Mahoney typically do, which is leverage score sketching. But then they do a bias correction or a rescaling. So that is in some sense what's typically been proposed and has the most theoretical guarantees, but it doesn't do so well in practice.
The red is doing that but without doing the bias correction or rescaling. So that's actually not proposed. The reason we do that is cuz that's not proposed in theory and there's very little theoretical. So what is rescaling? So typically when you're sampling, say, and you take r samples,
you typically need to rescale things so that you're not, like so that you, in some sense, correct for the fact that you've got less samples. So you might multiply it by square root of r over n. So to make sure that everything is sort of considered.
For OLS, what do you multiply? So you multiply, so if you were to take r samples uniformly, you would need to multiply by square root of r over n to make sure that everything kind of works out. And so what this is doing, but for leverage scores, remember, you have to do things a little bit differently because every point has a different leverage score.
So instead of dividing by square root of n over r, you do square root of n over whatever the leverage score is. This is to get the correct expectation. Yes, exactly, yeah, that's exactly right. To get the correct expectation, but what's interesting is that according to the simulations, you do a lot better if you don't do this correction many times.
And so that's sort of why I included this, and that's why we had that second theoretical result. Because we wanted to get some theoretical justification for why we're doing so much better in terms of the simulations. But I mean, it has to be pointed out also that I think this works very well under this model. But under a model misspecification, it's sort of unclear how well this would really do.
Could you show the previous slide again? Okay, so for both metric, green is much worse than- Yeah, yeah, exactly. So yeah, for both metrics, green's much worse than red, in general. But there's more theoretical justification for green than red.
And in general, doing sampling, as you might expect, is less stable than doing projections. Cuz with projections, you're kind of keeping all of the data, but rotating it, whereas for sampling, you're kind of losing a lot of your data. And as you'd expect, the performance is also worse for,
or more unstable for the new equals 1, which is the case where you have more non-uniformity in your leverage scores. Again, so getting back to Francis' earlier questions. So this was actually work that was done subsequently by
Palanchi and Wainwright. So we suspect that it was a lower bound, but I guess when we were trying to prove it, we discovered that Palanchi and Wainwright had been working on this sort of independently. And they'd done a bunch of other things, too. But if we looked at one of their results, and it wasn't written in this way exactly.
But in some sense, if you assume that your sketching matrix has this property, so this is for randomized sketching. So to be clear, this expectation is taken over S, over the sketching scheme that you're using. And if you kind of assume that it has this property, which is more or less just says that you've got kind of R rows in your sketch, if it has this property, then the prediction
efficiency scales no better than N over R, basically. So this suggests that all of those upper bounds that we had cannot really be improved by using a different sketching scheme, or that the analysis was not really tight. So this kind of really confirms what we thought,
that in fact this prediction efficiency, which is what statisticians are typically thinking about, is intrinsically harder metric than what was kind of being looked at in the algorithmic or computational setting.
And this was proved kind of using pretty standard information theoretic lower bounding techniques. Okay, and so this condition's kind of satisfied for all of the Ss that we talked about, except for that, I guess, the red line here.
So for the red line, we saw that we actually got a better result for this particular case. And that's because it doesn't satisfy this, it does not kind of satisfy this condition. But all of the other lines that I plotted basically satisfy this condition. Okay, and so I guess the work that they did also kind of led.
So what it's basically saying is that doing one sketch of your data when you do unreleased squares, or even constrainly squares, is really not enough if you wanna do well in terms of the statistical metric. And so what they kind of proposed is if you do iterative sketching schemes, and this is computationally more intensive cuz you're taking kind of more and more sketches.
If you do iterative sketching, then you can potentially sort of get the original, like this one plus delta type bound that you want. But you have to do kind of iterative sketching. So one sketch sort of isn't enough when you're doing least squares and you're interested in this prediction efficiency.
Is there any better bound for the 50% probability chart? Sir? You have, so that's holding the probability around 50%. Do you have any, can it be improved? Well, so you can make this 128 bigger and make that half smaller. So that's like, typically how this works is you kind of,
you get like, there's constants here. And so you get like a one minus delta and then it's like n over like that 128 depends on whatever that probability is. So if you wanna make that bigger, you can make this closer to one, I guess. So yeah, that's typically how, yeah. So I mean, these constants are very loose, both for
the upper and the lower bounds. So the upper bounds are kind of based on concentration and measure type arguments. And these lower bounds are using information theory techniques, but yeah, the constants are really, really loose, really, really loose here, okay? Any other questions?
All right, so just to sort of conclude and to talk about, like this is sort of a, we looked at a very simple problem. So there's a number of ways that this work can be extended. So what this analysis shows is that if we're interested in this, we had these two perspectives. The computational algorithmic perspective saying that doing
sketching once does, that preserves most of the information you wanna preserve and does well. But then we thought, we came out of thinking that you're losing a lot of your data. And so this was resolved by just the fact that we're in some sense looking at different criteria and different models. And so in terms of prediction efficiency,
which is what you care about in a statistical setting, this is substantially more challenging than the standard kind of algorithmic or worst case setting. And then there was this other sort of interesting thing that sort of came up that when you do like this leverage score sampling kind of without correcting for the bias to make the expectation, then it seems to do
well from a simulation performance perspective. But yeah, I mean, as pointed out, and relating to some of these questions, there's a number of ways that this work can be extended. So as pointed out, there's the online case. So this certainly only works in the batch mode case,
a lot of these ideas. But if you potentially did, is there a way to kind of extend this to the online setting? So this is kind of the simplest problem that you can look at, this unreleased squares problem. And so what we're looking at right now is when you low rank matrices or tensors, there's this notion of doing
sketching to come up with an approximation to the SVD. And this is called the CUR decomposition. And so right now what we're doing is an analysis for that. Then, and there's also a perspective of sketching that's related to regularization, implicit regularization.
And we're exploring this connection. And also looking at whether sketching can, in some sense, recover low rank tensor structure. So in general, tensors are more difficult to deal with than vectors or matrices. And we're seeing if sketching can potentially give you kind of computational gains there if you do the sketching in a
reasonably sort of clever and efficient way. All right, thanks very much. Okay, so we have ten questions. Is there anyone? Yes.
And, so the leverage score sampling. So do you have the results again? Yeah, sure, yeah. The simulations? Yeah, the simulations. And so you had a fast version of one of those leverage. Yeah, yeah, yeah, exactly. So that's this SSHR, which is, sorry, the pink.
Yeah. Is this like the analog of the SR one, but fast? Yeah. Do you have an analog of the SNR one, which is fast? Yeah, so that also does better, yeah. Yeah, so that does better than the SR one, which is fast. So the SNR one, which is fast, does better than the-
Fast means you do basically the leverage scores approximately? Yeah, exactly. So we use the scheme test to make the leverage scores approximately. And so you can see you do worse than if you knew them exactly, but- So that scheme does work, I've had the paper, but it seems like you get a proof of how to implement,
or how to check, is this what implement is? Yeah, this was implemented, yeah, it works reasonably well. Which color is this? That's pink, yeah. Pink, so you compare pink to green? Yeah, yeah, that's right, so pink, yeah, pink to green, basically. So pink is better than green? Yeah, yeah, and that's probably because of the fact that you're doing some kind of regularization
by doing the approximate, and that actually does better, which is interesting, yeah. Could you say the same thing about scaling versus not scaling? The fact that you're not scaling probably acts as a regularization and means that it works better. Yeah, that's, yeah. Totally counterintuitive, right? Yeah, no, I agree, I think it is, because it is a regularization type argument.
Can you write it down in a certain way? So this was something we tried to do, and I guess that was what this theorem was kind of trying to get at, but we needed to assume something very specific, and so we're trying to do that in a more formal way. So with non-uniform leverage curves, you divide by the leverage curve, I guess?
Yeah, yeah. This may introduce a lot of variance, right? Exactly, that's exactly right, so that's exactly why it does terribly when you have very non-uniform, because you're dividing by something that's potentially closer. You're doing like importance weights. Yes, exactly. Importance weights, when the weights, when the support, the distributions are a bit like
off and on, is it not even? Yeah, it is, it's terrible, yeah, yeah. So that's, and yeah, so that's, and I mean, like one way that, I mean, like other people try to overcome that is they'll do like a linear combination of leverage score plus uniform to try to overcome that, but it's still, that seems to still do worse than just not doing any re-sampling at all, or re-weighting at all, yeah.
And so there's another technique which is random rotation, which makes the leverage curve uniform and then uniform something. Yeah, yeah, that's right. That's completely intractable because there's no way to speed up the random rotation step? As I understand, that's true, yeah, that doing the random rotation step, yeah. That's okay. Yeah, that's, as I understand, yeah.
So you keep on doing that, sparse random rotation, doesn't end there with, well, I guess not. It's like compensating matrices, there's a limit on how sparse you can make them, I guess it might be the same issue here. And so the idea of random rotation is that you do the random rotation, and then…
If you do the random rotation, then the leverage curves are supposed to be uniform, so you don't care about computing them anymore. And you cannot do random rotations, it's like advanced rotations, you can do like very simple random, I think. Yeah, I, yeah, I don't know that there's a fast way to do that, but yeah, that's a good point, yeah. Okay, is there any other question?
Do you think the open is breakable, and how fast do you think it's possible to go, maybe under some assumptions? Ah, so when you say breakable, you mean, sorry, like… You said that the other one is incorrect. Well, it's not mine, it's not mine, so it's, yeah, yeah, so it's alright,
but like, I mean, so when you say, yeah, yeah. I mean, you said that without normalization, you do not satisfy the assumptions. Yeah, so yeah, like, yes. So you mean you can go below? Yeah, yeah, so certainly, like, yeah, the lower bound is certainly under this assumption, so yeah, you can potentially, yeah, so how far you can go, like I guess, for example,
like if you have a leverage score distribution which is very, like, non-uniform, then you can see you can get very close to kind of, you know, like, you can do much, much better than lower bound, but this, again, this is under a very restrictive assumption,
so basically, like, what it's saying is if all of your leverage score masses in, like, p of the samples, then you can kind of get those p samples and that takes care of everything, but yeah, is that, like, often satisfied, yeah, so that's kind of, yeah, so it depends, I think, on what your x matrix is like, but for a general x,
I don't know if you can do much better than the lower bound. Okay, another question? Yes? What's the quick intuition why SNR doesn't satisfy the assumption? So the intuition is that essentially, like, if you have, for example, like,
if you think about an example where p leverage scores have, like, a leverage score of one and the rest are all zero, you could just take all of those p leverage scores and it's like the rest of the rows are zero, and so that would, like, essentially this would kind of have, like, this SS transpose matrix would have a, you know, very different condition
number, because what this is basically saying is that you're effectively getting r on n fraction of the leverage score weight, and so if you don't do rescaling, you're potentially getting way more or way less than that.