We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

1/3 The cutoff phenomenon and rate of escape for Markov chains

00:00

Formal Metadata

Title
1/3 The cutoff phenomenon and rate of escape for Markov chains
Title of Series
Number of Parts
27
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
ChainUniqueness quantificationDistribution (mathematics)Matrix (mathematics)Ergodische KetteBounded variationSpacetimeDistanceFlow separationMarkov chainAlgebraic structureMathematical analysisEigenvalues and eigenvectorsSequenceMathematicsStatistical physicsSet theoryPositional notationFrequencyMatrix (mathematics)Vector spaceFinitismusInfinityFlow separationTotal S.A.State of matterConnected spaceLimit of a functionChainPower (physics)Maxima and minimaExtension (kinesiology)SubsetGoodness of fitMeasurementDistanceNormal (geometry)Spectrum (functional analysis)Mortality rateMusical ensembleDistribution (mathematics)Point (geometry)Absolute valuePopulation densityLatent heatGraph (mathematics)Classical physicsMultiplication signStandard deviationBounded variationSpacetimeIterationMultilaterationMarkov chainRandomizationComputer animation
Flow separationHill differential equationRandom numberScale (map)Parameter (computer programming)ChainMarkov chainMatrix (mathematics)SequenceLogical constantFamilyCycle (graph theory)IRIS-TMaxima and minimaSequenceOrder (biology)Central limit theoremNormal distributionProbability distributionCategory of beingAverageTriangleFlow separationTotal S.A.Arithmetic meanLimit of a functionRandom walkChainConvex setLocal ringFamilyMeasurementDistanceWeightDistribution (mathematics)Square numberPoint (geometry)DiagonalCanonical ensembleCycle (graph theory)Multiplication signDreiecksungleichungBounded variationMathematicsTheory of relativityFrequencyInfinityInequality (mathematics)Order of magnitudeDirection (geometry)Different (Kate Ryan album)Standard deviationRight angleLecture/Conference
State of matterMarkov chainMatrix (mathematics)ChainConvex hullDivisorRange (statistics)Eigenvalues and eigenvectorsNatural numberAsymptotic analysisMatrix (mathematics)ProduktraumFlow separationInequality (mathematics)Total S.A.State of matterLimit of a functionGroup actionRandom walkChainCoordinate systemLogarithmMaxima and minimaElementary arithmeticTerm (mathematics)DistanceNormal (geometry)RandomizationSummierbarkeitPoint (geometry)Social classProbability spaceLattice (order)Event horizonVertex (graph theory)CalculationClassical physicsCycle (graph theory)Multiplication signBounded variationTheory of relativitySet theoryCubeFiltrationCategory of beingUniformer RaumMarkov chainRight angleLecture/Conference
Helmholtz decompositionBound stateDivisorPerturbation theoryAerodynamicsChainGraph (mathematics)Model theoryCubeConvex hullWeightRandom numberMathematical analysisEigenvalues and eigenvectorsNumerical analysisPerspective (visual)Symmetry (physics)Model theoryMatrix (mathematics)IntegerFlow separationUniformer RaumTotal S.A.Energy levelState of matterGleichverteilungChainPower (physics)LogarithmMereologyProjective planeResultantTerm (mathematics)Configuration spaceDivisorMeasurementDistanceWeightRootBound stateDistribution (mathematics)SummierbarkeitEqualiser (mathematics)EstimatorVertex (graph theory)Different (Kate Ryan album)Multiplication signBounded variationDynamical systemGraph (mathematics)MathematicsOrder (biology)Group theoryModulformFunctional (mathematics)Limit of a functionGroup actionRandom walkMultiplicationTheoryPotenz <Mathematik>Markov chainGoodness of fitLogical constantNormal (geometry)Spectrum (functional analysis)Musical ensembleExpandierender GraphOrder of magnitudeExpressionGraph (mathematics)CalculationElement (mathematics)Degree (graph theory)Cycle (graph theory)EvoluteRight angle1 (number)Computer animation
WeightRandom numberGroup actionChainMathematicsOrder (biology)Expected valueDistribution (mathematics)Event horizonMultiplication signEigenvalues and eigenvectorsDressing (medical)Modal logicFrequencyPerturbation theoryVariable (mathematics)Category of beingPhase transitionInfinityWahrscheinlichkeitsmaßInequality (mathematics)Total S.A.State of matterArithmetic meanForcing (mathematics)Functional (mathematics)Limit of a functionRandom walkChainKontraktion <Mathematik>LogarithmMaxima and minimaMereologyRadical (chemistry)Slide ruleTheoremAngleLogical constantNichtlineares GleichungssystemDivisorDependent and independent variablesCounterexampleIndependence (probability theory)DistanceParameter (computer programming)WeightNormal (geometry)RootMortality ratePoint (geometry)Population densityRoundness (object)Lattice (order)Correspondence (mathematics)Film editingCondition numberDifferent (Kate Ryan album)CalculationCycle (graph theory)Bounded variation2 (number)SpacetimeConcentric1 (number)Computer animationLecture/Conference
Random numberFluxTerm (mathematics)Algebraic structureDifferential equationEigenvalues and eigenvectorsNatural numberOrder (biology)Stochastic processCentral limit theoremSet theoryVarianceGeneral relativityShift operatorMatrix (mathematics)Quadratic formDrop (liquid)Dimensional analysisInfinityUniformer RaumDecision theoryState of matterForcing (mathematics)Functional (mathematics)Directed graphLimit of a functionGroup actionRandom walkChainMaxima and minimaMereologyMultiplicationTheoryTerm (mathematics)TorusDivisorMeasurementDistanceParameter (computer programming)Spectrum (functional analysis)RootPresentation of a groupDistribution (mathematics)Point (geometry)Absolute valueNegative numberExpandierender GraphEqualiser (mathematics)EstimatorReflection (mathematics)Film editingEvent horizonCondition numberDifferent (Kate Ryan album)Contrast (vision)Element (mathematics)Sinc functionCycle (graph theory)LengthMultiplication signBoundary value problemRule of inferenceSpacetimeRight angleFirst-order logicRepresentation theoryLecture/Conference
Function (mathematics)Matrix (mathematics)ChainMarkov chainPropositional formulaHill differential equationEigenvalues and eigenvectorsMathematicsNumerical analysisOrder (biology)Theory of relativitySet theoryVarianceProfil (magazine)Combinatory logicAverageWell-formed formulaUniformer RaumTotal S.A.State of matterArithmetic meanBinary treeExpected valueGleichverteilungLimit of a functionChainLemma (mathematics)LogarithmMaxima and minimaMereologyPotenz <Mathematik>Term (mathematics)Configuration spaceMeasurementCounterexampleDistanceParameter (computer programming)WeightRootBound stateGeometric meanDistribution (mathematics)Square numberPoint (geometry)Density functional theoryEstimatorFilm editingAlpha (investment)Event horizonVertex (graph theory)Different (Kate Ryan album)Multiplication signStandard deviationBounded variationSpacetimeThetafunktionLinearizationClosed setCoefficient of determinationComputer animation
Total S.A.Bounded variationLink (knot theory)ChainOrder (biology)Total S.A.CalculationBounded variationBinary treeMultiplication signAlgebraic structureEigenvalues and eigenvectorsSequenceNumerical analysisSymmetry (physics)Network topologyStochasticMatrix (mathematics)ModulformConnected spaceGroup actionRandom walkChainMaxima and minimaMereologyNichtlineares GleichungssystemDivisorDistanceWeightSpectrum (functional analysis)Mortality rateAdditionAbsolute valuePopulation densityFilm editingIdentical particlesDifferent (Kate Ryan album)Degree (graph theory)Cycle (graph theory)Inverse elementRight angleComputer animationLecture/Conference
Function (mathematics)ChainTotal S.A.DistanceBounded variationDecision theoryMusical ensembleSequenceOrder (biology)Equivalence relationDivision (mathematics)Limit of a functionLimit (category theory)ChainMultiplicationSubstitute goodAreaFilm editingMultiplication signFirst-order logicLogical constantSupremumComputer animationDiagramLecture/Conference
ChainMathematical analysisSpectrum (functional analysis)BijectionEigenvalues and eigenvectorsEigenvalues and eigenvectorsTotal S.A.DistanceParameter (computer programming)Square numberSummierbarkeitBounded variationModulformFunctional (mathematics)Normal (geometry)Thermal radiationBoundary value problemComputer animation
Eigenvalues and eigenvectorsMultiplizität <Mathematik>Ring (mathematics)Eigenvalues and eigenvectorsGraph (mathematics)Order (biology)Functional (mathematics)Limit of a functionGroup actionMaxima and minimaMultiplicationTheoryPotenz <Mathematik>Term (mathematics)Goodness of fitDivisorDistanceNormal (geometry)Bound stateSquare numberExpandierender GraphExpressionGraph (mathematics)Different (Kate Ryan album)Element (mathematics)Degree (graph theory)Cycle (graph theory)Multiplication signLecture/Conference
Graph (mathematics)Order (biology)Modal logicCubeFrequencyTotal S.A.State of matterGroup actionRandom walkChainMaxima and minimaMereologyLogical constantNormal (geometry)Square numberPopulation densityRoundness (object)Order of magnitudeEvent horizonCondition numberCycle (graph theory)Multiplication signBounded variationSpacetimeLecture/Conference
Modal logicIntegerPropositional formulaLogarithmCounterexampleSocial classBounded variationEigenvalues and eigenvectorsMathematicsOrder (biology)Modal logicPerturbation theoryCategory of beingInfinityInequality (mathematics)State of matterLimit of a functionChainTheoremTorusLogical constantDivisorCounterexampleIndependence (probability theory)Parameter (computer programming)Bound stateDistribution (mathematics)Roundness (object)Condition numberCycle (graph theory)Multiplication signSpacetimeComputer animation
Eigenvalues and eigenvectorsOrder (biology)Set theoryQuadratic formDimensional analysisInfinityTotal S.A.Group actionChainMultiplicationDivisorMeasurementDistanceSpectrum (functional analysis)Distribution (mathematics)Point (geometry)Expandierender GraphCondition numberDifferent (Kate Ryan album)LengthMultiplication signBoundary value problemRule of inferenceBounded variationRight angleComputer animation
Bounded variationTotal S.A.MereologyMultiplicationDistancePoint (geometry)Different (Kate Ryan album)LengthMultiplication signBounded variationPosition operatorComputer animation
Bounded variationLoop (music)Function (mathematics)ChainDistribution (mathematics)DiagramLecture/Conference
Eigenvalues and eigenvectorsMathematicsOrder (biology)Profil (magazine)Uniformer RaumTotal S.A.State of matterGleichverteilungLimit of a functionChainCoordinate systemConfiguration spaceMeasurementDistanceRootRandomizationGeometric meanSquare numberMultiplication signSpacetimeLecture/Conference
CounterexampleRandom numberMassNetwork topologyBinary fileRootBoundary value problemCounterexampleFinitismusBinary treeSequenceOrder (biology)Network topologyRandom walkWeightCycle (graph theory)Computer animationLecture/Conference
Algebraic structureSymmetry (physics)Connected spaceLocal ringOperator (mathematics)Spectrum (functional analysis)Mortality ratePopulation densityClosed setRight angleLecture/Conference
Transcript: English(auto-generated)
Good afternoon. I'm actually going to survey two topics in Markov chains,
two of my favorite topics. One is the subset of mixing, namely the cutoff phenomenon. And the other topic, which is not up here, but we'll see it later, is quite different. It's the rate of escape of Markov chains. And so that's quite demanding, given
that we have Anna here, who is one of the greatest experts on this topic. But I'll still present it. But today, I want to start on cutoff. And some of this will be useful also for those coming to Eyal Lubetsky's lecture that
start tomorrow, since he will be analyzing some examples of cutoff on random and Ramanujan graphs. And these notes were prepared together with David Levin. And we have a book, Markov Chains and Mixing Times,
also with Elizabeth Wilmer. And there is a second edition coming out. And so you can find that book linked from my page or on David's page. It should be easy to find. Today, much of what I'll talk about is there,
since I'm not assuming that you read it or remember it. But we'll see more advanced material as we move through the course. And also, throughout, I'll be giving some problems that are maybe a little more advanced.
And so those who already know the elementary material can spend the time solving the problems. And we'll discuss the solutions in the exercise session at the end. So before defining, I'm just going to give a quote from Diaconis 96.
So Aldous, Diaconis, Shoshani were the initiators of this phenomenon in the early 80s. And until recently, really, the way this so until the last decade, the way this phenomenon was
understood is by really specific examples, spectral analysis of eigenvalues and eigenvectors. And despite the overall title, we're going to focus mostly, in my lectures, mostly on the non-spectral approaches. But we'll connect with the spectrum later. So just to fix some notation, so we're
going to look at ergodic, namely, this just means for us irreducible and aperiodic finite Markov chains. So just from every state, you can reach every state. And it's also aperiodic, so the times that you can
so you can think there is some power of the matrix which is strictly positive. So there's a unique stationary distribution which is denoted by pi. So pi equals pi times p. So it's a row vector.
Now, we measure distance to stationarity in several ways. But the most important one will be the total variation distance, so this total variation norm, which has multiple equivalent definitions. So it's half the L1 distance. And it's also, so the total variation distance
between two measures is just the maximum over all sets in the state space of the probability of one measure minus the other measure. So in other words, what we want is to measure d of t, the distances time t is the maximum over starting states,
maximum over subsets of the state space of the probability of being in A minus the stationary probability of A. So we know in this setting of irreducible periodic that this tends to 0. And the goal is to quantify at what rate it tends to 0.
But the classical point of view on this was you fix a chain of n states, and you drive time to infinity, and you see how fast this goes to 0. And then this really determined by the top non-trivial eigenvalue.
But really the modern point of view initiated by Aldous and Diaconis, but also closely related to statistical physics, is different. Instead of driving the distance to 0, we will have some goal. So we want the distance to be less than a quarter or one
tenth or 1%, some fixed epsilon. And we'll ask how long to get there, but with the sequence of chains. So we will not consider one chain on 10 states or 100 states, but we'll consider a sequence of growing chains. All this will become clearer with the examples.
A few other distances which will come up, and I'll remind them. So instead of L1 distance, one can also look at Lp distance. And the right normalization is to look at the density. So distribution at time t. So we look at p, t, x, y. We divide it by pi of y, so pi is strictly positive.
And we look at the Lp distance between 1 and this ratio, Lp with respect to pi. So all of these are useful, but the most are 1, 2, and infinity. So for 1, you just get the L1 distance
twice the total variation. An infinity distance will just measure for us the maximum deviation of the density from 1. Now here is a somewhat more subtle definition called separation by Aldous and Diaconis.
And so at first, this looks just like L infinity, but it's not. So notice there is no absolute value here. So the separation distance measures
to what extent the distribution is kind of blanketing the space. So we want to control states with too low a probability, but we don't worry about if some state has too high a probability. And it turns out that this is very, very close for reversible chains to the total variation distance,
while the L infinity distance can be much larger and decay more slowly. So we'll discuss some examples of this. Finally, something very close to total variation distance is to pi is total variation distance between two states.
So this is kind of measuring how much we forget the initial state. So if we start from x and from y, we want to see at time t how these distributions differ in total variation. And d bar is very close to d.
So remember, d of t was the distance to pi. d bar is at least d, but at most twice d. So I'll reiterate this later. But maybe I'll write this on the board.
So both of these follow from triangle inequality. So this one is because if I measure the distance from p
t x to p t y, it can just go through pi and use triangle inequality. While in the other direction, if I take d bar, so this definition, so the average, if I take p t y dot and average this over y with weights given by pi, the average, so if I take some pi of y, p t of y dot,
if I average these measures, so each of these is a probability distribution starting at y. I average them with weights pi of y. What measure do I get?
Pi. So just stationarity tells us that. So just from convexity of the norm, or again, triangle inequality, you see that d bar will be at least d. But d bar has some, so though d is usually what we care
about, d bar has some nicer properties. In particular, it's sub-multiplicative. So this is an easy exercise. Let me say one interpretation of, so what is d bar of t?
So d bar of t is the probability of failing to match the two chains under the best coupling. So if you look at the chain starting from x, starting from y, you have two measures, p t x dot and p t y
dot, and you look at the coupling of them. So some joint distribution that projects to both of these measures. And you want to couple as well as possible, meaning that the measure of the diagonal will be as large as possible. And the probability of that coupling failing,
that is the measure of the complement of the diagonal, will be d bar. And if you think of that interpretation, it's easy to see this inequality. So maybe for those who are not familiar, this is the first exercise you can do. And all these initial exercises, you can also look up in the Markov chain book.
Now, separation distance, again, it's a kind of pointwise distance, but just from below. And so it's very easy to see that d of t is bounded above by the separation distance. Again, just by averaging.
But something which is maybe more surprising is that the separation distance at time 2t is bounded by 4 times the total variation distance at time t. So this is only true in the reversible case.
So again, total variation tells us one thing, and separation seems to tell us something different, because it looks at all locations y. And here we see that roughly they are equivalent. And that's important that it uses reversibility.
So the rough idea is that if we want to go from x to y in 2t steps, we can think of going forward from x and backward from y. And if we are mixed in total variation, these two measures have to be close, so we can kind of interpolate and get this fact.
So again, this you can find either in the Aldous-Darkonis paper, where it's first proved, or in the Aldous-Phil unpublished notes, or in our Markov chain book.
So we're going to very soon go to examples, but I want to emphasize in the examples, we're not going to focus just on one fixed chain and drive time to infinity. We're really going to think of a family of chains. So the canonical examples will be, the simplest examples will be cycles and hypercubes.
Then we'll see other examples as well. So when we have a sequence of chains, sometimes we'll emphasize this by putting t mix n epsilon, meaning the mixing time in the n-th chain. So that's the first time where the total variation distance in that chain goes below epsilon.
Now, because of the sub-multiplicativity, the value of epsilon is sometimes de-emphasized, and so we will write t mix n to just mean we plug in epsilon equals a quarter. So again, the point of view we'll
discuss following Aldous-Darkonis and others is how does this scale? So when you fix epsilon, let n goes go to infinity. So for instance, one of the simplest examples of an aperiodic irreducible chain is lazy random walk on a cycle, right?
So you have n states, and we want to know the, to understand the mixing time here. So this is a very simple case. Again, to make it aperiodic, we make it lazy. So with probability, one quarter you walk left, right, one quarter you walk left,
and with probability half you stay in place. So that's the, so the lazy chain means that with probability half it stays where it is, okay? So that is a simple device to ensure aperiodicity. So once we look at the lazy, so this will be, once we look at the lazy random walk on the cycle, it's quite easy to see in many ways
that the order of the mixing time is n squared. So one way is, we all know the central limit theorem and the local central limit theorem for random walk on z, and random walk on the cycle is just going to be taking that random walk on z and taking it mod n. So because, so that would be,
because the, you know, Gaussian distribution is roughly flat over an interval of the standard deviation, you know, that will be one way to see it. And we will see better ways shortly. So the order of the mixing time will be n squared.
But kind of the question we will want to explore, which is easy in this case, but we'll explore it in other changes, how does this depends on epsilon? So when you, so fix epsilon and let n tend to infinity. Will the mixing time be asymptotic to a constant n squared
that depends on epsilon or doesn't depend on epsilon? So, and that kind of answer will differ from one class of chain to the other. So we'll see for the cycle actually, you cannot have such a relation. So this constant will have to depend on epsilon.
While for other examples, which will have cutoff, this constant won't depend on epsilon. So one key tool used in this business is coupling. So again, giving starting states x and y will,
this is a little dry with definitions. It will come to life when we go to the examples. So we'll have xt, yt will be a joint Markov chain. So a Markov chain evolving in the product space so that when we project on the first coordinate,
it will have this transition matrix at p, but started at x. The second coordinate will have the same transition matrix, but started at y. But these two coordinates are not independent. So they could have arbitrary dependence between them. And suppose that this Markov chain in the product space
has the property that after some random time tau, xt and yt coincide. So usually this tau will be a stopping time, but sometimes in enlarged filtration, so in large probability space. But suppose that the chain is arranged
so that xt equals yt after some random time. Then we can bound the total variation distance at time t very simply by the probability that tau is still bigger than t, because on the event that tau is less than t,
then the xt and yt agree. That's how tau is defined. So it's elementary calculation written here, just using the definition of L1 norm
that this total variation distance is bounded by the probability that tau, this meeting time, exceeds t. And so d of t, which is our goal, the total variation distance to stationarity is bounded by d bar of t, which is the maximum over xy of this quantity.
So as I said, that's bounded by 2d of t, but this inequality means that we can bound d of t using the tails of this meeting time if we can get these tails analyzed uniformly
in the starting point x and y. So finally, let's see an example where this works. This is the most basic example in mixing, which is the lazy random walk on the hypercube. So, okay, so we have, so this is,
the hypercube is just zero one to the n, right? So every, this would be the two dimensional one. So every vertex has n neighbors correspond to flipping one of the n bits, but we're going to make it lazy.
So one move of the chain corresponds to picking a coordinate uniformly at random and not necessarily flipping it, but just randomizing that coordinate. So replacing that coordinate, the bit in that coordinate by a uniform random bit. So this suggests a very natural coupling.
If we start at two states, x and y, how are we going to couple the chain starting with x and y? We just choose a coordinate and we are supposed to in x and y randomize that coordinate. Well, let's do it together.
So we chose the coordinate, we chose this one, and now we replace whatever is sitting here in x and y by a new random bit and the same bit in x and y. Okay, any questions? So this kind of coupling relates our problem
to the classical coupon collector problem because each time we're picking a coordinate at random and by the time we picked all the coordinates, which corresponds to collecting all the coupons, the two chains have coalesced, they've reached the same state, and they will stay in the same state from there on.
So we have a natural random time tau here, which is the coupon collector time, the time where we have picked all the coordinates. At that time, the two chains must have met. So what is that time?
What is the asymptotic of that coupon collector time? With awake, yes? N log n, right? So this means that, and moreover, by time a little more than n log n, it's very, very likely that all the coupons have been collected. So let's just go through this.
So the probability that tau is bigger than t is just the sum over all the coordinates. Well, it's bounded above. So here a crude union bound would be good enough for us. So it's bounded above by this sum, which is n times one minus one over n to the t, right?
And so if you take t, which is n log n plus cn, already this crude bound suffices to tell you that d of t is bounded by e to the minus c, right? If you just plug in this value of t. So think of n log n plus cn, where c is a large constant,
then at that time the total variation distance is very small and you see that the epsilon here comes in in the kind of correction term, not in the first term. So this suggests that maybe the mixing time is n log n because this looks like the best possible coupling,
but it's not. So the mixing time is actually asymptotic to n log n over 2. So that's kind of the first surprise in this business. However, we'll see later, this coupling is significant.
If you want to look at the separation time, the time where the separation distance goes below epsilon, then you'll get the n log n coming in here, rather than n log n over 2.
So on the hypercube, one can compute the transition probability, write it explicitly using the eigenvalues and the eigenfunctions. They're all very easy to write down explicitly. And then computing p to the t is just
a matter of taking a matrix to a high power. If we know the eigenvalues and eigenvectors, we can do that. And this will give sharp estimates in this case. But I mean, we'll write down this calculation soon. But the problem with that approach
is it requires all this detailed knowledge, which is harder to get in more involved examples. It will be useful to look at the simple example of the hypercube from several perspectives and to try to get the sharp result in a non-spectral way, which will then generalize to other examples
like Glauber Dynamics for the easing model. So I said the naive coupling, which gave us n e to the minus t over n to this bound, will then will yield the n log n bound for the mixing time.
But we'll see in the Sharper Analysis that instead of using this coupling all the way to the end, you really only want to use it until the distance is root n and then switch to a different way of bounding.
So OK, so let me already illustrate how we improve this n log n bound. So again, this is a very special case. Lots of symmetry, so many alternative things we can use.
So one way to use the symmetry is to look at the Hamming weight of a configuration, so just the total number of ones, the sum of the coordinates, so call that wt. Also observe that because of the symmetry of the hypercube, if we want to understand the mixing time,
it doesn't matter what initial state we use. So it will be convenient to fix, say, the all one state as our initial state, but it's trivial to check that the distance to stationary time t doesn't depend on the initial state. So we're going to fix the initial state to be all ones. And here is one useful observation
exploiting the symmetry that the total variation distance between xt, so xt is the lazy walk on the hypercube, wt is the Hamming weight at time t. So the Hamming weight, as we'll see,
is itself a Markov chain, but just a one-dimensional Markov chain on the integers. And the total variation distance between the hyperstate in the hypercube and pi, pi is just uniform here, pi is uniform measure on the hypercube, is the same as the total variation distance between wt and its stationary distribution.
So its stationary distribution is just obtained from the uniform distribution on the hypercube. So pi w of k is just the number
of configurations that have k1's multiplied by the weight of each such configuration. Now, why is this equality true? Because when you're computing the distance here to pi,
so what we're supposed to do is look at half the L1 distance. When we're computing the L1 distance, so if you want to look at p1, xt equals y minus pi of y. So we're supposed to sum such terms and divide by 2.
But observe that, of course, pi of y is just uniform. It doesn't depend on y. But when I started the all-one state, this probability just depends on the Hamming weight of y. All states y with the same Hamming weight have the same probability. So because of that, when I sum this over all y,
I can just sum it level by level. And if you look at the total contribution from vertices y with weight k, and you sum over that, you'll just get the difference between probability that wt equals k and pi w of k.
But this simplifies the problem, reduces it to understanding this one-dimensional chain. Now, the coupling, when we looked at the hypercube, the coupling we were using, which is also called the greedy coupling, looks like the best you can do.
But once you make this observation that you reduce to the birth and death chain, so this chain wt either stays in place with probability half or goes down with probability, which is wt over 2n. Because if you have k1s, then the probability of going down,
you have to be non-lazy. So that's a half. And then it's k over n to reduce one over one. And it goes up with probability n minus wt over 2n. This is known as a birth and death chain, because either stays in place, goes down, or up. And now for this chain, we see other possibilities of coupling.
So this chain is just a projection of the hypercube chain. And because in the hypercube, we saw that we had the bound d of t, was bounded by ne to the minus t over n. So if we take t, which is half n log n,
I'm going to omit writing integer parts. So t should be an integer, but you can add the integer parts yourself. So if t is this time, then we're going to get a. So when we plug in the half n log n,
so I don't want to use the bound for d of t. I really want to use a bound for the expected hamming weight.
So let me see if I have that down. So this bound is correct, but we want to do something better. So again, here is a recap of the evolution of wt.
And so if we look at the expected distance between wt
minus 1 and wt, this is, if you go to the previous example distribution, how does this change so its probability doesn't change? And it's going to go down with this probability and up with this probability.
So if you then use this to compute the expectation, you'll get that the expectation is just minus wt divided by n. And this rate of decay actually corresponds to the coupling we discussed before. But it's just elementary from the previous equation.
So again, this will give something a little better than this total variation distance. So that the expectation of wt is bounded by ne to the minus t over n, because w0 was n. But now we have something else we can use.
So if we're just using this inequality and we want to get this to be below 1, we would still have to wait for n log n. But instead, we're going to just wait till this goes to root n. So that will be n half n log n.
So in half n log n steps, this distance will be root n. So, okay, I'm doing something I should explain.
So this is not the wt itself. So, all right, so I'm sorry. So I should switch.
Yeah, there is a switch I need to do, which I mean. So let's, let me, okay.
So you can either switch to plus minus one
or look at starting at ones and starting at zero. So this is maybe the right way to say it. So think of the distance starting at one and the hamming weight started at one
and starting at zero. So using the coupling we discussed before, by time half n log n, we're going to get the expected distance to be bounded by root n. So we started with distance n and we're going to get the expected distance
bounded by root n at this time, at this time t. Now once we have the distance to be root n, continuing to use this contraction is too slow.
And at that point, what we want is to use the variation of the walk. So the point is that, okay, so it's not explained well in the slides. So let me explain this on the board.
So once we have, so let's write these two variables, maybe wt started at one and wt started at zero. Once we know their distance is root n,
is at most root n, let's look at a different coupling, which is not implied directly in the hypercube, just in this Berson-Desch chain. And in this coupling, we're going to first toss a fair coin to decide which of these two is going to move. Okay, and then the one that moves
will move with the correct probability. So this will move either up or down and this one will stay in place. So again, we toss a fair coin, one of them stays in place, the other one moves. So, and the probability it will move.
So suppose it will be, it's k, then it's going to move to either k plus one with probability n minus k over n and k minus one with probability k over n, all right? So it's kind of the non-lazy version of this chain,
but this is only with probability half. With probability half, this will stay in place and the other one will move. So now look at the difference between these two locations or think of them as particles. What you can check is that this difference
is going to be dominated by a simple random walk. So, and hence the time until these two coalesce will be dominated by the time that the random walk
that starts at root n reaches zero. And so if you have a simple random walk on all of z, you start at root n, of course, the expected time to reach zero is infinite. By time n, the probability is very large
that you will reach zero. So again, the key thing here is that the drift. So if w t one is larger than w t zero, then its drift to the left is stronger than the drift to the left from w to zero. So the difference between them actually has a negative drift.
So it's going to be, so formally the difference between them is a super martingale. And so the time until they couple is going to be order n. And so we will get, we can get the coupling
of the birth and death chains at time half n log n plus order n. So I'll say more about this example from other methods, but any questions about?
Yes. So you say it's dominated, like it's actually got negative drift at all times. Can we not just take the minimum of this negative drift, sorry, like minimum absolute value of the negative drift? Well, the problem is that the drift, as when they get very close, the drift is very weak.
So the drift is what we were using, right. So then the drift, you really get very little from this drift. And it is to use actually the diffusive nature to say that, so even if you had, so we basically are replacing this drift by zero drift and just saying that two particles,
because the variance is positive, each time they have a chance to move, then two particles that start at distance root n are going to coalesce by. So in this coupling, the difference is actually just like one process, which is a super martingale with a non-trivial variance. So just from the reflection principle,
you can easily check that. So if it's got, if the minimum absolute value of the drift, like. Well, the minimum absolute value of the drift is very, very small. So if they're adjacent, it's one over n. So that is too weak. So what you use is the fact that when they're far away, the drift is strong. So until they get to distance root n, we use the drift.
After they get to distance root n, the diffusion is the dominant force bringing them together rather than the drift. And that's something that's harder to exploit originally in the hypercube. That's why we project to the birth and death chain and then exploit it there. Any other question?
Okay, so the key thing which is that half n log n, so what one finds if you do this argument with more care is that the T-mix of epsilon
is going to be bounded above and there'll also be some bound like that from below by n log n over two, plus a constant that depends on epsilon times n. So if you want to be sure that they coalesce,
you just have to wait a large constant times n. So this c epsilon is actually of order log one over epsilon. So the key is that the epsilon only affects the second order term. It turns out there's also a lower bound of the same type, n log n plus c epsilon n.
So n log n over two plus epsilon n. And this will be the cutoff phenomenon. But before that, here's the contrast, the example where there is no cutoff. So this is random walk on the cycle.
So for a lazy random walk on the cycle, I said that you can get from the central limit theorem mixing time of n squared, but you can also easily get it using coupling. So then if we have lazy walk on the cycle,
we'll use the same trick as before where with probability half, we toss a coin and decide which of the two particles moves and then the other one will stay in place. And then the distance between the two particles will just itself be a simple random walk on the interval zero n,
because once the distance reaches either zero or n, they've coupled. And now the classical gambler's rune estimate tells us that if you have a gambler starting at k and he stops at either zero or n, and he just does simple random walk, not lazy now,
then the time till he reaches the boundary will be just k times n minus k. So it's always bounded by n squared over four. So the expected coupling time is bound, or coalescence time is bounded by n squared over four. And this gives a bound of n squared, or here I was generous and put it two for the mixing time.
But this is t mix of a quarter. If you want really to mix t mix of epsilon where epsilon is small, then this will not be enough. It's easy from this argument to get a bound
of n squared times log one over epsilon, because you can just do experiments. Each time you succeed to couple with probability at least a half by time two n squared. And then if you do repeated experiments, the chance you still haven't coupled
after k such experiments is exponentially small in k. So you can easily get a bound t mix epsilon, which is n squared log one over epsilon, but that log is really needed in this problem. So the difference between the hypercube and the cycle is that the epsilon on the cycle affects the first order term in the mixing, so it multiplies the n squared,
while at the hypercube it does not. So the holy grail of the topic is to, starting from these two very elementary examples, understand in more complicated groups, in more complicated chains, when is there cutoff and when is there not.
Let me give you one simple example that was just resolved this year. So this is known as the random to random shuffle. So you have n cards, one on top of the other. And you pick a card uniformly at random, take it out of the deck, and put it in another uniform place.
How long does this take to mix? It's pretty easy to guess and not so hard to prove that the order will be n log n. What's the constant in front? Is it concentrated? Is there cutoff? So that was open for some more than 20 years
since it was just solved this year. I'll give the citation later. But the answer is 3 quarters n log n for that chain. And that's still one of the examples
which is understood by very specialized arguments, upper bound using representation theory and lower bound by careful definition of the right event. There's no general theory that can solve for us now this type of example. So the holy grail is to understand when can we expect cutoff and when can we prove it.
So just for the cycle, here's just the elementary argument why the mixing time indeed is of order n squared and not less. Because if time is less than n squared over 32, then the chance starting at 0, the chance you will reach as far as n over 4
is too small by Chebyshev. So the mixing time is at least n squared over 32. So this type of argument won't identify the right constant, but it will give you the order.
So here is one tool to lower bound those variations. So this is not a sharp estimate. This is just the estimate you get from Chebyshev. So if you have two distributions, so for us one will be the stationary and the other will be the distribution at time t. If the means differ by r times the standard deviation,
then the total variation distance is at least 1 minus constant over r squared. And the way you check that is you just define, remember the definition of total variation, one definition. So mu minus nu, the total variation distance, where is the eraser?
This one is better? OK, so this is a little better. So remember that the total variation
is the maximum over sets of mu A minus nu A. That's one of the two definitions. You just have to define a good event. So if you have these two measures, mu and nu, so one way is just look at the threshold,
which is the average of the means. So this is E mu of f plus E nu of f over 2. And the event A that you take are the,
so give this point a name, I don't know, theta. And the event A is just the x's so that f of x is bigger than theta. And if you just, so one mean is here and the other mean is there.
So by Chebyshev, you know that the probability that f of x is bigger than this theta under this mean is small, and the probability it's smaller under that mean is small. So just by using Chebyshev twice using this threshold, you get this type of bound. Again, if you haven't seen it,
go through the Chebyshev argument to do it. And in the book, we have this as well as the sharper bound of this type. So you can do better than Chebyshev here. So this can be applied in many examples. For instance, in the hypercube, you can apply it.
So if you look at the, so let's look at the Hamming weight chain. The total Hamming weight is,
so we can write it as a combination of two terms. So if r t is the total number of coordinates that have not been updated, then the expectation of w of x t is r t. So all those still have a 1 because we started at 1. And from those that have been updated,
in expectation, half of them are 1. So we get this expectation is half of r t plus n. So the expectation starting from 1 of w x t is n over 2 times 1 plus 1 minus 1 over n to the t. And you see that the, okay, it's very easy
to see that in stationary distribution, the variance of w is at most n over 4, but it's possible to check it. This is true for all t. So again, I leave this verification. So you get it from this formula, that the variance is bounded by n over 4 for all t.
And once you know that, you can just use the previous lemma to lower bound the total variation distance. So if I look at the expectation under pi of w is just n over 2, the expectation starting from 1 is written here. So the difference is going to be sigma times root n times 1
minus 1 over n to the t. And if we take t, which is half n log n, oops, there is some, this n is wrong.
So half n log n minus alpha n, then you get d of t to be bounded. Mounded below by that. So the key thing is that if you, so there is no n here. So this is half n log n minus alpha n. And at that time, d of t is, so if alpha is very large,
then d of t will be close to 1. So this is the other part of what I told you. So that at this time, half n log n with a linear correction
you're going to get d of t close to 1. Oops, sorry. So for the hypercube, L2 and L1, so the total variation mixing and L2 mixing, both occur at the same time,
half n log n with corrections of order n. So the L2 calculation we'll see shortly. So there is a window of order n.
So this is an example of the cutoff phenomena, which I will define next. Now one thing to notice is that in this chain, we said the mixing time is order n log n. The relaxation time, which is defined as the inverse of the spectral gap. So we'll use this more, but let me write this.
So the relaxation time is defined as 1 over the spectral gap. So if lambda 1, the largest eigenvalue for a stochastic matrix is 1, lambda 2,
the next eigenvalue is smaller. Actually, I'm going to use not lambda 2, but the largest eigenvalue in absolute value. So maximum over lambda different from 1 is the value of lambda. So sometimes this one can be negative,
the largest absolute value. So we're going to use this for the relaxation. So for the hypercube, the relaxation time will be n, since the second eigenvalue. So this largest absolute value will be equal
to the second eigenvalue, and will be 1 minus 1 over n. And all the eigenvalues are non-negative. Maybe one thing I should comment. So we work a lot with lazy chains. It's convenient to reduce our periodicity.
It also, of course, kills all the negative eigenvalues. So going to a lazy chain is just averaging the transition matrix p with the identity. So since the stochastic matrix has eigenvalues in minus 1, when stochastic reversible matrix.
So self-adjoint. So eigenvalues are real, and they're in minus 1, 1. If you average with the identity, the eigenvalues go to 0, 1. So you don't have to worry about negative eigenvalues. So I'm pointing out this, because this will be part of a general phenomenon.
So here the relaxation time on the hypercube is n, while the mixing time is order n log n.
OK, so here is a general definition of cutoff that I've alluded to before. But here is one formal definition. So this d is really dn. So we look at the total variation distance in the n-th chain. And maybe there is a picture here.
So this picture is maybe a good one to look at. So what we want, and I'll come back to the definition, what we want is that the total variation distance will descend very rapidly from near 1 to near 0. So there are several equivalent ways to write this.
So one is here. There is some window wn. wn should be little o of tn in order to get cutoff. So this is not written in the slides, but let me add it. So wn should be little o of tn.
So tn formally represents the mixing time. wn is a window where the mixing happens. So if we are at a time which is tn minus some large constant times wn, the total variation distance should be close to 1.
If you're at tn plus a large constant wn, total variation distance should be close to 0. And so that will be cutoff and an equivalent definition of cutoff. So maybe easier to remember. Cutoff means that t mix n of epsilon
doesn't depend on epsilon to first order. In other words, if you like t mix n of epsilon and divide by t mix n of 1 minus epsilon, this ratio should go to 1 as n goes to infinity. So that's equivalent to cutoff.
It's just easy to see it's equivalent to this. And another useful notion is precutoff, which means I don't require that this will go to 1, but I just require that this ratio doesn't blow up with epsilon. So formally, for any epsilon, once you fix epsilon and you take a lim sup in n, this lim sup
should be bounded in a way which is independent of epsilon. So there are a few examples that have precutoff but don't have cutoff. Often what happens is there is a chain. It probably has cutoff or sequence of chains. It probably has cutoff, but we can't prove it. So the substitute is, well, we prove precutoff.
In fact, for many chains where we conjecture cutoff holds, we know precutoff, but we just can't refine it. OK, so again, here is another way to think of cutoff. So dn, so if you look at the mixing time of the nth chain
and multiply by a constant, the limit should be 1 if this constant is less than 1, and 0 if this constant is bigger than 1, the limit in n. So I told you that for the hypercube,
there's a very easy argument using eigenvalues. Let's go through that, because that actually bounds something stronger. It bounds the L2 distance rather than just the L1 distance. The drawback of this argument is it's specialized. It requires this detailed knowledge of the eigenvalues and the eigenvectors. So again, we have p reversible, eigenvalues,
lambda 1 bigger than lambda 2. This should be lambda n. And fj are the eigenvector corresponding to eigenvalue lambda j. And then the total variation distance squared, even multiplied by 4, because this is half the L1, is bounded by the L2 distance squared.
And this can now be written explicitly. So you just diagonalize, and you easily see that the L2 distance squared can be written in this form. So sum over the fj x squared lambda j to the 2t. So fj is a normalized eigenfunction, normalized in L2 of pi.
Now the relaxation time I defined is 1 over 1 minus lambda star. Lambda star is the maximum eigenvalue different from 1. And so using the diagonalization, it's easy to prove an upper bound.
t mix of epsilon is bounded by t rel times, well, times the logarithmic factor. And here, there are two terms here. So the epsilon that we want and the pi min, the minimum stationary probability. So
So again, this is a general upper bound. In some cases, it is sharp. But in many, it is off. So even in the hypercube, this bound is too generous. So in the hypercube, the relaxation time is n.
The log 1 of epsilon should be there. But log 1 over pi min is 1 over 2 to the n if you're in the n dimensional hypercube. So this will give you an upper bound of n squared on the mixing time. So this kind of bound, although sometimes useful, is often not sharp. One example where it is sharp, and you'll
see that later, say, in Eyal Lubetsky's lectures, is in expander graphs. So expander graphs have a bounded degree and have relaxation time, which is order 1. And pi min is 1 over n. So this general bound for expander graph
on n nodes, it will give a bound of log n for the mixing time, which is the right order. So there are examples where this bound is the right order. But often, it is too generous. So if you compute using eigenvalues the distance
in the hypercube for the lazy walk, so the eigenvalues, as I said, are all in 0, 1. And in fact, it's 1 minus g over n with multiplicity n choose j. Again, that's an easy exercise. So for instance, the largest eigenvalue is 1 minus 1 over n. And we have multiplicity n, because each
of the coordinate functions has this eigenvalue. So we have n coordinate functions, the first, the second, up to the n. Each one is an eigenfunction with eigenvalue 1 minus 1 over n. And in general, you'll get eigenvalues 1 minus j over n
with multiplicity n choose j. The eigenfunctions are characters. You can write them explicitly. And if you use that, you'll get the L2 norm squared. You can, again, write this binomial expression
and bound it in this way. And this bound is close to sharp. 1 plus e to the minus 2t over n to the n minus 1. So if t is half n log n plus cn, you get a bound on the right-hand side, which has this form.
So just plug this in. So it's e to the, and just use 1 plus x bounded by e to the x. So it's e to the minus 2c minus 1. So the key thing is, as c grows, you will find that this exponential will tend to e to the 0.
So this difference will tend to 0. So this means that, at this time, half n log n plus cn, not only we're controlling the L1 distance, but we're controlling the L2 distance as well. And any questions?
Maybe before going on to the example, let me connect to something that Charles was telling you
something about the Lamplighter groups, right? It was yesterday. So that's a good example, interesting example, where L1 and L2 are different. So let's look at the Lamplighter group
on the cycle. So I want to give you one more interesting example before continuing with the theory. So what is an element of this group? I won't draw the whole group. I'm just drawing an element of this group. So and this is the Lamplighter group on the cycle. So we have a cycle of n nodes.
And an element consists of a bit for each node, which you can think of the lamp being on or off, and also a marker on one of the nodes.
So this picture is not of the Lamplighter group. It's a picture of one element. Now, I don't care now about the group operation, but rather just about the graph. So in order to define the graph, I just have to tell you for a given state,
who are the neighbors? For a given vertex, who are the neighbors? And here there are three neighbors, one obtained by flipping this bit where the marker is, and the other obtained by moving the marker left or right. The easiest chain to analyze will be not the simple random
walk on this graph, but what's called the randomized move, randomized chain. So given where the marker is, first randomize the lamp where it is, then have the marker make one lazy step,
and then have it randomize the lamp after it moved. But this is just a technicality. So to make it nice and reversible, and to ensure that all the locations it visited have randomized lamps.
So what is the mixing time in this example? So then it's pretty easy to see that the mixing time in total variation is going to be order n squared. Because by time n squared, this marker
will have moved around the cycle. If I wait 100 n squared with high probability, it moves around the cycle. And it has time to randomize all the lamps. So I'm not after the constants now. So this example doesn't have cutoff, but I want to understand just orders of magnitude.
So in this example, the mixing time is going to be order n squared. But now suppose we want to understand the mixing time in L2. So this, I claim, will have a higher order. And to see that, just consider the event
that the marker stays in one half of the cycle, say in this half. That event is very reasonable if I just wait for a time period of n squared. So with constant probability, the marker will stay in this half cycle.
So let's assume that the initial state was just all zero state. So that was the initial state, all zeros. And the marker here, and it stayed in this half. So what this means is that if it stayed in this half,
it means all the lamps in the other half, in the lower half, stayed zero, which means that the density, the distribution, is focused instead, concentrated instead of on 2 to the n states.
On this event, which is reasonable, it's focused on a tiny collection of just 2 to the n over 2 states, about 2 to the n over 2 times n, but the tiny part of the state space. So then the density is exploding, because if you have a probability measure,
it's only on 2 to the n over 2 states, and out of the 2 to the n. So when you look at the density, it's huge, and the L2 norm is exponentially large at this time. And this will remain true as long as you have a substantial probability of staying
in this half. So the only time the mixing in L2 will occur is when the probability that you stay in this half is exponentially small. And for this, you need to wait for n rounds, where each round is time n squared. So it turns out that the mixing time in L2 for this example is n cubed, is order n cubed.
So the ratio between, so the mixing time in L1 is order n squared. The mixing time in L2 is order n cubed, which is the maximum possible ratio between them, because it's log of the size of the state space.
So now in the hypercube, that doesn't happen, and we get mixing in L1 and L2 to happen, not just in the same order, but really at the same time, so with the same constant. All right, so I want to mention
the spectrally necessary condition for cutoff. Which is that the mixing time and the relaxation time, so here I took relaxation time minus one, just because of some trivial two-state change. But in all examples of interest, the relaxation time is going to infinity,
so this minus one doesn't matter. Okay, so here is a necessary condition for cutoff, even for pre-cutoff. The mixing time and the relaxation time should be of different order. So this is already enough to exclude cutoff on the cycle.
So if we have, on the cycle, again, the first eigenvalue is one, the next eigenvalue is one minus constant over n squared, and so the relaxation time on the cycle is n squared, and so is the mixing time, they're of the same order,
and hence there is no cutoff. So in general, so why is this true? Basically, it's true because of this inequality, the mixing time, t-mix of epsilon, is always at least, so I gave you an upper bound before, which was the relaxation time times the log one over epsilon pi min.
But there's also a lower bound, which is the relaxation time multiplied by log one over two epsilon. So again, you can find that in my Markov chain book, but the idea behind this inequality is to use the second eigenfunction. So if I wait, so let me explain the intuition here.
Take the stationary distribution and perturb it by the second eigenfunction. Okay, so we move it by some constant times the second eigenfunction. This perturbation, if you wait the relaxation time, it just brings down this perturbation by a factor of e.
So the perturbation doesn't disappear. So you won't mix until really this perturbation becomes less than epsilon. So since the perturbation only is squeezed by a factor e in every round of the relaxation time, you will need log one over epsilon rounds until this perturbation becomes negligible.
So that's the idea behind this lower bound on the mixing time, which then you can find what a simple argument I said written out in the book. And then if you look at this ratio, t mix epsilon over t mix, so t mix is just t mix over a quarter,
you'll get a bound of at least log, this log one over two epsilon under the assumption that the mixing and relaxation time are of the same order. So, but this kind of inequality tells you that you don't have cutoff,
you even don't have pre-cutoff because this ratio is bounded below by some quantity that blows up as epsilon goes to zero, okay?
So one kind of, okay, so again, for the cycle, I said both mixing and relaxation time are order n squared and there is no cutoff. For the hypercube, hypercube the ratio with order log n goes to infinity and there is cutoff. So if you look at many examples,
this suggests the conjecture that this ratio going to infinity is the condition for cutoff. We know it's essentially necessary, okay, with this minus one, it's usually sufficient, but there's a warning with this usually. So there are various weird counter examples.
So one thing we are still missing is an exact condition that will inform us of cutoff without calculating exact constants. So to find where the cutoff is, you have to do precise calculations, but the hope is that we will find qualitative properties
of a chain that will tell us when there is cutoff and when there's not. And philosophically, I think by now we understand that just the theorems are somewhat behind. So I'll give more examples, but I'll say a word about the philosophy that a cutoff occurs when your state space
has a lot of independent or weakly dependent variables, and mixing means that most of these have to mix separately, while if your state space is low dimensional or tightly correlated, then you don't expect cutoff.
So then in the cycle, I said there's no cutoff if you have any torus of bounded dimension, right? So instead of the cycle, so we looked at the cycle Zn, Z mod n, but you could look at Zn to the d for fixed dimension and do a simple lazy random walk there.
And for any fixed d, this will not have cutoff. However, you could say let n tend to infinity and also let the dimension tend to infinity. And it turned out that if the dimension tends to infinity with n, no matter how slowly, then this implies cutoff.
So somehow cutoff captures a high dimensional behavior of your chain. From the spectral point of view, cutoff represents that mixing isn't the work just of one eigenfunction,
but somehow there has to be a cooperation of many important eigenfunctions. Okay, so one signature of cutoff is high multiplicity of the highest eigenvalue, but that's not necessary for cutoff because you might have many eigenvectors
corresponding to close by eigenvalues. So exactly quantifying the condition is still a challenge. So here is a nice example by Aldous showing that just having mixing and relaxation time of different order is not enough to give cutoff.
So this will be a reversible chain. So there is a path here of length five n where the walk is, so I'm describing a transition matrix,
but then the actual chain we'll use will be the lazy version of this. So with probability half we stay in place, with probability half we move according to these rules. So there is a tail here of length five n where we move with probability one third left, two thirds right. So this has a strong drift to the right.
And then here the path breaks in two pieces. So this lower path is of length n and here the probabilities are one fifth to the left, four fifths to the right. On the upper path it's of length two n,
so it's longer, you don't see it in the picture, but this is length two n. And here you have one third to the left, two thirds to the right. So in the upper one it's longer and slower. The lower one is shorter and faster. However, these probabilities are exactly planned to make this reversible. So how does the stationary distribution behave?
So because of these transitions, the stationary distribution just doubles every time as you go here to the right. Now here as you go on the bottom path, the stationary distribution has to grow by factor four every time to compensate for these probabilities.
And here on the top, it doubles every time. So because the top path is double the length two n, these two things are consistent. And when you reach this point, its stationary measure is just four to the n larger than this point, that's the idea.
And so this defines a reversible chain. And so there's four to the n times an extra factor of two becoming, but okay, I won't, and okay, you don't, it's not important.
Just four to the n times this one. Okay, so this chain is actually a weighted expander. So by that I mean, if I take any set of measure, so almost all the stationary measure is concentrated near this right node.
So if I take any set of measure less than half, so it means it really cannot contain all these nodes on the right, its boundary has measure which is at least this constant times the size of the set. So this is a weighted expander.
And hence, using the trigger inequality, which I haven't explained here, but again, you can find it in many sources, including our Markov chain book, the relaxation time of this chain is actually order one. So the largest eigenvalue is separated from one.
The mixing time is of course not order one, it's much larger because it's clearly order n. So the worst starting point will be here, almost all the stationary measure is concentrated here. So the mixing time is determined by the time it takes a particle from here to reach this point.
However, that time depends on whether the particle moved on the bottom path or the top path. So if I look at the total variation distance of stationarity, it behaves in this funny way. So I'll explain it back in this picture.
So when the particle reaches here, it could, with positive probability, take the top path, with positive probability, take the bottom path. These are not exactly half half, but if it chooses the top path, then it will take it another time,
six n, so after this initial part, okay, I'm ignoring the laziness, so everything has to be doubled. But without the laziness, this part would take time 15 n, and then another six n to go on this path, because here you move at speed one third
and the length is two n. On the other hand, here on the bottom, it's shorter and you move faster. So this part will only take five thirds n, ignoring the laziness. The laziness will double all the times. So the point is that these two possibilities
take different multiples of n time. So this means if I look at the total variation distance of stationarity, it will start near one, and at a time which is, well, twice times 15 n plus five thirds n, it will drop,
and then, but it will drop not near zero. It will drop to a constant which corresponds to the probability the particle is taking the slow road. And then finally, when the time of the slow road arrives, which is 15 n plus six n times two, then the total variation distance will go down to near zero.
So in this example, there is no cutoff. So this is what it looks. Again, everything has to be doubled to account for the laziness. Okay, so this is just what I explained in words.
And so in this example, if you look at it closely, there is a pre-cutoff, but no cutoff. Yes. Are there such examples
for the stationary distribution as uniform? Examples, yes. So I was debating whether to give that,
but let me give, so this is an example that's modified from an idea of Igor Pak. So let's start with the hypercube, and we're going to, okay, so I'll just write it in words,
with probability one minus one over L,
take this move in the lazy hypercube, and with probability one over L, a path to the uniform distribution.
Okay, so at any state, so this will be a chain. It can be done in many places, but to be concrete, start with the hypercube. So this will be still a chain on the hypercube
with probability one minus one over L, and I still have to tell you what L is. So L will be the geometric mean of the mixing time and the relaxation time. So it will be n times the square root of log n. Okay, so the state space is the hypercube
with probability one over L, you just go from where you are to a uniformly chosen configuration. With probability one minus one over L, you just take one lazy move in the hypercube, okay, which we know how to do, choose a uniform coordinate and randomize it, okay? Now, what is the relaxation time in this chain?
So it's easy to see that the relaxation time is still like it was, if you see what the change in the eigenvalues, the leading eigenvalue is still close to one minus one over n.
So, but the mixing time is now gone down from n log n to n root log n, because the main thing that will govern the mixing, we won't wait for mixing in the hypercube,
just by time n root log n, which is before that, we're just going to make one of these uniformizing moves, and that will be the mixing time. But because this is just waiting time for a rare event, here the total duration distance will have an exponential profile rather than a cutoff.
So the mixing time is order L, but the t mix of epsilon will be L times log one over epsilon. So in this example, so the stationary measure is uniform, and there is no pre-cutoff, even though the relaxation time,
so in the mixing time is order L, which is larger than the relaxation time. So we do have such examples, but you see they're kind of contrived.
So I guess we're out of time, but I said there are other interesting counterexamples. So the one you saw is modified from an example of POC. There are also interesting examples by Lacun.
So, okay, so maybe let me end with one example you can think about. Take a binary tree of depth k and n vertices, so n is about two to the k,
and try to figure out if in this example there is cutoff or not. So maybe I'll end with these two. Okay, so I'll tell you what the answer is.
Show that on a binary tree, binary finite tree, and so we just go to depth k, there is no cutoff.
So one way to do it is check the relaxation time and the mixing time and show they're the same order. In fact, they're order n, which are n is the number of nodes. But, okay, so this is one. Two, which I'll star, is a find a sequence of trees
where lazy simple random walk has cutoff.
So it's quite surprising that there exists, so if you have a path, you know, and you do simple random walk on a path, there's certainly no cutoff. It's just like the cycle. So it's a little surprising that there are trees, you can write down a sequence of trees without weights. So I'm not, when I say I want lazy simple random walk
on a tree, and I want it to be bound, maybe I'll emphasize bounded degree, bounded degree, and lazy simple random walk has a cutoff.
So that's enough for today. Okay, so I'll switch. Tomorrow I want to, I guess my next lecture, which is not tomorrow, I want to tell you a little bit about the rate of escape questions as well, and then we'll come back to cutoff in the last lecture.
Thanks a lot for the beautiful lecture. Are there questions, any more?
I have a question at some point. You mentioned the connection between the spectrum of the operator and the metric CTs and the ratio between the relaxation time and the unitary connection. How bounds on the spectral density close to one, or close to maximum level?
We don't have, so there should be some connection like that, but we don't know the right one yet. Does it depend on the structure of the eigenstates, or just the spectral density? Do you think it depends on the localization of eigenstates? In general, it must depend, but you know, under some symmetry assumptions, maybe you can remove that,
and it's still not fully understood. We've got a coffee break for 30 minutes, and we will be back at four o'clock. Thanks again. Thank you.