A Dyadic Simulation Approach to Efficient Range-Summability
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 13 | |
Autor | ||
Lizenz | CC-Namensnennung - keine kommerzielle Nutzung - keine Bearbeitung 3.0 Deutschland: Sie dürfen das Werk bzw. den Inhalt in unveränderter Form zu jedem legalen und nicht-kommerziellen Zweck nutzen, vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/57488 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | |
Genre |
00:00
GrundraumZufallszahlenVariableWiderspruchsfreiheitPhasenumwandlungAbfrageDistributionenraumSpannweite <Stochastik>SinusfunktionGanze FunktionPhysikalisches SystemStreaming <Kommunikationstechnik>Weg <Topologie>Normierter RaumDatenstromGewichtungStochastische AbhängigkeitNummernsystemSystemaufrufMaßerweiterungSummierbarkeitAdditionImplementierungApproximationSimulationTopologieSommerzeitPhysikalische TheorieROM <Informatik>Familie <Mathematik>AggregatzustandSiedepunktSpannweite <Stochastik>NeuroinformatikeCosInformationsspeicherungTonnelierter RaumTermDistributionenraumPuls <Technik>ComputervirusStreaming <Kommunikationstechnik>Folge <Mathematik>Inverser LimesMultiplikationsoperatorMaßerweiterungBenutzerschnittstellenverwaltungssystemMinkowski-MetrikProtokoll <Datenverarbeitungssystem>Bildgebendes VerfahrenMereologieSpeicherabzugStochastische AbhängigkeitBitrateBeobachtungsstudieMarketinginformationssystemARM <Computerarchitektur>CASE <Informatik>TopologiePhysikalische TheorieQuaderZweiNummernsystemGrundraumWorkstation <Musikinstrument>Diskrete-Elemente-MethodeTabelleProgrammierungPunktLesen <Datenverarbeitung>Uniformer RaumAbstandPhasenumwandlungBinärdatenImplementierungGrenzschichtablösungGanze FunktionArithmetisches MittelIrrfahrtsproblemKontrast <Statistik>t-TestSummierbarkeitLoginWurzel <Mathematik>AbfrageZufallsvariablePay-TVÜberlagerung <Mathematik>Kartesische KoordinatenSchätzfunktionWiderspruchsfreiheitAutorisierungDatenstromZahlenbereichGewichtungResultanteDeterminanteRandomisierungNormalvektorPhysikalisches SystemUmwandlungsenthalpieComputeranimation
09:43
PhasenumwandlungAlgorithmusROM <Informatik>SommerzeitInformationBeschreibungskomplexitätInformationsspeicherungSimulationGruppoidNummernsystemUmwandlungsenthalpieComputerDistributionenraumStandardabweichungTopologieDatenstrukturStichprobeStochastische AbhängigkeitFramework <Informatik>ZufallszahlenSampling <Musik>ImplementierungTaskOperations ResearchIdeal <Mathematik>Hash-AlgorithmusSchlüsselverwaltungFunktion <Mathematik>Wechselseitige InformationMAPFlächeninhaltGanze ZahlSummierbarkeitWiderspruchsfreiheitDatenverarbeitungssystemTopologieStochastische AbhängigkeitGenerator <Informatik>ZufallsvariableDistributionenraumNichtlinearer OperatorAbfrageStichprobenumfangHash-AlgorithmusSimulationArithmetisches MittelKonditionszahlMultiplikationsoperatorSummierbarkeitAlgorithmusWurzel <Mathematik>MAPFunktionalIrrfahrtsproblemHeegaard-ZerlegungVarianzTaskLoginKomplex <Algebra>ImplementierungApp <Programm>Ausdruck <Logik>WiderspruchsfreiheitAbstandTotal <Mathematik>MereologieInstantiierungSchnittmengeSchlüsselverwaltungKomplexitätstheorieSpannweite <Stochastik>ZahlenbereichVererbungshierarchieInformationsspeicherungNummernsystemAlgorithmische ProgrammierspracheInformationKategorie <Mathematik>HyperbelverfahrenRechter WinkelPhasenumwandlungFramework <Informatik>Diskrete UntergruppeAggregatzustandStandardabweichungDatenstrukturSystemaufrufGewicht <Ausgleichsrechnung>Kette <Mathematik>Fächer <Mathematik>Verteilte ProgrammierungBeobachtungsstudieNeuroinformatikEinfacher RingEndliche ModelltheorieSpeicherabzugBildverstehenFortsetzung <Mathematik>BitrateDifferenzkernQuadratzahlMinkowski-MetrikFreewareTermStreaming <Kommunikationstechnik>FitnessfunktionStabSchlussregelAdditionLesen <Datenverarbeitung>TUNIS <Programm>RechenwerkComputeranimation
19:14
Besprechung/Interview
Transkript: Englisch(automatisch erzeugt)
00:01
Hello everyone, I am Jin Fanmeng from Georgia ATT&CK. Today I will present a joint work with Bao Yiwao, Jun Xu, and Mitsunari Okihara. In this work, we propose a novel solution to the following problem. How can we answer into some queries on a long sequence of random variables, both consistently and efficiently?
00:27
Suppose we have a long sequence of billions of random variables. He knows them as X0 through Xc-1. These random variables are ideally IID, and they have the same user-defined target distribution.
00:43
This efficient range summability problem involves two phases. In the initialization phase, the realizations of these random variables are fixed. It means their values are determined but maybe not computed explicitly.
01:02
In the query phase, given any query range, we want to return the range sum of these random variables. I will use the following example to illustrate this problem. Here we have eight random variables. They are distributed as the 1D random walk with values 1 and minus 1.
01:23
Suppose we have fixed them as the values here in the initialization phase. Then the answer to any random query in the query phase is already determined. For example, if the query range is 2 to 6, not including 6, then the answer is the sum of 4 random variables in this box, and it is 0.
01:49
So, any efficient range summability solution should satisfy these two requirements, consistency and efficiency.
02:00
Consistency means the answer to any random query is determined as the sum of all random variables realized in the initialization phase in this range. In particular, for any range, a user would get the same result whether he queries for a three-range sum
02:21
of the entire range or first queries for every random variable in this range and then sum the answers up. Efficiency means each query should be answered in only O log U time. This is because query ranges may cover billions of random variables.
02:42
If we add up our realized random variables in the query range, it would take O U time. In the worst case, it is too long because U can be very large. Our ERS problem has two applications. The first application is tracking the L2 norm of data streams.
03:03
In this problem, the precise state of the system is described by U counters sigma 0 to sigma e minus 1. Initially, all counters have value 0. The state is continuously updated by a long stream of data items that come one by one.
03:27
Since U is very large, we cannot afford to store all counters. We want to track the L2 norm of this stream using some sketches. This problem is already solved in 2006 in which each data item is a point update.
03:46
It means to increase a specific counter by n among delta. This solution uses some sketches and any one of the sketches work as follows. In the initialization phase, we fix the sequence of four independent Gaussian random variables, X0 to XU minus 1.
04:10
And then we associate each counter with a random variable that has the same subscript. When data items arrive, we update the sketch so that it is always the weighted sum of counter values
04:25
associated by the corresponding random variable. It means when a data item i delta comes, we update the sketch to increment it by a amount delta multiplied by Xi. Finally, we can estimate the L2 norm from these sketches.
04:45
Our ERS solution is useful in two extensions to this problem. The first extension is to handle range update items. It means to increase every counter in a particular range by amount delta.
05:00
In this case, the update scheme becomes we want to increment the sketch by delta multiplied by a range sum of these random variables. In the second extension, we can estimate the sum of counters in a range. From the estimator being the sketch multiplied by a range sum of random variables.
05:23
In both extensions, we need to compute range sums of random variables at very high speed so that it can keep up with the arrival speed of data items. So an ERS solution is necessary.
05:40
The second application is approximate nearest neighbor search in Manhattan distance. A state-of-the-art solution to this problem calls for a solution that can answer a large number of range sum queries in several milliseconds. In that work, the authors have to use a very large precomputed table to do so.
06:04
We can eliminate the use of this table, but we still have all the guarantees of the table-based solution. After formulating this ERS problem, any desired solution to this problem should meet these three points.
06:23
It should support a wide range of target distribution. It should have strong independence guarantees on these generated random variables. And third, it should have implementations that actually run very fast. Unfortunately, all the three existing schemes listed in this table only work on random walk distribution.
06:49
And the only work, RM7, that goes beyond forward independence is very slow in practice. In contrast, our proposed solution named dyadic simulation, it can support in principle any target distribution.
07:05
It can guarantee KYC independence on generated random variables for any k that is larger than zero. And it has very fast implementations on Gaussian, Cauchy, and random walk distributions.
07:22
For example, if we have 1 billion random variables, then any range sum query does take several hundred nanoseconds to compute. Now I start describing our scheme. It is based on an idea that was briefly mentioned in a theory paper in 2002.
07:41
The starting point is the following dyadic simulation tree. It stores the realizations of all random variables, and the range sums on every dyadic interval of this universe. So it means this tree has all the random variables and the summations of them two by two.
08:02
Say x0 plus x1 into sum 0 to 2. And sum 0 to 2 plus sum 2 to 4 to sum 0 to 4. And here we have the root that is the sum of our random variables in this universe. Once we have this tree, it becomes a very easy problem to compute any range sum.
08:23
Because any range sum is equal to the sum of all log u values in this tree. For example, if the query range is from 1 to 7, not including 7 in this trapezoid, then the range sum is equal to x1 plus sum 2 to 4 plus sum 4 to 6 plus x6.
08:47
However, we cannot use this initialize and store scheme because it has the following two limitations. First, it takes a lot of time to initialize this tree.
09:00
Second, it takes a lot of space to store this tree after initialization. Instead, we propose a fix and compute on-demand scheme. It works as follows. As the initialization phase, we fix some seeds that would determine all the values.
09:20
It means all the values of random variables and the range sums on the other intervals of this tree. In the query phase, we compute these values on-demand using these seeds. These two schemes are equivalent in the sense that they generate random variables and range sums that have the same distribution.
09:42
Our scheme has three benefits. The first benefit is it has low initialization complexity. Because we do not compute the values of every node in the initialization phase, we can fix, once the seeds are fixed, it fixes a deterministic algorithm so that every node value can be computed.
10:06
But this algorithm is executed only in the query phase. The second benefit, it has low storage cost because we only store these seeds and we do not store any other information of these trees across queries.
10:23
In practice, these seeds are very short. The third benefit is we still have low query complexity because the node values are computed on-demand. Now I will describe how to fix this simulation tree. It is more natural to describe how we generate it.
10:44
To generate means we fix and compute at the same time. After this, I will describe how to fix a particular realization of this tree in the initialization phase without computing it. We want to generate a dielectric simulation tree that has these three conditions.
11:06
Efficiency, distribution, and consistency. Efficiency means any node can be computed from these seeds in just O log U time. Distribution means the leaves degenerated random variables should satisfy ideally RID target distribution.
11:28
Consistency means every non-leaf node of this tree is equal to the sum of these two children. In a simple example, suppose we have eight random variables and the target distribution is standard Gaussian.
11:44
We use the top-down approach to generate this tree. If we want to know what the value of X5 is, we first generate the range sum from 0 to 8. It is the root of this tree, and then the sum from 4 to 8, sum from 4 to 6, and finally X5.
12:04
To start with, this tree is completely random, so we generate its root from Gaussian with variance 8. To proceed, we need to generate a condition of what has already been generated to ensure that every node has the correct distribution.
12:23
Thanks to this tree structure, we only need one procedure, that is to generate a node on condition of its parent. We call this procedure a split operation because it splits a parent to two children that sum up as a parent. This split operation has two steps.
12:42
In the first step, we generate its left child as a fresh sample from some conditional distribution that we will specify later. In the second step, we generate its right child as parent minus the left child.
13:02
This split operation has the following properties. We can split all the way down to generate any node that we want in O log U time from the season. And it guarantees that two newly generated children are independent, and each has the correct distribution.
13:21
So if ideally all nodes are independently split, then it generates random variables that are RID, and each has a target distribution. This split operation is universal. It works for any continuous or discrete target distribution because we only need to determine the conditional distribution from this formula.
13:47
It is dependent on the parent, on the number of nodes at each child, and on the target distribution. Since these split operations are actually performed in the query phase, we
14:01
need a very efficient procedure to sample from this conditional distribution app. If the target distribution is Gaussian, then we only need to generate Gaussian random variables, and it is a well-studied problem to do so. For other distributions, the formula of this app may be very complicated.
14:23
So generating a random variable from this conditional distribution is not a trivial task to do efficiently. And we have worked out efficient rejection sampling implementations when the target distribution is called sheet or random walk.
14:43
Now I will show any ring-sum query can be answered in O log U time. Note that in our fixed and generational demand scheme, answering any ring-sum query is equivalent to generating the distance sum from the seed.
15:02
It is easy to show an O log square U bound because any ring-sum query is equal to the sum of O log U nodes in this tree, and each node can be computed in O log U splits. In fact, for any ring-sum query, if we count more carefully, only two split operations are needed at every level of this tree.
15:25
So the total time complexity of answering this query is just O log U. Now I will describe how to fix this tree after describing how to generate it. Since in the generating algorithm, the only random part is the split operation.
15:44
Therefore, fixing a tree means to fix how each node is split. The most straightforward way to fix it is to store the value of every generated child. Unfortunately, if we want to fix an ideal tree in which all splits are independent, we need to
16:03
store a large number of random variables that are independent, and it is very costly to do so. We will describe a space-efficient way that is not ideal but still has some guarantees. In our space-efficient scheme, we fix a dyadic simulation tree by fixing O log U independent instances of k-wise independent hash functions.
16:30
The seeds of these hash functions become the generation seeds of this dyadic simulation tree. A k-wise independent hash function means for any set of k different keys, their hash values are independent.
16:44
Such hash functions can be implemented with short seeds and evaluated in fast time, like in these two examples. In our scheme, we use one hash function to determine all split operations on a particular level of this tree.
17:05
For example, if we want to split the j-th node on the i-th level, we use the hash value H i j, which is the hash function H i with its key being j, to sample from the conditional distribution.
17:23
This hash value is a uniform random integer, and it is a determined value fixed by the seeds of these hash functions. To answer any ring-sum query, we need to evaluate only O log U hash functions, one for each split operation.
17:43
Although our scheme is not ideal, we have proved the following two guarantees. The first guarantee is that if we implement this tree using k-wise independent hash functions, then the generated random variables are also k-wise independent.
18:03
The second guarantee is that if k equal to 2, then we guarantee every returned ring-sum as the correct distribution. For example, if the target distribution is standard Gaussian, and the query range covers 99 random variables, then we return a ring-sum that has Gaussian with variance 99.
18:27
In conclusion, we propose a dielectric simulation framework to the efficient ring-sum ability problem. It consistently computes any ring-sum in just O log U time.
18:41
It supports in principle any target distribution of random variables. It guarantees key-wise independence on these generated random variables, and guarantees the correct distribution of any generated ring-sum. It has very fast implementation on Gaussian, Cauchy, and random walk distributions.
19:04
Thank you for listening to my talk. Here are the references.