Approximate Nearest Neighbor for Curves --- Simple, Efficient, and Deterministic
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 123 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/49333 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | |
Genre |
14
41
45
49
54
77
108
111
00:00
SicNeuroinformatikCurveApproximationDeterminismDistanceData structureoutputCurveApproximationComputer animation
00:26
Convex hullNormed vector spaceScalable Coherent InterfaceDean numberCurveQuery languageCurveMassQuery languageSet (mathematics)Data structureDistanceEndliche ModelltheorieDemosceneRevision controlMeasurementApproximationoutputGroup actionMultiplication signComputer animation
01:04
CurveQuery languageIRIS-TReduction of orderArtificial neural networkDistanceApproximationMultiplication signReduction of orderDecision theoryRevision controlCurveParameter (computer programming)Compact operatorCASE <Informatik>Product (business)Term (mathematics)Computer clusterComputer animation
01:54
CurveLengthMeasurementCurveDistanceDiagram
02:13
CurveSequenceMultiplication signPoint (geometry)1 (number)Thomas BayesSequenceCurvePosition operatorComputer animation
02:44
CurveSequenceSubject indexingSubject indexingPosition operatorCurveSequencePoint (geometry)Different (Kate Ryan album)NumberCellular automatonTotal S.A.Line (geometry)Computer animation
03:19
CurveSequenceSubject indexingDistanceDifferent (Kate Ryan album)NumberCurveTotal S.A.DistanceFlow separationMeasurementMaxima and minimaOpen setPoint (geometry)Descriptive statisticsComputer animation
03:38
Maxima and minimaCurveDistanceData flow diagramDistancePolygonDiscrete groupMaxima and minimaCAN busComputer animation
04:00
DistanceCurveMaxima and minimaData flow diagramComputer programmingSummierbarkeitDistanceFamilyNetbookProcess (computing)Multiplication signCondition numberApproximationAngleCurveAlgorithmDiscrete groupComputer animation
04:29
CurvePoint (geometry)Query languageDistanceDistanceCurveQuery languageoutputSet (mathematics)SpacetimePreprocessorDiscrete groupAlgorithmData storage deviceMultiplication signNeuroinformatikProcess (computing)ApproximationComputer animation
05:46
Query languageApproximationSpacetimeMetric systemProduct (business)System on a chipArtificial neural networkData flow diagramNumbering schemeDomain nameDivisorApproximationQuery languageMultiplication signPolynomialCurveDiscrete groupNumbering schemeSpacetimeResultantPrice indexMatrix (mathematics)DistanceProduct (business)Logical constantSensitivity analysisMetropolitan area networkPotenz <Mathematik>Integrated development environmentLocal ringDigital photographyComputer animation
06:36
Metric systemSystem on a chipProduct (business)Numbering schemeSpacetimeApproximationQuery languageCurveData flow diagramBound stateSpacetimeMultiplication signFeasibility studyData storage deviceQuery languagePotenz <Mathematik>Point (geometry)ApproximationProduct (business)DivisorData structureBasis <Mathematik>NeuroinformatikInfinityComputer animation
07:16
Product (business)Metric systemNumbering schemeQuery languageExtreme programmingApproximationSpacetimeData flow diagramCurveSpacetimeQuery languageLinearizationResultantApproximationMultiplication signDiscrete groupNeuroinformatikDistanceComputer animation
07:56
Data structureUniform convergenceData flow diagramLengthoutputLimit of a functionSquare numberRootForm (programming)NeuroinformatikComputer animation
08:11
Data structureUniform convergenceData flow diagramCurveQuery languageVertex (graph theory)Set (mathematics)Vertex (graph theory)CurveQuery languageData storage deviceMetropolitan area networkUsabilityGraph coloringRoundness (object)Computer animation
08:56
CurveInequality (mathematics)Square numberDistancePole (complex analysis)CurveSequenceSet (mathematics)Point (geometry)Discrete groupRadiusRootData structureMaxima and minimaRoundness (object)Query languageMultiplication signLengthAlgorithmMereologyTraffic reportingNumberCellular automatonCausalityQueue (abstract data type)PlanningUsabilityComputer animation
11:11
Point (geometry)Point (geometry)Computer configurationGoodness of fitMultiplication signNumberDistanceCurveTotal S.A.SequenceLimit of a functionOrder (biology)RadiusComputer animation
12:01
Point (geometry)Total S.A.NeuroinformatikMultiplication sign40 (number)NumberPower (physics)CurveComputer animation
12:21
Point (geometry)Limit of a functionMultiplication signGoodness of fitNumberPoint (geometry)Total S.A.DistanceOrder (biology)CurveComputer animation
12:59
CurveTotal S.A.CurveLimit of a functionTotal S.A.Constructor (object-oriented programming)CountingPoint (geometry)NumberOrder (biology)CubeSet (mathematics)Wave packetTerm (mathematics)CASE <Informatik>Computer configurationUsabilityImplementationFreewareMultiplication signData structureComputer animationDiagram
14:43
Computer configurationNetwork topologyCurveImplementationConstructor (object-oriented programming)Computer configurationMultiplication signTable (information)CurveOrder (biology)Hash functionDivisorImplementationQuery languageSet (mathematics)Goodness of fitFamilyExtreme programmingMathematical optimizationSystem callComputer animation
15:23
CurveImage warpingAerodynamicsSystem on a chipDistanceLengthPower (physics)Maxima and minimaSummierbarkeitL1-NormVector spaceProgrammschleifeMultiplication signOrder (biology)VotingDescriptive statisticsSet (mathematics)Normal (geometry)Instance (computer science)Computer animation
16:00
CurveImage warpingAerodynamicsData flow diagramSimilarity (geometry)ProgrammschleifeDiscrete groupDynamical systemMultiplication signDistanceComputer animation
16:23
Similarity (geometry)Inequality (mathematics)Parameter (computer programming)SpacetimeBound stateFinitismusDistanceBound stateCountingMaxima and minimaSpacetimeDifferent (Kate Ryan album)Point (geometry)SummierbarkeitParameter (computer programming)LengthPower (physics)Squeeze theoremPlanningForm (programming)FamilySocial classXMLComputer animation
17:29
Data flow diagramApproximationQuery languageSpacetimeApproximationPotenz <Mathematik>AlgorithmDivisorQuery languageMultiplication signComputer animation
17:53
Query languageExtension (kinesiology)ApproximationSpacetimePlanningObservational studyoutputExtension (kinesiology)CASE <Informatik>NeuroinformatikNumberSet (mathematics)Cellular automatonLengthQuery languageCurveBound stateData structureComputer animation
18:35
Query languageApproximationCountingRange (statistics)outputCurveSpacetimeExtension (kinesiology)Bound stateNumberRange (statistics)outputDistanceCurveData structureApproximationMultiplication signSet (mathematics)Extension (kinesiology)Adaptive behaviorAdditionComputer animation
19:17
Bound stateLogical constantSpacetimePolynomialApproximationOpen setSpacetimeNumberApproximationObservational study2 (number)ExponentiationData structureBound stateOpen setDivisorPolynomialComputer animation
19:50
Logical constantSpacetimeApproximationPolynomialBound stateOpen setComputer animation
Transcript: English(auto-generated)
00:12
Hi, my name is Omrit Filzer and I will talk about the Approximate Nearest Neighbors problem for curves. This is a joint work with Arnold Filzer and Mattia Katz.
00:26
Assume you have a massive dataset of curves. Now you want to arrange them in some data structure such that given a query curve, you can quickly find the closest curve from your input set. Let's denote our dataset by C, let delta be some distance measure for curves, and Q denotes
00:47
a query curve. In the Nearest Neighbors problem, our goal is to preprocess the set C, such that given a query Q, we can return the input curve P*, which is the closest to Q.
01:03
We can also consider an approximation version of this problem, where the goal is to return a curve P within distance 1 plus epsilon times the distance to the nearest neighbor. To solve the approximation version, we can first consider a decision version of the
01:21
problem, in which we are also given a distance parameter r. The guarantee in this case is that if there exists a curve P in the dataset within distance r of Q, then we return some curve P' within distance 1 plus epsilon r.
01:41
And there exists an efficient reduction between the approximation problem and the approximate decision problem, and therefore, we can focus on the decision problem. Before talking about the Nearest Neighbors problem, we first need to define our distance
02:01
measure for curves. And for that, let's first talk about curve alignments. Consider two polygonal curves, A and B, both of length m. Now, imagine two frogs. The A frog is standing on the first red point, and the B frog is standing on the
02:21
first blue point. They keep hopping from point to point along their sequence. Each time, either the A frog, the B frog, or both hop exactly one step forward. They continue until they reach the end point of the respective sequences.
02:44
A curve alignment, tau, is a sequence of positions, or pairs of indexes, that describes the movement of the two frogs along the curve. Every alignment starts with the position 1,1 and ends with the position m,m, which
03:05
are the indexes of the start point and ending points. The indexes in each pair can increase by at most 1, comparing to the previous pair. It is an easy exercise to check that the total number of different curve alignments
03:25
is bounded by 2 to the 2m. There are several different distance measures that are based on these curve alignments. For two polygonal curves, A and B, each with m d-dimensional points, if we minimize
03:48
over all possible alignments the maximum distance between the frogs, then we get the discrete Fréchet distance. If instead of max, we minimize the sum of distances between the frogs, then we get
04:06
dynamic time-warping. Both these distances between two curves can be computed in quadratic time using a simple dynamic programming algorithm. And for the discrete Fréchet distance, there exists a conditional lower bound showing
04:23
that strongly subquadratic algorithm is unlikely to exist. Back to the approximate nearest neighbor problem, recall that our goal is to preprocess a set C of n curves such that given a query curve Q, if there exists a curve P within
04:45
distance r of Q, then we want to return a curve P' within distance 1 plus epsilon r. In general, there are two extreme approaches for a solution.
05:00
The first one is not doing any preprocessing. Just store the set C and let the query algorithm do all the work. But then, obviously, the query time is huge. The second one is to precompute and then store the answers to a discretization of
05:22
all possible yes queries. And by yes queries, I mean that they have some input curve close enough to them. And the query algorithm just do a quick look up to get the answer. But then obviously the space bound will be large.
05:42
But how large it will be? The first to study this problem under the discrete Fréchet distance was Indic in 2002. He used the idea of product matrix and he got a constant approximation with query time polynomial in m log n.
06:04
The space bound is exponential in m and depends on the size of the domain on which the curves are defined. Then, only in 2017, Dremel and Silvestri presented a locality-sensitive Hessian scheme for curves under the discrete Fréchet distance.
06:23
They improved the result of Indic for short curves, but both space and the query time are exponential in m, and the approximation factor is polynomial in D. A year later, Emiris and Seiros managed to obtain a one-plus-epsilon approximation
06:43
factor. Their idea was to try all the possible alignments, which is why their query time is still exponential. Once the alignment is fixed, the problem can be reduced to the near-neighbor problem for points in L-infinity product of L2 spaces.
07:06
As you can see, all these data structures have either an exponential in m query time or an infeasible storage space bound. In our paper, we actually take the extreme approach of precomputing a discretization of
07:23
all the s queries to get a one-plus-epsilon approximation. Surprisingly, not only we get a linear query time, but we also significantly improve the space bound over the previous results. It is also interesting to note that the query time is significantly smaller than the lower
07:44
bound for computing the distance, because it means that verifying the correctness of a solution is more expensive than finding it. The basic idea in our approach is as follows.
08:02
Start by laying a uniform grid over the input curves, with edge length epsilon r over square root D. Then compute a set I of grid curves, representing all the possible yes queries, and store the
08:22
answers for all the curves in the set I. Now given a query Q, just snap its vertices to the grid, and this can be simply done by rounding, and we get a grid curve Q hat.
08:44
And the guarantee is that if Q is a yes query, then Q hat is in the set I, and we can return the precomputed answer. So how do we choose the set of grid curves?
09:03
For one curve P in the dataset, denote by I of P the set of all grid curves that are within distance one-plus-epsilon over two times r from P. And let's assume for simplicity from now on that r equals one.
09:27
So the points of the grid curves are in this sequence of poles of radius one-plus-epsilon over two around the points of P. For a query curve Q, what will be the distance
09:45
between Q and the rounded curve, Q hat? So since the length of the grid edges is epsilon over square root d, we get that the distance between a point of Q to its closest grid point is at most epsilon over two.
10:08
That means that under the discrete crochet distance, which is the maximum over all pairs in an alignment, the distance between Q and Q hat is at most epsilon over two, because
10:22
we can just match each point to the closest grid point. The correctness of the query algorithm now follows from the trianguline quality, because if Q is a yes query, then the distance between Q and P is at most one.
10:40
And then by the trianguline quality, the distance between Q hat and P is at most one-plus-epsilon over two, which means that Q hat is in the set I of P. So the data structure will return a curve with distance at most one-plus-epsilon over
11:02
two to Q hat. And this curve will be a distance at most one-plus-epsilon to Q. But how many grid curves are there in I of P? The number of grid points in a ball of radius one-plus-epsilon over two is order
11:24
of one over epsilon to the d. And again, the points of our grid curves are contained in this sequence of balls of radius one-plus-epsilon over two around the points of P. But there can be many different options to choose the points of a grid curve that
11:44
has this distance to P. Some curves can have many points in the first ball. Others can have only one point in the first two balls. The total number of relevant grid points is m times the size of one ball.
12:07
And naively, the number of grid curves is bounded by this number to the power of m. But can we get a better bound by counting more carefully?
12:25
I claim that we can get one over epsilon to the m d, and that we can also compute I of P in this time. So we know that the number of grid points in one ball is order of one over epsilon to the d, and that the total number of curve alignments is two to the two m.
12:47
Let's fix an alignment tau. Now we ask, how many grid curves are there within distance one-plus-epsilon over two from P, when using only the alignment tau?
13:01
I saw for example that our alignment is as follows. And now we need to pick a grid curve for the set I-P. We have to pick the first point, q1, from the ball around p1. And q2, from the ball around p2.
13:22
The third point, q3, lies in the intersection of three balls, the one around p3, the one around p4, and the one around p5. But we allow ourselves to do overcounting and choose q3 from one of them,
13:42
say the ball around p3. So the counting is constructive. For each k from one to m, we pick qk, which is the kth point of the grid curve, from the ball of Pj, for some pair jk in the alignment tau.
14:08
Now since we pick each point of the grid curve from a ball that contains one over epsilon to the d points, the total number of grid curves for a fixed alignment is order of one over epsilon to the md.
14:25
Therefore, the total number of grid curves is bounded by the number of alignments, multiplied by the number of curves per alignment, which is in total, order of one over epsilon to the md.
14:44
For the implementation of the data structure, we suggest two options. The first option is the most efficient. Just store the set of grid curves in a hash table using cuckoo hashing. The lookup time is then optimal.
15:02
Order of md, which is the size of the query. And we have a randomized construction. The second option is to store the grid curves in a prefix tree, which gives a fully deterministic solution, but at the cost of a logarithmic factor in the query time.
15:24
Now let's consider dynamic time looping. Can we use the same approach here? Recall that for two polygonal curves, a and b of length m, the fresh air distance is minimizing the maximum distance over all alignments.
15:41
And the dynamic time looping is minimizing the sum of distances. More generally, Emirs and Psaros defined the lp2 distance for curves, which is the lp norm of the vector of power's distances in the alignment.
16:01
Notice that when p is infinity, we get the discrete fresh air distance, and for p equals one, we get dynamic time looping. We apply a similar approach for the lp2 distance, but there are some challenges here.
16:23
First, we cannot use that triangle inequality, which does not apply to the lp2 distance for finite p. The second challenge is that the grid that we use has to be more dense.
16:43
The edge length now depends on m to the power of one over p. If we use our previous arguments, we get a larger bound. So we have to use a different counting argument that takes into account the sum of distances and not their maximum.
17:03
For example, if the distance between the first two points is already large, then the rest of the points are chosen from much smaller balls. Surprisingly, we obtain almost the same space bounds.
17:20
For any finite p, there is only this additional plus one here, in the stored space. Some of the previous solutions were also applied to dynamic time looping, but again, either with a larger approximation factor
17:41
or with an exponential query time. Our algorithm has a one plus epsilon approximation factor and linear running time. We can also extend our solution to the asymmetric case, where the length of the query can be smaller than the length of the input curves.
18:05
This setting, in general, was first studied in a paper by Dremelb, Seuss, and Schmitt. We adapt our data structure for the asymmetric case by computing simplifications of length k for the input curves,
18:22
and construct the set of grid curves i with respect to the simplifications. We get similar bounds except that m is replaced by k. Another extension of this work is for the approximate range counting problem,
18:43
when we want to return the number of input curves within distance r from q. The approximation here means that this number might include additional input curves that are within distance one plus epsilon times r from q.
19:02
Here, the adaptation of our data structure is easy, because we just need to add a counter to each grid curve in the set i. The counter will count how many closed curves are there.
19:21
Two main questions are left open. One is getting lower bounds for the nearest neighbor problem. Our conjecture is that our space bound is almost optimal, up to a constant factor in the exponent. The second question is to find a much smaller data structure with polynomial space,
19:44
but at the cost of a constant approximation, instead of one plus epsilon. Thank you for listening!