We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Approximate Nearest Neighbor for Curves --- Simple, Efficient, and Deterministic

00:00

Formal Metadata

Title
Approximate Nearest Neighbor for Curves --- Simple, Efficient, and Deterministic
Subtitle
Q/A Session A - Paper A1.E
Title of Series
Number of Parts
123
Author
Contributors
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
SicNeuroinformatikCurveApproximationDeterminismDistanceData structureoutputCurveApproximationComputer animation
Convex hullNormed vector spaceScalable Coherent InterfaceDean numberCurveQuery languageCurveMassQuery languageSet (mathematics)Data structureDistanceEndliche ModelltheorieDemosceneRevision controlMeasurementApproximationoutputGroup actionMultiplication signComputer animation
CurveQuery languageIRIS-TReduction of orderArtificial neural networkDistanceApproximationMultiplication signReduction of orderDecision theoryRevision controlCurveParameter (computer programming)Compact operatorCASE <Informatik>Product (business)Term (mathematics)Computer clusterComputer animation
CurveLengthMeasurementCurveDistanceDiagram
CurveSequenceMultiplication signPoint (geometry)1 (number)Thomas BayesSequenceCurvePosition operatorComputer animation
CurveSequenceSubject indexingSubject indexingPosition operatorCurveSequencePoint (geometry)Different (Kate Ryan album)NumberCellular automatonTotal S.A.Line (geometry)Computer animation
CurveSequenceSubject indexingDistanceDifferent (Kate Ryan album)NumberCurveTotal S.A.DistanceFlow separationMeasurementMaxima and minimaOpen setPoint (geometry)Descriptive statisticsComputer animation
Maxima and minimaCurveDistanceData flow diagramDistancePolygonDiscrete groupMaxima and minimaCAN busComputer animation
DistanceCurveMaxima and minimaData flow diagramComputer programmingSummierbarkeitDistanceFamilyNetbookProcess (computing)Multiplication signCondition numberApproximationAngleCurveAlgorithmDiscrete groupComputer animation
CurvePoint (geometry)Query languageDistanceDistanceCurveQuery languageoutputSet (mathematics)SpacetimePreprocessorDiscrete groupAlgorithmData storage deviceMultiplication signNeuroinformatikProcess (computing)ApproximationComputer animation
Query languageApproximationSpacetimeMetric systemProduct (business)System on a chipArtificial neural networkData flow diagramNumbering schemeDomain nameDivisorApproximationQuery languageMultiplication signPolynomialCurveDiscrete groupNumbering schemeSpacetimeResultantPrice indexMatrix (mathematics)DistanceProduct (business)Logical constantSensitivity analysisMetropolitan area networkPotenz <Mathematik>Integrated development environmentLocal ringDigital photographyComputer animation
Metric systemSystem on a chipProduct (business)Numbering schemeSpacetimeApproximationQuery languageCurveData flow diagramBound stateSpacetimeMultiplication signFeasibility studyData storage deviceQuery languagePotenz <Mathematik>Point (geometry)ApproximationProduct (business)DivisorData structureBasis <Mathematik>NeuroinformatikInfinityComputer animation
Product (business)Metric systemNumbering schemeQuery languageExtreme programmingApproximationSpacetimeData flow diagramCurveSpacetimeQuery languageLinearizationResultantApproximationMultiplication signDiscrete groupNeuroinformatikDistanceComputer animation
Data structureUniform convergenceData flow diagramLengthoutputLimit of a functionSquare numberRootForm (programming)NeuroinformatikComputer animation
Data structureUniform convergenceData flow diagramCurveQuery languageVertex (graph theory)Set (mathematics)Vertex (graph theory)CurveQuery languageData storage deviceMetropolitan area networkUsabilityGraph coloringRoundness (object)Computer animation
CurveInequality (mathematics)Square numberDistancePole (complex analysis)CurveSequenceSet (mathematics)Point (geometry)Discrete groupRadiusRootData structureMaxima and minimaRoundness (object)Query languageMultiplication signLengthAlgorithmMereologyTraffic reportingNumberCellular automatonCausalityQueue (abstract data type)PlanningUsabilityComputer animation
Point (geometry)Point (geometry)Computer configurationGoodness of fitMultiplication signNumberDistanceCurveTotal S.A.SequenceLimit of a functionOrder (biology)RadiusComputer animation
Point (geometry)Total S.A.NeuroinformatikMultiplication sign40 (number)NumberPower (physics)CurveComputer animation
Point (geometry)Limit of a functionMultiplication signGoodness of fitNumberPoint (geometry)Total S.A.DistanceOrder (biology)CurveComputer animation
CurveTotal S.A.CurveLimit of a functionTotal S.A.Constructor (object-oriented programming)CountingPoint (geometry)NumberOrder (biology)CubeSet (mathematics)Wave packetTerm (mathematics)CASE <Informatik>Computer configurationUsabilityImplementationFreewareMultiplication signData structureComputer animationDiagram
Computer configurationNetwork topologyCurveImplementationConstructor (object-oriented programming)Computer configurationMultiplication signTable (information)CurveOrder (biology)Hash functionDivisorImplementationQuery languageSet (mathematics)Goodness of fitFamilyExtreme programmingMathematical optimizationSystem callComputer animation
CurveImage warpingAerodynamicsSystem on a chipDistanceLengthPower (physics)Maxima and minimaSummierbarkeitL1-NormVector spaceProgrammschleifeMultiplication signOrder (biology)VotingDescriptive statisticsSet (mathematics)Normal (geometry)Instance (computer science)Computer animation
CurveImage warpingAerodynamicsData flow diagramSimilarity (geometry)ProgrammschleifeDiscrete groupDynamical systemMultiplication signDistanceComputer animation
Similarity (geometry)Inequality (mathematics)Parameter (computer programming)SpacetimeBound stateFinitismusDistanceBound stateCountingMaxima and minimaSpacetimeDifferent (Kate Ryan album)Point (geometry)SummierbarkeitParameter (computer programming)LengthPower (physics)Squeeze theoremPlanningForm (programming)FamilySocial classXMLComputer animation
Data flow diagramApproximationQuery languageSpacetimeApproximationPotenz <Mathematik>AlgorithmDivisorQuery languageMultiplication signComputer animation
Query languageExtension (kinesiology)ApproximationSpacetimePlanningObservational studyoutputExtension (kinesiology)CASE <Informatik>NeuroinformatikNumberSet (mathematics)Cellular automatonLengthQuery languageCurveBound stateData structureComputer animation
Query languageApproximationCountingRange (statistics)outputCurveSpacetimeExtension (kinesiology)Bound stateNumberRange (statistics)outputDistanceCurveData structureApproximationMultiplication signSet (mathematics)Extension (kinesiology)Adaptive behaviorAdditionComputer animation
Bound stateLogical constantSpacetimePolynomialApproximationOpen setSpacetimeNumberApproximationObservational study2 (number)ExponentiationData structureBound stateOpen setDivisorPolynomialComputer animation
Logical constantSpacetimeApproximationPolynomialBound stateOpen setComputer animation
Transcript: English(auto-generated)
Hi, my name is Omrit Filzer and I will talk about the Approximate Nearest Neighbors problem for curves. This is a joint work with Arnold Filzer and Mattia Katz.
Assume you have a massive dataset of curves. Now you want to arrange them in some data structure such that given a query curve, you can quickly find the closest curve from your input set. Let's denote our dataset by C, let delta be some distance measure for curves, and Q denotes
a query curve. In the Nearest Neighbors problem, our goal is to preprocess the set C, such that given a query Q, we can return the input curve P*, which is the closest to Q.
We can also consider an approximation version of this problem, where the goal is to return a curve P within distance 1 plus epsilon times the distance to the nearest neighbor. To solve the approximation version, we can first consider a decision version of the
problem, in which we are also given a distance parameter r. The guarantee in this case is that if there exists a curve P in the dataset within distance r of Q, then we return some curve P' within distance 1 plus epsilon r.
And there exists an efficient reduction between the approximation problem and the approximate decision problem, and therefore, we can focus on the decision problem. Before talking about the Nearest Neighbors problem, we first need to define our distance
measure for curves. And for that, let's first talk about curve alignments. Consider two polygonal curves, A and B, both of length m. Now, imagine two frogs. The A frog is standing on the first red point, and the B frog is standing on the
first blue point. They keep hopping from point to point along their sequence. Each time, either the A frog, the B frog, or both hop exactly one step forward. They continue until they reach the end point of the respective sequences.
A curve alignment, tau, is a sequence of positions, or pairs of indexes, that describes the movement of the two frogs along the curve. Every alignment starts with the position 1,1 and ends with the position m,m, which
are the indexes of the start point and ending points. The indexes in each pair can increase by at most 1, comparing to the previous pair. It is an easy exercise to check that the total number of different curve alignments
is bounded by 2 to the 2m. There are several different distance measures that are based on these curve alignments. For two polygonal curves, A and B, each with m d-dimensional points, if we minimize
over all possible alignments the maximum distance between the frogs, then we get the discrete Fréchet distance. If instead of max, we minimize the sum of distances between the frogs, then we get
dynamic time-warping. Both these distances between two curves can be computed in quadratic time using a simple dynamic programming algorithm. And for the discrete Fréchet distance, there exists a conditional lower bound showing
that strongly subquadratic algorithm is unlikely to exist. Back to the approximate nearest neighbor problem, recall that our goal is to preprocess a set C of n curves such that given a query curve Q, if there exists a curve P within
distance r of Q, then we want to return a curve P' within distance 1 plus epsilon r. In general, there are two extreme approaches for a solution.
The first one is not doing any preprocessing. Just store the set C and let the query algorithm do all the work. But then, obviously, the query time is huge. The second one is to precompute and then store the answers to a discretization of
all possible yes queries. And by yes queries, I mean that they have some input curve close enough to them. And the query algorithm just do a quick look up to get the answer. But then obviously the space bound will be large.
But how large it will be? The first to study this problem under the discrete Fréchet distance was Indic in 2002. He used the idea of product matrix and he got a constant approximation with query time polynomial in m log n.
The space bound is exponential in m and depends on the size of the domain on which the curves are defined. Then, only in 2017, Dremel and Silvestri presented a locality-sensitive Hessian scheme for curves under the discrete Fréchet distance.
They improved the result of Indic for short curves, but both space and the query time are exponential in m, and the approximation factor is polynomial in D. A year later, Emiris and Seiros managed to obtain a one-plus-epsilon approximation
factor. Their idea was to try all the possible alignments, which is why their query time is still exponential. Once the alignment is fixed, the problem can be reduced to the near-neighbor problem for points in L-infinity product of L2 spaces.
As you can see, all these data structures have either an exponential in m query time or an infeasible storage space bound. In our paper, we actually take the extreme approach of precomputing a discretization of
all the s queries to get a one-plus-epsilon approximation. Surprisingly, not only we get a linear query time, but we also significantly improve the space bound over the previous results. It is also interesting to note that the query time is significantly smaller than the lower
bound for computing the distance, because it means that verifying the correctness of a solution is more expensive than finding it. The basic idea in our approach is as follows.
Start by laying a uniform grid over the input curves, with edge length epsilon r over square root D. Then compute a set I of grid curves, representing all the possible yes queries, and store the
answers for all the curves in the set I. Now given a query Q, just snap its vertices to the grid, and this can be simply done by rounding, and we get a grid curve Q hat.
And the guarantee is that if Q is a yes query, then Q hat is in the set I, and we can return the precomputed answer. So how do we choose the set of grid curves?
For one curve P in the dataset, denote by I of P the set of all grid curves that are within distance one-plus-epsilon over two times r from P. And let's assume for simplicity from now on that r equals one.
So the points of the grid curves are in this sequence of poles of radius one-plus-epsilon over two around the points of P. For a query curve Q, what will be the distance
between Q and the rounded curve, Q hat? So since the length of the grid edges is epsilon over square root d, we get that the distance between a point of Q to its closest grid point is at most epsilon over two.
That means that under the discrete crochet distance, which is the maximum over all pairs in an alignment, the distance between Q and Q hat is at most epsilon over two, because
we can just match each point to the closest grid point. The correctness of the query algorithm now follows from the trianguline quality, because if Q is a yes query, then the distance between Q and P is at most one.
And then by the trianguline quality, the distance between Q hat and P is at most one-plus-epsilon over two, which means that Q hat is in the set I of P. So the data structure will return a curve with distance at most one-plus-epsilon over
two to Q hat. And this curve will be a distance at most one-plus-epsilon to Q. But how many grid curves are there in I of P? The number of grid points in a ball of radius one-plus-epsilon over two is order
of one over epsilon to the d. And again, the points of our grid curves are contained in this sequence of balls of radius one-plus-epsilon over two around the points of P. But there can be many different options to choose the points of a grid curve that
has this distance to P. Some curves can have many points in the first ball. Others can have only one point in the first two balls. The total number of relevant grid points is m times the size of one ball.
And naively, the number of grid curves is bounded by this number to the power of m. But can we get a better bound by counting more carefully?
I claim that we can get one over epsilon to the m d, and that we can also compute I of P in this time. So we know that the number of grid points in one ball is order of one over epsilon to the d, and that the total number of curve alignments is two to the two m.
Let's fix an alignment tau. Now we ask, how many grid curves are there within distance one-plus-epsilon over two from P, when using only the alignment tau?
I saw for example that our alignment is as follows. And now we need to pick a grid curve for the set I-P. We have to pick the first point, q1, from the ball around p1. And q2, from the ball around p2.
The third point, q3, lies in the intersection of three balls, the one around p3, the one around p4, and the one around p5. But we allow ourselves to do overcounting and choose q3 from one of them,
say the ball around p3. So the counting is constructive. For each k from one to m, we pick qk, which is the kth point of the grid curve, from the ball of Pj, for some pair jk in the alignment tau.
Now since we pick each point of the grid curve from a ball that contains one over epsilon to the d points, the total number of grid curves for a fixed alignment is order of one over epsilon to the md.
Therefore, the total number of grid curves is bounded by the number of alignments, multiplied by the number of curves per alignment, which is in total, order of one over epsilon to the md.
For the implementation of the data structure, we suggest two options. The first option is the most efficient. Just store the set of grid curves in a hash table using cuckoo hashing. The lookup time is then optimal.
Order of md, which is the size of the query. And we have a randomized construction. The second option is to store the grid curves in a prefix tree, which gives a fully deterministic solution, but at the cost of a logarithmic factor in the query time.
Now let's consider dynamic time looping. Can we use the same approach here? Recall that for two polygonal curves, a and b of length m, the fresh air distance is minimizing the maximum distance over all alignments.
And the dynamic time looping is minimizing the sum of distances. More generally, Emirs and Psaros defined the lp2 distance for curves, which is the lp norm of the vector of power's distances in the alignment.
Notice that when p is infinity, we get the discrete fresh air distance, and for p equals one, we get dynamic time looping. We apply a similar approach for the lp2 distance, but there are some challenges here.
First, we cannot use that triangle inequality, which does not apply to the lp2 distance for finite p. The second challenge is that the grid that we use has to be more dense.
The edge length now depends on m to the power of one over p. If we use our previous arguments, we get a larger bound. So we have to use a different counting argument that takes into account the sum of distances and not their maximum.
For example, if the distance between the first two points is already large, then the rest of the points are chosen from much smaller balls. Surprisingly, we obtain almost the same space bounds.
For any finite p, there is only this additional plus one here, in the stored space. Some of the previous solutions were also applied to dynamic time looping, but again, either with a larger approximation factor
or with an exponential query time. Our algorithm has a one plus epsilon approximation factor and linear running time. We can also extend our solution to the asymmetric case, where the length of the query can be smaller than the length of the input curves.
This setting, in general, was first studied in a paper by Dremelb, Seuss, and Schmitt. We adapt our data structure for the asymmetric case by computing simplifications of length k for the input curves,
and construct the set of grid curves i with respect to the simplifications. We get similar bounds except that m is replaced by k. Another extension of this work is for the approximate range counting problem,
when we want to return the number of input curves within distance r from q. The approximation here means that this number might include additional input curves that are within distance one plus epsilon times r from q.
Here, the adaptation of our data structure is easy, because we just need to add a counter to each grid curve in the set i. The counter will count how many closed curves are there.
Two main questions are left open. One is getting lower bounds for the nearest neighbor problem. Our conjecture is that our space bound is almost optimal, up to a constant factor in the exponent. The second question is to find a much smaller data structure with polynomial space,
but at the cost of a constant approximation, instead of one plus epsilon. Thank you for listening!