Tight Trade-offs for the Maximum k-Coverage Problem in the General Streaming Model
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 155 | |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/42836 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
SIGMOD 20198 / 155
34
37
44
116
120
122
144
148
155
00:00
Maxima and minimaStreaming mediaData modelData managementUniverse (mathematics)SubsetoutputMaxima and minimaAlgorithmData recoverySet (mathematics)SubsetElement (mathematics)Universe (mathematics)Lecture/ConferenceComputer animation
00:22
Data managementUniverse (mathematics)outputMaxima and minimaParameter (computer programming)SubsetInstance (computer science)Parameter (computer programming)CASE <Informatik>Set (mathematics)Lecture/ConferenceComputer animation
00:38
Data managementUniverse (mathematics)outputSubsetMaxima and minimaParameter (computer programming)Instance (computer science)Set (mathematics)System callRight angleCASE <Informatik>Classical physicsLecture/ConferenceComputer animation
00:53
outputUniverse (mathematics)SubsetMaxima and minimaApproximationParameter (computer programming)Information retrievalData miningMilitary operationMachine learningQuery languageSign (mathematics)Streaming mediaModel theoryCovering spaceAlgorithmLinear mapSpacetimeSet (mathematics)Element (mathematics)Euclidean vectorConnectivity (graph theory)Random numberSample (statistics)Sampling (music)DivisorMathematical optimizationGraph (mathematics)Software testingSampling (statistics)NumberGraph (mathematics)ResultantString (computer science)Set (mathematics)Service (economics)Fiber (mathematics)Instance (computer science)CASE <Informatik>Covering spaceStreaming mediaPhysical systemTask (computing)Graph (mathematics)Group actionConnectivity (graph theory)WordMultiplication signMessage passingRandomizationMaxima and minimaFrequencyState observeroutputElement (mathematics)Wave packetArchaeological field surveyNoise (electronics)Goodness of fitSet (mathematics)ForestQuicksortSummierbarkeitData streamData miningInformation retrievalComplex (psychology)SpacetimeLimit of a functionModel theoryDivisorApproximationsalgorithmusFamily of setsE (mathematical constant)Fiber bundleDifferent (Kate Ryan album)Cartesian coordinate systemStandard deviationIterationParticle systemWeb 2.0AreaNetwork topologyFraction (mathematics)Representation (politics)Graph (mathematics)MereologyEndliche ModelltheorieAlgorithmGreedy algorithmLambda calculusSample (statistics)Logical constantUniverse (mathematics)Computer animation
09:33
Maxima and minimaGraph (mathematics)ApproximationSpacetimeEstimationTraffic reportingCovering spaceAlgorithmMathematical optimizationSet (mathematics)Public key certificateMathematicsElement (mathematics)Universe (mathematics)Reduction of orderSample (statistics)Computer clusterSign (mathematics)ACIDSampling (music)Random numberCASE <Informatik>CountingStreaming mediaoutputFraction (mathematics)NumberBit rateSet (mathematics)Equals signKolmogorov complexityFamilyTotal S.A.SubsetMultiplicationSquare numberGamma functionCASE <Informatik>String (computer science)PolynomialPublic key certificateSampling (statistics)Set (mathematics)Element (mathematics)NumberEstimatorUniverse (mathematics)Bit rateGraph (mathematics)Vector spaceStreaming mediaScaling (geometry)MereologyAlpha (investment)Instance (computer science)Equaliser (mathematics)Connectivity (graph theory)FrequencyReduction of orderData structureSocial classPresentation of a groupNormal (geometry)Term (mathematics)AreaSpacetimeComplex (psychology)Multiplication signDivisorMaxima and minimaImage resolutionApproximationsalgorithmusMessage passingParalleler AlgorithmusGraph (mathematics)ResultantAlgorithmFraction (mathematics)Set (mathematics)Mathematical optimizationCovering spaceLogical constantSimilarity (geometry)Fiber bundleOptimization problemRevision controlGroup actionSummierbarkeitRight angleBeta functionOffice suiteSoftware testingPlotterWeb pageCompass (drafting)DecimalMatching (graph theory)Likelihood functionFormal languageFamilyComputer animation
18:13
Lecture/Conference
Transcript: English(auto-generated)
00:04
I'm going to talk about the streaming algorithm for maximum k-coverage problem. Let me first define the problem so in this problem we are given a collection of sets s1 to Sm and Each is a subset of a universe u that contains n elements e1 to en
00:22
okay, so for instance you can see the example here and We are also given a parameter k and the goal is to find k sets in collection f in the input such that the union is maximized, so it's a very
00:40
So for instance if k is equal to 2 then s1 and s2 are the best two sets whose union is maximized in this example so this problem is a very classic classical problem, and it's known to be empty heart and We know that the standard greedy algorithm that runs in Iteration and iterations and in each iteration it picks
01:02
the sets that cover the maximum number of yet uncovered elements achieve 1 minus 1 over e approximation and Actually, it's the best we can hope for unless p is equal Okay, so the complexity of the problem is very clear in the standard setting And it's also has I mean many applications so the maximum k coverage problem and also
01:25
May very similar problem the set cover problem has many application in different area such as information retrieval data mining web search advertising and so on you can count And But I mean by the in this talk I'm going to focus on this problem in the streaming setup
01:44
So basically the motivation is when we want to solve the problem for large data sets and in particular in the case of streaming Model so people started looking at this problem So so actually I mentioned that is the general streaming case in the title of the paper, so let's see what it means
02:05
so people have started Looking at this problem in the set arrival stream in which that They assume we have a data stream where the sets are coming one by one and when a set arrives You will see all its elements so for instance here
02:22
We will see a swan that contains e1 to e9 e1 e6 e3 e9 and so on Okay I mean I'm most of the results over in this model the set arrival stream But there is a more general streaming model and in particular
02:40
captured the set arrival stream, which is called edge arrival stream so in this model we look at the bipartisan graph representation of the set systems Where we have the elements in one side and the sets in the other side And then there exists an an edge between an element and a set if if the sets contain the elements
03:01
okay, and then we assume that in the streaming model we receive the edges one by one and There is no guarantee that the elements that appear in a in the same sets comes next to each other They may come in an arbitrary order so for instance here we may first see that e3 is in a swan and then we may see just e5 in is in s4 and
03:26
Maybe like that and so on okay as you see. I mean the elements of a Certain set may come out of order And it's actually clear easy to see that the edge arrival stream is more general than the set arrival one
03:40
Okay, so let's see what exactly the input to this problem and what you we want to solve so we are given a edge arrival stream of the maximum coverage problem and we would like to design an algorithm that read through the input in one pass and Only use a sublinear amount of a space and find the approximate solution
04:04
So here we assume that we have M sets and N elements, so The the total space that the input can have could be as large as m times n so by sublinear space I mean little of m times n and we will see actually the space is much better than
04:22
Okay, so before going to going through our result lets me first Just revisit some of the Sampling component that have been widely used in this for this problem of sub cover problem and maximum coverage problem in a streaming setting
04:44
So one is the and this actually are very simple and but So the first one is set sampling which says You can pick just L random sets and with high probability the frequency elements will be covered So, let's see more carefully and so basically it says if you pick all random sets
05:04
Then the elements that appear more than M over L sets will be covered So for instance here if I pick two random set I expect that the elements that appear more than half of the sets in my set system will be covered so here for instance I If I pick these two just randomly then eat as you see e2
05:24
E6 and e9 that are appearing more than half of the sets will be covered Actually, it's a very simple observation But it's a very nice because we can just randomly pick L sets and get rid of the high frequency elements and these elements are those that are contributing the most to the
05:44
space complexity of the problem actually and The other component is saying that okay if we want to solve the problem Let's say set cover or in our case max cover problem You can just sample a few elements and solve your problem on the sample set of elements
06:01
And then the solution that you find there will be with high probability is a good solution for the original instance as well Okay, let's see more carefully Suppose I just sampled e1 e3 e7 e10 in this example, then I just forget about the rest of elements and Find the cake good cake cover for these sampled element. Let's say the three I
06:26
Have the budget of K equal to 3 and this is the one that actually cover everything in my sample set and If we see the original instance here, we see that it's actually a good solution. It's cover almost all elements so more formally if
06:42
if we know that the there exists cases that cover that cover everything then we can just Randomly pick K over lambda elements from the from the universe Find the find the solution to the random random sampled element
07:01
Then that solution is going to cover 1 minus lambda fraction of the whole elements Okay so I mean this is helpful because now our space complexity would be something like the Sample size the number of element that we have samples times the number of sets rather than n times m
07:23
Okay, and it's actually I mean the the idea here is that you can just focus on much smaller Instance and solve that instance and that will help you to To decrease the number of uncovered elements and you can just repeat that that thing together solution
07:41
Okay So these are these are two simple sampling components that that actually was used before for set cover problem Okay, so now now that we know that these two simple component Let's try to see how the algorithm looks like So this is the requirement that we had for the streaming maximum K coverage problem
08:04
and just I have a few remarks so edge arrival assumes are more general and the sampling based approaches that I describe are actually worked on edge arrival streams because just like a sketching we can just always compute them even if the edges are coming out of order and
08:22
But there is some Some subtlety here for instance in the set arrival stream There is a this primitive task of testing the size of set which is very simple to to do, right? I mean for just testing what's the size of a set in a set? I will assume is actually trivial
08:42
but this is this is more challenge is where this is challenging actually in the edge arrival stream because you have to wait until the End of a stream to figure out what's the size of a set? Okay, and that's actually I mean this just a small thing make a make a very different Make make the complexity of these two
09:02
stream models very different Okay, so let's see what what we know about this problem We know that in edge arrival stream you can get actually the almost optimal Approximation 1 minus 1 over e minus epsilon in a space almost order of M Okay, and then this is actually tight
09:23
However We know that if you don't Really want the optimal constant factor and you are okay with any constant factor Let's say half is good for you Then you can get much better space complexity for the maximum coverage problem in set arrival stream
09:40
Okay, so I mean this is the set arrival. It's important So now the question that we were asking was if we can get the similar result for edge arrival stream If we if we can get a constant factor approximation in edge arrival stream that has
10:00
Complexity like let let's say order of n Actually, we know that this is not possible at least for half Approximation because if you want to get half approximation Then you have to pay omega of M in a space complexity And we actually generalized the result of botany at all and show that if you want to get one over alpha
10:22
Approximation you have to pay omega of M over alpha square so which means that for any constant factor you have to You have you need omega of M and but but the still there's a question if we can get the
10:40
Matching upper bound to this to this lower bound and We show that that's actually the case and we can get a one or alpha approximation algorithm in order of M over alpha square space Okay, and This is for the estimation part when you want to just estimate the size of the optimal coverage
11:04
and if you want to actually report the case sets that achieve the optimal coverage you have to pay an extra K in the space complex and Here I'm going to The rest of the talk. I'm going to talk about this Okay, so now that's the idea so we want to solve the estimation version of the problem
11:24
So it means that and we want to design the one over alpha approximation algorithm that use order of M over alpha square space and solve the problem in a gyrable string and this means that if we can provide a
11:43
Fraction of the optimal one Okay, let's revisit the sampling component that I described So the first one was the element sampling that says you can just sample a bunch of elements and solve the problem where the sample set Sampled element and that would give you a good solution But there we assume that the case there exists cases that cover all the universe
12:04
However, this is not the case here, right? I mean we can have the optimal solution only cover gamma fraction of the universe and Together anything meaningful from the sampling we have to sample at least one over gamma gamma elements Okay, and gamma if gamma is a very small
12:21
I mean it could be I mean one over gamma could be something like polynomial in which is not good for us So basically now the first question is if we can amplify gamma to some constant no matter what's the what's the impetus? And we showed that actually by simple hashing Trick you can do that You can just hash the elements to the to the optimal coverage size and then you will have the result
12:46
You are looking for you will get a new instance in which the optimal coverage size is order of the universe Okay, so now we have we have this assumption that gamma is equal order of one and we can just
13:01
Just easily do the element sampling as before. Okay, so that's part is fixed and Now I mean since we I know that the optimal solution is covering the constant fraction of the universe I can say that actually if I'm looking for k set whose coverage is at least one over alpha fraction of the universe Okay, so that's the first part and
13:22
We also I mean I'm going does now I'm going to briefly describe the main ideas in the rest of the algorithm and the other one is the Generalization of the set sampling So we are going to apply set sampling with different scales and there are two cases either We find the solution either we can find a certificate by that which is good or it will impose some
13:46
Structure on the freak on the size of on the number of the elements in each frequency class so for instance, then we if it doesn't succeed it means that The number of element that appear more than m over k set is less than n over alpha
14:03
Those that appear more than m over 2 case that is less than 2 n over alpha and so on So we have some then we have some nice structure Okay, so either via these these set sampling will find a certificate that is good actually I mean here we use some zero estimation algorithm, but I'm not going to describe that
14:24
otherwise, we have some structure on the size of each frequency class okay, so this is the second part and then we can just apply element sampling by having that nice structure over the frequency classes and
14:40
We can show that the total space that then we need here is m over alpha, but it's not good for us We were looking for m over alpha square. So here so far we have just sampled the elements, right? So one may ask, okay, why not sample the set as well? So as far as to find the set side as well and that's actually what we are going to test next so
15:01
If we can sample the sets by rate 1 over alpha and it would just work for us Then hopefully I mean the hope is that we can get m over alpha square Okay, so we can think of two extreme cases just For the case of presentation. So one case if the optimal solution has almost equal size sets
15:21
So then if we just sample by a Factor 1 over alpha then we expect that the survived set in the optimal solution is going to have 1 over alpha coverage of the of the universe Okay, and actually we can work it out for this case And actually this is this is good for us and we can get m over alpha square
15:42
space that we're looking for but there is another case that there are few critical sets are just only few sets that are contributing to the The coverage of the optimal solution then if we sample by a factor 1 over alpha We are going to miss basically those a few critical set with high probability. So we have to do something else
16:04
Okay, so basically now it's about finding the elephants finding some large set and if you are familiar with this streaming area it means that okay, we should look for the heavy hitters and this is actually what we are going to do and
16:20
Just Just for those of you who are not familiar with heavy heaters, let me describe it very fast so we are given a stream of items and we look at the The frequency vectors of the item in the stream and we say an item is heavy heaters let's say phi a heavy term is respect to the L2 norm if the
16:41
Frequency of the items square is greater than phi times the L2 norm of the vector Okay, and actually I mean this is good because we know that all phi heavy heaters can be found in order of 1 over phi space and Basically, I mean we can also we can just massage the instance in this case as well
17:02
I use the algorithm for the heavy heater to show that in this case we can get We can get m over alpha square space complexity as well This is the reason here is that because eventually we need to solve alpha square over m heavy heater for some properly
17:20
sample the streams on the sets Okay, and that would also gives us the solution and this is actually the whole picture I Described so far. So this was for the hashing part. This is for the Set sampling this was for the case that the optimal solution has almost equal size
17:40
And this was for the case that we have few critical sets Okay, so this is the first I mean, I just described is the universe reduction. This is the set sampling part So basically I was just to emphasize that all these Components are running parallel or running in parallel so that we can just implement everything in a single pass
18:02
this is for the case that the Optimal solution has almost equal size sets and this is also for the one with few critical sets and thank you
Recommendations
Series of 10 media