We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Tight Trade-offs for the Maximum k-Coverage Problem in the General Streaming Model

00:00

Formal Metadata

Title
Tight Trade-offs for the Maximum k-Coverage Problem in the General Streaming Model
Title of Series
Number of Parts
155
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
We study the maximum k-coverage problem in the general edge-arrival streaming model: given a collection of m sets F, each subset of a ground set of elements U of size n, the task is to find k sets whose coverage is maximized. The sets are specified as a sequence of (element, set) pairs in an arbitrary order. Our main result is a tight (up to polylogarithmic factors) trade-off between the space complexity and the approximation factor alphain(1/(1-1/e), tildeOmega(sqrtm)] of any single-pass streaming algorithm that estimates the maximum coverage size. Specifically, we show that the optimal space bound is tildeTheta(m/alpha^2). Moreover, we design a single-pass algorithm that reports an alpha-approximate solution in tildeO(m/alpha^2 + k) space. Our algorithm heavily exploits data stream sketching techniques, which could lead to further connections between vector sketching methods and streaming algorithms for combinatorial optimization tasks.
Maxima and minimaStreaming mediaData modelData managementUniverse (mathematics)SubsetoutputMaxima and minimaAlgorithmData recoverySet (mathematics)SubsetElement (mathematics)Universe (mathematics)Lecture/ConferenceComputer animation
Data managementUniverse (mathematics)outputMaxima and minimaParameter (computer programming)SubsetInstance (computer science)Parameter (computer programming)CASE <Informatik>Set (mathematics)Lecture/ConferenceComputer animation
Data managementUniverse (mathematics)outputSubsetMaxima and minimaParameter (computer programming)Instance (computer science)Set (mathematics)System callRight angleCASE <Informatik>Classical physicsLecture/ConferenceComputer animation
outputUniverse (mathematics)SubsetMaxima and minimaApproximationParameter (computer programming)Information retrievalData miningMilitary operationMachine learningQuery languageSign (mathematics)Streaming mediaModel theoryCovering spaceAlgorithmLinear mapSpacetimeSet (mathematics)Element (mathematics)Euclidean vectorConnectivity (graph theory)Random numberSample (statistics)Sampling (music)DivisorMathematical optimizationGraph (mathematics)Software testingSampling (statistics)NumberGraph (mathematics)ResultantString (computer science)Set (mathematics)Service (economics)Fiber (mathematics)Instance (computer science)CASE <Informatik>Covering spaceStreaming mediaPhysical systemTask (computing)Graph (mathematics)Group actionConnectivity (graph theory)WordMultiplication signMessage passingRandomizationMaxima and minimaFrequencyState observeroutputElement (mathematics)Wave packetArchaeological field surveyNoise (electronics)Goodness of fitSet (mathematics)ForestQuicksortSummierbarkeitData streamData miningInformation retrievalComplex (psychology)SpacetimeLimit of a functionModel theoryDivisorApproximationsalgorithmusFamily of setsE (mathematical constant)Fiber bundleDifferent (Kate Ryan album)Cartesian coordinate systemStandard deviationIterationParticle systemWeb 2.0AreaNetwork topologyFraction (mathematics)Representation (politics)Graph (mathematics)MereologyEndliche ModelltheorieAlgorithmGreedy algorithmLambda calculusSample (statistics)Logical constantUniverse (mathematics)Computer animation
Maxima and minimaGraph (mathematics)ApproximationSpacetimeEstimationTraffic reportingCovering spaceAlgorithmMathematical optimizationSet (mathematics)Public key certificateMathematicsElement (mathematics)Universe (mathematics)Reduction of orderSample (statistics)Computer clusterSign (mathematics)ACIDSampling (music)Random numberCASE <Informatik>CountingStreaming mediaoutputFraction (mathematics)NumberBit rateSet (mathematics)Equals signKolmogorov complexityFamilyTotal S.A.SubsetMultiplicationSquare numberGamma functionCASE <Informatik>String (computer science)PolynomialPublic key certificateSampling (statistics)Set (mathematics)Element (mathematics)NumberEstimatorUniverse (mathematics)Bit rateGraph (mathematics)Vector spaceStreaming mediaScaling (geometry)MereologyAlpha (investment)Instance (computer science)Equaliser (mathematics)Connectivity (graph theory)FrequencyReduction of orderData structureSocial classPresentation of a groupNormal (geometry)Term (mathematics)AreaSpacetimeComplex (psychology)Multiplication signDivisorMaxima and minimaImage resolutionApproximationsalgorithmusMessage passingParalleler AlgorithmusGraph (mathematics)ResultantAlgorithmFraction (mathematics)Set (mathematics)Mathematical optimizationCovering spaceLogical constantSimilarity (geometry)Fiber bundleOptimization problemRevision controlGroup actionSummierbarkeitRight angleBeta functionOffice suiteSoftware testingPlotterWeb pageCompass (drafting)DecimalMatching (graph theory)Likelihood functionFormal languageFamilyComputer animation
Lecture/Conference
Transcript: English(auto-generated)
I'm going to talk about the streaming algorithm for maximum k-coverage problem. Let me first define the problem so in this problem we are given a collection of sets s1 to Sm and Each is a subset of a universe u that contains n elements e1 to en
okay, so for instance you can see the example here and We are also given a parameter k and the goal is to find k sets in collection f in the input such that the union is maximized, so it's a very
So for instance if k is equal to 2 then s1 and s2 are the best two sets whose union is maximized in this example so this problem is a very classic classical problem, and it's known to be empty heart and We know that the standard greedy algorithm that runs in Iteration and iterations and in each iteration it picks
the sets that cover the maximum number of yet uncovered elements achieve 1 minus 1 over e approximation and Actually, it's the best we can hope for unless p is equal Okay, so the complexity of the problem is very clear in the standard setting And it's also has I mean many applications so the maximum k coverage problem and also
May very similar problem the set cover problem has many application in different area such as information retrieval data mining web search advertising and so on you can count And But I mean by the in this talk I'm going to focus on this problem in the streaming setup
So basically the motivation is when we want to solve the problem for large data sets and in particular in the case of streaming Model so people started looking at this problem So so actually I mentioned that is the general streaming case in the title of the paper, so let's see what it means
so people have started Looking at this problem in the set arrival stream in which that They assume we have a data stream where the sets are coming one by one and when a set arrives You will see all its elements so for instance here
We will see a swan that contains e1 to e9 e1 e6 e3 e9 and so on Okay I mean I'm most of the results over in this model the set arrival stream But there is a more general streaming model and in particular
captured the set arrival stream, which is called edge arrival stream so in this model we look at the bipartisan graph representation of the set systems Where we have the elements in one side and the sets in the other side And then there exists an an edge between an element and a set if if the sets contain the elements
okay, and then we assume that in the streaming model we receive the edges one by one and There is no guarantee that the elements that appear in a in the same sets comes next to each other They may come in an arbitrary order so for instance here we may first see that e3 is in a swan and then we may see just e5 in is in s4 and
Maybe like that and so on okay as you see. I mean the elements of a Certain set may come out of order And it's actually clear easy to see that the edge arrival stream is more general than the set arrival one
Okay, so let's see what exactly the input to this problem and what you we want to solve so we are given a edge arrival stream of the maximum coverage problem and we would like to design an algorithm that read through the input in one pass and Only use a sublinear amount of a space and find the approximate solution
So here we assume that we have M sets and N elements, so The the total space that the input can have could be as large as m times n so by sublinear space I mean little of m times n and we will see actually the space is much better than
Okay, so before going to going through our result lets me first Just revisit some of the Sampling component that have been widely used in this for this problem of sub cover problem and maximum coverage problem in a streaming setting
So one is the and this actually are very simple and but So the first one is set sampling which says You can pick just L random sets and with high probability the frequency elements will be covered So, let's see more carefully and so basically it says if you pick all random sets
Then the elements that appear more than M over L sets will be covered So for instance here if I pick two random set I expect that the elements that appear more than half of the sets in my set system will be covered so here for instance I If I pick these two just randomly then eat as you see e2
E6 and e9 that are appearing more than half of the sets will be covered Actually, it's a very simple observation But it's a very nice because we can just randomly pick L sets and get rid of the high frequency elements and these elements are those that are contributing the most to the
space complexity of the problem actually and The other component is saying that okay if we want to solve the problem Let's say set cover or in our case max cover problem You can just sample a few elements and solve your problem on the sample set of elements
And then the solution that you find there will be with high probability is a good solution for the original instance as well Okay, let's see more carefully Suppose I just sampled e1 e3 e7 e10 in this example, then I just forget about the rest of elements and Find the cake good cake cover for these sampled element. Let's say the three I
Have the budget of K equal to 3 and this is the one that actually cover everything in my sample set and If we see the original instance here, we see that it's actually a good solution. It's cover almost all elements so more formally if
if we know that the there exists cases that cover that cover everything then we can just Randomly pick K over lambda elements from the from the universe Find the find the solution to the random random sampled element
Then that solution is going to cover 1 minus lambda fraction of the whole elements Okay so I mean this is helpful because now our space complexity would be something like the Sample size the number of element that we have samples times the number of sets rather than n times m
Okay, and it's actually I mean the the idea here is that you can just focus on much smaller Instance and solve that instance and that will help you to To decrease the number of uncovered elements and you can just repeat that that thing together solution
Okay So these are these are two simple sampling components that that actually was used before for set cover problem Okay, so now now that we know that these two simple component Let's try to see how the algorithm looks like So this is the requirement that we had for the streaming maximum K coverage problem
and just I have a few remarks so edge arrival assumes are more general and the sampling based approaches that I describe are actually worked on edge arrival streams because just like a sketching we can just always compute them even if the edges are coming out of order and
But there is some Some subtlety here for instance in the set arrival stream There is a this primitive task of testing the size of set which is very simple to to do, right? I mean for just testing what's the size of a set in a set? I will assume is actually trivial
but this is this is more challenge is where this is challenging actually in the edge arrival stream because you have to wait until the End of a stream to figure out what's the size of a set? Okay, and that's actually I mean this just a small thing make a make a very different Make make the complexity of these two
stream models very different Okay, so let's see what what we know about this problem We know that in edge arrival stream you can get actually the almost optimal Approximation 1 minus 1 over e minus epsilon in a space almost order of M Okay, and then this is actually tight
However We know that if you don't Really want the optimal constant factor and you are okay with any constant factor Let's say half is good for you Then you can get much better space complexity for the maximum coverage problem in set arrival stream
Okay, so I mean this is the set arrival. It's important So now the question that we were asking was if we can get the similar result for edge arrival stream If we if we can get a constant factor approximation in edge arrival stream that has
Complexity like let let's say order of n Actually, we know that this is not possible at least for half Approximation because if you want to get half approximation Then you have to pay omega of M in a space complexity And we actually generalized the result of botany at all and show that if you want to get one over alpha
Approximation you have to pay omega of M over alpha square so which means that for any constant factor you have to You have you need omega of M and but but the still there's a question if we can get the
Matching upper bound to this to this lower bound and We show that that's actually the case and we can get a one or alpha approximation algorithm in order of M over alpha square space Okay, and This is for the estimation part when you want to just estimate the size of the optimal coverage
and if you want to actually report the case sets that achieve the optimal coverage you have to pay an extra K in the space complex and Here I'm going to The rest of the talk. I'm going to talk about this Okay, so now that's the idea so we want to solve the estimation version of the problem
So it means that and we want to design the one over alpha approximation algorithm that use order of M over alpha square space and solve the problem in a gyrable string and this means that if we can provide a
Fraction of the optimal one Okay, let's revisit the sampling component that I described So the first one was the element sampling that says you can just sample a bunch of elements and solve the problem where the sample set Sampled element and that would give you a good solution But there we assume that the case there exists cases that cover all the universe
However, this is not the case here, right? I mean we can have the optimal solution only cover gamma fraction of the universe and Together anything meaningful from the sampling we have to sample at least one over gamma gamma elements Okay, and gamma if gamma is a very small
I mean it could be I mean one over gamma could be something like polynomial in which is not good for us So basically now the first question is if we can amplify gamma to some constant no matter what's the what's the impetus? And we showed that actually by simple hashing Trick you can do that You can just hash the elements to the to the optimal coverage size and then you will have the result
You are looking for you will get a new instance in which the optimal coverage size is order of the universe Okay, so now we have we have this assumption that gamma is equal order of one and we can just
Just easily do the element sampling as before. Okay, so that's part is fixed and Now I mean since we I know that the optimal solution is covering the constant fraction of the universe I can say that actually if I'm looking for k set whose coverage is at least one over alpha fraction of the universe Okay, so that's the first part and
We also I mean I'm going does now I'm going to briefly describe the main ideas in the rest of the algorithm and the other one is the Generalization of the set sampling So we are going to apply set sampling with different scales and there are two cases either We find the solution either we can find a certificate by that which is good or it will impose some
Structure on the freak on the size of on the number of the elements in each frequency class so for instance, then we if it doesn't succeed it means that The number of element that appear more than m over k set is less than n over alpha
Those that appear more than m over 2 case that is less than 2 n over alpha and so on So we have some then we have some nice structure Okay, so either via these these set sampling will find a certificate that is good actually I mean here we use some zero estimation algorithm, but I'm not going to describe that
otherwise, we have some structure on the size of each frequency class okay, so this is the second part and then we can just apply element sampling by having that nice structure over the frequency classes and
We can show that the total space that then we need here is m over alpha, but it's not good for us We were looking for m over alpha square. So here so far we have just sampled the elements, right? So one may ask, okay, why not sample the set as well? So as far as to find the set side as well and that's actually what we are going to test next so
If we can sample the sets by rate 1 over alpha and it would just work for us Then hopefully I mean the hope is that we can get m over alpha square Okay, so we can think of two extreme cases just For the case of presentation. So one case if the optimal solution has almost equal size sets
So then if we just sample by a Factor 1 over alpha then we expect that the survived set in the optimal solution is going to have 1 over alpha coverage of the of the universe Okay, and actually we can work it out for this case And actually this is this is good for us and we can get m over alpha square
space that we're looking for but there is another case that there are few critical sets are just only few sets that are contributing to the The coverage of the optimal solution then if we sample by a factor 1 over alpha We are going to miss basically those a few critical set with high probability. So we have to do something else
Okay, so basically now it's about finding the elephants finding some large set and if you are familiar with this streaming area it means that okay, we should look for the heavy hitters and this is actually what we are going to do and
Just Just for those of you who are not familiar with heavy heaters, let me describe it very fast so we are given a stream of items and we look at the The frequency vectors of the item in the stream and we say an item is heavy heaters let's say phi a heavy term is respect to the L2 norm if the
Frequency of the items square is greater than phi times the L2 norm of the vector Okay, and actually I mean this is good because we know that all phi heavy heaters can be found in order of 1 over phi space and Basically, I mean we can also we can just massage the instance in this case as well
I use the algorithm for the heavy heater to show that in this case we can get We can get m over alpha square space complexity as well This is the reason here is that because eventually we need to solve alpha square over m heavy heater for some properly
sample the streams on the sets Okay, and that would also gives us the solution and this is actually the whole picture I Described so far. So this was for the hashing part. This is for the Set sampling this was for the case that the optimal solution has almost equal size
And this was for the case that we have few critical sets Okay, so this is the first I mean, I just described is the universe reduction. This is the set sampling part So basically I was just to emphasize that all these Components are running parallel or running in parallel so that we can just implement everything in a single pass
this is for the case that the Optimal solution has almost equal size sets and this is also for the one with few critical sets and thank you