We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

From Shopping Baskets to Structural Patterns

00:00

Formal Metadata

Title
From Shopping Baskets to Structural Patterns
Title of Series
Number of Parts
611
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production Year2017

Content Metadata

Subject Area
Genre
Abstract
Mining frequent itemsets is an established approach to data mining andsupported by productive data mining solutions. For example, one can getinsights about buyers’ behavior by analyzing frequent co-occurrences ofproducts in shopping baskets. In contrast, frequent subgraph mining (FSM), thegraphy variant of frequent itemset mining, not only evaluates entity co-occurrence but also relationships among entities, i.e., structural patterns.However, existing implementations are all research prototypes which aretailored to textbook problems. In our talk, we want to give an introduction to the FSM problem on distributedcollections of graphs and our implementation in Gradoop, an open source systemfor scalable graph analytics based on Apache Flink. In contrast to otheriterative graph algorithms like page rank, in FSM the search space is droppedbut intermediate results of iterations are the desired result. Here, the majortechnical challenge is the respective usage of Flinks’ distributed iterations. We will explain different implementation approaches, discuss implementationdetails which influence scalability and show benchmark results. Intended audience and goal of the talk: Developers and analysts, interested inrelationship-centric data mining techniques
17
Thumbnail
24:59
109
Thumbnail
48:51
117
Thumbnail
18:37
128
146
Thumbnail
22:32
162
Thumbnail
23:18
163
Thumbnail
25:09
164
Thumbnail
25:09
166
Thumbnail
24:48
171
177
181
Thumbnail
26:28
184
Thumbnail
30:09
191
Thumbnail
25:08
232
Thumbnail
39:45
287
292
Thumbnail
25:14
302
Thumbnail
26:55
304
Thumbnail
46:54
305
314
317
321
Thumbnail
18:50
330
Thumbnail
21:06
333
Thumbnail
22:18
336
Thumbnail
24:31
339
Thumbnail
49:21
340
Thumbnail
28:02
348
Thumbnail
41:47
354
Thumbnail
26:01
362
Thumbnail
18:56
371
Thumbnail
13:12
384
385
Thumbnail
25:08
386
Thumbnail
30:08
394
Thumbnail
15:09
395
411
Thumbnail
15:10
420
459
473
Thumbnail
13:48
483
501
Thumbnail
32:59
502
Thumbnail
14:48
511
518
575
Thumbnail
25:39
590
Thumbnail
25:00
592
Thumbnail
23:32
ImplementationGraph theoryData structureGraph (mathematics)Term (mathematics)Maxima and minimaVertex (graph theory)BitMathematical analysisSoftware design patternFrequencyMathematical optimizationAssociative propertySequenceAlgorithmMereologyContext awarenessAnalytic setDataflowComputer virusoutputStreaming mediaDifferent (Kate Ryan album)Level (video gaming)Einbettung <Mathematik>WebsiteSoftware frameworkComplete metric spaceTheoryData miningBounded variationRight angleConstraint (mathematics)Rule of inferencePoint (geometry)Web 2.0Set (mathematics)AbstractionDatabaseOpen sourceSubgraphSpacetimeResultantContent (media)Network topologyState of matterThresholding (image processing)Social classClosed setView (database)Loop (music)Cycle (graph theory)Function (mathematics)CASE <Informatik>Near-ringDirection (geometry)Greatest elementGroup actionLink (knot theory)Patch (Unix)Multiplication signBeta functionTunis1 (number)Similarity (geometry)Computer animation
InjektivitätSoftware design patternVirtual machineRootGraph (mathematics)Inheritance (object-oriented programming)Physical systemSoftware testingNeuroinformatik1 (number)Einbettung <Mathematik>Arithmetic meanLevel (video gaming)Link (knot theory)Context awarenessLattice (group)AlgorithmMathematical optimizationGraph isomorphismParalleler AlgorithmusoutputElement (mathematics)Different (Kate Ryan album)Pattern matchingSingle-precision floating-point formatSet (mathematics)Product (business)Combinational logicRule of inferenceDataflowComputer programmingNetwork topologyCategory of beingShared memoryVertex (graph theory)Point (geometry)DatabaseMultiplication signMappingElectric generatorGene clusterComplete metric spaceExtension (kinesiology)MultiplicationTransformation (genetics)Process (computing)Representation (politics)Maxima and minimaFunctional (mathematics)String (computer science)SubgraphIterationCASE <Informatik>Semiconductor memory2 (number)Line (geometry)Depth-first searchGroup actionProgramming paradigmSound effectFrequencyNumerical analysisMereologyLambda calculusCountingPatch (Unix)CodeGraph theoryForceParameter (computer programming)Pairwise comparisonMorphismusAlgorithmic efficiency
IterationDataflowSoftware design patternGraph (mathematics)Set (mathematics)1 (number)Position operatorLevel (video gaming)Performance appraisalFrequencyEinbettung <Mathematik>Group actionComputer programmingElement (mathematics)Function (mathematics)Reduction of orderPhysical systemData compressionPreprocessorFunctional (mathematics)Vertex (graph theory)IntegerPairwise comparisonMappingCurvatureComplex (psychology)String (computer science)Ocean currentAlgorithmRoundness (object)Data dictionaryKey (cryptography)Operator (mathematics)ResultantData structureLoop (music)Lattice (group)Mathematical optimizationValidity (statistics)Hash functionSubgraphConsistencyStapeldateiConstraint (mathematics)Broadcasting (networking)Product (business)Graph theoryRoutingAdditionSingle-precision floating-point formatCodierung <Programmierung>Vapor barrierSound effectContext awarenessComputational complexity theoryData storage deviceTupleoutputBitNumerical analysisProgramming languageMultiplication signProcess (computing)Virtual machineEqualiser (mathematics)Grass (card game)Inheritance (object-oriented programming)Bell and HowellBuildingXML
DataflowSoftware frameworkProcess (computing)AlgorithmCodierung <Programmierung>Ocean currentGraph (mathematics)Sound effectSoftwareFitness functionGroup actionKey (cryptography)Set (mathematics)Software design patternSerial portTunisoutputAxiom of choiceMereologyLatent heatPosition operatorArray data structureBound stateCountingTransformation (genetics)TDMAEinbettung <Mathematik>FrequencyIntegerMathematical optimizationRun time (program lifecycle phase)Limit setJava appletObject (grammar)String (computer science)Level (video gaming)Revision controlGraph theorySemiconductor memoryVirtual machineGoodness of fitScaling (geometry)SubgraphData dictionaryBit2 (number)CASE <Informatik>Representation (politics)Data compressionField (computer science)Type theoryDatabaseProjective planeVertex (graph theory)Numerical analysisResultantParalleler AlgorithmusMultiplication signOpen sourceComputational complexity theoryDifferent (Kate Ryan album)Total S.A.Characteristic polynomialBranch (computer science)ProgrammschleifePhysicalismPatch (Unix)Link (knot theory)Operator (mathematics)ThetafunktionPrototypeAbsolute valueMusical ensembleMultiplicationPerformance appraisalPhysical systemIterationComplex (psychology)Computer filePreprocessorCycle (graph theory)
Computer animation
Transcript: English(auto-generated)
Yeah. Hello, everyone. Thank you. So, hello. I'm Andrew from University of Leipzig, and today I'm going to talk about frequent subgraph mining on Apache Flink, which I just hide behind the title of
From Shopping Baskets to Structural Patterns, because it's kind of related to something some of you might know from the term of shopping basket analysis or association rules. So, first of all, my question, who of you regularly works with graphs? And with data mining?
Or with data mining? Yeah. And graphs and data mining? All right. So, I slow down a little bit, and yet I try to make the talk not too much scientific, and rather give you a problem introduction and the challenges in the implementation on Apache Flink or in distributed data flows in general.
So, the contents are as follows. First, I introduce the problem and explain details about the current state of these algorithms. Then I propose a solution, which we found on top of Apache Flink in the context of the Gradoop framework. Another question, who was attending the talk before?
So, all right. Quite a lot. So, all right. So, yeah. It's in the context of an open source framework for distributed graph analytics called Gradoop. And finally, I'll talk a little bit about some optimizations of our solution if I find the time, and of course, conclude. So, part one is the problem.
So, the general class of data mining problems is called frequent pattern mining. So, and we are talking about the transactional setting, which means we have a collection of things, for example, shopping baskets, click streams, XML documents, chemical compounds, whatever. It's just depends on the data structure,
but we have a collection of, let's say, datasets which might have different structures. And in frequent pattern mining, we're interested in the patterns that, for example, occur in 80 percent of these things. And the frequent pattern mining can be categorized into the basic ones, the frequent item set mining, for example, shopping baskets, where we're just interested in
which things frequently co-occur together. Then we can, if we add a constraint that the order of items matters, then we have frequent sequence mining. For example, a click stream. We're interested in which paths on a website go users regularly, or something like that. Then we can, again, extend it.
If we also consider branching of paths, then we are at frequent sub-tree mining, which is a little bit more difficult than sequence mining, but still very closely related to these things. And then, at the end, if we are also interested in all the relationships, so trees may have closing cycles and stuff like that,
then we have the problem of frequent sub-graph mining. And an example are chemical compounds. So, let's first go back to the simplest variation of this problem. Some of you might know association rules. This is something like people who brought
bread and butter often bought cheese and wine, so we can use this for recommendations. That's what some web shops do, I guess. For example, if you already added two items to your shopping basket on Amazon, for example, then one way to recommend you other things is using association rules, which are based on the results
of frequent pattern mining on past data sets. In the frequent sub-graph mining, we can use it for similar things. For example, if we know that a substructure often occurs in legal stimulants, and we have an unknown chemical substance, which contains a substructure,
which is known to be often part of illegal stimulants, we can have a look on this substance if it's maybe it's a candidate for a new illegal stimulant. I don't know if this example makes sense, but I hope you got the idea. So, the problem definition, just to conclude the introduction,
is our input is a graph collection, so our collection of things, in this case, are graphs. And we have a min support threshold, which specifies the minimum percentage of items, of search space things, which have to contain a pattern to be considered to be frequent. And the output is the complete set of frequent sub-graph patterns,
which are supported above the given threshold, so the min support. And to, wait, just give you a toy example. On the bottom, I hope it's visible for most of you guys, no. So, on the bottom, just believe me,
there are three graphs. This is often called a graph database. And, yeah. Fair for you?
It's smaller, but now more people can enjoy the fun on seeing the graph. So, this is the input, it's a graph collection. So, we see they, first, I need to tell you, in our problem scenario, in the context of Gradoops, we are talking about directed multi-graphs, which is a difference to typical solutions.
So, which means our edges have a direction, like also in graph databases, like Neo4j. And we support multiple edges between any pair of vertices, which changed the problem a little bit from the algorithmic point of view. And, yeah, now we have three graphs, and we're interested in sub-graphs,
which we can describe by pattern. So, a sub-graph, just to not to confuse you, a sub-graph is a part of the graph. And we're interested in such sub-graphs, which occur in at least 50%, so in two of three. And, typical algorithms use the abstraction of a frequent pattern lattice,
so, which contains different levels. In the first level, they are all zero-edge sub-graphs, which means just vertices. And we see, we have a vertex A here, and A occurs in this graph, in this graph, and in that one. This is, we have a frequency of three. The same for B, frequency of three. We have C, frequency of one,
because it only contains in this one. And thus, we can say, all right, it's not above the threshold of 50%, which would be two. And thus, we can prune it. So now, we continue and grow children from that. We have, for example, AAB, ABB, or this loop pattern.
And they are children of the zero-edge patterns. And we still see, okay, this one has a frequency of three, one and two, one is below the threshold, that's why we can prune it. And finally, that's the largest frequent sub-graph contained. We have this two-edge sub-graph, which is contained in two graphs at K2.
This is the, actually, this is the desired result, but it's not an, we do not implement it as a data structure, it's just the theoretical model behind these algorithms. And from the data structure point of view, of course, we have graphs and we have patterns, which we need to store somehow. But the most important thing are embeddings. It's very similar to the previous talk
with the pattern matching, so we need embeddings which map the patterns to actual sub-graphs. And in this case, we see that due to automorphism, so a graph might contain a pattern multiple times, which means this pattern is contained in this graph two times.
So once is the green line, so this vertex is this vertex and this vertex is this one, and because this, this vertices cannot be distinguished by their pattern, we create a second embedding because of that. So this is everything you need to know before I start explaining the algorithms. So we have graphs, which are the input elements.
We have sub-graphs, which are parts of the input elements with patterns, which are nodes in our pattern letters, which are isomorphic to sub-graphs. So isomorphic means there is a one-to-one mapping between vertices and edges, and this is the tricky point of these algorithms, the same as pattern matching.
And we have embeddings, which are mappings between patterns and sub-graphs. And the challenges in implementing such algorithms, especially in the context of shared nothing clusters, is that we need to meet the dataflow programming model, in this case of Apache Flink, but this also applies to Apache Spark or actually even MapReduce.
And what we need to find is an efficient algorithm which does not rely on shared memory. And I hope I don't want to become too scientific now, so I just explain your algorithm efficiency.
The problem of these algorithms is they contain the so-called sub-graph isomorphism problem, which is NP-complete. We need to enumerate all isomorphisms between graphs, actually, which is very expensive, and we cannot avoid this completely. If we want to have the complete set of frequent sub-graphs, it's not possible to avoid this problem, but what efficient algorithms
can do, they minimize the isomorphism testing effort. So let me explain you, most of this work has been done in the early 2000s. And, but not in the context of distributed dataflow systems but for single machines, so there were a priori approaches.
These are based on a so-called breadth-first search in the lattice, BFS, which means we process first all frequent sub-graphs of level one, then generate candidates, do pattern matching in the graph database, find out frequent ones and join the frequent ones to children. And then we, again, do the pattern matching
and repeat this until there are no frequent sub-graphs left. The good thing about this, these algorithms is that they only need one iteration per level and it allows effective frequency pruning. But the problems of these algorithms are they contain the sub-graph isomorphism testing in the pattern matching.
They contain, additionally, sub-graph isomorphism testing in the candidate generation, so when you have two frequent patterns, you want to generate the child, you need to enumerate all sub-graphs of a certain level to generate children, which is also very expensive. You may generate candidates that don't even occur in a database because they're just candidates.
There's no guarantee for them to occur. And the candidate generation itself is hard to paralyze without shared memory because you need, actually you need to build the cross products of all combinations, yeah, just a lot of combinations. This is why the more efficient single machine approaches
are based on a depth first search in a lattice, which means this is the BFS and this is the DFS. They need, they go from, let's say from this pattern, this is let's say the root pattern, to this pattern, then go to this, until no more frequent patterns are found. Then they jump back to a frequent parent and so on and so on.
And the idea behind these algorithms is that we can skip some of the links in a lattice due to specific tricks, which I will not explain now because this then really takes a lot of time. But it's possible with these algorithms to avoid checking all the combinations. So, and the idea is we have an edge wise extension
of frequent patterns and the growth rules can reduce the search to a tree. For example, there's a GSPAN algorithm, it's a very popular example of these category. The problem is where the isomorphism still occurs is that sometimes we don't want to visit, for example, this edge or link in the lattice,
but we accidentally visit it. This is why we need to check the graph representation if it's really a valid, if it's really on this tree that we wanted to find. So, if we look at these algorithms, okay, it automatically finds canonical labels, which means we do not, we can compare
the graph patterns by a simple string comparison. The isomorphism testing is reduced to a minimum. We just need to validate these labels, which is the end. We only check patterns that actually exist. So, we solved a lot of the problems of the a priori algorithms.
But the problem is we need a large number of iterations for that and it can lead in a shared nothing cluster, which with many, many, many, many workers, it can lead to unbalanced parallelization because we can maybe parallelize single paths, but not a whole level. The more idea of the new data flow systems
is to bring the computation to the data, which means processing all the, for example, all the embeddings of the same edge count at the same time. And this is not easily possible with this DFS approach. Come back to our toy example. We see this is why I used dotted lines for some of the links in the lattice.
The a priori approaches would go all the links or there are even some optimizations which maybe skip some, but still, this is the basic difference. And in a pattern growth one, we can just leave out the dotted ones. Okay, so now this is everything you need to know about frequent subgraph mining
and the research in the past. And now we are facing these problems in the context of distributed data flow systems on shared nothing clusters. And the solution to this problem is something we call a level-wise pattern growth, which means we just do a parallel depth-first search in the lattice.
Which means we process the lattice level-wise, but still skip the links we do not want to visit, and still avoid the isomorphism problem due to candidate generation, for example. And this means we only need K max iterations, which is the maximum edge count of the largest frequent pattern,
and can skip these links and we still minimize isomorphism testing. And this approach fits the data flow programming model very well. I will show you using some pseudocode how we can implement this on Apache Flink. So for all who don't know Apache Flink,
the basic extraction is you have data sets, and then you have transformations on these data sets, which where an high order function is a parameter, and it's similar to lambda expressions, but for data sets. And the idea is that the function is executed for each element of a data set,
or for a group of elements, for example. So, but there's no, they completely run, they are executed concurrently, and all things which require global knowledge exchange, I need to create barriers in the data flow.
To exchange global knowledge. So that's the problem in the shared nothing context. And what you would do in a usual programming language, okay, you know these constraints, but the naive programming would be, there would be a data set of graphs, there would be a data set of all frequent patterns, and there would be a data set of K edge frequent patterns.
That's all the data sets we need. And then we have a loop, which terminates until we do not find any K edge frequent patterns anymore. And at the end, we are interested in all frequent patterns. That's what our algorithm should do. What can we do? For example, we do a pattern growth in the beginning,
which means, and which we use something called broadcasting in Flink. It's like a very efficient cross product in a distributed data flow. We just send all K, last iterations frequent patterns to all machines, and then we can access them in the pattern growth step.
And initially, it's empty, so this is the dummy route in the lattice. We apply the pattern growth using a map function, which means we update the graphs because we not only, the graph data set does not only contain graphs, it also contains a map in between patterns and embeddings.
So which means in the pattern growth step, we just check the map, all the keys which are also contained in the frequent pattern set, we need to grow. And then the next step, we report the patterns. So after we grew patterns, we report them using flat maps. Flat map is a map function,
but can be an arbitrary cardinality output. So it could output zero or many patterns. Then now it becomes a little relational. We group by pattern, so it's just a key, or it's very similar to the reduce function in MapReduce. We just group by key, we sum the frequency, we filter out frequent ones, and we additionally need to filter valid ones.
Because sometimes there are some false positives. And we found out in our evaluations that it's far more efficient to group and filter invalid patterns instead of validating them in this step. Because in this step, we have to validate for each graph and each pattern, and here we only have to validate
for each pattern in the whole system. Which means, imagine we have an input size of like 10 millions of graphs. We do not repeat the same operations 10 million times, but do it here, and this is in comparison to the complexity of counting a larger data set, it's basically free.
Yeah, and finally, we add the current round's frequent patterns to the all frequent patterns data set using a union operator, and that's how natively everybody one would program it. But the problem is that Flink's bulk iteration, it's a data flow iteration,
it's not like a typical while loop, it's something related to the batch API of Apache Flink. It allows only exactly one data set to be passed along iterations. And the problem is here, we have actually three data sets which we manipulate during our iterations.
But there's a constraint that from iteration body, we cannot access these data sets. So this doesn't work. So the first one, the FPK, there's a quite simple solution to that. We just pull it out of the iteration and repeat us a little bit, and just instantiate the set of KH frequent patterns inside the loop so we can avoid the external access.
But this does not work for this union operation, and our data set collecting all the frequent patterns of all iterations. This is where we use the very hacky approach. We just, at the beginning of the data flow, we added to the graph data set
a new empty graph element, which is actually an empty graph with an empty map. And because it's an empty graph and we ensure that it's the only empty graph in the whole data set, we can identify it. And then we extend this map function, we do not only apply pattern growth, but pattern growth or store.
So we just check if the graph is empty, we can use its pattern embeddings map to store frequent patterns. So because we can reuse, we don't need any additional data structure, we can just use the data structure we already have. And at the end, we filter out this single dummy graph, which was collecting all the frequent patterns,
and use a flat map, which creates a new pattern element in the output data set, including its frequency, frequency we also can encode in the mappings, and have the same result. So this really works, that was the solution to our problem.
Okay, so this was the basic idea, how we hope you understand the problem, and how we implemented this on a distributed data flow system like Apache Flink. And now we did some optimizations to this problem. So actually there are four of them, and because two of them are very difficult to explain without becoming too scientific,
we just skip them. So one is, it's even more efficient to do the validation step in the combine function, because it matters to have fewer tuples before your network traffic is coerced, as for everyone's programming with distributed data flow systems, it seems obvious. And we use a merge join instead of a cross
during the pattern growth process, but this needs a lot of explanation now. But what we also do, we do a pre-processing, where we use dictionary encoding and label pruning to reduce the input size, and the most efficient thing is we do a pattern encoding and with a fast and effective compression technique.
So the pre-processing's idea is basically we keep only vertices with frequent labels. Then, if we remove them, we also remove some edges. And then we consider only the remaining edges, which are not, which never occur with frequent vertices,
and also count their frequencies. And we can also drop all the edges with infrequent labels and based in these steps, we have the frequencies and all the labels, so we can just reduce them to create dictionaries between strings and integers. And because the frequent subgraph mining contains a lot of comparison, equality,
hash code building, whatever, it's much more efficient to use integer labels instead of strings. And the workflow is then as follows. We first encode graphs, we process the actual mining, and we decode the patterns. Because in many scenarios, we do not want patterns consisting of numbers, we just, we really want our patterns,
like RDF patterns, for example. So just to explain this approach, we see here there's a vertex C, which is only contained in a single graph. And we see here's a edge labeled B, which if we only consider the edge label, it's still frequent because it contains in here and here.
But if we first remove this C guy, then this B one to preserve consistency would go to, and B is not frequent anymore. And if we do this in a pre-processing, these patterns will never be discovered. And the second optimization technique
is the pattern encoding and compression. So we do not store the graphs in a typical way, as you natively would program graphs using Java objects and so on, because we found out that avoiding Java objects is one of the most efficient tuning techniques you can do in Apache Flink.
This is at least for this scenario. So we use multiplexed integer arrays to represent patterns and graphs and embeddings. We represent everything in multiplexed integer arrays, which require K times, so the edge count, we need six positions in the array for each edge.
And we know that we only have positive values, and we have very low upper bounds of the values. So the upper bound is edge count or the size of the labeled dictionaries. And typically, we do not exceed four to eight bits for this case. I say just typically, of course, you can create data sets where it's different, but just say in the most used cases, that's enough.
This is much less than 32 bits, which is usually required for a single integer, so we can use an integer compression, and we can roughly, in our data sets, encode a complete edge from six to one integers, in this case, and we use simple 16 from this nice open source package for this.
And the effect of this is we decrease network traffic because the patterns are traveled over the network are smaller. We have a smaller grouping key in the group by operation, which means this is notably faster, and also the map access, when we access the abettings of our patterns,
we have much smaller map keys, so also this is more efficient. Then additionally, we experimented with graph and embedding compression, mainly to decrease memory usage and to allow faster serialization between transformations. And this is also something we have learned, this really matters in Apache Flink.
And the decompression, we execute only on demand. For example, embeddings of infrequent patterns will never be decompressed and so on, so there are a lot of detailed optimizations. Yeah, then we did some optimizations, so we used, I do not talk about absolute runtimes now
because it doesn't make sense if you do not have any background knowledge about the data sets and what does it mean. But what we did here in these experiments, we have two data sets, one is synthetic, we have one million graphs, 90% support, but it gives already a lot of patterns, this data set, and contains directed multigraphs. And the classical scenario, which is used by all other work
in the field of frequent subgraph mining is a molecular data set with undirected simple graphs with one million graphs and 10% support. So it has a quite low support, and it's one million molecules, I mean, with other types of data, you can have big data problems with much larger input sizes,
but here the problem are the intermediate results and computing complexity. So we see we used this data set and scaled up from six to 96 threads, which is roughly the same as one to 16 nodes in a cluster, machines. And we see that, especially on this synthetic data set,
for which we optimized our algorithm for directed multigraphs, it shows a quite good scale up. And even for the molecular database, it's not bad, it's not linear, but still quite good. And then we also evaluated the impact
of our optimizations. So depending on the data set, in this data set, again, we created many, many infrequent labels that this technique is very efficient, of course, but maybe there are real-world data sets, such like RDF, for example, which have these characteristics, which we considered at the design
of this synthetic data set. So we see that overall scalabilities, all parallelism levels, so from six to 96 parallel slots, we see that the dictionary encoding, in this case, can slow down the total runtime by up to 400%. Then we just avoided pattern compression,
so we only used integer representation, which is already 10 times faster than using strings. This is a notable effect, especially on the molecular database. The embedding compression still has a notable impact, probably due to the serialization time,
but graph compression, it's a little tuning effect, but it's not too much. All right. So conclusion, what we did is the first approach to graph collection, frequent subgraph mining using an in-memory data flow system. There are MapReduce approaches, but in this kind of system,
by best of our knowledge, we're the first, and we support directed multigraphs from arbitrary string-labeled inputs, so you can just enter a string-labeled input file, and you get string-labeled patterns. We do not need to do any external pre-processing. It works, but using Flink requires some workarounds, as you have seen,
and we think it's a much better choice than MapReduce because in each iteration, you have to update three data sets, and this is also, for MapReduce, you also need workarounds to solve these problems, but these MapReduce workarounds are, by far, more expensive than the workarounds you need for the patch-of-link-based approach.
And the availability of the, it's part of the Gradoop framework, this distributed framework for graph processing. It fits its algorithm API. The current version, you will find in the master branch, is not optimized. It's still using Java objects and strings and stuff like that, so it's not very efficient,
and the optimized version is still a messy research prototype, but will be merged to the master soon. So, thank you, if you have any questions. Try this on real-world data set, so evaluate this project with a business partner,
I mean, on real versions and data. No, the problem is the availability of such data sets. So, we have a data generator which creates this type of data, but this is why we use the synthetic data set, which contains loops, parallel edges,
some many orthomorphisms, yeah. Yeah, welcome. I happen to know that MSD in Prague has a data set like that. Excuse me? I happen to know that MSD in Prague has a data set like that. Yeah? Yes, because they're working on it. What is in there?
Molecular data. Yeah, but yeah, molecular data, of course. I have there tons of molecular data sets, but the thing is that they do not, in molecular data sets, you have only two types of edges. It's a single bound and double bound. You have a very limited set of vertex labels, and there are very low number
of specific patterns that occur, like you have at some cycles. So, of course, you can tune the algorithm to molecular databases, but we wanted to make a general approach, fitting all the data sets, and because we have no data set.