CATAPULT: Data-driven Selection of Canned Patterns for Efficient Visual Graph Query Formulation
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 155 | |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/43067 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Visual systemGraph (mathematics)Query languageMagneto-optical driveData managementDigital filterWeb pageFormal grammarQuery languageDatabaseInterface (computing)Graph theoryAudiovisualisierungGraph (mathematics)Query languageContext awarenessXML
00:17
Digital filterGraph (mathematics)Query languageFormal grammarWeb pageQuery languageGroup actionTask (computing)Error messageInterface (computing)Directed setVisual systemClassical physicsSoftware maintenancePortable communications deviceProteinRepository (publishing)Physical systemTime domainSource codeConstructor (object-oriented programming)Query languageSet (mathematics)Interface (computing)Variety (linguistics)1 (number)Power (physics)NumberInterface (computing)Graph (mathematics)MathematicsTerm (mathematics)ProteinSelectivity (electronic)Multiplication signSoftware frameworkDomain nameTask (computing)Pattern languagePhysical systemError messageGroup actionConnectivity (graph theory)Formal languageAudiovisualisierungExpert systemCodeProgrammer (hardware)Formal grammarData structureDirection (geometry)Electronic mailing listContent (media)Bit2 (number)SoftwareDirected graphGraph theoryConstructor (object-oriented programming)Pairwise comparisonCartesian coordinate systemDampingCompilation albumSoftware developerDegree (graph theory)CASE <Informatik>DatabaseDifferent (Kate Ryan album)Cellular automatonMusical ensembleRight angleFitness functionWritingConformal mapCondensationComputer animation
05:51
Electronic visual displaySpacetimePattern languageQuery languageGraph (mathematics)Division (mathematics)Graph theorySubgraphPareto distributionMathematical optimizationAddress spaceWeight functionMultiplicationFunction (mathematics)Bit ratePhysical systemGroup actionBoom (sailing)Web pageLocal ringStructural loadSoftware frameworkManufacturing execution systemPrice indexPartition (number theory)Similarity (geometry)Algebraic closureInterface (computing)Graph (mathematics)Physical systemPattern languageSpacetimeUtility softwarePareto distributionGraph theoryMathematical optimizationMaxima and minimaMultiplicationElectric generatorFunctional (mathematics)SummierbarkeitSheaf (mathematics)Different (Kate Ryan album)DatabaseReal numberSelectivity (electronic)Query languagePartition (number theory)Set (mathematics)SoftwareEntire functionWeb pageDirected graphElectronic mailing listTotal S.A.Algebraic closureVector space modelStructural loadNumberSubgraphAudiovisualisierungSoftware frameworkConstructor (object-oriented programming)Configuration spaceData structureCognitionSampling (statistics)Object (grammar)Domain namePoint (geometry)Portable communications deviceGoodness of fitState observerLikelihood functionTerm (mathematics)2 (number)Multiplication signCAN busBitoutputContext awarenessSubject indexingRevision controlCodeGroup actionContent (media)Limit (category theory)Social classDuality (mathematics)CASE <Informatik>Uniformer RaumComputer animation
11:19
Weight functionPattern languagePartition (number theory)Graph theorySimilarity (geometry)Graph (mathematics)Algebraic closureWindowVertex (graph theory)Extension (kinesiology)Element (mathematics)Attribute grammarDatabaseStrategy gameWeightStatisticsParameter (computer programming)Sound effectSubgraphPairwise comparisonGraphical user interfaceObservational studyScalabilityCognitionConstructive solid geometryQuery languageExecution unitRange (statistics)Total S.A.Population densityTexture mappingRandom numberVariety (linguistics)Data structureInclusion mapInterface (computing)Visual systemFocus (optics)Software maintenanceGraphical user interfaceSocial classMereologyGraph (mathematics)Electric generatorPattern languageQuery languageFile formatWeightRange (statistics)Random number generationNumberVertex (graph theory)VirtualizationAttribute grammarAsynchronous Transfer ModeExtension (kinesiology)DatabaseMultiplication signComputer fileStructural loadMaxima and minimaSubgraphCognitionObservational studyCuboidMixed realityTerm (mathematics)Software testingReduction of orderVarianceResultantData structureRandomizationStrategy gameTotal S.A.Variety (linguistics)Set (mathematics)Algebraic closureSelectivity (electronic)Constructor (object-oriented programming)Pairwise comparisonSoftware frameworkRow (database)Parameter (computer programming)StatisticsSoftware maintenanceSound effectMortality rateBitInformation securityGame theorySampling (statistics)Logical constantBit ratePerturbation theoryProcess (computing)CodeComputer virusFood energySinc functionRandom walkPresentation of a groupCompass (drafting)Compact spaceState of matter10 (number)Graph coloringMachine visionInterface (computing)Limit (category theory)Content (media)Group actionIndependence (probability theory)Disk read-and-write headMatching (graph theory)Sheaf (mathematics)Control flowComputer animation
Transcript: English(auto-generated)
00:02
I'm gonna talk about Catapult, a data-driven approach of creating visual graph query interfaces. So let me dive right into it. The way we interact or search graphs in the context of database research is basically through formal query languages.
00:21
So we have SPARQL queries, or SPARQL language or Cypher, which are the popular ones where we can formulate queries using these languages. And here is an example of a query written in SPARQL over biological networks. This is fine for people who are in this room, but if you consider graphs being ubiquitous,
00:43
ubiquitous in most of the different applications domain, then this is a pretty hard query to write for a biologist or somebody who is working in that domain. Basically, to be very honest, they are not going to write this because they have to learn the syntax of SPARQL. Okay, so it leads to frustration.
01:00
Okay, and then this is the first step of interacting with the database, but then the first step itself create a problem. A popular way of elevating this challenge is to give direct manipulation interfaces. Basically, you visually present the task concepts, and then these are easy to learn, easy to retain, and allows errors to be avoided.
01:21
And basically, it appears to the novice users, intermittent users, expert users, and so on. And indeed, in the real life, these are two interfaces I'm showing you to write from commercial data sources. The first one is the PubChem, which is a large collection of smaller,
01:41
medium-sized chemical compounds, and then the below one is DrugBank. Both of them provides these interfaces. And you can see these interfaces, they formulate query by dragging and dropping on the query panel without writing any code in query languages. The classical way of constructing these interfaces is that you will sit with,
02:02
there's a bunch of programmers who will sit with the domain experts, and then construct the different components of these interfaces manually by writing code, right? So this has always been the way we created graph query, any query interfaces, but in specifically for this stock graph query interfaces, right? But in this case, the problem is
02:22
because you're doing it manually, the labels and patterns used in these interfaces are hard coded, they have limited variety depending on limited understanding of the data sources, which means it may not lead to efficient query formulation Second, if your data sets changes or evolves, the interface does not evolve automatically
02:40
because these are manually coded. So you have to do it manually. And the third one is, let's say you are going to replace this chemical compound data sets with something else like protein structures, then your interface is totally useless because these patterns don't make sense in a different data sources, right? So you have to scrap this and write it from scratch again. So what in this world we are trying to tell you
03:03
is that the way we can do this is in a data driven manner. So that means you have a data source, you have the interface structure, and we can fill up this content automatically and judiciously in a data driven manner. So entire interface is constructed in a data driven manner.
03:20
So what is the advantage of this? First of all, it leads to efficient formulation of queries. It becomes portable, and you can maintain these automatically, okay? So the goal of this work is to basically talk about this in a little bit more detail, okay? So here is my general structure of a visual query interface.
03:41
You can see the panel two is basically a list of labels, distinct labels in a graph, in a graph data source. Panel five and six are basically small patterns which you can use, you can drag and drop to formulate a query, we refer to the scan patterns. And the panel in the middle, panel four, is where the user takes actions to formulate the query.
04:03
So you can imagine that panel two and panel five and panel six are basically data driven. They are basically containing, they're reflecting data on which this interface is built. Whereas panel four is user actions, depends on user actions. So in this talk, we are going to talk about
04:20
how to construct, automatically select these small patterns or canned patterns from the data sources, which will enable you to formulate queries. What is the benefit of this? First is it gives you efficient query formulation, primarily because we make much judicious selection of these patterns, otherwise if you have done it manually
04:41
and we compare it with real commercial systems. And efficient in terms of the number of steps you take to formulate the query, as well as the time you take. Okay, second is by doing this, you can slap this interface on top of any data sources like this and it will automatically fill up the content. So it makes you portable, distinct portable
05:02
across different sources and domain. Okay, and our broad goal is going towards one constructor fits all, so you change your data sources, you change your domain, doesn't matter. If I have this particular framework running, which we call as Aurora, then it will automatically
05:21
fill up the contents of your interfaces and you can go ahead and formulate your queries. Okay, why this is important, or why efficient query formulation matters? So I just want to highlight here by quoting Ben Shneiderman. What he said was computing time power is getting cheaper,
05:41
but user's time isn't. So user's time and experience matters, especially with comparison to query languages and so on. So what makes a good pattern set, can patterns we want to discover? First of all, you're looking at patterns which have high coverage of the data sources.
06:00
Okay, high coverage of your underlying data. Second is the pattern should be diverse. That means you should not have structurally similar patterns on your UI because you have limited space. Third one is this pattern, because you're going to look into this pattern when you formulate queries, and this pattern should have low cognitive load on you. Otherwise you will have to figure out whether, and spend substantial amount of time figuring out
06:22
whether this pattern is useful for your query. So just give you an example, take a look at this two pattern. These are small subgraphs, but you can see the first one is much more dense than the second one. So the second one you can easily, instantaneously you know the relationship between the nodes. But the first one you have to a little bit
06:41
look into carefully and figure it out because they're edge crossing, the graph is denser. So the first one intuitively has higher cognitive load than the second one. And if you consider this pattern, then the second and the third patterns are structurally very similar. There's no point having both these patterns on your UI because you're basically giving very similar patterns
07:00
and wasting the limited space you have. So putting this together, I'm going to introduce a problem. What we are trying to do is that given your graph data sets, which is a collection of small or medium sized data graphs, a pattern budget is basically specify the minimum size and the maximum size of patterns you want on your UI
07:21
and the total number of patterns. The CAN pattern selection problem is to find a set of CAN patterns from this data source D. What is the goal here? We want to maximize the coverage in terms of label coverage as well as structural coverage. We want to maximize diversity of patterns and we want to minimize the cognitive load.
07:41
And this is the objectives we have. And we want to make sure that the size of the pattern is greater than two because we are looking at, in this particular paper, looking at patterns greater than two ages to fill up the panel, okay? Now if you look into this, it is basically, you can see that it may be difficult to achieve Pareto optimality
08:00
due to conflicting objectives. So a traditional approach to address this problem is varied sum based approach. But in this context, it's very hard to know what are the weights of priority. So we use a multiplative utility function which is given here below and which is what we're going to optimize. Okay, so let me give you a brief overview of how this entire system look like by the time.
08:23
Aurora is a novel data driven visual subgraph query interface construction framework for a collection of small or medium sized data graph. Panel one allows user to generate useful patterns that supports query formulation and to load previously generated patterns.
08:41
Panel two lists all vertex labels on the data graph. Panel three is the query construction panel. Panel four and five contain the basic and generated patterns of the data graph respectively. Aurora leverages catapult to automatically populate panels two, four and five. We shall demonstrate pattern generation using catapult
09:02
with real world networks. In the first example, we use the PubChem dataset and configure catapult to generate 10 patterns between sizes three and 12. The generated patterns can be configured to display in different formats such as in a single page
09:23
or to be grouped by size. Now we shall demonstrate the portability and data driven nature of catapult using a different data graph. We now generate five instead of 10 patterns for the AIDS dataset. Observe that we use the same interface for this dataset.
09:43
Panels two, four and five are updated accordingly based on the input dataset. Note that the patterns of AIDS is different from that of PubChem. In summary, catapult provides a common framework to generate canned patterns of low cognitive load and high diversity from any source and domain centered
10:01
on small or medium sized graphs. To handle the competence section problem, we present the catapult framework giving a graph database that contains large set of small or medium sized graphs. Small graph clustering is performed to divide these graphs
10:23
into a set of topologically similar partition. Each partition is also known as a cluster. So small graph clustering consists of K-means based clustering and graph structure based clustering. K-means based clustering adopts frequent sub-tree to represent each graph as a feature vector
10:42
and then utilize the feature vector to perform K-means method. There is a likelihood that the cluster generated by K-means method are too large and hence, expensive to process. In this case, graph structure based clustering is performed.
11:04
Note that sampling is used here to handle larger dataset. Next, closure summary graph generation is performed. Each cluster is summarized using graph closure to generate a set of closure summary graph.
11:22
Finally, the k-pattern selection adopt weighted random work to generate a set of k-patterns and these k-patterns will be displayed on the GUI for query formulating. So we should now go into some details of catapult.
11:40
CFG generation is to generate a set of closure summary graph by taking following step. Firstly, it performs two-step clustering, namely cost and file clustering on data graph. Then, it extends each graph in every cluster by inserting dummy vertex or edge accordingly. We call this graph as extended graph.
12:06
Followed by performing element-wise union of attribute value of each vertex and edge of extended graph pairs. Sampling-based strategy is used here to handle very large graph dataset. K-pattern selection aims to select pattern
12:23
directly from CFG using random works by taking following four-step. Firstly, it assigns coverage-based weight to each CFG. Then, it performs random works on the weighted CFG, followed by generating candidate k-pattern
12:40
using statistics from the random works. Finally, it greatly selects k-pattern using pattern score based on the coverage, diversity, and cognitive load. The experimental study covered effects of sampling and variance parameter, comparison with commercial GUI, and the frequent subgraph-based technical.
13:03
User study and the Scarpitti study will discuss two experimental results and their setup. We test Calport on three real-world datasets, namely Aids, PopCane, and Emonucule. And we test Calport against two commercial GUI,
13:21
PopCane GUI and Emonucule GUI. The patterns displayed on these two GUI are manually selected. In addition, we will compare Calport with frequent subgraph-based selection method. For each dataset, 1,000 subgraph query in size range of four to 14 are randomly generated.
13:45
In terms of parameter setting, mix mode and maximum pattern size is said to be three and 12. Maximum class size is 20. K in K-means clustering is a ratio of database size to maximum class size.
14:01
And number of K-pattern is 30. The PopCane GUI contains terra patterns in size range of three to eight, where 11 of these patterns are unlabeled. Emonucule consists of six unlabeled patterns in size range of three to eight.
14:22
Since PopCane and Emonucule use unlabeled patterns, and vertex of Calport's K-pattern are labeled, so we should adopt vertex relabeling step to assign a common label to all vertex of query and the pattern. So vertex relabeling step is to simulate user
14:41
assigning correct label to vertex. There are three performing measures, namely the step reduction ratio, mu g, missing percentage, m t, and the cognitive load, c o g. The results are shown as following, where the green underline and the red box are good results. Compared to Emonucule, Calport has superior c o g,
15:03
mu g, and m t. Compared to Calport, PopCane has superior m t. Since unlabeled vertex relax vertex mapping, however, Calport still performs superior in terms of c o g and mu g.
15:20
We further conduct a user study to see if Calport can lead to more efficient query formulating. So query set consists query with variety of structure and vertex label. Each query was formulated by five different user.
15:42
A total of 25 participants, each from 20 to 30, were on trial for this study. Each was given a query of random order and was asked to make a maximum use of can pattern to construct the query. Time and steps taken to construct the query was recorded.
16:03
From the experiment, we found that Calport required less query formation time and fewer steps. In summary, we present a novel framework towards data-driven construction of virtual query interface.
16:22
The framework focused on can pattern generation. It can make virtual query formation more efficient by exposing pattern that are large, diverse, but lower the cognitive load and it can enhance probability. As part of our future work, we will explore incremental maintenance of GUI
16:42
with evolving data. Thank you for your attendance.