Using Neo4j for exploring the research graph connections made by RD-Switchboard
Video in TIB AV-Portal:
Using Neo4j for exploring the research graph connections made by RD-Switchboard
Formal Metadata
Title |
Using Neo4j for exploring the research graph connections made by RD-Switchboard
|
Author |
|
License |
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. |
Identifiers |
|
Publisher |
|
Release Date |
2016
|
Language |
English
|
Content Metadata
Subject Area | |
Abstract |
Jingbo Wang (NCI) and Amir Aryani (ANDS) present the Neo4j queries that can help data managers to explore the connections between datasets, researchers, grants, and publications using the graph model and Research Data Switchboard. In addition, they discuss the recent paper on "Graph connections made by RD-Switchboard using NCI’s metadata", presented in the Reproducible Open Science workshop in Hannover earlier this month.
|
00:00
Goodness of fit
Implementation
Graph (mathematics)
Query language
Graph (mathematics)
Multiplication sign
Core dump
Query language
Metadata
Ranking
Mereology
00:56
Web page
Slide rule
Group action
Context awareness
Service (economics)
Connectivity (graph theory)
Combinational logic
Set (mathematics)
Computer configuration
Energy level
Position operator
Link (knot theory)
Information
Cross-platform
File format
Keyboard shortcut
Data storage device
Line (geometry)
Limit (category theory)
Degree (graph theory)
Repository (publishing)
Query language
Personal digital assistant
Telecommunication
Search algorithm
Revision control
Website
Quicksort
Resultant
Row (database)
02:57
Collaborationism
Group action
Presentation of a group
Multiplication sign
Projective plane
Moment (mathematics)
Mereology
System call
Windows Registry
Integrated development environment
Universe (mathematics)
Computer music
Data structure
Information security
Computing platform
04:30
Context awareness
Information
Numerical digit
Model theory
Multiplication sign
Internet service provider
Set (mathematics)
Database
Revision control
Prototype
Digital photography
Process (computing)
Googol
Basis <Mathematik>
Different (Kate Ryan album)
Personal digital assistant
DDR SDRAM
Library (computing)
05:39
Web page
Slide rule
Asynchronous Transfer Mode
Connectivity (graph theory)
MIDI
Set (mathematics)
Database
Mathematical analysis
XML
Profil (magazine)
Different (Kate Ryan album)
Musical ensemble
Gamma function
Summierbarkeit
Zoom lens
Content (media)
Internet service provider
Scanning tunneling microscope
Mereology
Translation (relic)
Digital object identifier
Protein
Inclusion map
Process (computing)
Sample (statistics)
Repository (publishing)
Personal digital assistant
Time evolution
Internet forum
Chain
Booting
06:33
Context awareness
Group action
Java applet
INTEGRAL
Graph (mathematics)
Multiplication sign
Direction (geometry)
1 (number)
Set (mathematics)
Database
Parameter (computer programming)
Computer programming
Mathematical morphology
Inference
Programmer (hardware)
Computer configuration
Different (Kate Ryan album)
Endliche Modelltheorie
Physical system
Area
Theory of relativity
Cross-platform
Internet service provider
Flow separation
Category of being
Arithmetic mean
Software repository
Repository (publishing)
Quicksort
Row (database)
Point (geometry)
Web page
Slide rule
Functional (mathematics)
Implementation
Identifiability
Service (economics)
Connectivity (graph theory)
Disintegration
Characteristic polynomial
Virtual machine
Maxima and minima
Similarity (geometry)
Scalability
Metadata
Element (mathematics)
Supercomputer
Profil (magazine)
Googol
Graph (mathematics)
Harmonic analysis
Energy level
Data structure
Computing platform
Computer architecture
Domain name
Time zone
Graph (mathematics)
Scaling (geometry)
Information
Projective plane
Database
Line (geometry)
Limit (category theory)
Particle system
Personal digital assistant
Universe (mathematics)
Object (grammar)
12:20
Slide rule
Graph (mathematics)
Link (knot theory)
Connectivity (graph theory)
Multiplication sign
Set (mathematics)
Mass
Database
Software maintenance
Flow separation
Element (mathematics)
Degree (graph theory)
Degree (graph theory)
Process (computing)
Flow separation
Graph (mathematics)
Endliche Modelltheorie
13:03
Presentation of a group
Link (knot theory)
Different (Kate Ryan album)
Query language
Projective plane
Universe (mathematics)
Energy level
Water vapor
Mereology
13:57
Slide rule
Functional (mathematics)
Presentation of a group
Graph (mathematics)
Graph (mathematics)
Computer file
Code
Set (mathematics)
Web browser
Student's t-test
Bookmark (World Wide Web)
Web browser
Revision control
Inclusion map
Cache (computing)
Integrated development environment
Atomic number
Googol
Query language
Website
Hydraulic jump
Physical system
Window
Window
14:55
Slide rule
Graph (mathematics)
Information
Prisoner's dilemma
Computer file
Content (media)
Code
Magnetoresistive random-access memory
Water vapor
Bookmark (World Wide Web)
Mach's principle
Fluid
Integrated development environment
Graph (mathematics)
Hydraulic jump
Physical system
Window
15:46
Web page
Point (geometry)
Domain name
PC Card
Greatest element
Connectivity (graph theory)
Computer file
Civil engineering
Set (mathematics)
Hurewicz-Faserung
Bookmark (World Wide Web)
Metadata
Local Group
Mach's principle
Integrated development environment
Query language
Data Encryption Standard
Gamma function
Window
16:33
Slide rule
Multiplication
Email
Link (knot theory)
Graph (mathematics)
Connectivity (graph theory)
View (database)
Graph (mathematics)
Model theory
Set (mathematics)
Bit
Digital object identifier
Proper map
Flow separation
Number
Degree (graph theory)
Degree (graph theory)
Flow separation
Personal digital assistant
Query language
Internet forum
Query language
Data conversion
17:55
Point (geometry)
Email
Module (mathematics)
Graph (mathematics)
Graph (mathematics)
Connectivity (graph theory)
Cellular automaton
View (database)
Computer file
Graph (mathematics)
Code
Database
Bookmark (World Wide Web)
Element (mathematics)
Mach's principle
Personal digital assistant
Googol
String (computer science)
Gamma function
Hydraulic jump
Physical system
Spacetime
Computer architecture
Window
19:00
Computer file
Sound effect
Insertion loss
Instance (computer science)
Limit (category theory)
Bookmark (World Wide Web)
Food energy
Number
Category of being
Medical imaging
Internet forum
Spacetime
Window
20:17
Digital electronics
Computer file
Electronic mailing list
Content (media)
Angle
Bookmark (World Wide Web)
Event horizon
Bookmark (World Wide Web)
Mach's principle
Type theory
Query language
Personal digital assistant
Computer configuration
Googol
Internet forum
Simulation
Directed graph
Window
21:14
Slide rule
Presentation of a group
INTEGRAL
Connectivity (graph theory)
1 (number)
Set (mathematics)
Water vapor
Limit (category theory)
Counting
Metadata
Number
Element (mathematics)
Mach's principle
Revision control
Sign (mathematics)
Different (Kate Ryan album)
Profil (magazine)
Negative number
Authorization
Endliche Modelltheorie
Traffic reporting
Directed graph
Window
Link (knot theory)
Graph (mathematics)
Information
File format
Computer file
Physical law
Bit
Bookmark (World Wide Web)
Measurement
Particle system
Subject indexing
Process (computing)
Personal digital assistant
Internet forum
Right angle
Quicksort
Cycle (graph theory)
Form (programming)
Directed graph
24:34
Group action
Connectivity (graph theory)
Execution unit
Characteristic polynomial
Limit (category theory)
Counting
Automatic differentiation
Field (computer science)
Element (mathematics)
Mach's principle
Causality
Different (Kate Ryan album)
Endliche Modelltheorie
Data structure
Formal grammar
Directed graph
Physical system
Window
Link (knot theory)
Graph (mathematics)
Computer file
Sound effect
Type theory
Data model
Keilförmige Anordnung
Personal digital assistant
Repository (publishing)
Internet forum
Mathematical singularity
Row (database)
26:18
Group action
Graph (mathematics)
Code
Plotter
Computer file
Set (mathematics)
Database
Counting
Digital object identifier
Bookmark (World Wide Web)
Mach's principle
Category of being
Googol
Internet forum
Authorization
Form (programming)
Physical system
Row (database)
Window
27:02
Code
Computer file
Set (mathematics)
Control flow
Basis <Mathematik>
Limit (category theory)
Density of states
Regulärer Ausdruck <Textverarbeitung>
System call
Mach's principle
Local Group
Particle system
Proof theory
Internet forum
Form (programming)
Directed graph
Row (database)
Window
27:54
Email
Link (knot theory)
Connectivity (graph theory)
Computer file
Set (mathematics)
Limit (category theory)
Counting
Tangible user interface
Term (mathematics)
Bookmark (World Wide Web)
Number
Mach's principle
Frequency
Internet forum
Graph (mathematics)
Connectivity (graph theory)
Data Encryption Standard
Hill differential equation
Quicksort
Form (programming)
Directed graph
Physical system
Reverse engineering
Window
29:00
Email
Connectivity (graph theory)
1 (number)
Set (mathematics)
Limit (category theory)
Mass
Counting
Likelihood-ratio test
Protein
Mach's principle
Data management
Degree (graph theory)
Escape character
Flow separation
Blog
Condition number
Window
Multiplication
Key (cryptography)
Computer file
Line (geometry)
Cartesian coordinate system
Bookmark (World Wide Web)
Flow separation
Rothe-Verfahren
Degree (graph theory)
Arithmetic mean
Query language
Internet forum
30:15
Asynchronous Transfer Mode
Divisor
Structural equation modeling
Web browser
Open set
Thermische Zustandsgleichung
Mathematical morphology
Area
Mach's principle
Local Group
Revision control
Data management
Googol
Pattern language
Species
Endliche Modelltheorie
Formal grammar
Window
Domain name
Metropolitan area network
Graph (mathematics)
Computer file
Database
Bookmark (World Wide Web)
Digital object identifier
Inflection point
Forest
Inclusion map
Hausdorff space
Personal digital assistant
Query language
Internet forum
Quicksort
31:39
Slide rule
Presentation of a group
Information
Graph (mathematics)
Multiplication sign
Computer file
Moment (mathematics)
Metadata
Mereology
Bookmark (World Wide Web)
Digital object identifier
Number
Data management
Integrated development environment
Query language
Different (Kate Ryan album)
Internet forum
Computing platform
Physical system
Window
33:14
Satellite
State observer
Service (economics)
Dynamical system
Scaling (geometry)
Graph (mathematics)
Connectivity (graph theory)
Moment (mathematics)
Data storage device
Metadata
Metadata
Medical imaging
Computational physics
Process (computing)
Integrated development environment
Optics
Energy level
Figurate number
Videoconferencing
Physical system
35:22
Ewe language
Service (economics)
Scaling (geometry)
User interface
File format
Connectivity (graph theory)
MIDI
Structural equation modeling
Data model
Process (computing)
Integrated development environment
Hill differential equation
Virtual reality
36:06
Context awareness
Logical constant
Graph (mathematics)
Identifiability
Rational number
Graph (mathematics)
View (database)
Relational database
Database
Database
Line (geometry)
Library catalog
System call
Computational physics
Database normalization
Query language
Analog-to-digital converter
Repository (publishing)
Query language
Video game
Contrast (vision)
Data structure
Linear map
God
39:07
Asynchronous Transfer Mode
Identifiability
Link (knot theory)
Computer file
Graph (mathematics)
Connectivity (graph theory)
Database
Metadata
Whiteboard
Hierarchy
Energy level
Electronic visual display
Cuboid
Process (computing)
Data structure
Computer-assisted translation
Domain name
Graph (mathematics)
Interface (computing)
Moment (mathematics)
Projective plane
Metadata
Computer network
Database
Library catalog
Latent heat
Process (computing)
Software
Synchronization
Website
Energy level
Physical system
Spacetime
Row (database)
41:13
PC Card
View (database)
Connectivity (graph theory)
Multiplication sign
Database
Analytic set
Metadata
Computational physics
Goodness of fit
Uniformer Raum
Hypermedia
Arrow of time
System identification
Graph (mathematics)
Information
Graph (mathematics)
Metadata
Internet service provider
Database
Digital object identifier
Data management
Digital photography
Process (computing)
Software repository
Summierbarkeit
Energy level
Row (database)
43:56
Point (geometry)
NP-hard
Group action
Presentation of a group
Service (economics)
Graph (mathematics)
Connectivity (graph theory)
View (database)
Multiplication sign
Database
Branch (computer science)
Subset
Number
Data management
Different (Kate Ryan album)
Collaborationism
Graph (mathematics)
Prisoner's dilemma
Data storage device
Thomas Kuhn
Database
Data management
Digital photography
Software repository
Telecommunication
Directed graph
47:00
Medical imaging
Arithmetic mean
Computer file
Well-formed formula
Graph (mathematics)
Metadata
Database
Open set
Computing platform
Address space
48:19
Slide rule
Digital filter
Graph (mathematics)
Numbering scheme
Expert system
Whiteboard
Bubble memory
Googol
Position operator
Wireless LAN
Window
Service (economics)
Graph (mathematics)
Computer file
Java applet
Code
Database
Bookmark (World Wide Web)
Twitter
Web browser
Logic synthesis
Inclusion map
Repository (publishing)
Fingerprint
Directed graph
49:05
Web page
Point (geometry)
Meta element
Reading (process)
Asynchronous Transfer Mode
Digital filter
Computer file
Software developer
Graph (mathematics)
Model theory
Counting
Data model
Googol
Thermal fluctuations
Position operator
Window
Logical constant
Information
Software configuration management
Computer file
Java applet
Menu (computing)
Core dump
Instance (computer science)
Bookmark (World Wide Web)
Web browser
Element (mathematics)
Inclusion map
Repository (publishing)
Normed vector space
Inference
Pulse (signal processing)
49:51
Email
Group action
Context awareness
Presentation of a group
INTEGRAL
Graph (mathematics)
Multiplication sign
Combinational logic
Set (mathematics)
Database
Insertion loss
Open set
Neuroinformatik
Inference
Different (Kate Ryan album)
Computer configuration
Semiconductor memory
Bubble memory
File system
Endliche Modelltheorie
Data conversion
Physical system
Service (economics)
Email
Inference engine
Software developer
Moment (mathematics)
Data storage device
Internet service provider
Twitter
Category of being
Type theory
Process (computing)
Repository (publishing)
Chain
Computer music
Cycle (graph theory)
Spacetime
Slide rule
Implementation
Observational study
Computer file
Virtual machine
Student's t-test
Streaming media
Tangible user interface
Smith chart
Element (mathematics)
Power (physics)
Supercomputer
Revision control
Writing
Well-formed formula
Googol
Graph (mathematics)
Selectivity (electronic)
Graph drawing
Data structure
Gamma function
Computing platform
Compilation album
World Wide Web Consortium
Pairwise comparison
Graph (mathematics)
Myspace
Information
Projective plane
Graph (mathematics)
Planning
Database
Line (geometry)
Semantic Web
Graph theory
Subject indexing
Number
Personal digital assistant
Query language
Internet forum
Video game
Form (programming)
Lambda calculus
00:01
good afternoon everyone and welcome to the millenium this afternoon my name is I I'm working for an answer and it myself today have thought the jingle rank from in C on who will be their core presented in this talk that we are going to talk about their we'll forge a that acknowledges that the use of part of their usage data switchboard and I will give some background to the talk was of 4 colossal of this and in a gym would talk about that in C or implementation of this technology
00:33
so the agenda for the talk which to the background on their research to the Switchboard and new research graph which is a data-modeling behind this then we'll talk about them will forge a query is have a look at that something where the 10 to 15 minutes on the technical side of things and then we to in CI implementation and at the end we would have time for questions and answers in the background and
00:58
this work is it has farther from their challenge off a cross-platform discovery of research data it goes back to 2004 in new research data lines working group when you had the problem of finding the related connections
01:12
sissonnes you might have seen this slide is actually the the slide in this work and I actually kept using that because it actually shows the early stage of the problem and I although we have solved this elusive started to a great degree but so many repositories already that have this issue if you have a data set and you want to know what else in a scholarly communication can be linked to disease formation is usually keyboard such is not efficient way so in this case for example the answer paging 2014 we had their data said that was actually who in this page for that they had a cross linked to edit the site and the 100 across a keyword search for title and really the key what now the problem was the queries that comes back from that was including a lot of false positives we had in this result we have more than 1000 records connected to the side so that's it was were supposed to be a recommendation for the researcher that is unrelated information to this dataset in practice that is another useful and have so many combinations so 1 of their initial ideas behind this was how the Amazon or other retail stores do this fight looking at their their service call the also service if you look at the book they tell you do want to look at these 5 other books the same also Watson publisher or someone level of the protesters this will also purchase builds on the books so usually those recommendations are very precise and are really at their limit that's seem you become until 5 options like 1 thousand options to their end users in this context we're sort of working with every day is this initial partners
03:00
that join the group at a time to basic active since a cross-platform with for the challenge that we had a significant the mission followed lots of aside every time you had opened their doors of a and other apartment joined us in astronomy their the collaboration of an and also in the sense of me and also there are other universities who had been involved in this project but I feel a little call in early stages of the working group now it this is a this is a mother sister intolerance
03:31
and usually when I get to this part of the presentation of talk about the that structure of resisted farmlands thinking in the scope of this talk we don't have that much time for this I would just as a brief notes that a research data alliances joint venture by the main from those that they invest in the infrastructure and the main goal is that the people who actually work on different projects that need to coordinate and collaborate and be from working groups and each working will have different that it will close is almost projects now in in this environment data Alliance me had a working group which suffered from 2004 the moon and concluded there main delivery of LAN deliverables in 2015 and L. we are at the moment continuing to maintain the wall and extending the platform now they're working group they're the main recommendation
04:31
after multiple prototypes was the data sets can be connected using the quantal should model and then this although in principle is that there is a simple idea but in practice requires connecting information across different infrastructures not when we were doing this at a time we were looking at the 1st stage of this process was looking at how this can be done by librarians and as a reaction went to the process of asking a library and what you had cases that we will to look at the this process and tell us what they will do so we learn from their practice so this context this this
05:12
is over but the college new version offers 58 hours after a couple of opposites of watching I that that screenshot from the same dataset there in 2004 games or have a new photo this data this is from University City and then when you research also mobile to the library and you look at the status that you can identify the researcher at less war or contributed to this status what you can actually search for that person and Google and they can't
05:42
find the relationship the profile page of the including publications when you go to their
05:49
publication released you can if you want to read every single paper at the
05:54
content of those paper you will find
05:56
data sets in other repositories has been cited low mentions so in this case you have a data set from the job depositors
06:04
now being it job repository you can actually see the same research at all the different name abolition and the publication is basic disconnected from University City so the low water and are just put together in this slide to emphasize the connection but in practice the new what enjoyable alignment you don't know that this war is connected to a researcher at the University of Sydney so if you go to the chain of all of these connections we would get something that will go from
06:34
a data set in answer to a research area in here since it means that article plus 1 to it datasets enjoy depositors now all this forces the 1st at the of the book was to actually demonstrated what so we went around what had what 250 collection that time that it is something wrong that we established elites have about of this it is understandable what you can do this with this new dataset and innocently to the decided use machines and in this context
07:07
we so the goal here was we wanted to have a solution with other spend too much time on the research on inventing the stand up so we adopted all the style of the cooled from other groups a of the other platforms and you try to implement something that it is simple to adult by others and also easy-to-maintain so their enormous structure is basically have 3 different layers the first one is their hydrogen there is basically is always document and it reads a positive performance the government will is the most obvious 1 but also there is a a a serious a motley line these and problem upon a ones from interacting repositories None of this stuff is actually is in the working group the least which I believe I have it if for this further in the slides they're Harmison layer put the information into a sort of machines that in this case we implement that zone but you can listen select followed other high-performance computing and platform that they're the main function of the required from those platforms that they should be able to run Java program because everything is implemented in Java and what it's what those problems do is that they basically read information from all of those points and the connect them to get invented the connection is possible it is or isn't identifiers it uses the prospect integration to get their metadata flooded your eyes the same for they decide it does the for orchid it it will API integration when you have the grant or paper and we actually search for certain domains in universities to find a profile pages the users some level of disambiguation and then we'll go inside the button loaded linking that blows across the graph and link the notes that there is a a visit inference components for those connections to happen there again I cannot actionable to detail of this there is a a document in will include the culmination that talks about the relationship that called known as so if you find 2 different elements that those elements air by I don't have the same identifier that enclosed in on a few all there are other same entitled with their similarity in other elements and them as a known as elements and then a new node linking this is definite and to get know the point is all of that information would get together in 1 database and in the case of our project we use we'll forge at the main reason for that was 1 the simplicity of implementation it is it was very easy to hire programmers who can actually called job while and actually but program for willful J and also the performance so that this speed of querying the database for morphology is much much weaker than that you implement something for engine arguments on user centered not the cheapest so the off order was that mean mean point of aggregating all of these connections for us and that the debate access layer after something because metadata harmonization which basically harmonizing it names as much as possible but by means of the property names as much as possible to what the government called that we ended up with a graph model of the college research graph at a time now there is 1 characteristic of this matter model which is different from the the others the method of of this this 1st of all it's not mainly by you or I have the philosopher elements where it is possible to convert information do you like the example of that is collected the deal lighting why are you like converting the grant ID to everywhere that we had options come to come but something to you ought to have done and that enables a scale the scalability of this graph to a distributed graph across multiple platforms and the other thing is that we have the separate relation object there which is some of you are familiar with as witnesses you have their well we have their particles have a data which is a collection we have a services he had activities in these small also the whole relation object and that rendition main advantage of this is that the enables connecting a that notes to you I was actually don't exist not persistent example of this is that if you're looking at their data repository university you might have actually or could recall other related identifier and in this more than just the limits up there put their relation object but so this records to Orchid and we don't have to resolve it sits there by the time they slamming what's the difference whether we can resolve the record to that a identified and Intergraph system actually handles this as a as a bi- directional relationships between nodes now I would not supposed talk about too much will architecture because you have a lot to talk about the actual and no food you queries so just call it have put them this topic there there
12:21
are this all examples of Monta degrees of separation just to 1 of the 2 2 database can be connected is for example if the dataset half the contributor which is also on that Artur actually published a paper and that people cited and other data sets and this is what we got 3 degrees of separation and have multiple offers this indifference nights this is a link to their
12:46
research graph and model in the interest of time I think I will skip this slide to a lot of a lot of job elements there a maintenance the long and if it had a time at the end of the living and if you have questions about that a connection come back and possible discriminate and these mainstay at
13:04
this level the available so you can actually learned like this a slow but the links that only that the stars and thinking and this meets the next part of
13:12
the talk is about male so that was actually the main motivation for this and under review switchboard project if you like but we implemented the research for the Switchboard you want but if institutions in Australian see I have that up that that university the signal the adopted the air and is using this and also in in Europe we have a multiple partners when using this apology now we came up with the same questions again and again and is on different queries that people ask how can do this I can't find my like dataset using you like how can I final the datasets that are from his particular publisher so in this part of this in the presentation under water to some of the scenarios
13:57
now 1 thing about a Newell for Gerda is that it has a graph browser and being that for the purpose of this presentation Annals of photosynthesis Switchboard we have their extended version of this balls of that site which has some built-in functionality clearly is related to the related a scholarly what's basically if if you don't think about this as an extended a graph explore the forge in this environment the 1 other thing that you can do as an example of this i've Clooney's is you can search for a data set and what I'm going to do is I'm to search for the same dataset from that just look at it in our example so here atoms of undersea did in well what if the student you from the slides so what
14:49
I said is that given a data set from the windows of the light and that light quanta base but I do it out its basic
14:57
incomes that with the a kind of fluid slash warned you dot edu prison that they just recall you get the content of this it tells you this is a recall from dry but this is the title and the use of waters know what we can do in this environment can actually be queried by order double-clicking on that or just a lovely little political expand but so here you would get the other information there are 400 datasets in the dragon violently to this there is a paper which is this slightest from plus 1 optical this was also 1 of the the slides and this is a researcher well in this environment I can keep expanding the nodes and what about the coldest strongest into the
15:42
graph so here I can see all the their publications for that
15:48
recall for the researcher reduce this is catching up and then all the grand sensible and also for these plants some of them actually have connections and I live 1 of them is also connected to another dataset while the status of now in this environment if I want to look at the metadata after recall I can actually expand this thing in the bottom of the page and here I can see the title which this is the 1st dataset at his father farmers so back to the point of what there is a fibration here I have a data set that
16:22
this leads to a researcher goes all the way back to the initial bids now in
16:28
this domain you can do the set of queries and I actually have a list of
16:33
things that we need to look at it so
16:36
we're going to look at the hard to find a dataset hard to find a publication glance incision had a fundamental could like on how to find data sets that have blue light that hard to find new eyes using prefix find highly connected datasets that is using there are a number of edges in the graph their connections with multiple degrees of separation and find shortest path between 2 incidents now this a mind and up to be a bit overwhelming to go into all of this either queries what I will do is that if there is some of the things that come conjugated the slides will be available online so you can actually go and try and and then basically said in e-mail and we cannot offline conversation about their syntax of the good in this case to already check their finding a data set by view on it but it can also find the of title so the way that it works is that new it they old in our research graph models have a proper article title and in this case the title for the data set is the 1 that we get from veritable that this is a simple query you can the descending for publication you can actually get the publications recall this is all obligation queries follow the fairly the latest knowledge in looking
17:55
at how much it cost us included here and then I can actually get this circle from
18:01
database now 1 thing another point out is that if you see the the the point of view on it actually can take longer than usual because they're the size of the graph is the database and this this cell at 16 in militants instructed and what it what you have just 1 here is that it's asked for a string search in a graph database full some of us were familiar with the architecture design of databases that graph they that this is not designed for string search longer trick about this and that I was doing of it would be a very good example in this in this case it was actors with 6 million nodes to find the remote that element but we collect them you connection actually do this much better by just making this will be more precise if you know it is from so we could just at the Serre in space here and also we can add that
19:03
limit longer at and what it does it says the first one that you founded comeback those go for the rest of this I know is only 1 instance effects so long to do here if I had been a button and it kind of that immediately so you know where you make a clearly have a direct impact on the performance of from the RAF it is now there this is another example of how 1 can
19:27
find the title of this is the same as a dataset so there's no complex about that then is the other and on recalled from Serre and that the publication a has a title and this is the of woman not for the grant to have another property that it is a useful and that is called pair persistent persistent you wanna there is a struggle for all that they are the image of our surveillance we have no which basically is this letter space and the grand ideas at the end we basically you have problems or the slash EU research such grounds and an a slash if it is a possible BARC background number it is an intimacy is that the energy loss Deshpande number so I can actually cobordism clearly a little here and pace
20:17
but before Apple's emotional another thing every included you type you can actually hit the spot and and out that your favorite so I
20:26
have the list of the favorite of all the qualities and I want to rock just in case of a circuit apartment after so in this case can actionable to this cities and people not the same query they can write and just to make a great doesn't just even 1 now this is
20:43
events and you can look at the content of the grant here now the same thing you can
20:48
actually search for Grandpa title obviously for the researchers there are a lot of options but you can certainly social above 1st and 1 last thing because Thurston is such but Aukey but also in the 1st and social but Scott was cited in this is still the actually get the scope was these interlinked or and mean excess so if you have if you're looking for research we discuss this idea that you can actually find that adequately would look like
21:15
this so you say I want a research over the scope sign equal to this and when is such this clearly that every
21:24
node into there was a lot more complicated so what you're going to hear they're going to find the connection so that not only looking for a particle of notes I'm looking for the nodes that satisfy this with the criteria in this case and we are looking for quote from John but this is little to local from Orchid now 1 thing is if
21:45
you if you had tried to convey this information in committee cycle queries into come upon presentation all the Tec-City tours but they do they fit around with their format of these dashed and other specific characters so just be mindful and that when you copy and paste sometimes they actually been sort of so back to the topic in this case we have a data set and is from dry up and this the syntax basically said I want the only notes from the and into working and then I wanted account of 1 to a hundred so this is actually introduction of an onto index inches cont elements a connection to vitamin D. The framing the notes you get it and a number of note so in this case we have 1 thousand 231 dryer datasets that are actually what I said it doesn't and publications that are included in the water profile and the way that we can actually see what law again just a with just 10 of those records and that these ones I can expand 1 off the land only so this publication that should be set up which you know explain and others that are not when you look at the nodes in their research graph model that is multiple labels for each note the labels identify the source in this case you have 1 publication which came from Orchid and process that means we have to a different sources for information that had the metadata for this report and then we managed now and 1 of their presentation someone asked me what the have and if there's a conflict about this with the data by the habit in the title is different from 5 till the faster the way that we manage this is that there is a priority in the this basically there are different sources that have authority but if information for example anything that is related to you why if it is is that you I religious faster in full measure from costs of we all right other notes so let's say in this case the tightening orchid recalled a might be modified by the researcher when be just this and then do integration of trust that should alright information from what by the cross that when it gets to you I know I know this might actually to false negatives but this is actually a bit for us was the most practical way to manage now this publication we can expand it postal could recall and is is it's expandable with recall what the song Felicia passion which in this case we didn't so that it is 1 of could recalled lead to 1 publication going back to the slides that
24:34
now this is another type of clues that might be useful for many of us is hotter find recalls the ads that talking turn this case I want to look for a cause the University Sydney
24:50
that contributed to this Australia up and we have a connection to work so the only difference he that I use the new elements that called kinds group now that there is another characteristic of our graph model is that that unlike a many difference what the some of the of have a concept of 4 in their routine not graph more than 200 to the concept of for and what we do is that we have found so the field but also their repositories this there the alderman them in a specific field we can just them into the system and efforts about the example that we did for NCR ingesting globalization nodes into the graph structure as to how these units will be in a data model that AID what's a move because the system actually agnostic to the metadata unless in the joint note to basically the graphical is you can have a hybrid model so in this case and in this case you have school which is with a data element from ants and say OK that is equal to the University of Sydney and Macau records of effect so it these are records from the University of Sydney enhanced wish to have a connection to work to for example this is a data sets I can expand this this publications on a oppressive publication singing or cost him actually it's its so this is all a triple again so this is another
26:21
example of a at this is that useful thing here is a new 1 of the things that you can do in the graph the authors of the search for properties so I want to know all the data sets that they have but can I should go beyond praise in our graph database system and
26:39
you can think about it will see how many datasets the half year but has your other differently the answer 15
26:48
7 thousand records in our data with how the plot of the code you want now you can replace this and other properties that the sale of undersold grounds that action they have come answer 45 thousand grams
27:03
now the other example of the clue that had a especially from all European partners was a finding the data sets by the proof of some of you knew what basically the goal of this is you want a lecture find the basis of somebody with O Jornal from particle publisher so
27:21
this actually uses the syntax of call regular expression in there it no forger you put a cue that after break equals and then you put that that start at the end the which that moved everything after that is acceptable so the idea that included a lot of it actually he has up with the attendance and this will tend
27:41
records that those Michael's have you ought to be vigilant about the matter all I can just pick up most of the code for this on off that have them you why that matches my criteria now another example
27:56
here would be a hobby can find a highly connected data sets so that things are getting more and more complicated and I have good news I have wanted to what is complicated clues to the so here what we have
28:10
done this is a game of and to see all the data sets based on the number of connections that have what we have done that with a period of 1 minute dataset should assist this for and southern distilled the hands that datasets that have been lost all of connections so I say OK well even with a cue for the article in the title of thoughts on lot about toward is and I want you to actually the number of connections that has sort them by the number of those connections in the reverse order know that it goes into the system and comes back quickly so here we find all the data sets that we have by the number of connections that they actually linked to them so if you actually look at his particular knowledge you end up 757 nodes link to that but it but it's now
29:01
that this is a this is the mass of all queries and this is a scenario lecture notes on a given line these ones this is example of finding the means that was from the multiple degrees of separation the application of this gloomy someone elected by lot of wines thinking how many connections that we have to drive well we might have connections directly but it was always also have collections indirect the example of indirect connections would be I have a data set L a protein and the just 3 that is actually going by all their recalled would provide an orchid and that'll final could also have connections to dry up so this syntax for this would be something like Gates Knuth says I want to get all the conditions team 1 to treat so if founded the quality like
29:57
its and ought company is returned title and the key and only 25 after what it does is it actually returned to their long titles to this is from and and then we have the data said she found the dry not the last of a police
30:18
I found this 1 is interesting for a lot of people in in the publication of the domain when they are looking at the researchers what datasets or 2 different datasets under wondering on these to actually connect connected but you can do in a graph database you can search for something that all the shortest
30:40
pack up that is the example of their coping obtained from the browser is loaded that the modest factors and appear that should go back to so keep this 1 of the last we don't have this
30:53
once about so we have a fix so
30:56
what this query has done it says OK I want of data from dry up is and island have a dataset from and spit is the life of final need the shortest but if things to and thus it looks like this status of these 2 datasets on me and using a research and publication now in this case you can actually replace the dataset researcher and sort of new you can resort began replace a dataset and with a grand final so that you want things now if you if you use the extended version of morphology with their research graph model you actually did this last half here which had opening and it has a
31:40
template for all of these queries so for example for the shortest path I here and few not the box automatic before we are not only need to actually fill up the cooler the deal was that only and we are actually extending this further so at the moment we have about 10 queries seen example and you planning to more will lose to this at template so that this is basically the
32:04
last slide for this we'll forge queries we want we will do is of 11 of the presentation will have a time to have a discussion about this question will be due on a then the next part of the talk is by a doctor jingle learned about in CI an ugly I did not actually introduce but bottle and there are other reminders actually looking at the data collection manager for a in CI in which is located in a new number up they have background is a this is actually the trick you want to try before that was envisioned by moved she she got here her PhD in their follows what seismologists by the son Sutherland usage is a it science and basically we have region of the molecule CI is collecting information across different different platforms by basically providing this and system for researchers and make the research environment more efficient not without introduction or I will have a lower this presentation through the will help everyone
33:16
I will use about 10 to 15 minutes to share my experience as a user of hockey Switchboard and we wanted I want to show their graph connection experiments and using our MCI metadata and this talk has also been presented 2 weeks ago mean the 1st reproduced science workshop in Hanover Germany it's got a positive feedback so for people of all who doesn't know much about and C I and C II is in short is on national computationally infrastructure so where the national level sentence physically located at the Australian National University campus from 2 thousand 13 we received a big chunk of money to store research data and the
34:11
MultiVision these some of the data are getting bigger and bigger to give this Gigabyte terrified even even much larger and in in especially in our domain such as the environment it is growing so fast I interviewed you PC or heart disease it cannot a rose large scale data and transferring data and share the data became a problem that's why we got this finding and still a large 1 of the aides no the storage to support 3 research data infrastructure at the moment we have more than 10 petabyte viruses up as you can see in this figure and our data including for wrong they spread east astronomy observation to satellite images and climate model a climate change research system onto the ground like geophysics exploration and even deeper might and your dynamic processing data ways the funding of
35:24
being 1 of the research data infrastructure we make use of the advantages to work on the data that we collected to make a seamless connection crossed different disciplines so as you can see
35:41
here we care about data formats because we want to make use of the HPC facility and we hear about provide open access to researchers and because of the large scale data is impossible for them to download it for their local machine and do the processing it's better to provide some kind of virtual environment for them to log and to 2 of the processing at our center because we have this so much
36:08
they are we need to modernize the catalog for people to know what did said available at the NCI so this is the 1 of the common questions researchers care about and their cattle we appeal to based on our rational relationship between researcher data grant and the paper for example if you see the 1st online it says researcher use data once supported by grants any generated paper 1 and 2 similarly for its of line we have 1 record however the obvious thing here is we can see the redundancy of researcher Beate appeared twice the 1 appeared twice grand be appeared twice is every every single nodes is in our database create up a lot of redundancy so the idea of adopting by switch bodies we use the idea identifier and we use the identifier of the same researchers like all kid we use the same identify all the data like dual I we use the same identifier like a pearl of a grant asexual now after we manage those and different nodes with the same identifier actually each entity of researcher data grants and paper and now connected is through of graph fickle relationship I think that's my understanding as a user how God Switchboard can help us to make the connection because with these graph view what I can't buy answered she question would for
37:51
example like what is the usage of MCI dataset that and he can be translated directly into ungodly Switchboard query that yeah just to show you how know but in a while ago how many datasets published at NCI about being referenced in the research journal articles and other questions such as What is the awareness of the available datasets with in the research community into contrast leads to a query questions that is how many researchers institutes are connected to the datasets and so on so that is that a question which is even more specific if I would like to know more about this dataset all should I contact call generate this data or use the data to publish the paper and what is the previous research has been done using this dataset und I believe this is on a very common question for researcher that when they start a new topic they would do this kind of research like myself doing a Google search 1st but you've we provide this kind of infrastructure it will make the researcher at the literature review much easier and now we used to use life
39:09
to explain exactly how we organize our catalog and then by adopting passage for the technically so we organize our cats can't in the hierarchy structure on the top of a node as you can see here it's an NCR network node which is only about a talk lecture level high-level summary of the data collection at the moment we have more than 2 100 and on the middle level you can see every single project has its own and she'll network catalog children's work is our metadata display interface but you can use other interface as well so it ends in the eye of for each individual project we might have thousands of other records fall in the file level of granular level it's not appropriate to have all this different granularity cataloged in this single goes note because we then at it's harder to separating them and it's harder to the by space by research domain for example so we use this structure as it provides flexibility for us to do more aggregation at a at a later stage you can check out of out of our main job network websites using this link so what I'm Switchboard do every single geometric has
40:44
its own dignity do database and we dump those databases into that i the Switchboard graph database and the connection has been made at here I don't show their exactly how it does but that's the magic where it happens in this box when the identifier was used it to merge the 2 French nodes so that's the connection has been made this screenshot is a common status that
41:14
using NCS metadata we find connections for example between the data sets the researchers errant Institute but in way I also notice that there are no are disconnected my follow up of processing by actually find out they are on the edge they are actually connected but in our metadata because it lack of some critical information so when I presented this database in a graph of all view it's disconnected it means that I have to concede correct a sum of all of the metadata information in our database so so far
41:56
as I explained that Switchboard help me out in device some missing critical metadata entries which by can provide I committed will complete sometimes there's is also help me identify the arrows in the catalog and by the he lived fix but without other Switchboard it's almost impossible because we have thousands hundreds some hundred and thousand records it's so hard to check manually but I switch but can tell me immediately to I switched photograph you will provide an analytical view of how research there has been used to so far this has been a very common question being in many times by our user because they care about who use their data and they hear about how to make the data even more public to make more connections to the external world and ideas was but it is an ideal to to make it happen it also can help me evaluate evaluate the the impact of the datasets researchers and in-situ Basedow on some that's more connections and need to get back into the has and it has up that of bigger impact if you like finally as you see in the example in the media as a demonstration is a research it doesn't have all Lockheed hate those clear wouldn't work so it's it's really good motivation or encouragement to forward researcher to have to register on can't for data manager and few men to do I for their dataset for data repositories provided President item you fire to increase the accessibility of the dataset so this is an one our experiences so far I would like to end up
43:56
my presentation by the you a real example that how I feel it is really helpful and from a data repository point of view and that is the basic question would be is that are connected to each other so we have a group from the Bureau of metrology and they down to the climate reanalysis data from U.S. because is too large and there are a number of people want to use that data so they are approved by NCI to store at NCI so that they can use it after a little while and by another group from cyro also climate research group they're downloading the clans reanalysis data from the same source but different politician different subset and anyone do some research however those 2 groups don't know each other but they both came to me I would like to find some storage at NCR to support their research and I suggest that since you shared a common interest why don't you talk to each other see if group pay are already download some data that could be can use and vise versa and then they start talking to each other after a few minds of that group which is also from the Bureau of metrology but a different branch and they are asking something very similar some question very very similar about using and sharing reanalysis data and I suggest this things means however as a human being as a communication hop it's very difficult and it's hard it's time consuming I can see the good chance for on the Switchboard now play my role as the communication higher to presenters those connections are automatically in the graph database so people can go there anytime 20 was there and check the connection of the dataset and start talking to each other without talking to myself is also reduced and it might also motivate a collaboration from different groups when they see the connections so that's my hope that that I Switchboard when NCR adopting and so we can offer those kind of services to our user so in summary that is what it is the greedy
46:28
to creatively linkage among researchers babies and and the publications the new photograph you is a very eye-catching and straightforward to complex interconnections with in the research community I also view this data management is a joint effort by the whole research community and the like librarian community and that's the end of my talk I will hand over to a Kuhnian as a prison fair clashing will think much on what will happen
47:00
after this wall after you're this living or we will have a ball of means usage of firms and there that would be 1 place that you can
47:12
find out some of the gene will be there as well and we also have not on your formula sheet should be we will talk about this technology furthering above and therefore some of you were interested to get that in you database that I was using for dinner affection quite a big file I can give you that 1 USB this so that's 1 way of getting that if they so that is a quick way in next week if you come to the research conferences you finally I can give you the fire there the if that doesn't happen said imaging 1 online actually putting that day-to-day on CI N platform as open access to positive so other people didn't downloaded but that is that is the publication it you're going to do little little of it so I don't like Thacker Holland was a fixed for that status at it should be that 1 also serve as regardless there the spies would be available online if you have any further questions about Ms. Newell Folger technology you can send me e-mail on this in an address
48:19
dating entire called football at switchboard and search graph database that structure all and get out so the leans on posits bear on the slide all I
48:32
can also bring in a screenshot from this so what will happen is the
48:37
1st thing you we probably would need to do if you want to create a graph database
48:42
you might want to go to nail forgery and
48:44
download there no for source Quillen called pilot that the easiest way would you we have a forge a repository
48:52
here that actually called so every plugging everything and being built the can just downloaded right so that would be a database legal you're asking what these on their scheme of repository here and
49:06
also there is a way that is a page and a way of explaining how does the long walks there are some crosswalks only positive
49:14
and you can look at my file you can get across what if you want to the import you know forgery
49:20
is fluctuation in another a harvesting point there so when you know the harvesting information it's with the Switchboard and the Switchboard data that is also unlimited GitHub repository under the suche what name have a multiple instances also here some of the positives here and a cold about as implicit making magic happens is in the influence repository now we have about 5 minutes that can actually allocated to questions and answers so
49:52
how 1 question from Christopher so the question is at what processing power is required for the queries of so the answer to this is that is very much a it depends on your 2 things action graph size is the obvious element and other 1 is your indexing of elements for the properties so in line in order to get everything then that obviously went to the left much quicker their dead the computation of but requires more storage they're in the in the trade off would be you can actually index less properties and then you will actually need tribal computation although allocated to that but the example that I shall inform actually my MacBook goal which has I buy 5 plus a so at this sitting in a graph database find a go to the in US offer plus followable presentation and also I it will forge of unwanted background so that is not expensive to love that thing which is expensive to run is an inference engine that's what it was high-performance computing and a lot of memory because it is a lot of that filesystem so you know in our case and as machine which I believe it is the will of the Iranian slides that was you have killed you to 36 costs and 6 in the weather Proc and that machine that's about at 72 hours to complete the pipeline but these if you have a machine so the rate of flow they're they're made of type is if you have a machine that half of the parlor it doesn't takes twice the time it actually make 8 times smaller so for the inference on their large graph databases you need a very probable machine or a set of emission of glass now this action opens a conversation about something that call this to be the blast I briefly mention that and that is why this graph project also now more adopting the idea of having a cluster of graphs running on different platforms and that is something that probably the few open beta conversation and other living well the technical so 1 question here this is for what this is what is the form-based search option plan other than the search queries see they yes the answer is yes so we are as a working on a a couple of different options for this 1 is at the institution repositories we on at the moment exploring the idea of having integration or life into the repositories that MySpace space that enables just stared they're called Plaphol the loss of planet and also deletions that bedside no walking on the idea of a providing as form-based search and I should tightly queries and get to the graph without actually loading of which OK so the question is these are any study at conquering that no for the graph and with other traditional news and the nuchal within compounds and here about so the question here is that is in this study above compelling in willful ejected biology and out of acknowledges that there are a lot of action studies underway and fractional forger compare the divisible with bring lots of other options so because that is where the company for their traditional databases like yesterday that is there are comparison willful gently did not explode later this is not long would you be and in comparison with e-mail forgery and a triple store in store X. studies so in this context as well there the first one is quite obvious there's no student and this is not you into this stream based search and they had not had the structure of the problem of those databases with this kind of scenario is a finding the chain of relationships is very very expensive process to do more SQL databases well we'll forge a is 1 of those surveyed are not comparing their different items in the same category that none of those options there is an honor for example singing up for that in this category of or you the the the mother mother called tight the needs and there's a couple of those in this group to matches social for finding if a performance differences is what it is that they're the main differences on performance the simplicity of the news and interoperability with other tools and platforms and the Semantic Web they're that the main difference between for J and a triple store is on the inference model I would say no for GEC it's far less capable of making a complex logic but at the same time it provides you with the simplicity of implementation and a performance inquiry so it's is much quicker to get the data file J. however there are different triple stores so different acknowledges and I remember in 2014 when we were doing the approach of for a 1st line is some cases that is a time there are experienced in their suggested that this is the cheapest or acknowledge requires more computation of but that might have been changing the last 2 years welcome OK so the question that means there is a compilation of empirical inference that the search quality short answer is yes it got for the last question in using their audiences with the agreement inserted alliance yeah well yes it's mother collaborative project of that was initially started by as violence sad and and other people joined so we had the infrastructure confusion that that data contribution we have coding contribution from the from partners X and overall we can say this is this is the implementation of what they're working will recover and that so the working group came up there the combination of different reactions had implemented that is which for look at the question of the modern research data the structure selection question was is is is which would be implemented in a system of Australia which I and Leslie that
55:54
so their answer to this 1 is that we use this 1st at the switchboard to being reached insisted Australia it's is 1 of the linking capabilities of hands has but at the moment we do not have the graph visualizer this is 1 of the items not pipeline L which there we already have this from planning to development cycle and in the future versions of the antecedent Australia we were looking how the graph visualizer that full light some of this information OK thank everyone there I mean up formulas for the time of the living on so I would like to thank the Avila for having you know not
