RD-Switchboard: find connections to your data

RD-Switchboard: find connections to your data
In this webinar, Dr. Amir Aryani (ANDS) gives a brief presentation on this collaborative project, and answer the question: "how can universities discover their datasets or connections to works by their researchers without manually searching in Google?" What is the Research Data Switchboard? This webinar describes Research Data Switchboard (RD-Switchboard) which addresses the problem of cross platform discovery by operating online services that connect datasets across multiple registries. RD-Switchboard is a collaborative project by ANDS and number of other international partners in the Research Data Alliance Driven by the rapid development of data storage technology, the number of research data repositories is growing fast and researchers more than ever have access to a range of data repositories including university data storage, discipline specific repositories and national (regional) level data infrastructures. The problem is that these infrastructures are often operating in silos; that is, they cannot connect their datasets to the related research or datasets in other platforms.
OK sodas talk is about their research data switchboard form this is a collaborative project that came out of 1 of the working groups usage data lines in the next 20 10 minutes I give overview about what it means how it happened why it happened hot as a book and all you can actually the benefit from so that actual project succeeds this came out of the
working group in research into our minds they're working group call they description register interoperability is very this this group was created quite a number of partners initially appalled enabling the cross-platform Scotland inference services and the actual goal of the project was creating a platform so we can actually collect data each mean and which systems that these why those were hosting their own infrastructure so the project was at the moment it is at the final stage of the working Group and before I go to the project don't talk about this and about his estate lines and hold these things are actually coming together so this
shit Lions is a collaboration of Cerf and international partners were all involved in research it insisted infrastructure it is funded a two-day coalition of Commonwealth Government European Commission and from U.S. National and Science Foundation that it sort of 2013 and enables was a rapid growth different groups inside the cities or let's say the structure of a social policies and basically there are interest groups are the Working Groups interest of sort of a group of people who are actually interested in poverty and and talk about the topic but the actual work and the projects of happening to the working groups working with a timeline so we start a specific proposal and you book in 12 to 18 months to compute those time the
global when started really had only hands to ion and so on as part of the group 1 later on about a number of other partners in this group and you can see the names of those in groups and those partners in there in this light Peeple initially assorted we'd idea that we can actually do much better than just key such an force of platforms so 1 of the ways of thinking about this new looking at was on systems and we realized when you've had a book it tells you whether other books by the same margin what what order all the publication in the same category which are partly corrected to this and has been purchased by the same and new words before so we thought OK we can implement the known for a structure and connections come it datasets across our platform under joint collaboration a lotta sheet this on the same grounds will connect the publications basically beyond just a vocabulary search inside the content of the datasets we were planning to correct information base and other information that in this Colorado walks comic the data to publication and grants and on the systems in Odyssey singular this thing that connection so you will have a bridge between the 2 datasets they signed up for exhibit a joint grant or leave Professor or will collaborate with someone else in another country and they also the same dataset connect the papers that you can see all of those connections visit sold 1 going
to do in the next few slides I'll show you actually what is the problem in practice because it was about that by explain you can actually do it today using the manual process of actually being limited this is a data set being socially
constructed and the this status set has been a curative of properties from University of Sydney not just have the mean you can click on a can look to the page and you can review the datasets now the question I have is that we have to all tools and use the world is a professor cuts and love and the doctor and we won't now actually cannot find anything goes on away from user searches can I find I have they have the result was published in the other a basis set in this domain so we fight sessions alters where to this section of the course of
probably statement can from other datasets by the same model if I'm session 1 where I would
find a page from pepsinogen love this is found in the city of city and you it when you basically looking at the stage
because you can see that there are different information about the odds taught including the grants and including publication so in this page if I scroll down
you can see them subdural publications conference papers and grants which now there are 105 more publications slates and we can technically such as this and 1 of the and in the literature review to find out exactly what what are those and if these innovators of connected to and you do that you
would find 1 of them is a set in plus 1 to cite the paper plus 1 which this paper actually in the body of the paper there a link to it datasets
enjoy so I have a bunch of old that a trade-off links from their original dataset enhanced to that led to the altar to the paper for the
and this is on data which the action discovered in the body of that paper Boggess' not so the process that I explained is basically fulfill the requirements of the initial activities that means we actually find a lot of
data set by the same author and this is actually much more accurate way to find related datasets 1 allegedly Q such and will go on the same topic and I remember when I was in the kitchen evidence in acting on the literature to find other books by the same authors that should give me a cohesive view off their activities as and discoveries in the link now the problem is that this is not really a scalable platforms so they found out
we got to the concept of scan automate this process and that's what they got to than that of 4 office which soaking the pastor was there in the telecom companies there was a group of people who actually connecting the lines they're actually connecting 1 person to another person would fund address book at now these days these sort of activities happens completely automatic and elected using a computer integrated systems and the goal here is actually the same thing the state environment suit to make this happen we
actually build system and when the this is the 1 of the rules of this working group was not to actually and you metadata schema do I will continue to stand out in Goldwasser walked purely on a new has suffered development project using existing technologies and the reason for that had only 12 months later extended taken months but we have a limited time and wanted a production of infrastructure so to make that happen we actually use existing data mining techniques and existence of development biology using the existing scheme of and to make and we also have the platform as much as possible to that last thing today scheme of formant it can listen and I just when from formants no they're is the actual architecture of this system contained 2 different layers we have the harvest server have the graph creation layer which does the it all of the drop it has different integration to external systems and user really uses the differences technology faction recalled connections happen and then we had to push this information into the API consumer later when you can actually injustice information into your system using life your order is there browser interface and actually explore and that their connections in this environment now to explain
exactly how this actually works would like to show you the graph behind this is is almost without looking and looking at on that bonnet and but it is a good exercise to see the consul so
on the research of the researcher to Switchboard of is a graph databases called for germs basically the aggregate of all the connections that information into the system but in this environment you can actually write queries to search for different records and information in this example we we look at the data set from so if look at a slide that I
had here this was why and I connections search would assume so in this environment
environments I can write a query although another note here that you do not need to do that is just 1 illustration purpose here and what will
happen here is that you would get a head
data science it's which is missing in the dry data-science knock and click on it and this call it gives me and other because of from dry up as a researcher so this researcher it's this is discovered by the computer basically there is no manual in order to use an old no so there's a lot to achieve the level of its infrastructure this is all done by the harvesting an automated system so the researcher here which is discovered and pernickety dataset we look at it you all around here I can actually
transfer this you will have to be a browser
and check their researchers in this case so this is the same page from the
professor Kupchan blog which I have on the last
slide and his use of papers and but not in this environment we can quickly
this system fared but it could researchers and I
get all the publication by this researcher I can get the grounds of like any other data-science which is also from job so this information together all collected to move to a harvesting system now run get a different brands CEO These graphs also can be investigated fared so in this case I can actually probably you alot of its the
ways of that space and then I can move displayed here and I can go up this like in Kwun fair so in this scenario I have other connected records are all coming here from hands and this is the part recalled from the bear professor catching blob which is the manager of this ground or the participants grant if I click on these I can actually see Frederick records from the same person which in this case it's not a dataset not the same thing the the the dead then to other court can actually grab this you all around
go bed and I can recall so this is the this
sort of from the 1st so the process here is that they're a graph data is actually collected all of this information into the core phase and then we will later exposes information into the environment that you can actually look like not modeled on looking at these this is a browser interface which is based on an
software package so here we go to the home page decision not to the hands resisted Australia you have the option to search for the phone records and when you find that you can actually get further information
about a dot product dataset and connections to other macOS so what's this 1 you must respond not 1 thing here is
that there we have to version of this oxide this is American hosted version which is a move it would know what I will do here I actually
post their in version which is
host of current in Australia forward
and when you do if you need to actually access these their outlook on and play with their mom some closeness to chat to become a poster just as I have an office is a mechanical system 1 ounce objects so
that as long as actually can show here know
when you this broad environment and then you
go to a particular data this 1 of the look in the 1st place on the page there is a region that
actually utilize the same connections
and we have these images for every single
data graph paper and search in this environment so if I had to look at them on a
data set in this environment or a graph I cannot a double click on this I can go to that
page if the key is that data said you would get a
graph that is 1 of the distance
of at the core of the job what this system collecting information 2 data sets so the previews and visualizing information for the Davis in this environment by
default but it means that at this
stage of graph visualization is limited to the connection to the data set as initiating point nodded major that
you can see here it is open source and can actually not all the hostess be another installed as it is on your own website but also chewed API system and how we can actually pull this thing from which a pull graph and putting in your patent numbers of having graph of fuel graph of your data set connectivity under home
so there the source code for the Sutherland give half on the MIT open-source license and this is another
that slide about the hardest things are connected and what do you mean by degree of separation so this is an example of the degree of separation full of data collection between different datasets so they have a data set which is researcher researchers also offer paper and that paper is actually connected from a datasets we can have a data set which is into a paper to paper is actually acknowledgment grant and the grant is linked to other datasets not need these of their sort of connectivity that we actually get a lot of them more international partners for example this kind of a connection between the papering grants that we had a solid water our international partners and there are a lot of information come from Europe and US but this kind of connectivity that all aggregated into the body of the search now in the last example we have a data set which is going to researcher and the researchers deserve participant in the ground underground also is linked to Davis so this is an example of 2 degrees of separation in their constitutional for browser actually initialize everything up to 4 degree of separation and this that is the limit of this system exploration at this phase you might extend this later on this on the further demands the point is the more degree of separation that goes from his collection similar and inference engines to work harder and actually getting it offers system also is more complicated visited more information coming cognitive so if you want
to access this information you can actually using browser interface that you that well it isn't it totally is we are climbing for their final suffer leaves in a probably accepted was a business of come from there I have about the API and the API is actually is available right for the live at this if you want to active participants in this project as a bit of this side you cannot thing in an art and key and you have access to or if at this stage on the court also is available only give up now if you want to actually include your
data into the system so that the princess slide with what given the order of the system but now if you want to put your data into the system currently there the Switchboard platform harvesting that was from the hands repository from drawn from cell phone and future from work in a multiple of soccer but wider follows as you have the capability of living Felicien this former so if your data into their if you had a lot of the Socialist rather than deleting information if you have been examined by the 3 then you want to have that other parts of the rotor the status that you can do that that is a harvester black all that can actually this volition independent from social stratum and if you want to actually see improved discovery under a dataset connections is energy that we suggest to include in your in aggressive object so as a result of this the inference engine provides a better result more number of connections to this connectivity is 1 of the 1 of data so that researchers so when you have been asserted connected to the data set and then you have a morning formation fraction going pro-goverment fine when you have the researchers connected to dance to the level of accuracy improve the same thing about the connection with researchers of publications I many of datasets connected to dragons and publications or publications they're both of these sort of connectivity is also improved accuracy of the result of the system by default as it is it is enduring to a higher accuracy that means if you have a place that we have identified as not 100 per cent accurate close on the person at the connectivity system drops so that's why the included this information actually helps to find more connections because but that might become a shoddily insistence that link this engine the says I don't have enough evidence is about some when ambiguity here and for that reason I am not actually established a connection and the most important point of all of this is in using their persistent identifiers like you wife for the dataset and publications the powerful brands have or people searches worries me in some cases we also we can actually identifies the records and if that is a connection with these it was an international obligations that also would count and the National Library identified happens on and built into that's use more those the yours not that connectivity into our environment
now to get more information about this what we have here a research data Switchboard dot org website engine wise abstract information about a project there there will be new updates on that side information about their real estate and about other infrastructure components on the system I am using actually other hedonistic give hopping positively link is the slide so I will lecture course supplement in the chat box or actually working with and that a later on after this is sort of bring the whole ghetto repository to catch it could be that from there tweeter hand of 4 days and guessing what about this not on the Iist group is it called part of the activities that from this project you can join the group of highly colleges and close to the next plenary ditch happens in September in Europe and there would be no more presentation and talks about their the progress of this prize in IT
hub repository for this 1 this is
the core part of their system and you can see the harvest so rigid they have some other repositories just connected to our system on the court here open source and all of them are licensed under an open-source coaching in can used it since at this stage we are actually in the pipeline of development that bears some of the repositories here they all contain the codes and you not able to compile them part of the reason is that the schools are integrated to order code so though not mean to be employed by then so that if you have any questions about this is something I reckon that you will under like initially guided through the hole to compile a couple of years OK so fight this I can actually go back to our final slide and then
and then from this point we can actually open discussion if you have questions at the entrance of thank you
very much the