Who is using our linked data?
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 16 | |
Author | ||
License | CC Attribution - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/47581 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
6
11
12
00:00
Lecture/Conference
00:24
Lecture/Conference
01:44
Computer animation
03:11
Computer animation
04:35
Computer animation
05:18
Computer animation
07:19
Computer animation
07:42
Computer animation
08:16
Computer animation
09:20
Computer animation
10:36
Computer animation
11:11
Computer animation
12:10
Computer animation
13:11
Computer animation
13:39
Computer animation
15:34
Computer animation
15:58
Computer animation
17:20
Computer animation
17:47
Computer animation
18:17
Computer animation
18:55
Lecture/Conference
Transcript: English(auto-generated)
00:47
So our next presenter is Corinne de Léon from the British Library to talk about who's using our linked data in their project. Good afternoon everybody.
01:03
So I'm a collection metadata analyst at the British Library and as you can see there was more than one person enrolled in this project. Unfortunately Luca and Pierre-Riv who are the developers of the platform couldn't be here. So I hope I do justice to their work. So this is what I'd like to talk about. To give you some context about, well, Osmo has already mentioned very briefly the linked
01:27
open BNB, what the collaboration between the British Library and Fujitsu was in building this RDF analytics platform. And give you a few highlights of what we learned about the usage of the BNB and what we think
01:42
the value of RDF analytics has been for us. So the British Library was created in 1973 by the British Library Act and in that act our role is defined as being the national centre for bibliographic and other information services.
02:02
So originally a lot of those, well many of those bibliographic services were priced but more recently we've been offering open data and that's partly because of the impetus
02:20
of the UK government which is pushing public sector institutions into publishing their public data as open data. And so the linked open BNB is just one offering and part of our strategy which we published in 2015.
02:42
So when we were looking at publishing linked open data we decided to publish the British National Bibliography so it's about 3.7 million records representing publication, published in Ireland, in the Republic of Ireland and the UK on all subjects in all languages
03:02
and from 1950 to the present day. We decided to pick that dataset because, well it's part of our core function as a national library to produce the British National Bibliography. We thought it would be a reusable dataset because it's not a unique institutional catalogue.
03:28
We also were able to publish it under CC0 because we either create the metadata or if we don't create it ourselves we've secured the right to distribute it in perpetuity.
03:45
It also includes things with a capital T of interests, people, places, date, subject and it's also consistent and as consistent as a marked dataset can be over 60 years.
04:02
So it is well maintained, it's authority controlled, we've got DUA, we've got LCSH but we also have to bear in mind that over the years, over 60 years we've changed policy, we've changed cataloguing standards from ACR to RDA, we've also changed formats because
04:25
it used to be in UK MARC and in 2004 we moved over to MARC 21. So it's as consistent as can be. So I'm not talking about the data model, I mean Osmash put it on his graph, on his diagram,
04:44
but what is available as the linked open BAB is a subset of the overall dataset. We've got a booked dataset and a serial one available on the linked data platform, so you've got a few links there. The data is hosted externally, we haven't got internal capacity at the moment.
05:05
We've got, you've got sparkle endpoint, a sparkle editor and we also make bulk downloads available into serialisation in RDFXM, RDA triples and this is updated monthly. So some challenges, not all of them, I'm only dealing with those that are more pertinent to
05:27
the usage, the use of our data. Well first, resources, both human and financial, I think the British Library, many UK public sector institutions, is continuing to face very challenging situations.
05:48
So to give you an example, from 2010 to 2015 the library has lost 30% of its budget and 23% of its staff, so we must really know is it worth continuing to provide this service,
06:01
where should we focus our efforts, there is less and less money and the library has a very ambitious programme, so there's more competition for funds. We also get limited user feedback, so I think that's not a problem unique to the British Library, I think it's the experience of many linked open data publishers and I'll talk
06:26
about the kind of feedback we get a bit later. So we really need to know who uses our data and what for and how best we can support those users, and another challenge is that there's a kind of lack of linked data specific analytics
06:42
tool, as far as we're aware anyway, and there's another perspective to that is that it's not just any tool that we were looking forward to, but also some user-friendly tool, because at the library we are mainly principally metadata librarians, we're not linked
07:06
data specialists really, so we really need some user-friendly tools with good visualisations.
07:20
So this is the way we, how do we currently monitor the BNB data uses, so we have, we keep some basic statistics, so the number of hits against the SPARC endpoint, the number of downloads from the British Library webpage, and we get some basic weblogs analysis reports, but there's a lot of work required at the moment to extract information, so a lot of the
07:45
information we get about the data usage is kind of anecdotal, or I mean we know it's been used in pilot projects, because we've ourselves given the data, so it's been used in a test data for semantic search demonstrator, and also we gave it to Microsoft to for them to work on one
08:01
of their projects, we know it's been used in tutorials, for example Owen Stevens, which I'm sure some of you know, has done a tutorial about an API, it might get tweeted, people do something about it, and tweet about it, or more rarely, but it does happen that people contact us and say I've used your data, thank you very much, and you know this is great work.
08:25
So we were after something, we had all these questions about who is using our data and which data and how to do it, when Fujitsu were on their part wanting to have some test data to be able to develop the tool, so we gave 13 months worth of weblogs, and also we provided
08:49
feedback on the interface and to try and make it to improve the functionality and the user experience, Fujitsu very graciously gave their time and experience to develop that tool,
09:09
and I also have to add that they developed it with other weblogs, they had five years worth of the French chapter of DBpedia. So I was mentioning this tool
09:32
incorporates some features that traditional web analytics tool have, for example location
09:41
and network provider, but it has some distinctive features, for example some spark specific sparkle, specific metrics, it for example counts the different types, returns the different types, you know, ask, describe, etc. It provides some fine-grained analytics for each category of resources, so it counts the number of instances and tells you which one they are,
10:02
classes, properties, and graphs. It supports the three of three patterns, RDF dereferencing, it also detects visitor sessions and gives you some idea about the depth and length of those sessions. It also attempts to classify sparkle queries and their complexity, whether they're
10:24
light or heavy, and it also tries to classify the visits to see whether they're done by humans or machines. So this is what the system looks like overall, so the logs are processed,
10:44
the first thing that happens is that all the access information from robots and search engine crawlers are filtered out because they provide some noise, so we are left with quote unquote genuine queries. The traffic metrics are extracted, they are stored in this data warehouse whereby they can
11:06
be queried via web user interface. This is what the web user interface looks like, you're welcomed by an overview dashboard that you can customize. At the top you've got the protocol,
11:23
you can either see the overall view or select whether you want to see sparkle or HTTP dereferencing. You can also select your user agent, so you can decide whether you want to see just browsers, mobile browsers, or software libraries, and you can customize the date span, so you can have
11:45
a look at the whole year or you can just go down to a day if you want. And on the left hand side you have the metrics, the request counts, the response codes, and the audience metrics, looking at location, user agents, the types of visitors and sessions, and looking at the protocol, what
12:05
data access the visitors have been using. So what did we learn? So out of 13 months, so from March 2014, we're looking at March 2014 to April 2015, most of the requests were the results of
12:24
search engine crawlers or some robot activity, so about 43.7 million requests, so we're left with 252,000 that were kept. Overall, we can see that the request flow is stable overall,
12:48
and there's even a slight increase because we start with 18,000 requests and moving on to 24,000. And within that time frame, the number of sparkle queries increased from 67 in March to
13:08
about just over 11,000. We've got new users coming in all the time, but what we also see is that there's a bounce rate of 48%, which means that users come in but they only look
13:21
up one single resource, and then they look away, they go away. So we need to have some retention strategy to make sure that they just stay longer and explore the system and the data. So I mentioned before that there'd been tutorials, and so we had an idea
13:46
that quite a lot of the users were novice users using the sparkle endpoint as an educational tool, and that's confirmed when we looked at the top five instances. In the top five instances,
14:03
we've got these two, and we know that the Hobbit was set as an example by a tutorial done by Lee Dodds, and the Lewis example is based on, we've got on our documentation, a sparkle query using that, so people are obviously using, following that. And if we look
14:25
at the top five classes, out of those top five, there are three that do not exist in the data, even do not exist at all. So the second one, there is no Bebo author, as far as I know.
14:44
In the ontology, Beo, by the way you say it, birth exists as a class, but you see there's an issue of capitalization, and we haven't created an author as part of our classes.
15:05
Contrasting that to properties, the top five properties are spot on, and we can see that people search by ISBN title, the type of resource, the label, and the creator, so I'm not sure that we have to look at what's, or maybe do some more documentation about
15:28
what's the problem with classes. Is it to do a generic misunderstanding, or is it to do with our data model? Location, so we can see that the USA comes up, then the UK and Germany, and when you look at user categories,
15:44
we can get a breakdown via academia, so Carl's row there, I don't know if it's anybody from there, but you've been busy, and also by government, so overall there are 350 academic and government organizations using the dataset. Finally, if I look at access, we've got
16:02
sparkle queries account for 29% of total requests, direct human access for 62% of total requests, and desktop browsers are the most popular, 54%, but what we notice also is there's a sharp increase in requests from software libraries, about 95 times more from the beginning to the end,
16:25
so we can see that there's a kind of evolving use, we seem to have gone from more manual human browsing of HTTP, thanks to the dereferencing, through to
16:47
more machine access via software libraries, if we look again, and if we look at the software libraries, we can see they've got bigger, deeper, and longer sessions, they look up more resources, the depth is bigger than if you compare that with browsers, so this seems to be from the
17:09
beginning to the end, an evolution from the side being used as an experimental tool to the data, to maybe a more an ecosystem developing of more mature applications, so to conclude,
17:23
as we were hoping, we get a better understanding of how the data is used at greater levels of granularity and with more user-friendly visualization, it helps us to support the business case, to be able to continue to provide this service, and also it's going to help us
17:42
to work out where we need to develop documentation or support the users, what it's also done as well it's informed the dialogue we have with the existing platform provider, it's given us the evidence and the confidence to ask more questions, and it's also informed the tender specification, which we have now awarded to the same provider, which is TSO, but we are
18:07
going to get a dashboard, maybe not as with all the bells and whistles that this one has, but we are going to get a better management information, so I'll just finish with this slide, which gives you a few links, and the one at the top is a demo, if you want to have a look at that,
18:26
there's one month's worth of data there, if you want to discuss any of this, we've got an email address, but if you want to talk about the system itself, then I would recommend you talk to Luca at Fujitsu Ireland, and again more details about our other open data, where you
18:47
can download the data, and if you're an insomniac, I really recommend the collection metadata strategy, thank you. Thank you, so again we have some time for questions, I'm sure we must have
19:09
many again, someone have a question? I think you've done an impressive job of collecting evidence and analytics in a very difficult environment that probably many of us
19:23
can relate to, so I'm wondering what your sense is now in terms of, you mentioned that there is a case for service continuity, but how do you think things will be progressing? Yes, I mean we have re-tended and we are going to continue providing the service, I mean we are hoping
19:43
to develop the BNB, because as I mentioned at the beginning, it's only a subset, so we are hoping to output in the first quarter of 2017 the forthcoming publication data set,
20:01
and that's going to be the opportunity to review the documentation and try to address some of those issues, I mean I mentioned the fact that there was an issue with classes, there's also very few people searching, we've got two graphs and nobody searches by graph, and I haven't got that in my documentation, so maybe
20:23
the fact that the data and all the access is done by novice users, if that's the case, as an education tool, that's great, I mean we have a function as a national library of public good, so it's all, you know, it's fine, so yeah, next step is more data and then
20:46
better documentation, which rings a bell with the keynote that there was this morning, that's one of the outcomes. Thank you, does anyone else have a question? Yes, oh thank you very much for this
21:08
really interesting talk, can you elaborate a bit more on the libraries and software you used for this Sparkly query log analysis, and particularly if there are parts which are open and can be
21:29
reused by others who drive Sparkly endpoints? You're asking about the tool and whether it can be reused? So software for analyzing Sparkly log files, for example detecting heavy made and
21:51
lightweight queries and things like that? Well, I mean at the moment it's the, this is the research prototype developed by Fujitsu, so it's not open source, so in terms of reusing the
22:05
actual software, then you'd have to talk to Luca, I think. I think we have time for one more question before we move on, last chance. This is a very minor one, but what happened in October
22:29
and also in August? What happened in October? Yeah, because all the charts had these huge usage bumps there. You're asking me to remember what happened in October?
22:43
No, I was mostly thinking that did you have some publicity thing there or did someone decide to use your services and did it in a sort of non-neat way or something like that? Yeah,
23:04
I'm getting there. So you can see peaks and troughs, and so some of them have been linked to either when we've improved the data and done some publicity around it, so for example when we've put ISNI in the data, or when the BNB has been mentioned at a linked data event, so
23:34
if I look after today, there'll be a peak there. I mean, some of those peaks are also due.
23:41
There's been some people hitting the SPARQL endpoint, which means that we've had to put a threshold on the result size, but some of it has been meant and some has been people sort of writing a script and it's going to be pear-shaped.
24:07
Thank you, so join me again in thanking Celine for her presentation.