We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Decoding of Data

00:00

Formal Metadata

Title
Decoding of Data
Title of Series
Number of Parts
14
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Chair: Joe Wass (Crossref, UK). - Measuring knowledge mobilization for grant-funded research using bibliometric, patent and policy indicators: Ben McLeish (Altmetric, UK). - Exploring the Landscape of Curated Publications in Biology: Maria Levchenko, Aravind Venkatesan (Europe PMC/EMBL-EBI, UK). - Can the Impact of Grey Literature be Assessed? An Investigation of UK Government Publications Cited by Articles and Books: Matthew Bickley (University of Wolverhampton, UK)
MeasurementPressure volume diagramMaß <Mathematik>Point (geometry)Observational studyRevision controlLevel (video gaming)Multiplication signQuicksortMetric systemSlide ruleSet (mathematics)Order (biology)Scaling (geometry)XML
Dimensional analysisLink (knot theory)Goodness of fitObservational studyQuicksortFront and back endsSource codeMetric systemProduct (business)Dimensional analysisPattern languagePhysical system
BlogStack (abstract data type)Computing platformFacebookPressure volume diagramStirling numberFunction (mathematics)Wechselseitige InformationSource codeTerm (mathematics)Price indexPoint (geometry)Direction (geometry)AlgorithmArchaeological field surveyComputer virusDiffuser (automotive)Virtual machineWritingCASE <Informatik>Traffic reportingNumberOrder (biology)Standard deviationMeasurementObservational studyLink (knot theory)Digital object identifierMobile WebPattern language
Parameter (computer programming)Stirling numberCollaborationismPressure volume diagramMassNumberAreaComputer networkDimensional analysisHeat transferHeat transferVolume (thermodynamics)Projective planeField (computer science)Set (mathematics)CollaborationismNatural numberUniverse (mathematics)Differential equationTwitterElectronic mailing listMetric systemNumberCASE <Informatik>Level (video gaming)Right angleFunction (mathematics)Moment (mathematics)Goodness of fitAngleQuicksortMultiplication signMeasurementBlogBitDifferent (Kate Ryan album)Element (mathematics)Product (business)Computer fileDigital photographyPrice index1 (number)Pattern languageTraffic reportingAreaLatent heatSlide rulePoint (geometry)Pointer (computer programming)Entropie <Informationstheorie>Sound effectMathematical analysisRanking
PredictabilityInformationData structureDatabaseNeuroinformatikGreatest elementXML
ProteinComputer networkFunctional (mathematics)Table (information)HypermediaSource codeTime domainMathematical analysisStirling numberSet (mathematics)Source codeInteractive televisionInformationComplex (psychology)
DatabaseStirling numberWeb 2.0Service (economics)
DatabaseFreewareLink (knot theory)Stirling numberCentralizer and normalizerLink (knot theory)AuthorizationDatabaseInformationIntrusion detection system
Stirling numberDatabaseConnected spaceCross-correlationArithmetic meanMetric systemInformation
Video trackingEndliche ModelltheorieLink (knot theory)DatabaseStirling numberGamma functionCellular automatonComplex (psychology)Data structureImage resolutionWeb pageType theoryDifferent (Kate Ryan album)View (database)Term (mathematics)Single-precision floating-point formatType theoryOrder (biology)Observational studyLink (knot theory)Row (database)Data loggerDatabaseTask (computing)Electronic mailing listInternet service providerInformationSet (mathematics)Price indexNumbering schemeTouchscreenSelf-organizationArchaeological field surveyBit1 (number)File formatWebsiteOpen setLatent heatKey (cryptography)Pay televisionWeb pageLevel (video gaming)
Order (biology)Descriptive statisticsBitFigurate numberConsistencyPrice indexKey (cryptography)Point (geometry)Validity (statistics)Sheaf (mathematics)QuicksortFlow separationObservational studyTable (information)Data modelResultantView (database)
Entire functionMathematical analysisType theoryProcess (computing)Bit rateDatabaseEvent horizonDatabaseMathematicsMathematical analysisDifferent (Kate Ryan album)DivisorNumberAnalytic continuationTerm (mathematics)Type theorySlide ruleProteinInternet service providerBit rateXML
Stirling numberGroup actionAuthorizationQuicksortView (database)Category of beingWebsiteDatabase
Enterprise architectureSource codeStandard deviationLine (geometry)Computer networkGoogolRepository (publishing)Scale (map)Observational studyMathematical analysisPrice indexAreaComputer configurationDigital object identifierNormed vector spaceMathematical analysisStandard deviationSoftware testingInternetworkingIdentifiabilityLine (geometry)Limit (category theory)Search engine (computing)MereologyBitPlanningMetric systemTwitterMeasurementDescriptive statisticsWebsiteInclusion mapCountingQuery languageTrailRepository (publishing)CASE <Informatik>Observational studyWritingDigital object identifierType theorySource codeScaling (geometry)Computer fileQuicksortTerm (mathematics)GoogolSpacetime
Observational studyStandard deviationAugmented realityScale (map)WebsiteStirling numberUniform resource locatorWeb crawlerUniform resource locatorDifferent (Kate Ryan album)CoroutineQuicksortAreaStandard deviationBitElectronic GovernmentPrice indexRepository (publishing)1 (number)Google BücherLink (knot theory)IdentifiabilityMereologyAxiom of choiceMultiplication signStrategy gameUniqueness quantificationIncidence algebraPhysical lawSampling (statistics)GoogolMetric systemWebsiteMoment (mathematics)Scaling (geometry)XML
Sheaf (mathematics)Pressure volume diagramStrategy gameObservational studyUniform resource locatorSoftware testingCurveError messageModal logicGoogolTerm (mathematics)AdditionInclusion mapGoogle BücherFunction (mathematics)AreaFocus (optics)Search engine (computing)Stirling numberGroup actionSoftware frameworkType theorySource codeDigital object identifierCharacteristic polynomialStatisticsCodierung <Programmierung>Level (video gaming)Different (Kate Ryan album)Generic programmingDerivation (linguistics)Regular graphRepository (publishing)Query languagePay televisionLocal GroupLogicSample (statistics)EstimationTotal S.A.Traffic reportingLimit (category theory)PhysicalismDifferent (Kate Ryan album)WebsiteObservational studyAreaType theoryStatisticsSlide ruleResultantWordReverse engineeringPosition operatorComputer configurationStrategy gameQuicksortQuery languageUniform resource locatorGroup actionHydraulic jumpWhiteboardWeb crawlerMetric systemCASE <Informatik>Right angleSampling (statistics)GoogolMathematical analysisBitMathematicsControl flowSoftware developerDigital object identifierFunction (mathematics)Price indexNormal (geometry)Range (statistics)IdentifiabilityKnotCurveLine (geometry)Web pageSoftware testingSurfaceSheaf (mathematics)Universe (mathematics)MereologyMultiplication signCombinational logicGreatest elementDatabaseMatching (graph theory)MeasurementElectronic mailing list7 (number)Multilateration
Stirling numberMaxima and minimaObservational studyStandard deviationDatabaseLocal GroupGoogolUniform resource locatorComputer filePresentation of a groupTable (information)Characteristic polynomialRevision controlSheaf (mathematics)Observational studyQuicksortEmailTable (information)XML
TrailGoogle BücherSoftware testingIdentifiabilityQuicksortPosition operatorHand fanService (economics)Repository (publishing)BitDescriptive statisticsMetric systemInformationElement (mathematics)Lecture/Conference
AuthorizationMereologyLecture/Conference
Transcript: English(auto-generated)
It's a last minute adjustment, but this talk was first given back in IAMA, so unfortunately the second audience for this, but it means you get the practiced version. So all right, thanks first of all to Stacy Conkeel who couldn't be here, who sends
her regards. She was the primary person who put all these data together. I will go through the data that she combined in order to demonstrate some of the stuff that you can do with these metrics and with other data sets and how they interrelate, and we will go through all of these points. I'm not going to read them all out.
I don't like it when people do that, so I'm not going to do it, great. So our scope for our data for the study that Stacy did was grant funded medical research funded by Italy, Germany, France, and the United Kingdom. I belabor that because when I presented this at IAMA with exactly the same slides that
said that, someone at the end asked me why China wasn't in there, that's why. It's sort of European, we've picked, of course US as well is going to be compared later on as well, but we've had a look at what you can understand on a country level about the way that knowledge moves through from publication through to uses within things
like patents and so forth. Time scale, 2015 to 17. What was the data we used? We're quite lucky at Digital Science, we have a lot of immensely interesting products, and we have a team that has back-end API access into the various products, and they sit and they sometimes stand, there are lots of standing desks where we work,
and they combine these data and do studies and sort of, we do research for its own sake on top of the more commercial activities that we do. So the data in one of those systems, Dimensions, contains now 105 million publications, it contains one and a half trillion dollars worth of awarded funding data broken down by country, institution, researchers, what they're actually about,
money and things like that obviously. We have now 39-40 million patents that come from IFI claims, it's a good source of innovation data, who filed it, in which country, what jurisdiction, legal status, all the crazy patent stuff that you get. We have policy of course, that comes from Altmetric, and we of course have the Altmetric
attention scores and score data that are in there as well. So we began with all of that, and then we've seen that before, those are just the sources in Altmetric itself. So then Stacy wants to have a look at knowledge mobilization, what can we do if we have a look at patents and we have a look at policy,
can we sort of get a sense of how countries compare when we're looking at funded data, data for which we know who funded the study to begin with, how those end up bleeding out, I need to find a nicer term than that, diffusing into the research landscape, into innovation. It's something that funders will want to know,
because they'll want to know have we spent our money well, can we demonstrate that our funding has actually had an impact, had inventions, and who it was, when it happened and so forth. So patent data, it's quite useful data because it's structured really beautifully, it's used very frequently for this, it does show I think an indirect economic activity,
as well as a direct invention, it's quite useful for that. Some people are wary about patents use, so this is only preliminary, what I'm showing you is not something that's been peer reviewed. And the indicators we had for it are twofold, on one side a patent can be counted as a data point in itself,
so the number of patents that have been filed in a certain region or by a certain institution or a certain end user or something like that, and then the other one as well would be the actual citations from patents, there are two of those, there's patents referring to other patents, which does happen of course, we stand on the shoulders of giants, and then of course there are citations to what's called non-patent literature,
what they mean is standard article citations that a patent makes in order to make its case. In terms of policy citations, we talk about that quite a lot, but we don't make it clear quite how it's done, we go and download all of those PDFs and then write machine learning algorithms to be able to read through them and detect something like,
whatever that name is, I picked one that's really complicated to say, very very brief citations, no links, no DOI, you've got to go and do the work yourself to find out how it works, we rely on Crossref amongst others to make that magic happen, so you will see for example, survey reports, reports on when the Zika virus happened,
we started to see publications show up in policy really quickly because it was, can't sit around for two years and write about it, we have to implement measures right now, so these are normally links back to those PDFs. It is used in that citation, it differs from being cited,
my goodness, citationed, what am I doing? What am I doing here? It is used in a more applicable way than just a general citation in academic literature, I think we've all followed at least one citation back before and found that they are casually referring to this paper, it's not actually particularly instrumental in an article,
it's perhaps there to bulk out the reference list sometimes, but with policy that does not seem to be the case, this is something where we are citing actionable things, things that will inform concrete movements, there are strategic elements to it, and things that of course are going to be affecting policy climate,
so things where there will actually be headlines potentially from it, that's a definition from Weiss in Haynes et al. So both do measure uptake of ideas, concepts, one of them technical, the other one much more, I don't want to say political, but on the way to policy, but they don't of course directly measure downstream impact,
we can't demonstrate from them that lives are led in a more healthy or longer way now, or remarkably differently, that's something the OECD is working on at the moment to try and measure that kind of thing, so that's what we had to look. So if we have a look at trends in research, productivity and collaboration, what you're looking at here is the UK, Germany, the US in the middle,
France and Italy, where we have compared the volume of what is out there, so the blue is publications, the red is researchers in R&D per million people, so the number of researchers actually active in that area, so interesting to compare the, I suppose publications per researcher population
is what you're looking at. Don't worry about taking photos of the slides, I'll make sure that they are available online afterwards, in fact I think that Stacy might even be writing something for her blog about it, and then we have a look at research productivity, so we're looking at funded publications per researcher by country,
so it gives you an idea of sort of how it breaks down, now those numbers start to change when you look at the amount of money per publication, or the amount of money per researcher, but when you just simply look at it by how many publications per researcher per country have we found that actually have a grant ID attached to them, where there is a project with money, with a timeline and all the rest of it,
so UK surprisingly high up there, 20.4 publications per researcher per country that are funded, Germany with 9.1, US a little bit lower, although I have a feeling that the per, I think we may even get to this actually, but the per publication funding amount in the US is very high,
so the projects might be smaller, but they are heftier when it comes to the amount of money that's sort of in there, so it's an interesting ranking, number of publications by health area, of course one of the things that we're very lucky at Digital Science to have done is to map fields of research, various different kinds of fields of research, to grants themselves, to publications themselves,
we can work out what an article is actually about, rather than inheriting the journal field of research, which in some cases is very, very difficult to use, Nature or PLOS for example is a variously considered mega journals, what are those, interdisciplinary I suppose, or general, or multidisciplinary, or even worse, science,
these can't be used for any kind of serious analysis, and so we're quite lucky to be able to break down the data we have in the health areas, very specific things, mental health, neurological disorders, cancer, stroke and so forth, with cancer leading the way there in the volume of publications that it falls into. So when Yama happened, I actually struggled to explain what you're looking at here,
now Josh is taking a photo, but that's just of my beautiful face, he's not trying to capture the intricate detail you can see there, so what I'll explain this is, Stacey had a look at the, in every country that we're talking about, Italy, Germany, US, UK, France,
the number one institution for the data that we're looking at, the top institution, the most funded institution, and then the ten closest collaborating institutions that are connected to that, so where they show up the most times in the affiliations, and then the top ten of all of those top tens,
so you start to end up with several institutions, and then she mapped those, so in the US, we're looking at the moment, Harvard University, is that a pointer? Oh yeah, cool, I look all professional now, Harvard University right there, and then you can see immediately it's related institutes, so things like Broad and the university hospitals and things like that,
are very closely connected, but then you also get the second order affiliations, the top ten universities or institutions working with those institutions, and you start to see, yeah, one minute left, five minutes, that's fine, plenty of time, I'll fix the whole metrics discussion right now, so you do get a sense of that,
so the top ten overall institutions in the US, you'll notice that hardly changed at all, because the US dominates when you compare the sheer publication output and funding output, did the same thing with the UK as well, I cannot remember, oh yeah, it's MIT weirdly, MIT is the top one, and how they connect as well,
and then in France, it's the national scientific something or other, CNSR is the one, yeah, I should have abbreviated it, and the same thing for Germany, you can see that as well, this is useful in the sense that you can get a sense from more than just publications, but from funding and from collaboration,
where perhaps a researcher may want to apply themselves, institutions that are leading the way when it comes to any particular discipline or any particular subject area, it gives them an idea of where the money is flowing to, which researchers perhaps to collaborate with or read or invite to conferences instead of me,
so she did that as well for Italy too, so we'll make sure that these are available so that you can sort of see it, she does have the underlying data set as well, we do have that so that you can see it a bit more clearly than you would hear. I'm going to skip right to the end now, because I forgot that there were minutes involved and I can't just talk forever, I'd love to do that.
The recommendations that Stacey makes, preliminary ones, it seems like encouraging researchers to file patents where appropriate is a good idea. If there are inventions out there that can be commercialized and that you do consider to be original, that itself is already a good metric of outputs,
some countries have adopted that, China has adopted it heavily as an indicator, which then led to abuse a little bit, a lot of abandoned patents, they just sort of filed them in the same way you used to just publish as much as possible, so sensible and honest filings of patents are always a good thing. We're taking the Italian angle because we were at Yama on the original Brexit day.
There it seemed that tech transfer is much more powerful than we thought. Three left, thank you. And then policy impact, especially in the UK, seemed lower than in the other places, so it seemed like that was a good thing to promote as well.
And then lastly, international collaboration. International collaboration works really well, not just because it's quite nice to work with people from somewhere else, it seems to be when there are international affiliations, when there are collaborators across nations, you get more citations, you get slightly more social attention as well,
you ultimately might attract, because of that, more readers and therefore potentially also more patent filings, and you can then demonstrate in your grant reports, we actually had some tech transfer effect, we had patents filed based on our ideas, this is something for which we should get continued funding, please. That's it, and thank you for bearing with me through that.
Great. The reason it is important not only for the biology community, but to people outside the research itself, is because the information becomes more easy to access. So here is just one example of where curation helps to solve problem with creating new drugs.
This, on the bottom, is a database that accumulates information about genes that are somehow related to diseases. To do so, they look at publications and they extract mentions of a gene being mentioned together with a disease,
and then compile them in a more computable, structured way for everybody to take advantage of. And the idea behind it is also, it's not an individual publication making an impact, but without that individual publication, it wouldn't have happened.
And now a large set of information that's been compiled from different sources can make its way into new research. So here, this is just a paper on integrating interaction data to solve the cancer problem,
and they're using over 60 different resources, over 60 different curated resources that have pulled this information together, and without all of that complexity, it wouldn't be possible to do this research. So as I mentioned, I work at the European Bioinformatics Institute, and it is hosting over 30 curated biological data resources.
And just to, again, emphasize the value of curation, it is heavily used by over 20 million users per year, sending billions of web requests to the services. So it's definitely being picked up in a way that a single publication might not have.
Now, I work at one of such databases called Europe PMC, and it is a literature database. We are developed at EMBL-EBI, supported by all these wonderful people, and partnered with PubMed Central.
And we collate not only peer-reviewed publications, but also preprints in grey literature, and one of the specialties of ours is linking up information. So we try and link publications with things like funding, with things like author IDs, but particularly being among all these databases, we link publications and data.
And from that, we got interested in the value of curation. We wanted to see what makes a paper curatable. Is there any link between publication quality and curation? Is that link reciprocal? Is it something that we can pull out?
We were generally interested in what I call the landscape of biocuration. And so these are the questions that we realized we needed to ask in this journey. The first one was more of a problem. How do we track curation with 30 databases just at EBI, and many, many more in different communities?
How do you collate the information? What is it about papers that makes curators pick them up? Can we kind of characterize what a curated paper is? Does curation correlate with quality, and what does quality mean in that sense?
And finally, perhaps more of use to this audience, is there connection between curation and any kind of metrics, whether traditional or alternative? So to start with the first one, to be honest, our task was sort of straightforward in some ways,
because seeing as curators read publications and extract information from them, they structurally keep the link back to the publication. So in this data record from a database called PDBE, you will see that there is a link to the primary publication that the information was extracted from.
So what we needed to do was to get all of those links from all these different data providers and make them accessible in a single format through APIs programmatically and on the website. And we've collaborated with nearly 30 databases, but this is an open scheme. So any new providers may be joining as they learn about the initiative.
First thing that we did was actually to integrate this into a data view of the paper. So we display the database citations on the article page, and we say how many database records have cited this article from all these different databases. But that was just the first step.
Then we wanted to look into the actual data. So the one thing that we had to do was to actually talk to curators and ask them this question, how do you select and prioritize papers? And we've done a study which I put the reference to here, so this has been done by one of my colleagues who couldn't be here with us,
but you can read the full paper and see the underlying data as well. So it relied on a curator survey and few observational studies in a workshop. So what we learned was that curation happens in a few very distinct stages. It's the same for all of the different databases.
They decide on what they look for. Is it genes versus diseases? Is it mentions of organisms and so on? Then they have to somehow select the publications, extract the information that's relevant to the first question, and then fill out the structured data record that they actually produce.
And it's those two questions that we managed to follow up on. So in order to proceed, what I actually wanted to say was, with every data there's going to be a bit of a bias. And the first one here is that curators use different search tools to look for papers,
which means that whatever revelations we draw from the data will be biased based on which tool was used, because you can't compare them throughout all of the databases if different curators use different things to find the same information.
There were also several challenges that the curators mentioned, and these are the quotes from the studies. Most publications returned in the search are usually not relevant. So they have to sift through a lot of stuff, most of which they don't really need, and so it's easy to miss. They're never confident that they found what was relevant to them.
And that actually relates to the paywalls. Not only curators sometimes don't have access to full text to curate the information they're interested in, but even if they had a subscription, was it picked up by the indexing tool that they had, because the information they look for is often hidden in the full text, or even in the supplementary data.
So if they haven't found it in the first place, even if they could pay for the access, they just won't be looking at it. So in terms of the publications, open access ones might have a preference in our set. Some curators mentioned that they have a watch list of specific journals, which means there might be a journal bias,
and some of the research that's not represented in this watch list gets lost. And they're looking for very specific underreported experimental evidence. And here experimental is the key, because when we looked into article types that are of interest to curators, this large blue portion of the screen is the peer-reviewed research,
and some of them said they exclusively look for that, not paying any attention to reviews, preprints, or other types of literature. So the second question that we were aiming to ask after learning about curation practices
was does curation actually mean quality from a publication point of view? So our take on it is yes. Not only curation itself has a value, but if a publication has been curated, that means that publication was valuable as well.
And that comes from several sort of ideas. One, curators stated that in order to curate, they need clear experimental details, researchers writing papers have to be really precise and very detailed in explaining what they've done. It won't do just to write some unclear instructions on the back of the hand.
This is one of the key things curators look for. They look for proper nomenclature. So again, that's very important, the description that researchers put in. Sufficient description I've mentioned already, but another important bit that can get quite lost
is that curators actually look at the data and data specifically, and they will reject a lot of publications that don't fit into these criteria. And the reason I bring that up is when we look at article sections that curators are interested in,
you will see that they actually skip the introduction almost whatsoever, and perhaps most importantly, they focus on things like methods, figures and tables, results, and supplementary data. These are the things that can often get overlooked,
even in the traditional peer review, particularly supplementary data, and there are studies around it. So curators actually pull out to the light the things that really make up the scientific findings. I was talking about curation being valuable to other users, but curation is also an indicator for curators themselves.
So we asked them if it was useful to know whether an article has been curated by someone before, and most of them said yes. And when we asked why was that important, a lot of them said validation. So they say that if it has been curated,
it means that it is good. So there's use of structured data, checking cross-references, prioritizing, and consistency that are playing into this as well. So these have been all of the steps that we took, and we are now just at this one. So the analysis continues.
This will be more questions than answers, and here are the questions that we want to explore. In terms of connecting citations and curation, we want to know do curated papers look like other publications, or is there anything special about them? Since curators don't look at the number of citations for the paper
or a journal impact factor or anything like that, does that mean they over or under select highly cited papers? Does curation change the citation behavior? Will a fact that it has been curated mean that it's going to be more cited? And there are a few challenges along this. We don't know what to take as a baseline,
what if we're just comparing pears and apples, what a curatable but not yet curated publication looks like. The data is obviously incomplete. We've got 30 providers, but there are many more. Some publication types, as I mentioned reviews and gray literature, will be underrepresented,
and the availability of full text will make a big impact. So this is one of the last slides, and we just looked at comparing two different databases, two different curated databases in green, one called Flybase, in blue one called Protein Data Bank, and in orange papers that have been curated by both.
And we looked at annual citation rates, so each of these triple columns will be a single year. Now, this data has not been statistically challenged, but you can see that in a few years, the overlap is more highly cited than papers curated by just a single resource.
And initially we were kind of interested, is it because it signals a higher quality, or is it because there's more people who read it because there are two different databases dedicated to two different topics? This is just an example of the challenge that it is. So we've gone through a few steps along the way, but there are still questions connected to it.
I'm gonna skip my personal story, if you wanna hear it, come ask me. But this is sort of why it's important. For young authors that are not heavily cited, being curated could be an impact that they don't show. And what we want to transition to in Europe PMC is from this view alone to these views side by side,
from citations from articles to citations from databases as well. So at this, I get to the question slide. Okay, so my name is Matthew Bickley. I'm from University of Wolverhampton. Some of you might know Mike Thalwell from there, or at least heard of him. So he's one of my supervisors,
so if you've got any problems with my work, speak to Mike. So my work here, I'm looking at grey literature and seeing if we can work out the impact of grey literature online in a similar way to what we do for standard citation analysis. Let's test this as well. I should have tested this first.
Okay, so just in case you're not with it, with what grey literature is, a very brief overview. So it's a non-standard source of research. It's contrary to literature written with the foremost intention of publication, either in a book or a journal article. So it's quite a wide umbrella term.
Anything online from Twitter to governmental white papers could be classed as grey literature. The line can be a bit fuzzy, mainly due to the fact that it's the foremost intention to publish. So you could write something with intent to publish and not publish it, or vice versa. It ends up getting published when you started. It wasn't necessarily the plan.
So assessing the usefulness of the grey literature is actually quite difficult. There's limited research into it and how it's actually used in standard literature or how it uses standard literature. As I mentioned, it's quite a big umbrella term, and that's obviously ballooned since internet and social networks have become mainstream. And the methods that I'm looking at here
will look at just one type of grey literature, but the methods that I'm proposing are hopefully adaptable for other repositories as well. So search engines to assess the impact of standard articles actually obviously exist. You've got Scopus, Google Books, et cetera. These are very difficult to track
to use to actually track grey literature. There are ways, but we need to come up with new methods to do that. Okay, so citation count could be the best measure. Our standard citation analysis is not necessarily the best one. There are lots of old metrics we could possibly use. This is just one example. Large-scale studies could be difficult
with certain search engines due to their limitation on how many queries you can put through, for example. I've mentioned about citation analysis, so I'll skip over that. One of the problems we have is due to having to manually locate or capture these grey literature articles. For what I've used, I've specifically used one repository,
but trying to automate the method of identifying whether it's grey literature might be a bit difficult. Obviously, altmetric.com are here. Problem is they need something like a DOI or some unique identifier, so I'm trying to come and propose these slightly different, but still hopefully unique identifiers
that could be used in a similar way to how altmetric tracks. Here's an example of the problem. If we just go to Google Books, I've just put in a query at the top there, which is just gov.uk slash government slash publications, which is where the grey literature from the government was held. I'll get back into that.
We can see here in the descriptions in bold that that part of the URL is highlighted, so we can hopefully assume that there is a reference in here to something that is grey literature. If we go into that first example, it's a book, an education book, and go into the references all the way in, and you can see highlighted in yellow, we have many references to this repository.
There are grey literature citations in here. We just need to try and work out a way of extracting them and identifying them. There are, as you can see, other governmental citations in here. This one here is not highlighted, because it doesn't have the slash publications part. We limit it to just that repository. However, the possible inclusion in future work
is to use, say, the news parts that they have on their site. My research questions and my data collection for this was to take the UK governmental grey literature repository, which was self-defined as grey literature, but hopefully we can agree it is, and look in how these documents
are actually cited in your standard publications. The first question I wanted to ask was, can these grey literature citations in standard publications actually be automatically extracted from Scopus and Google Books on a large scale, so not just manually or a small sample? Can we semi-automatically at least do this?
If we can do that, which citation search strategy or indicator is the best to try and measure impact? There are obviously quite a few choices we've got, and I'll go into that. Are there any disciplinary differences between the topics as defined on UK Government website? Again, I'll go into that in a sec. The data was collected from the UK Government website.
That's the repository that was, and it still is, but they've merged a lot of other things together now. The citation I showed you before in yellow, that will obviously not change, but some newer documents will have slightly different URLs to this. This does still redirect to the new one, which is good,
but obviously the unique identifier I've used might not work anymore, which is great. We had to add some bespoke routines into Webometric Analysts, which you haven't heard of. You can find it freely available at that link there. That is written by Mike Thalwar at Wolverhampton, and he has lots and lots of tools in there
to measure all sorts of altmetrics. One of them that he wrote was a way to crawl this and to analyse this data for me. When we crawled, we were able to get quite a few different indicators we could possibly use from the data, so document title, the policy area, the URL, the date it was released. There were a few others if it's part of a collection,
but we didn't end up using those. The policy areas that we used for the topics, which I'll get a little bit more into in a sec, were defined by the UK Government. If you want to trust anything they write at the moment, that's up to you. They defined 47 quite wide areas.
For example, NHS, further education, sports and leisure, there's lots of self-defined topic areas in their repository. Again, that's changed since I'll get back into it in a bit. Data collection, we crawled the whole site, got every single document. Altogether, there was about 138,000.
Some of these might be duplicates, as some of the documents might be multidisciplinary. There is a slightly few less than this, but not far off. As you can see, unsurprisingly, there's barely anything from 1945. That's not because I've just put that in there.
There is a document from 1945. It was obviously retrospectively added. There's barely anything up to the mid-2000s, that 2005, where it starts ramping up. We decided we just wanted to obviously do some recent documents and look at ones with about over 10,000 documents per year. We just narrowed it down to 2013 to 2017,
at the time of collecting this data. We hadn't finished 2018. We just went with the most recent five complete years. We still ended up with almost 100,000 documents to analyze, which is quite useful. To identify gray literature citation, we needed to come up with some kind of indicator
or search strategy, which has obviously high precision, high recall, and something unique, hopefully. We did some pilot studies, trying to show what might be best. Document title was the first option we ran with. We got a lot of false matches. There is a document in the database called Ahead of the Curve,
but it was about UK motorsport. It had a subtitle after this, but the actual document was just that. If we searched in Scopus for that document title, we got almost 1300 mentions, whereas if you just went with the URL of this page, you only got one, and this was actually the only true positive.
We decided that document title was not useful, was not unique enough. We went with URL in the end. It's a bit more specific. Hopefully, it should be unique. However, we did consider using the whole URL, but then, due to issues with PDFs and line wrapping, if it was line wrapped halfway through,
then it might not find that as a citation, so then we're missing too many. Again, we did some further tests, and we found out that it was enough to chop off the first bit, the gov.uk slash government, and just go with publication slash whatever the document title was called, so there's that Ahead of the Curve option.
If we'd have just gone with, obviously, just the title again, we've got the same issue that we've got here, but by putting publications in front, it did actually make it unique enough that we were getting decent positives. Okay, and then final search strategies to actually find these citations. We decided to go with both Scopus, Scopus, get my words right,
Scopus and Google Books in the same way. Using Webometric Analysts, we were able to use those. The difference is that Scopus has the reference section that we can actually specifically search in, which is very useful. Google Books didn't have that, but again, Mike has built in ways of matching and trying to filter out results that are false positives, which I'll go into later.
So the methodologies to go over, we called documents from 13 to 17, picked out 100,000. We then took those 47 areas and tried to re-categorize them because, for example, we put NHS, Public Health, and Social Care together into healthcare, tried to make it a little bit less fragmented. We then created, we chopped off the URLs,
just used the last bit, a combination of Excel, VBA, and some Webometric Analysts to do that, and then used Webometric Analysts using the APIs for both Scopus and Google Books to actually search them. Citation counts then calculated based off these two and impact assessed comparing across years
and topic areas to compare. We then also checked the precision of the results in a crude way with a sample just to see which one was more precise. Okay, there's the results. So it's quite a busy slide, apologies for that, but they're all pretty similar in terms of if you look at the top three cited areas
across each year, it tends to be Healthcare Education Science across the board, which is not necessarily surprising. The only difference is leisure jumps in here in 2017, but as that's the most recent year, sorry, the seven's slightly cut off there, that might just be because it's too recent. The two lines here, the top one represents Scopus
and the bottom one is Google Books. Apologies that they're both in black and white, but when I submitted this to another conference, they wanted everything in black and white, so I had to go with two shades of grey, which is funny because it's grey literature, I suppose. Anyway, so if we have a look, we have got these thematic differences where we have Healthcare Education Science
that have the highest impact via this sort of citation analysis-esque measure. Overall, the results in Scopus were cited higher than Google Books. This is probably just down to the fact that Google Books is obviously focusing on books and Scopus is mainly focused on journal articles.
There is obviously some overlap, but the difference isn't huge between them. Although there is some statistically significant differences, it still might both be useful depending on what you're actually looking for, whether you're looking at books or journal articles. As I say, both might be useful. The precision estimates we got from a small sample was, for Scopus was 82% and for Google Books it was 71%,
which wasn't too bad considering we were coming up with our own indicator, much better than the one in 1200 that we had as the example I showed earlier. This method is useful, obviously I'm from the UK, and the Ref is coming up in two years.
Last time it was 2014, next one in two years. Part of the Ref is a university's output, they're trying to prove that their work is impactful. If they have something that isn't necessarily written for publication, they can still show that the work they're doing has impact. Also possibly of use for external companies,
which I've not really considered right now, but possibly in the future. My one suggestion is that if Great Literature could have DOI, that would be great, but good luck getting that done. Obviously a suggestion, if the UK government wanted to track it, better they could generate DIYs for all their output and it would make everything a lot easier.
If we just look at a little bit of characteristics, if we just look at the top cited articles and remove some of those false positives that were obvious, the theme seems to be more healthcare related. We've got healthcare reports, healthcare studies, statistics and annual reports and the NHS constitution.
Four of the top five most cited were actually on education, even though most of them in the top 60 or more citations were healthcare related, the four of the top five were actually education. It varied on their themes, there was lots of different subjects and lots of different age ranges in that. Some of the more recent developments like special educational needs
were included as well. Another sort of trendy one is e-cigarettes as well, that was in there, lots of research going into that, it's a relatively new thing. And physical activity and staying active and obesity, they were all mentioned as well. So these sort of things that aren't necessarily surprising,
but it's nice to see that that's actually what's happening. The types of publication actually seems to vary, so whether it's a governmental white report or a healthcare report or it's some kind of educational report, they do seem to vary amongst each subject area. The limitations of this method, this is obviously one case study,
but the methods are relatively repeatable. The crawling of the website and the slight change to the URLs might have to be different, but a similar method could be used. Scope's API is limited to 10,000 queries per week, but when we run it, we broke it and it just carried on going, which is great, made my work a lot easier.
Don't know if we'd be able to replicate that breaking of it, but it happened. So that's one possible limitation if you're doing a large study. Some of the emerging of the UK government policy areas are a bit arbitrary. So for example, the one I mentioned earlier are schools, further education, higher education, make a logical group, but then we've got things like housing and travel, the grouping for that was a bit, one was housing, one was travel,
put them together. But they were the less impactful areas anyway. And also generic URLs might produce many false positives, which is a problem, so you might have to sift through data and get rid of these generic URLs first. The future work that I'm doing is the reverse of that.
So if you go into a grey literature document, have a look in. Some of them have reference sections, not all of them, but here's one that does. And if we zoom in, we can see here that this is referencing journal articles and books as well. So trying to identify the reference sections in grey literature is my current study that I'm doing. That's basically, and I'll be using publish and perish and all sorts of stuff to do that.
Okay, thanks for listening. If you didn't understand any of that, there's an example of grey literature. That's my little joke. I avoided the 50 Shades reference, so that was probably good. If you need to contact me, there's my email. If you want data tables, there's stuff on Figshare, all sorts, and invite you to ask any questions if you'd like to.
Any questions? Why Google Books? Mostly from a suggestion from my supervisor.
But we considered just using Scopus to start with. We wanted to compare it to something, but the coverage of books in Scopus is not as large, and we wanted something that we could identify book references as well. Google Books was our sort of first store, and we ran with it. We didn't pick any other sort of book repository for it,
so that was why it would be replicable with something else that would cover books, but those are the two we went with and we ran with. I'm also specifically asking, well, I'm not a huge fan of Google and academics in general, but Google Books, we also have Google Scholar,
which also searches in Google Books. Yes, we did some basic tests first, and it seemed like Google Scholar was missing a lot of the references we were actually finding in Google Books, which was weird. We thought that they would be kind of well intertwined, but we noticed there was a bit of missing data from Scholar, so that's why we went with Books.
Google Books is also, there's already the API searchable thing in Webometric Analysts, which was quite attractive, because I could go straight into that method and not have to develop more of a method for Google Books. Mike's work actually, there's a lot of false positives you'll get from Google Books due to the way it matches with the description.
However, Mike's built a routine that can actually filter out the false positives, and it's pretty accurate from his tests, so that's another reason why we went with it. Okay, thanks. No problem. Any more questions? No? I have a quick question. People that assign DOI's do it
because they care about the reliability of the citation. You made a joke about the UK government, but in truth there's a civil service in the government, but stuff does go missing. Do you think there's greater care about this and provenance information, or do you think actually it's more convenient that you can't track it?
I've not really thought about that, I suppose. There's a possibility that some of the work that they have written ends up adapted and going into some kind of published work anyway, so they might not be too fussed about this part of their work. However, just by some pilot studies that we searched,
they are getting reference, so some people care about them, even if the actual author doesn't in the original intent. So they might not feel that they're worth DOI-ing or whatever, but clearly some people do.