Day 2: The Value of Connected Systems and wrap up
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 8 | |
Autor | ||
Lizenz | CC-Namensnennung 3.0 Deutschland: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/46262 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
2
3
00:00
SystemprogrammierungEinfach zusammenhängender RaumEinfach zusammenhängender RaumMetrisches SystemCoxeter-GruppeOffene MengeThermodynamisches SystemGebäude <Mathematik>TouchscreenComputeranimationVorlesung/Konferenz
01:51
HauptidealZahlenbereichDichte <Stochastik>RechenschieberEreignishorizontDigital Object IdentifierBetafunktionRückkopplungCASE <Informatik>AssoziativgesetzAutorisierungZahlenbereichSelbst organisierendes SystemMetrisches SystemAdditionLanding PageMetadatenInhalt <Mathematik>MultiplikationsoperatorInformationWeb Sitesinc-FunktionComputeranimationVorlesung/Konferenz
03:59
InformationVerschlingungRelation <Informatik>Inhalt <Mathematik>Inhalt <Mathematik>SichtenkonzeptPerspektiveTouchscreenDifferenteRelationale DatenbankDigital Object IdentifierInformationTwitter <Softwareplattform>MAPRechter WinkelMetrisches SystemBitProzess <Informatik>MetadatenMengeTeilmengeComputeranimation
05:36
InformationVerschlingungRelation <Informatik>Inhalt <Mathematik>EreignishorizontWeb logSystemplattformWeb-SeiteStreaming <Kommunikationstechnik>Twitter <Softwareplattform>Rechter WinkelEreignishorizontInhalt <Mathematik>Digital Object IdentifierLanding PageWeb logMetadatenSystemplattformDatenstrukturSpannweite <Stochastik>URLDifferenteSchlussregelMathematikWeb-SeiteProzess <Informatik>MultiplikationsoperatorIdentifizierbarkeitAdditionInformationsspeicherungPhysikalische TheorieAutorisierungp-BlockComputeranimationVorlesung/Konferenz
08:03
Web logURLÄquivalenzklasseGoogle BloggerAuswahlaxiomPublic-domain-SoftwareDifferenteRohdatenMultiplikationsoperatorDigital Object IdentifierAnalysisRobotikPunktAutorisierungTwitter <Softwareplattform>IndexberechnungRoboterEreignishorizontSystemplattformHilfesystemInhalt <Mathematik>Benutzerbeteiligung
10:03
Web logOpen SourceHilfesystemSoundverarbeitungWeb-SeiteMultiplikationsoperatorVersionsverwaltungMailing-ListePublic-domain-SoftwareInformationPunktDatensatzStreaming <Kommunikationstechnik>EntscheidungstheorieInhalt <Mathematik>RobotikCodeLoginProzess <Informatik>Umsetzung <Informatik>MengeRoboterMathematikEreignishorizontMessage-PassingSoftwareDigital Object IdentifierArithmetisches MittelLastEinfache GenauigkeitElektronische PublikationOpen SourceParametersystemBenutzerschnittstellenverwaltungssystemComputeranimation
14:26
DatensatzAnalogieschlussMinimumProdukt <Mathematik>MAPLoginEin-AusgabeWeb logEreignishorizontMultiplikationsoperatorBetafunktionComputeranimation
15:26
Twitter <Softwareplattform>Web logEreignishorizontTwitter <Softwareplattform>Statistische HypotheseMultiplikationsoperatorEreignishorizontWeb logKeller <Informatik>MetadatenMereologieProzess <Informatik>Digital Object IdentifierDienst <Informatik>ResolventeFlächentheorieBitRechter WinkelProjektive EbeneAggregatzustandBenutzerhandbuchComputeranimation
17:19
EreignishorizontMetrisches SystemMaßerweiterungAnalytische MengeMetrisches SystemWeg <Topologie>MaßerweiterungBrowserRechenschieberDivergente ReiheUmsetzung <Informatik>Quick-SortSoftwareEreignishorizontInternetworkingGruppenoperationWeb logMultiplikationsoperatorRechter WinkelTwitter <Softwareplattform>Wort <Informatik>Auflösung <Mathematik>Coxeter-GruppeWeb-SeiteInteraktives FernsehenMehrrechnersystemRotationsflächeGüte der AnpassungZahlenbereichComputeranimationVorlesung/Konferenz
21:51
Auflösung <Mathematik>EreignishorizontRechter WinkelDateiformatOpen SourceZahlenbereichQuick-SortMetrisches SystemGebäude <Mathematik>Inverser LimesRuhmasseMultiplikationsoperatorMathematikAggregatzustandEinflussgrößeProjektive EbeneProgrammbibliothekDigital Object IdentifierCracker <Computerkriminalität>Güte der AnpassungProxy ServerHackerSystemaufrufAuflösung <Mathematik>CodeAdditionKurvenanpassungComputeranimation
24:08
VerschlingungVersionsverwaltungDatenbankDigital Object IdentifierVersionsverwaltungSpezifisches VolumenProgrammbibliothekMultiplikationsoperatorDigital Object IdentifierVollständiger VerbandInternetworkingObjektorientierte ProgrammierspracheProzess <Informatik>DatenbankAbfrageCoxeter-GruppeOffene MengeKontextbezogenes SystemBrowsert-TestEreignishorizontMaschinenschreibenÄhnlichkeitsgeometrieResolventeRechenschieberPhysikalisches SystemMaßerweiterungZahlenbereichAlgorithmusGemeinsamer SpeicherTotal <Mathematik>Computeranimation
25:21
DokumentenserverProgrammbibliothekInklusion <Mathematik>Physikalisches SystemPortscannerBenutzerbeteiligungLesen <Datenverarbeitung>Produkt <Mathematik>Mailing-ListeProzess <Informatik>ProgrammbibliothekOffene MengeZahlenbereichVollständigkeitResolventeDokumentenserverMaßerweiterungObjektorientierte ProgrammierspracheExistenzaussageOAISMessage-PassingBaum <Mathematik>EreignishorizontTwitter <Softwareplattform>Physikalisches SystemUmsetzung <Informatik>MAPGreen-FunktionBerechenbare FunktionWeb-SeiteInformationArithmetisches MittelApp <Programm>Computeranimation
26:33
EreignishorizontÄußere Algebra eines ModulsMultiplikationsoperatorLoginEreignishorizontStrategisches SpielMetrisches SystemEntscheidungstheorieSoundverarbeitungMereologieThermodynamisches SystemZahlenbereichUnternehmensmodellBenutzerbeteiligungPunktSpezifisches VolumenSicherungskopieTwitter <Softwareplattform>IntegralGebäude <Mathematik>Web logPhysikalisches SystemQuick-SortSichtenkonzeptDigital Object IdentifierLesen <Datenverarbeitung>Computeranimation
28:05
EreignishorizontMAPBitQuick-SortEuler-DiagrammKrümmungsmaßWeb logDatenfeldAggregatzustandArithmetisches MittelNetzadresseEinsMultiplikationsoperatorBerechenbare FunktionInternetworkingZahlenbereichYouTubeLesen <Datenverarbeitung>Einfache GenauigkeitMultiplikationMultigraphEreignishorizontCoxeter-GruppeZählenTermRechter WinkelOvalAdressraumNatürliche SpracheDatenmissbrauchProzess <Informatik>Selbst organisierendes SystemMetrisches SystemSprachsyntheseIdentitätsverwaltungRichtungNP-hartes ProblemNatürliche ZahlSchreib-Lese-KopfInformationURLGenerator <Informatik>TabelleInteraktives FernsehenSystemplattformEinfach zusammenhängender RaumSoftwareKartesische KoordinatenComputeranimation
30:54
AssoziativgesetzBitKorrelationsfunktionTwitter <Softwareplattform>BenchmarkPlotterWeg <Topologie>ZahlenbereichInstantiierungQuick-SortMinimumGruppenoperationDatenflussKategorie <Mathematik>Reelle ZahlMetrisches SystemRechter WinkelRichtungObjekt <Kategorie>MultiplikationsoperatorRuhmasseRauschenGeschwindigkeitComputeranimation
32:13
Mixed RealitySichtenkonzeptObjektorientierte ProgrammierspracheTermLeistungsbewertungMetrisches SystemMultiplikationsoperatorDesign by ContractTabellePhysikalische TheorieComputerunterstützte ÜbersetzungGrundraumRechter WinkelRegulärer GraphQuick-SortComputeranimation
33:16
Mixed RealityKorrelationsfunktionInformationsspeicherungSpezifisches VolumenRegulärer GraphMinimumRankingBitGrundraumTwitter <Softwareplattform>ZählenSkalarproduktRechter WinkelKategorizitätMetrisches SystemTotal <Mathematik>Abgeschlossene MengeRechenschieberURLComputeranimation
34:36
RechnernetzPrognoseverfahrenEinflussgrößeLesen <Datenverarbeitung>MatchingMereologieBitKonfiguration <Informatik>Twitter <Softwareplattform>GraphEreignishorizontSpieltheorieVorhersagbarkeitSummierbarkeitBewertungstheorieEinsLeistungsbewertungRichtungProjektive EbeneÄhnlichkeitsgeometrieElementargeometrieZählenMetrisches SystemRechter WinkelCASE <Informatik>Quick-SortAutorisierungDatenfeldComputeranimation
37:24
Metrisches SystemMaßerweiterungVirtuelles privates NetzwerkReelle ZahlDienst <Informatik>Fuzzy-LogikIdentitätsverwaltungRoboterMetrisches SystemSpieltheorieAlgorithmusSoftwareRechenschieberVerkehrsinformationNetzadresseSelbstrepräsentationCross over <Kritisches Phänomen>EreignishorizontURLDifferenteInformationDatenmissbrauch
38:53
Offene MengeBusiness IntelligenceDatenmissbrauchTermVirtuelles privates NetzwerkAbstimmung <Frequenz>InstantiierungWeb-SeiteInternetworkingProdukt <Mathematik>App <Programm>NetzadresseTotal <Mathematik>Metrisches SystemInverser LimesGrundraumBitZahlenbereichOffene MengeGebäude <Mathematik>ComputervirusSoundverarbeitungComputeranimationVorlesung/Konferenz
40:17
KontrollstrukturFunktion <Mathematik>TypentheorieElektronischer DatenaustauschMittelwertKonvexe HülleE-MailZahlenbereichStatistikAutorisierungZahlenbereichFunktion <Mathematik>BenutzerbeteiligungFrequenzVerkehrsinformationSelbst organisierendes SystemKontextbezogenes SystemVollständigkeitOpen SourceGemeinsamer SpeicherMedianwertCoxeter-GruppeProzess <Informatik>Schreib-Lese-KopfMetrisches SystemBildgebendes VerfahrenGrundraumSpezifisches VolumenMultiplikationsoperatorTexteditorVorhersagbarkeitInformationZählenSystemidentifikationTypentheorieCodeMomentenproblemTVD-VerfahrenBitrateStabSpezielle unitäre GruppeInstantiierungFlächeninhaltMengePhysikalische TheorieMittelwertÄquivalenzklasse
47:38
StatistikZahlenbereichUnternehmensarchitekturMIDI <Musikelektronik>Konvexe HülleLokales MinimumTotal <Mathematik>ProgrammbibliothekElektronischer DatenaustauschInformationBenutzerschnittstellenverwaltungssystemProdukt <Mathematik>Stochastische AbhängigkeitKonditionszahlZahlenbereichRoboterTotal <Mathematik>BereichsschätzungHypermediaFacebookComputerspielBildverstehenFunktion <Mathematik>GrundraumPhysikalisches SystemThermodynamisches SystemLokales MinimumResultanteStrömungsrichtungExogene VariableValiditätUmwandlungsenthalpieSoftwareentwicklerTwitter <Softwareplattform>BildschirmmaskeProgrammierumgebungQuick-SortMereologieSondierungAusnahmebehandlungRechenwerkStandardabweichungMultiplikationsoperatorEinfach zusammenhängender RaumAutomatische IndexierungDatenbankVerteilungsfunktionTabelleSoftwareTropfenStichprobenumfangDämpfungSchlussregel
54:59
GoogolFacebookInformationFormation <Mathematik>AlgorithmusBesprechung/Interview
56:08
InformationFlächeninhaltInformationKontextbezogenes SystemThermodynamisches SystemMultiplikationsoperatorMomentenproblemGüte der AnpassungCASE <Informatik>Mobiles InternetFunktion <Mathematik>RelationentheorieInstantiierung
58:35
Kategorie <Mathematik>Computeranimation
59:30
Vorzeichen <Mathematik>HypermediaRechenwerkE-MailNormierter RaumGruppenoperationHackerRechenschieberComputeranimation
01:00:31
MIDI <Musikelektronik>InformationSystemprogrammSystemprogrammierungBetrag <Mathematik>EinflussgrößeProzess <Informatik>EreignishorizontMaschinenschreibenMultiplikationsoperatorEntscheidungstheorieMetrisches SystemIntegralSystemplattformGebäude <Mathematik>DatenfeldMultigraphEinfach zusammenhängender RaumÄußere Algebra eines ModulsThermodynamisches SystemSicherungskopieUnternehmensmodellWeb logCoxeter-GruppeTwitter <Softwareplattform>Umsetzung <Informatik>Message-PassingSystemaufrufZahlenbereichRechter WinkelGeschwindigkeitGruppenoperationKategorie <Mathematik>TermLeistungsbewertungSchreib-Lese-KopfDatenmissbrauchOvalSoftwareEinsGenerator <Informatik>TelekommunikationEuler-DiagrammNatürliche SpracheComputeranimationVorlesung/Konferenz
01:08:39
MultiplikationsoperatorRechenschieberURLComputeranimation
Transkript: Englisch(automatisch erzeugt)
00:03
Okay. Thanks, everybody. Sorry to drag you in from coffee. We're nearly there. You'll all be pleased to hear, I'm sure, just one more main session and then one little wrap-up session. But I think this will be a good one, so it's definitely worth sticking around for. We're now going to hear three presentations on the value of connected systems.
00:23
First up, we have Jo Was from CrossRef. Jo is going to talk to us with a presentation entitled Transparent by Design, Building Open Infrastructure for the Optmetrics Community. Jo, the floor is yours.
00:41
Hi. Yep. Apart from my missing up, you'll think, sorry about that.
01:09
Sorry, Claire, I've broken it again. I've kind of half-moved it to the other screen.
01:45
This feels like kind of a calm before the storm. Tell the joke. We're waiting. I only know one joke, and I couldn't possibly tell it here. Sorry, there's a special way to do the PDF.
02:05
Okay, well, there's meant to be some text on that slide. Okay. Okay, there's the text. How's that for technical stand-up?
02:21
More magic. Hi, I'm Jo. I'm from CrossRef, here with Maddy, who you've heard of before, here to talk to you about event data. It's been quite an intense couple of days, so I'm going to stand here to try and keep myself awake. It's good to be back. I was here last year with Jennifer Kemp, and we did a slightly unconventional thing
02:43
about what it means to use metrics and some of the things you need to think about if you're going to use them in your organization. And we weren't trying to preach to the choir. We were trying to demonstrate the stuff that we were thinking about this time last year. And I was here last year with a sneak preview, and we're back here now with a beta. So it's there. It's available for you to use,
03:01
and I'd love to hear some feedback from you guys using it. So in case you don't know who we are, you've heard CrossRef quite a lot. You've heard DOIs. I know there's a couple of people who don't know exactly kind of what the deal is. We're a nonprofit. We're an association of publishers. And the main thing we do is we allow publishers to register content, and when they register it, we give it a DOI.
03:22
That's a DOI. That's one of our pet DOIs. And if you click on this, you get to the article landing page. And that's the main thing we do. We register DOIs. But more important than that, we run the infrastructure to keep them working because a link's just a link unless there's someone looking after it. So in addition to having registered content and DOIs,
03:42
we have metadata and links about that. We have links to authors via their ORCIDs. We have citations via their DOIs. We have citations to datasets via their DOIs and to dataset. We have funders, grants information, clinical trial numbers. It's all links. And because... Yeah.
04:02
So we have... There's kind of two broad views of the world from our perspective. There's all of our registered content. It's weird, the difference between these two screens. There's all of our registered content, which is mostly articles, preprints, books, that kind of stuff, stuff with DOIs and the rest of the world. And we have these links, like citation links
04:21
and relations funding information. And all these links come from our registered content to something else. And we have approaching 100 million items of registered content. And because we have all this content we're looking after, or we're looking after the metadata for it, not the content, we thought we'd do a bit of an experiment a few years back, and we thought,
04:42
Altmetrics, this sounds interesting. It's all links, isn't it? A link from a tweet to an article is similar, if you kind of blow your eyes a bit, to a link from an article to a dataset or a link between two articles. So we spun up a copy of PLOS-ALM, which was since renamed Legato,
05:00
and we loaded all our DOIs into it, or at least a subset, and we watched as it didn't do the job, at least didn't do the job that we wanted it to do. It was great for PLOS because PLOS, it was trying to collect metrics at the article level. But it didn't work for us. What we're trying to do is create something for the whole of scholarly publishing.
05:21
We're not trying to create metrics, we're just trying to get the underlying links. So PLOS-ALM wasn't the right thing for us, because it didn't collect the data in a transparent way. So we thought, yeah,
05:41
and another thing that wasn't quite the right thing with PLOS-ALM was it went out to all these different APIs and says, how many tweets have you got, how many mentions? We realized what we should in fact be collecting is individual links. So here's a new thing on our chart, event data, and it's links from outside, non-traditional stuff, in towards our scholarly content.
06:02
So it's just links, we're just collecting more links. It's not that simple, of course, because we're looking in non-traditional places and things work slightly differently outside scholarly publishing. So for a start, if you're looking at blog authors, of course they don't use DOIs, they use article landing pages. So we couldn't just collect links from blog URLs to DOIs
06:21
because there aren't very many of them. There's quite a diverse range of platforms. There's all kinds of different blogs, there's Reddit, Twitter, all kinds of places you might find data, and they all work slightly differently and in slightly different ways. So like about scholarly publishing, there's general rules and structured metadata and persistent identifiers they tend to use.
06:42
None of that really applies in this new world. And there's no publisher looking after the content. If a publisher publishes an article, it's up to them to say, here's my citations, I'm going to look after the metadata for it. If we're looking at blogs, then I just put a blog out there and I don't really care about looking after it necessarily,
07:02
checking for link rock, for example. And no structured metadata, as I said. In addition to that, pages change. If I publish an article and something is wrong with it, maybe I need to issue an update or retract it. In theory, that means a new DOI, a new piece of content, metadata to represent that.
07:21
There's a process around change. Wikipedia pages, no such thing. They change all the time. You can't just say there's a citation from this Wikipedia page to this DOI because that may not be true tomorrow. It's all a bit different. So we went from thinking, what we need to maintain is a triple store, a link store, an assertion store, that kind of thing,
07:41
between all these pages and all these DOIs. That doesn't work because people aren't using DOIs unless references come and go over time. So a triple store isn't quite the right thing to do, and instead we settled on this idea of events. An event is an occasion on which we notice
08:00
that there is a link, and it's how we tell you about that. So what comprises an event? What goes into an event? So we're going to observe all different kinds of places over the web, and when we see a link from that place to one of our DOIs or a piece of registered content, we're going to create an event. And this growth gives rise to interesting quirks.
08:22
For example, blogs. You may know that the Google Blogger platform will publish the same blog in different places. So you might have the same content published at different URLs. That means technically all these different URLs link to this one DOI. If you look at Twitter, for example, you might tweet two DOIs in the same tweet
08:43
if you're comparing them. That means one event isn't equivalent to one tweet. An event is every time we saw some kind of link in something. Wikipedia, of course, things change over time. And you can't say for sure, here is this reference. In a way, this will be true forever.
09:02
So we collect all this data, and it's your choice about how you want to interpret this. Everybody, as we've seen, has different ways that they are trying to classify stuff. People may be interested in looking at certain authors or certain domains. They may be interested in looking at analysis of robots.
09:25
So if we provide data in as raw format as possible, then you can choose how you want to interact with it. So if you want to make use of this kind of data, you have a few questions. At which data points do you want to count, and how much do you want to weight them? If you look at Twitter,
09:40
are you interested in original tweets? Are you interested in retweets? You may want to weight them different ways according to what you're trying to do. Are you interested in tweets that contain just a link? Do you care about if there's any text in there? Might that be an indicator that it's a bot account?
10:01
Spam. If you can cast your mind back to tomorrow morning, I couldn't help but notice all these Elsevier accounts on the altmetrics17 hashtag all tweeted the same thing. Spam is a problem, and it affects not only altmetrics, but it definitely affects us.
10:20
You might be wanting to do some research into spam. Sorry. If you look at Reddit, you find loads of links to the scholarly literature on Reddit, and if you follow them, some of them are full conversations
10:41
where people are debating the pros and cons of a piece of literature. Some are just discussions which are posted without comment. This may indicate more or less interaction with the content in question. They may be posted by bot accounts, or they may be posted by humans. You may want to look at the reputation of the person who posted it within the Reddit community
11:03
to see if that's something that's of interest to you. Wikipedia edits. This is something that we thought quite hard about how we want to represent because a citation changes over time. We came to the conclusion that we need to monitor every revision of every single Wikipedia page
11:22
and then record the reference from that Wikipedia page version to the DOIs, and this means there's a lot of data. There's a lot of churn. If you have a page which has lots of edits in it, you'll find lots of events because we found lots of page versions which all reference the same DOI. So we tell you everything we've seen, and that's a lot of data,
11:41
and it's up to you to work out what you want to do with it. We tell you everything we know, and how do we know? Well, we're trying to create something transparent in a way that we can then tell you what we've discovered and allow you to make your decisions about how you want to interact with it. So what we start with is a list of artifacts,
12:00
and an artifact is a snapshot of a piece of information that we know. Examples of these are the list of RSS feeds that we're following and the list of published domains, the domains we think belong to publishers. And by snapshotting those in versioned files, we can say at this given point in time, this is what we knew.
12:23
And those are consumed further down the stream of processing, and it means by the time you come to consume a piece of data that we've created, we can communicate to you the assumptions that we made, the things that we knew. And if you have a question like,
12:40
why did you collect this piece of information, or why didn't you, you can go back to the artifact and we can say, right, this is what we knew at this point in time. We produce evidence records and evidence logs. Every piece of activity that we undertake to try and match a link, we log in our evidence log, and this adds up to about a gigabyte of data per day.
13:02
So there's quite a lot of detail. It means that every API request we made to try and identify a link, we log that. Every time we were blocked by robots.txt file, we log that. Every time we saw something that we thought might turn into a link, we log that. We log successes and failures
13:20
and the whole process by which we get data from the outside and end up with events. This is quite a lot of data. All the code is open source, which means you can go back and look at the tools that we were using to curate this data, to produce these events, and you can line up the logs and the log messages
13:43
with the versions of software that we were using, which means you can explain as well as we can how we arrived at the data. Just because there's no data, it doesn't mean that activity didn't happen. These things aren't perfect. It's possible sometimes to match an article and a page back to a DOI.
14:02
Sometimes it's more difficult. We log our failures as well as our successes. Hopefully, if you've got a data set and you want to use it for research, you can at least make some attempt to quantify the parameters around that. If you find, hey, this is an interesting signal, we didn't get some data for this publisher,
14:21
you can look at the logs and say, actually, it's because the publisher was blocking the bots. So you might have seen my blog post a couple of weeks ago. We have the data coming in at the bottom. This was when I was trying to make an analogy to oil refinement. The input data comes in at the bottom, and events come out at the top. But all along the way are these intermediary products.
14:41
We have the evidence records which describe the activity that the agents did to create the data, the evidence logs which describe all the activity, and the artifacts at the same time. You may be interested in all these links, and I hope you can use them for research. But by the same token, the evidence records or the evidence logs may be the level of detail you're looking for.
15:02
So all the data in this API is completely free. It's either CC by, for, CC zero, et cetera, which means hopefully you can reuse it in your research. Event data is open for beta. I'd love to hear you guys' experience using it. It was really cool to see, is it PaperBuzz?
15:21
That wasn't forced. I didn't know much about it until yesterday. It's very cool to see it being used in that. So we're tracking a few different sources. We have approximately 100,000 Reddit events, whatever an event is, about 3 million Twitter events, 30 million Wikipedia. We're looking at hypothesis annotations,
15:42
WordPress.com, blogs, and Stack Exchange. That's the beginning of where we're looking, and we're going to try and broaden that out as time goes on. So the Event Data User Guide is there. It's quite exhaustive. There's a lot of detail in there, but hopefully it explains every single part of the process about how we create this stuff.
16:00
There's a few blog pieces on our blog, and you can contact us. Cool, thank you. Yeah, yeah. Does anyone have any questions for Joe? Sorry, that was a bit gabbled. I got quite a dry mouth. I have one, if no one else does.
16:22
So we had a couple of times, I can't remember now if it was yesterday or today, about the issues with publisher metadata, which I imagine is going to become a bit of a problem with this service potentially as well. Like quite often, for example, we'll find DOI links that don't resolve and things like that. So is working with publishers to help them improve that going to be part of this project as well?
16:43
Not actively, I wouldn't say. We have a mission to improve the quality of metadata in general, and of course that has impacts on all kinds of things, including this. I hope this might help to kind of surface some of the metadata quality issues that we see.
17:01
Yeah. Thanks. Yeah, we'd love to see that, certainly. I think when you start to bring this stuff out like this, people are quite quick to pick up on it, so it would be awesome, whatever you guys can do there. Any other questions? No? Everyone tired? Nearly there. Okay, thanks, Joe. Right, next up, we have Jason Preen from Impact Story.
17:24
Jason is going to talk to us about unpayable analytics, a novel metric of article readership using tracking data from a popular web browser extension. All right, I have the temerity to try and use the clip-on mic. Does everybody hear that okay?
17:41
All right, terrific. All right, I think I'm waiting for some slides. I'm not allowed to touch the buttons anymore. I'm with it? I'm not allowed to touch the buttons anymore. I actually bother getting the slides up, but I would like to try and start before we get into the nitty-gritty
18:01
with looking at the broader picture of why we're doing this kind of thing. That's what kind of gets me excited about it, gets me up in the morning, and it just is a great time to be doing this presentation at the end of the conference because I've been able to see the scope and the richness and the diversity and creativity of the work that's been presented here already. Really, really exciting to me.
18:21
Definitely fills my heart with joy. I haven't been involved with this for a long time now, but at the same time, while I say a long time, seven years, something like that, it makes me realize how early things are in the all-metrics world and how exciting the next seven years and the next seven years after that are going to be. If we look at the history of human knowledge,
18:44
and someone may be timing me to say how long before Jason says the history of human knowledge in the course of the talk, but in the history of human knowledge, we start off with just talking to people, right? We just share ideas with people. Eventually, it gets written down into books, and that's a really important transition. And then quite recently, we've gone from books
19:01
to being able to represent it online, right, so we can see the availability of knowledge, right? It takes a big leap up, right, when we get sort of the internet. But I think that's only the first three. There's this fourth one that we're about to see, and we're on the cusp of, and y'all in this room are helping to usher in. We in this room are helping to usher in. That's why I'm so excited in this room.
19:20
And that's the movement from a piece of knowledge as a dead thing, as a dead and fixed and static thing, like a web page where it kind of just sits there, to a network of interactions, a network of links, a network of events, right? And we've heard a lot of people talk about altmetric events over the last two days, and knowledge is that, right? Knowledge is not a static and stuffed dead thing
19:42
like a dinosaur at a museum, which might be a great thing, but knowledge is actually a series of conversations. It's a bunch of conversations happening all the time, and it's constantly moving. It's the difference between if you take, you know, basil, like dried basil, you put it on your food, it's good, it's kind of got a basil leaf flavored, but if you take live basil, like a live basil leaf, right, and you rip it right there, and you can smell all the smells,
20:01
and it's so vibrant, that's what knowledge is, right? That's knowledge is this conversation in this room right now. And as these words that I say come in a mouth, they disappear, they're gone, right? And somebody else is going to say something, disappear, it's gone, right? But in the altmetrics world, it leaves a trace. And we heard about traces in one of the other talks, how that was fantastic. I think Mom was talking about that. I've often talked about traces as, you know,
20:20
the landscape of ideas, we've got a fresh fall of snow. Right, and you walk across that snow, and all of a sudden there's footprints, where before there was nothing. No one knew about the idea, right? It comes, it goes, who knows? Maybe someone builds a house eventually and you can see it. But now we can see all the traces, all the movement. So it just makes me so excited to be here and see us at this cusp, this revolution, and not just scholarly metrics, which is great and I'm really excited about, but human knowledge.
20:42
I think 10 years, 20 and a half, people are going to be looking at this conference, and conferences like this and say, wow, they're really the cusp of something special. Not just a new metric for tracking something. That said, guess what I'm going to present? A new metric for tracking something. Because big things come out of little things, you've got to add them all up, right? And I am really excited about this little thing
21:01
that me and Heather have been working on as a way to help expand or increase the granularity of the resolution of the scope that we have to look at this ever-changing landscape of human knowledge. Also, it puts me in mind, do you mind giving me a warning in like five minutes, like five minutes ago, because otherwise,
21:20
I may just go. Thank you so much. So one of the questions that got me excited about all metrics a while back ago, like before I think there was even such a thing as anyone saying all metrics, was, wouldn't it be cool if we could see how many people were reading a particular resource? That seems like it would be super valuable. I'm always kind of curious about it. It's certainly in the world of blogs and things, right? You're always checking your readership numbers,
21:41
your Google Analytics or whatever. How many people are reading this thing? And I thought, well, that would be great. And of course, it's very difficult to get ahold of, which is partly why I got excited about looking at Twitter and other kinds of metrics like that. But I've never lost my interest, like how many people are reading this thing? And there's a couple places that we could look to try and help us find who's reading this thing, right? So there are sources of data out there.
22:00
So PLOS is probably the most notable one, but a number of other publishers let people know what a readership data is for a given article, right? So you can kind of check this metrics tab, and on the metrics tab, it says the number of readers, and sometimes it'll even break it down by days. You can kind of see this little pretty curve, like, you know, gradually the number of reads decays. It's pretty cool. I like it. But, of course, it only works if a publisher has gone to the effort
22:21
of making this little readership tab, right? And quite a few of them have not. In fact, a lot of them actively go to efforts to keep you from finding out, which cracks me up. It's so funny. They're like, no, no. We definitely wouldn't want to share that. Wait, so you're ashamed? You're literally ashamed of the number of people that are reading the things in your journal? It doesn't seem like a real advertisement
22:41
that I would want to publish there, but that's the state of play right now. So they do have this approach called counter, right? Those of you all in the library world are probably very familiar with this, and it's a format of a way that publishers can report their usage to libraries. Of course, if you're not a library who's buying that journal, you don't have access to that. So those of us as researchers don't really have access. In addition, the counter formats are a little antiquated.
23:01
They don't generally give you sort of, like, DOI. There's, I think, kind of changes afoot about that, but right now they don't give you DOI by DOI or article by article. So this is kind of just like the title as a whole, which is whatever is fine, but it's not kind of the thing I would really like to see as a researcher. You can kind of do special arrangements with publishers. Again, you know, because for them, it's like, oh, it's ever so secret, right? And so, you know, the measure project with Johann Ballen
23:22
was, I think, leading up that project. Did terrific work. They got a big old mess of readership events. But again, it was kind of a one-time special dispensation. You got to kind of go with your hand, like, please, please, please, could I have some readership data? And maybe you get lucky, maybe you don't. And so again, it seems like there's some real limitations there. Some people looked at Mendeley as a proxy for readership. I think that's really cool.
23:41
It's not the same as readership. A lot of people don't even know what Mendeley is, so that's not exactly the same. And finally, Crossref DOI resolutions, I think, is a really cool source. Any time Joe wants to just give us a whole bucket of those, then I will gladly take them. But right now, Crossref is not giving those out en masse for research purposes, or particularly for building tools on top of. So maybe that will happen.
24:00
I hope so. I think that would be great. In the meantime, between time, what are we supposed to do? I want to know how many people are reading this thing. That seems like a reasonable question. Should someone give me the answer to that? Well, I've got good news for you. There's this thing called Unpaywall. It's a browser extensionally made, and we think it's going to help solve this problem. So let me introduce Unpaywall to you. It's powered by something called OA-DOI. We've heard a couple people mention that, but I'm not sure how many people might be familiar with this.
24:21
OA-DOI is a database of all the papers that have a DOI. So it's 90 million papers, Crossref DOI, not data side DOIs at this time. And it links to an OA version of 15 million of those papers about. That number has grown as we kind of improve the algorithm. So if there's an open access version of a paper on the internet somewhere, we will find it, and then we will give it to you via a query to our database.
24:43
At least that's what it says on the tin. Obviously we miss some from time to time, but we think we're probably doing as good a job as anybody else in the world is right now of finding open access links, links to open access articles and letting people see them. It handles a lot of traffic. I was looking at this, but I don't know. It's just kind of amazing. I thought I'd share this with you. It's between 500,000 and 2 million API requests every day.
25:03
So to put that in context, Joe hopefully informed me that the Crossref DOI resolver does about a million a day. So if you look at how many DOIs get resolved total a day, like we're doing a similar amount of volume of that at OADOI. So there's a tremendous amount of traffic, a lot of usage in the system. Users include a lot of libraries.
25:21
Oops, that's the wrong one. Sorry. Oops. Yeah, library link resolvers. So link resolvers like SFX. If you're in the library world, you're probably familiar with that. Otherwise you probably won't get into it. But the upshot is it's being used by British Library, UC system, 700 libraries worldwide. It's going to be in the Web of Science, which we're really excited about. So when you see your Web of Science results,
25:40
right now it'll kind of say whether something's open access, but they're not doing a very complete job of that. And so we're going to be working with them to help them do a more complete job of that. So if it's open access anywhere on the web, you do your Web of Science search, you see it, open access, boom, provided by OADOI. You go click through and read it, which we're excited about. And a number of national funding bodies and repositories are doing research about their open access production list.
26:02
But the users of OADOI, of course I want to focus on today, is this web extension called Unpaywall. Unpaywall I can describe to you in a picture pretty easily. You use your computer to go look at a web page that has a scholarly article on it. And if you get a green tab, it's free, boom, done.
26:21
Click on the green tab, read the thing. Very simple. We didn't want to overshoot our skill level on this. We figured we'd keep it simple. So green tab means it's free. Download the web extension and you're good to go. And so far a lot of people have downloaded this web extension. We launched it in March. And we've got 100,000 users right now, active users, these people who use it at least once a week.
26:42
And as part of this, of course we have logs because Unpaywall is constantly calling OADOI. And by looking at these logs, we can see how many times is someone landing on a scholarly article and reading or viewing or whatever with Unpaywall. And we're doing 80,000 of those every day. So every day, we get 80,000 of these new usage events.
27:01
And again, to kind of compare that with other systems, I asked, Ewan was helpful to point me at the Altmetric API. It handles about similar numbers, so I think between like 60 and 100,000 Altmetrics events per day. So if we're thinking about volume,
27:20
that's sort of everything, right? It's like Twitter plus Wikipedia plus et cetera, et cetera. So we're looking at a pretty significant amount of volume here daily, particularly given that this is a relatively new system. And that number is continuing to grow, right, as our installed user base grows. So I want to really hammer that home is there's a lot of these events. There are a similar number of Unpaywall reads as there are all the other Altmetrics events of every kind that are being collected.
27:42
So it's a large corpus. And we're excited about that. Total 4.6 million events. And right now, there's only 2 million DOIs that actually have a single event, which is a relatively, I think, small proportion. Of course, that's skewed quite heavily towards recent articles because Unpaywall is based on what people read and people read not so much stuff from 1972.
28:01
So we get a lot of recent events. So in a recent article, we have a pretty good chance that there being at least one OADI event. People use it from all over the world. This is a pretty flat map. Obviously, the United States sort of sticks out. UK sticks out a little bit. And the third place country is China. Those are where most of our Unpaywall users are. But we're really excited about the fact that it's pretty hard to find a country on this map
28:21
that isn't using Unpaywall. And because of that, we're able to get kind of a wide geographic distribution, particularly if you want to zoom in on one particular country. So that's the Unpaywall data. Obviously, I don't have to work very hard to link, in your minds, hopefully, the use of this Unpaywall data to the problem I set up earlier of, if only I had readership numbers.
28:40
We have readership numbers. We have readership numbers via Unpaywall, right? And we have them for all papers, not just papers in PLOS, not just papers that we're, went with our hat in our hand and begged some publisher, please give us the readership events. No, we got it for everything, everything. And it gets better. We don't just have counts, right? If you get a PLOS API, you say, how many people are reading this? Oh, 100, 200, 2000, whatever.
29:00
They'll give you a number. Yay, that's good. I appreciate having a number. But even better than a number, as Joe was exactly just saying just now, is an event because a read is not a count. It's a thing that happened in time, right? At a certain time, a certain person read this certain thing. And if we want to get to the next generation, right, away from the static world of the dried basil kind of approach to knowledge,
29:21
and we want to get to the exciting, vibrant world of living knowledge, we need to get those events. We need to get the events of this thing happen to this person and build this enormous network of billions and billions of interaction of every human interacting with every item every day. Now, obviously, and paywall's not quite there. But hopefully we're stepping in that direction, right? So the events give us, speaking specifically, fine time granularity.
29:44
So we give you hour by hour granularity. So we don't want to say, who read this ever? Who read it this hour? That's pretty handy, right? Not only that, but we also can get IP derived things like geographic location. The IP is an address that uniquely identifies every computer. Nominally every computer. Nowadays, a lot of times, multiple computers will share a single IP.
30:02
But more or less, every computer on the internet gets an IP address. And by looking at the IP address, you can find some things about the computer, right? So if you're browsing the internet, and it says, welcome, user from Toronto, you know, Toronto, Canada, they're getting that from your IP address, right? Because they go, oh, this IP address, they got some table somewhere, and it says, oh, this particular IP address, it lives in Toronto in this particular city.
30:22
So we can get information about the geographical application from the IP address, which is really handy, if we want to find out not just that it's being read, but who is reading it. Which I think that's been a theme that I've been banging on that drum forever. A lot of people at this conference talking about it too, which I'm really excited about, is let's move beyond counts and towards meaning.
30:42
And meaning requires identity, it requires intention, right? It requires those sorts of qualities. So how does this data compare to existing data sources? Things that are already, that we already now have a number of years of experience dealing with. Well, the first thing I thought was, I would love to compare it to Twitter accounts. So Heather and I, mostly Heather, sat down and kind of tried to gather a bunch of data from this.
31:06
This is based on this week. We just thought it would be kind of fun to do the most recent week of data, and we also have it for, I think some of you all remember we showed this thing called PaperBuzz. If you haven't seen it yet, check it out, paperbuzz.org. It uses unpayable data, and I'll talk about it a little bit more later.
31:20
But it uses the last week, it's trying to say what's popular in the last week. So we already had that data kind of sitting around. So let's look at the last week of data and compare how much unpayable readership a given article is getting versus the number of tweets the article is getting. And tweets, we use that mostly just because tweets tend to correlate a lot with the other, with for instance an all-metric score and for other kinds of all-metrics activities.
31:41
So it was a useful sort of benchmark. So as you can see, there's a noticeable association, right? So this is a log-log plot. Obviously everything's like actually clustered down in the bottom left corner. But if you log-log plot it, you can kind of see, oh, this is definitely an association there. As unpayable reads go up, we also see more tweets and vice versa, right? Or vice versa, right?
32:01
Like the association is there in regards to the directionality, which is pretty exciting. I was happy to see that because it lets me know, hey, this isn't just like out of the blue. We don't have like total noise. So the experiment correlation is .55 for those of you all keeping track at home. Five minutes. Five, perfect, thank you. All right, but of course you all know who is reading the articles, right?
32:21
We got the readership. That's cool. I'm happy to see that. But I would love to dig deeper beyond just the counts, right? So as I was saying, whoops, boy, this thing really gets a little excited sometimes. We can tell if an IP is on a university campus, right? So we can look at whether there's actually a contract with sort of a third party who has this big lookup table.
32:41
We can say, is this particular IP a university user? And we can then compare how many of the reads of this article are coming from a university campus versus someplace else. And in theory, if something is of tremendous academic interest, we would see all these like nerves from the university. Oh, that's just so interesting, right?
33:01
Versus we would see all these regular people like, oh, this is great and so much fun, right? We'd see a lot of nerves reading it, not that many regular people. If, on the other hand, a lot of public interest, we see a lot of regular people, not so many nerves, right? So we can use that ratio to potentially see where's the academic interest. And here's what we found. So this is a categorization based on the quartile.
33:21
So I put the bottom quartile of the most academic ratio over here. The quartile of the most, sorry, this is the most, the quartile that is the most academic, right? Where the ratio is moving towards the academic side. This is the quartile that's the most public, where the ratio is on the public side. And these are the two quartiles in the middle. So if it looks like there's more dots in the middle, that's because that's two quartiles instead of one, right?
33:43
So, unsurprisingly, we saw that the mixed correlates with the altmetric score. We kind of already saw that correlation together, so I'm not too surprised about that. I'm not that surprised that the public would correlate with the Twitter account. Sorry, Twitter account, not altmetric score. But I'm not surprised that the public would correlate with Twitter account, because you kind of think of, in total volume, there's probably more regular non-university people on Twitter
34:06
than there are university people. But then I think it's very interesting, and again, I think kind of encouraging to see, that the academic articles, the articles that are in the most academic quartile, do not correlate so strongly, and again, this is rank order correlation, do not correlate so strongly with the tweet account.
34:24
Where presumably, we do know that Twitter use is, of course, a lot of scientists, a lot of regular people, but there's a lot of reason to think that it's a little bit more heavily balanced towards people who are outside the academy. And this would seem to bear that out. So what else can we do with this, with this data? Well, we should sort of sum up the accounts.
34:41
I think I'm really excited about the geo, so we can tell you what a particular country is reading. So we say, oh, given a country, given France, what's most interesting in France? What's disproportionately interesting in France? I'm quite interested in that, because if I'm a French research funder, I'd like to look and see what's really blowing up in France. Like, what is it we're funding that our people are particularly enjoying, because they're the ones paying for it, right? I think that's really interesting.
35:01
Where are the readers from is the other direction to take that, right? So if we give an article, who's reading this article? And so, in many cases, as an author, particularly for some kinds of fields, I really want the article read in a certain place. And so I can detect that, because we're familiar with this already. But what's different is that we're able to get higher end. We've got a little bit more options for how we can do that geographic.
35:23
One thing I'm really excited about is then, because we have the IPs, we can look and see, for a given person, let's say I read one article. Now I go and read this other article. Now I have a link between these two articles, right? Because I was interested in this, and I'm interested in this, too. So we've got a co-readership link between these two things.
35:42
I'm probably not going to, you know, I'm not just randomly walking around reading articles, I'm reading the kinds of things that are interesting to me. So by tracking what all of our 100,000 users are reading, and building these co-readership links, we can build a graph of what kind of articles are similar to other articles. And this is the kind of stuff that Johann Bolin already did with the measure project. Beyond that, we can also start to build recommendations, right?
36:02
Maybe we can get this Amazon recommendation, like, similar users to you also bought this. So if I'm reading through on Paywall, right, and me and Juan have both been reading similar things, I read three articles, Juan read the same three articles. We're article buddies, right? We don't even know it. But we're article match buddies. And Juan read this fourth article, I've never heard of it,
36:20
maybe I ought to read that one. Right? We're article buddies, I bet we have similar tastes in articles. So we can start doing those kinds of recommendations, we can start doing that filtering, right? And that's part of what we're trying to get at with PaperBuzz, again, check it out if you haven't, is we're trying to build a filter. And I think a lot of the end game for altmetrics has not just been on valuation, which is great. Heard a lot of talk about valuation. But, moving beyond evaluation and filtering things.
36:41
Being able to say, what's useful for me as a reader? Not just, you're doing good, you're doing bad, I'll give you some money, I'll give you some being fired instead. Let me find out what I as a reader would find fascinating. What I as a reader find edifying, valuable, and powerful to me to continue doing great research. And so we can do article validation, we can even get as far as doing some prediction about what's going to happen.
37:02
Because we do have a lot of readership data very early. So you can potentially say, oh, given this readership data, we think this is going to be, later be a citation hit, this is later going to be a Twitter hit, this is going to be this kind of hit, whatever. The main thing is it's doubling the amount of data that we have. And so we're going to be able to make more robust predictions. So we're really excited, everybody here go check out the API, it is live, it works right now.
37:21
It's really cool, every event comes back. We are a little bit concerned, obviously, about a couple issues that are not quite there. The last slide talks about a couple issues we're concerned about. Certainly, we need to know how representative and paywall users are of all readers. I don't know what's different about them from other people. Obviously, we do de-duplicate this data,
37:41
but de-duplication and catching bots or people who might try and game the algorithm is always concerned with metrics. So more work needs to be done on that for sure. And of course, there's also a real privacy concern, right? Because people who sign up to use a service aren't necessarily, they don't want all the details of their searches. And one thing we've seen from the famous AOL Lake and stuff is even if you release this stuff in a way that's meant to fuzz people's identity,
38:03
it's not so hard to de-duce sometimes who people are. So what we've done is we've taken a pretty dramatic draconian approach to anonymity on this one. I think there are maybe some by, like we just don't report the IP address at all, we just give you the country. But maybe there's some ways to let that network information out.
38:22
We'd love to hear your thoughts. Anyway, check out the API, paperbuzz.org is where it's being reported, along with the crossword event data. And I'm super out of time, thank you very much. Yes, get out. Okay, do we have any very quick questions for Jason? Yeah, Abigail?
38:46
I just have a question about this because a lot of this is based on the location data. As people use VPNs more and more because of the privacy issues, countries becoming more, as we say, intrusive in terms of what they're looking at when people are using the internet, how is that going to affect this product, this application?
39:04
It'll make it less valuable. It's a great question. You're absolutely right, and there's a number of things, not just VPNs, but also a lot of users are behind some kind of a subnet where they might have 20 or 25 or 120 users all sharing the same IP. And in that situation, their votes will get thrown out because if two people really quickly from the same building
39:24
on a university that shares the IP, for instance, both visit the same page, we're like, oh, you visited it twice, but no, it's two different people, right? So that's a problem, VPNs are a problem. All of the normal problems that come with inferring things from IP addresses will certainly apply to this. And so absolutely, that is a limit to the effectiveness.
39:42
Okay, if there are any more questions for Jason, I'm sure he'll be around for a little bit at the end if anyone wants to talk to him further. Right, last up in this session, we have Connie Young from Wiley. Connie is going to present on between open science and business intelligence. Hello. Yes, I'm going to talk about something that's been on my mind for a while.
40:02
When we research art metrics or citations or usage, how much of the total body of research that exists in the world do we actually cover? And related to that, how much of that might realistically go open or free? So for instance, we say, and we have seen some good presentations here today,
40:22
art metrics can measure the impact of research on society, but is it all research? No, because a lot of it is happening in corporate labs, in the military, et cetera. If you think about what was the research that has had the most impact on German society recently,
40:40
for instance, definitely I would say it's the Volkswagen scandal. 850,000 jobs directly affected, over 11 million cars have lost their resale value, giant image disaster. And you ask yourself, where was all that published?
41:00
And it wasn't published anywhere. It was not only top secret, it was also criminal, so they couldn't even file a patent. So you start asking yourself, how much of this kind of research is out there? And I sent myself on a quest to investigate what we can find out.
41:21
Starting with what we know, my usually bread and butter universe is the journals in the web of science. Already, when you look at Crossref, you see there's a lot of other material in there, and maybe I should be looking and investigating these more closely in future.
41:42
And something that we know about corporations is they don't publish articles, they publish patents. So I investigated how many patents exist, and apparently volume of patents published each year is quite similar to the number of journal articles. So definitely also a body of research that needs closer attention to what I know that our editors pay to this at the moment.
42:11
You see there's also quite a variation in the growth rates of this data, and you ask yourself, are there more researchers going to work each year,
42:23
or do they work hard, or do they split their publications up into more little pieces? So one number that would be good to know is how much does a researcher actually publish per year. If you Google that, you get one answer. Two articles per year is the average, and then comes a qualifier which makes this kind of useless.
42:46
The research active staff is expected to publish two articles per year, so you don't know when does that count as research active, and I tried to do my own investigation on this. I went into the Web of Science and pulled all the articles that were published between 2009 and 2016, which had an OID.
43:11
Of course, researcher identification is still early days. In this data set, that is actually quite good already.
43:21
Over 20% of the articles in the Web of Science from this period do have somebody with an orchid assigned to it. Then I grouped this into two periods, the first four years, second four years, and I looked at the average articles published between 2013 and 2016 only and got these numbers.
43:49
In total, the median for the authors who have published throughout the eight-year period
44:06
is a little over one article per year. For those who have published their first article in 2013 or later, it's less than one article in two years. The output is not quite that much, as was suggested by the researcher who was quoted in Google.
44:28
Here you wonder, this is Web of Science articles. Did they publish in journals which are not in Web of Science? Did they publish not in journals?
44:41
Or did they publish completely different outputs? Or did they not publish at all? Did their article get rejected, etc.? Now, when you want to analyze this and extrapolate to corporate researchers, this is immediately then you notice maybe this is the wrong approach,
45:04
because in corporate research, as they very infrequently publish classical articles, there are a lot of other output types, and I wouldn't know how to weight these against each other, how many reports versus one article, what would be the ratio.
45:24
So this is kind of too much ambiguity in there. Some of these output types will really take more work than an article. If you want to put a new drug in the market, it's probably a lot of more work. I actually often do the last two pieces here.
45:43
Somebody asked me questions. We want to publish a new journal in this and this area. How much usage do you expect? I sent back the number, and as long as my prediction is correct, I don't even need a theory, and I can use all the short code that's used in our organization.
46:03
The information that the output that is produced doesn't need to be self-explanatory, and it needs to be valid only in the context of the question that is currently relevant.
46:21
So actually trying to find out how many pieces of information, pieces of output, exist worldwide, apart from the fact that the corporates wouldn't give us numbers, is kind of hopeless to achieve. But we can still assume that the researchers in corporate research are just as productive as those in academia,
46:46
and we can try to find out how many researchers exist and estimate from there. And actually there, I was surprised, there are some quite good data. The best data source I found was the UNESCO Institute of Statistics.
47:01
I also tried Eurostat, the World Bank, and they all share data, and I didn't run into any contradictions. And this one is the most up-to-date and most complete. They have data for research accounts for 148 countries with head counts, 130 with FTEs,
47:25
and what they request is as FTE really to calculate, that is full-time equivalent, that is to actually calculate really the time that somebody has time to spend on research rather than other jobs.
47:43
And they have for over 100 countries even details by sector. Often these come in percentages, the research accounts sometimes come in number per million inhabitants, but it is easy to do the arithmetic and calculate the actual numbers.
48:01
And yes, you say we have maybe over 200 countries, but these in the samples here, they represent the countries with 97% of the world population, so that's quite good. But there are some exceptions, and all these countries have more than 20 million inhabitants,
48:21
and unfortunately we know that research is being done there and it's not really safe to ignore. This is what the distribution by sector looks like, and for me it was a surprise that in the most of the bigger countries, the majority of the researchers works in industry. And again I've been asked, are they really researchers or do they just run a production plant or something?
48:46
The table is called researchers, and it's not called scientists and engineers in industry, that's a different table. So if all the countries filled out the forms correctly, this should be actually researchers.
49:03
And yes, the sector here, government, I don't like too much because it includes agencies like environmental agencies, they publish quite a lot, but also the military, which usually does not publish too much. Exceptions, quite interesting, where more researchers work in, it's called here higher ed, I would say academia,
49:30
United Kingdom, India, Brazil, Australia, Malaysia, and many of the smaller countries. But in total we still end up with more than half of the researchers working in industry, so that was for me quite a surprise.
49:49
On the other hand, the number of researchers who work in an environment where publishing is standard practice is still large enough that I would say it makes sense to analyze them and still get some sort of meaningful results.
50:02
So in total, in the countries where we have data, I arrive at an estimate of about 9 million people. And how many would there be in total? Somewhere around 11 is my semi-educated guess, but don't quote me on it.
50:25
So corporate researchers, what do we know about them? They tend not to take part in surveys. But with some common knowledge and with some experience from my own environment, maybe we can fill the gaps. So most definitely, the research and development department is an investment
50:44
and it must bring some return for the company, otherwise the finance department will try to get them shut down. They need to minimize time and effort to produce the results and many companies, they invest in expensive discovery tools which are often hosted locally so that the competition cannot hack them,
51:05
so that the researchers get their information as quickly as possible and as selectively as possible. In smaller companies, just the opposite, they may not even have a library, the researcher is all by himself and may decide that it's faster and more efficient for him to start from scratch and reinvent the wheel.
51:23
When we talk here about the connected, what was it, the value of connected systems, so these I would say are disconnected and are a failure for the current system. So is there a chance that we can perhaps bring them in?
51:42
Everything must be really, really reliable. If you find some information in the internet, incorporate in your development and things go wrong, it can be very expensive. So information sources which have some sort of stamp of approval are very much preferred to those where the researcher has all the responsibility and has to do all the validation if the data that he is using
52:04
are correct and applied to his specific research question. And of course the research must somehow bring a competitive advantage for the company, either better products or more sales, better reputation, whatever,
52:20
and that of course is the reason why the research is often kept secret. So the researchers work under no disclosure conditions. They can share internally, but they will not get anything independent, external to review their work. Their boss may say, well done, the product sells, but there's always the doubt could have been done better.
52:43
And they also have no insight into the competitor research. And coming back to the diesel example, when VW was found out, all the other car companies were asked, did you know about this? Were you involved? They all said no, no, no. But meanwhile Fiat has been found out, Mitsubishi has been found out, Daimler is being investigated.
53:04
So either they lied or they all did their independent exhaust manipulation research. And that I find really what a serious waste of resources. And something that as a consumer I could go into a crusade against.
53:21
So my other question is these conditions, to me they appear as if the quality of the output might suffer from them. And I wonder, are the researchers happy with that? Especially those who have had social media throughout their academic career so far.
53:41
Facebook and Twitter are now both over 10 years old, and there are researchers now who have used that throughout their university life. They are networked up to here, they have blogged and shared and discussed, and now suddenly they are in a company. It's my vision.
54:00
And they struggle with a problem in their work. And don't they think that if I could only reach out to my friends on the network, I would get 10 solutions, I could select the best and my work would be better than I have to struggle here all by myself. The bigger companies, of course, they may have enough researchers with whom they can discuss,
54:23
and there will also be boards who will approve research proposals, et cetera. But in total, I think those researchers who are used to social media, they may start to question this very strict confidentiality and non-disclosure rules.
54:49
So we have talked so far about the content, what about the infrastructure, all the databases and indexing systems, discovery tools, et cetera.
55:01
Angela Merkel, she was laughed at when she said this. She was talking about Google and Facebook to reveal their algorithms. Well, I think Google, Facebook and so on, they are the big players in the information industry,
55:21
and as long as we allow them to send us content filtered based on who knows what, we cannot actually expect from companies like Mendeley or Altmetric.com also to reveal their algorithms. In fact, if any of the companies here in the information industry, if we publish anything that's really good and useful,
55:49
the most likely is that Google Scholar will pick it up and rebuild it, and this may not be really good for the business. But I still think that there is a chance that the corporate research can become more transparent and visible,
56:09
and it would be actually a benefit for the companies. Definitely I think through sharing the quality of the output could be improved,
56:21
and redundancy could be reduced, and it would save the companies time and money. So this is a benefit not only for the companies but also for society. All this duplicate research is something I think should be campaigned against. The researchers should also be encouraged to publish more.
56:45
A good research article published in a decent peer-reviewed journal is a pretty cheap marketing instrument. If you have the patent, you can then publish and you claim the topic in this area and you can scare the competition away.
57:02
People who work as reviewers and editorial board members, they have the advantage that they see all the latest research before it is published, and even they see the research that is not published, which might also give them some ideas. We have some journals where really all the big players in a specific industry are on the board
57:28
just because they want to know what is being researched at the moment. Finally, the big corporate companies rely on a very good research information infrastructure.
57:45
They are willing to pay a lot for the big discovery tools, but I have also seen cases where they sponsor pieces of information infrastructure that is then free for the rest of society.
58:01
For instance, lens.org has one of its sponsors, Qualcomm, who is a mobile technology semiconductor company. So maybe there will be more examples like that. This is my thoughts about this. My ambition here was to provide some context about the environment in which we normally move around.
58:33
I hope you found it useful. Thank you. Any questions?
58:45
You are all so overwhelmed by this whole new category of researchers that none of us thought about for two days. OK, thanks very much. OK, I think we are there. We are just going to have a wrap-up session from Mike who will share some concluding thoughts.
59:07
Stacey has an announcement. Now or later? Stacey has an announcement coming up. Now if she is around, but if she is not around, she is coming, she is coming. So a big thank you to all of our panelists.
59:30
Go Stacey. Go there and I will see if I can find my deck. So this is just a quick advertisement for tomorrow's Hack Day.
59:42
We are going to be getting together a pretty large group of us this year, which I am really excited about. Joe and I are coordinating this. And we are really trying to focus less on the stereotypical hacking, although that will definitely be taking place. We are calling tomorrow's Hack Day a do-with-on this year.
01:00:00
because we'd like to encourage all of us who are educators and activists and advocates, as well as the hackers, the stereotypical hackers who code, to join us for tomorrow's get together. So we are going to be meeting at the Ryerson University Student Learning Center, which is where the Altmetrics 17 Conference was on Tuesday, we'll be on the fifth floor.
01:00:22
Please join us, the fun kicks off at 9.30 a.m. Thanks guys. Okay. Awesome, and I am just trying to find where my slides are, because I really helpfully tweeted, there we go, okay, so it has finally made its way over to the podium. So I did a quick presentation, this is what I've been doing.
01:00:42
No, you can go now, because I've changed my mind about what I'm gonna do, and I've changed my mind partly, I've changed my mind partly because I can. And I also changed my mind because of the timing. So my job at this final event is to give a few thoughts,
01:01:04
a few impressions that would ideally come from the audience, would have come from you. So we had three different ways of getting in touch with me and making some suggestions about what are we doing, what are we missing, what's Altmetrics missing?
01:01:20
So the methodology was to tweet me at ask4am, you could post messages on Padlet, or alternatively you could just kind of let me hang around and absorb ideas, and I spent most of the event in that potted plant there, listening to what you've got to say, and I'm gonna draw through these three
01:01:43
pieces of data to come to some kind of conclusions. So the really exciting thing is that there were quite a number of tweets on the ask4am tag. But sadly, quite a lot of those were not actually to do with this conference, they were more to do with some kind of booty call thing that was going on.
01:02:02
But nevertheless, we did have some, yeah. We kind of asked for it, really, yeah, okay. So anyway, nevertheless, there were some tweets that came there, and I'm gonna reflect on what those tweets had to say. And on the Padlet app, which I didn't know if anybody was gonna use, but we ended up having two notes.
01:02:20
And one of the questions is, should altmetric information be archived, and the second one was, what does altmetrics actually mean? And I am gonna use these as being kind of where I'm going to with this conversation. That first question about whether altmetrics should be archived or not is a really great question, because it is symptomatic of where we are with altmetrics, so five, six, seven years ago
01:02:42
when we started talking about alternative metrics, it was very much a new idea. We were just seeing whether there was any signal out there. And over the course of these years, we've been building businesses on it, we've been building strategies on it. And if we are going to be using alternative metrics and altmetrics as a way of influencing decisions
01:03:03
and measuring the effect of societal impact, then we need to be aware that we are having to build something, and if we build something, we need to build those foundations. The underlying question suggests, what are we doing with audited stability? What are we doing with verifiability?
01:03:20
What is the robustness, what is the integrity of the systems that we're building, if we are to build something on top of alternative metrics? It's a really important question. In some ways, it sounds like a dull engineering question, but it goes to the heart of some ethical and business models. So one of my solutions is, because I'm not gonna answer this,
01:03:41
I was gonna get three folk up here to talk about it, but given the time, I am not gonna do that now. What I am gonna do, I'm gonna ask Jean Liu, and I'm gonna ask Andrea of Plum, Jean of Altmetric, and I'm gonna ask Joe to write a blog post about their systems, about how they do the backups and how they manage the integrity of the systems.
01:04:00
So keep your eye on that. We'll post something in a couple of weeks' time to talk about that in the Altmetric blog. But the next question that came up is what Altmetrics mean, and I think that really dives into the broader picture of where we are. And it's a really great question because it is impossible for us to ask.
01:04:20
If I was to give you a Venn diagram of all the fields and all the people and all the stakeholders who need to be involved in understanding what Altmetric means, we would be this tiny little dot in the middle of it, and we'd be surrounded by funders and philosophers and sociologists and natural language processing folk. We, as an organization, can't answer that question.
01:04:43
But fortunately, we know that there are people who do. It is the rest of the world. This is a classic interdisciplinary problem. If we are to answer the question, what does Altmetrics mean, we have to go outside of the field that we are. We have to seek relationships with other researchers. We have to expose our data. We have to seek interdisciplinary work.
01:05:01
As Antoni was saying at the very beginning of this, we need to be very aware of the work that people are doing. The issues that we are confronting, that we need to answer, are already under investigation by people who are investigating network graphs. You know, there is a literature going back to the 60s in this field.
01:05:20
This is not a new thing. And Antoni was absolutely right to key into this. So our work, what we're missing are those connections with the other fields. We need to be open with our data. We need to be open with our publications. We need to be open to other influences and other intellectual stimulation. But, okay, so there was all this stuff buzzing around in my head
01:05:43
because there's always stuff buzzing around in my head. And one of the things that I was really struck with was the presentation on some of the feminist issues in Altmetrics, the issues of privacy and also the importance of social platforms
01:06:01
in terms of promoting and developing one's own career. Now, whether it's Instagram or YouTube, these tools offer a really exciting way for young researchers to connect with, to communicate scientific research, to connect with the generations of people who come.
01:06:21
But there is somewhat of a chicken and egg situation. In Altmetrics, we've built this pipeline that enables us to capture this data, commit this data, transport this data and analyse it. But unless people are taking that activity seriously and rewarding people for doing that activity, then we're never going to get into a situation
01:06:41
where it's recognised as being a way of communicating. So, in some ways, we've got this, like I said, it's a chicken and egg situation. And some of the work that we have as ambassadors for alternative metrics is to go out into the wider world and, as I said two years ago, use it like Americans use cheese.
01:07:03
Pour it. Use it. Talk about Altmetrics. Encourage people to exchange data. However, there are all these broad issues. And so I'm going to add in this truly shameless plug right now because we do have this issue about how we use metrics
01:07:21
and whether we're measuring the right kinds of things and what we're measuring because there is a danger in metrics and thinking that we are measuring something which is an objective truth, a ground zero truth, in the same way that we might be measuring the velocity of a mass. The reality is that we are measuring something
01:07:41
which is an action which is undertaken by reflective human beings who are capable of observing the consequences of their behaviour and shifting their behaviour in light of the consequences. So when we think about what we're doing here, we need to be very aware that there is this reflective quality about it.
01:08:03
It's not an objective ground truth. Two weeks' time in Baltimore, there is going to be a workshop conference. The workshop conference is called Transforming Research. It is very much about reflecting on some of these interesting issues around how we use metrics
01:08:22
in terms of understanding how effective our policies are, how we change those policies, how we go about doing research evaluation. All of you here would be interested in the matter that we were talking about at this conference, and I very much hope to see some of you there. So finally, I would like to say thank you very much to Bioscientifica.
01:08:44
This is the first time that we've had a professional conferencing organisation, and we've seen several of ... Martha and her team have been very busy here. Usually at 1 a.m., 2 a.m., 3 a.m., Kat, myself and others run around trying to figure out where the light bulbs are
01:09:02
and mopping things up and undergoing things. In fact, we've been able to enjoy the conference for probably the first time in a really important way. Thank you very much to Bioscientifica for keeping us on the path. We were going to be planning on making an announcement about the location for the Altmetric Conference next year.
01:09:22
We are not quite there. We are very close to it. It's tantalisingly close, but I'm not even going to put a slide suggesting where we're going to be. There will be an announcement in due course in the course of the next month, and in the meantime, let's carry on talking. Thank you.