EXPLORE: the need for an open classification system
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 5 | |
Autor | ||
Lizenz | CC-Namensnennung 3.0 Unported: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/69880 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
00:00
Web SiteGarbentheorieZweiBesprechung/Interview
00:47
Web-SeitePhysikalisches SystemPerspektiveZählenMomentenproblemZählenElektronische BibliothekFunktionalMultiplikationsoperatorZoomTwitter <Softwareplattform>Besprechung/InterviewComputeranimationXML
01:55
Metrisches SystemStandardabweichungRechnernetzBeobachtungsstudieZählenTreiber <Programm>Framework <Informatik>Komponente <Software>LeistungsbewertungPhysikalisches SystemPerspektiveElektronische BibliothekSpeicherabzugSoftwareentwicklerZählenMathematikWeb SiteSelbst organisierendes SystemMetrisches SystemOffene MengeMultiplikationsoperatorExogene VariableFramework <Informatik>CodeMomentenproblemForcingZweiYouTubePunktStandardabweichungDivergente ReiheGruppenoperationNichtunterscheidbarkeitREST <Informatik>Chatten <Kommunikation>XMLComputeranimation
04:30
Offene MengeRechenschieberShape <Informatik>EreignishorizontPhysikalisches SystemTermp-V-DiagrammCharakteristisches PolynomROM <Informatik>Relation <Informatik>Apache ForrestMinkowski-MetrikVektorrechnungUmwandlungsenthalpieEinflussgrößeMaßstabART-NetzFrequenzVolumenRuhmasseFontIntelVerdünnung <Bildverarbeitung>TropfenNotebook-ComputerInformationDatenfeldNatürliche ZahlProgrammierumgebungTelekommunikationMathematikComputerMedianwertMinimumSchlussregelMaschinelles LernenAlgorithmusFunktion <Mathematik>DokumentenserverAutomatische IndexierungPolygonnetzOISCRechenschieberBitGemeinsamer SpeicherMailing-ListeMengeNatürliche ZahlFunktion <Mathematik>Klasse <Mathematik>Web SitePerspektiveIndexberechnungDatenbankTelekommunikationMAPTypentheorieTopologieDatenfeldDokumentenserverKontextbezogenes SystemGrundraumProgrammbibliothekMetrisches SystemHierarchische StrukturEinsOffene MengeInformationSchreib-Lese-KopfDifferentialJensen-MaßProjektive EbeneMixed RealityGenerator <Informatik>StandardabweichungUmwandlungsenthalpieBesprechung/InterviewProgramm/QuellcodeXMLComputeranimation
10:44
MinimumMaschinelles LernenSchlussregelAlgorithmusFunktion <Mathematik>DokumentenserverOISCOffene MengePolygonnetzOffene MengeIndexberechnungREST <Informatik>PerspektivePhysikalisches SystemKlasse <Mathematik>Cluster <Rechnernetz>Algorithmische LerntheorieCharakteristisches PolynomÄhnlichkeitsgeometrieFunktion <Mathematik>MAPMetadatenDifferenteAutomatische IndexierungAutorisierungProgrammbibliothekBenchmarkKartesische KoordinatenDatenfeldMengeHierarchische StrukturKategorizitätEin-AusgabeTypentheorieKonfigurationsdatenbankSichtenkonzeptGebäude <Mathematik>Spannweite <Stochastik>Metropolitan area networkInformation RetrievalFormale SpracheInhalt <Mathematik>Virtuelle MaschineBenutzerbeteiligungInformationAlgorithmusRuhmasseBildschirmmaskeFreewareMinimumTaskRechter WinkelTermKontextbezogenes SystemComputeranimation
16:36
Offene MengePerspektiveBitGrundraumComputeranimationBesprechung/Interview
17:21
GrundraumInstantiierungData MiningProjektive EbeneDifferenteMAPt-TestEreignishorizontFunktion <Mathematik>VerschlingungMinimumAlgorithmusBruchrechnungTermArithmetisches MittelKonditionszahlBeobachtungsstudieOrtsoperatorKomplex <Algebra>MultiplikationsoperatorKraftBitSoftwareentwicklerAuswahlaxiomOffene MengeObjekt <Kategorie>Reelle ZahlGeradeDatenstrukturPunktEDV-BeratungBesprechung/Interview
22:23
BitPerspektivePunktOffene MengeOntologie <Wissensverarbeitung>Selbst organisierendes SystemComputerunterstützte ÜbersetzungDifferenteTranslation <Mathematik>ProgrammbibliothekAnalytische FortsetzungImplementierungMengeProzess <Informatik>GrundraumUmsetzung <Informatik>Kontextbezogenes SystemDivergente ReiheMinkowski-MetrikFunktion <Mathematik>REST <Informatik>LeistungsbewertungFlächeninhaltGemeinsamer SpeicherBesprechung/Interview
25:21
Produkt <Mathematik>MAPMetadatenMailing-ListeMultiplikationsoperatorFreewareMereologieKategorie <Mathematik>Umsetzung <Informatik>MengePunktRegistrierung <Bildverarbeitung>Rechter WinkelParametersystemAlgorithmusAutorisierungProzess <Informatik>Güte der AnpassungFormale SprachePlastikkarteVerzeichnisdienstLesen <Datenverarbeitung>Wurzel <Mathematik>Wort <Informatik>DefaultData MiningHierarchische StrukturBitMultiplikationBesprechung/InterviewComputeranimation
30:07
AuthentifikationKonfigurationsraumMailing-ListeAlgorithmische InformationstheorieTermQuick-SortProdukt <Mathematik>ARM <Computerarchitektur>OrtsoperatorBitBesprechung/InterviewXML
31:26
REST <Informatik>QuellcodeInformationREST <Informatik>Projektive EbeneBitPhysikalisches SystemXMLComputeranimation
32:08
AutomorphismusDienst <Informatik>Produkt <Mathematik>ZahlenbereichStandardabweichungWeb-SeiteMailing-ListeQuellcodeSerielle SchnittstelleInhalt <Mathematik>TypentheorieVollständigkeitDivergente ReiheInklusion <Mathematik>ExpertensystemRouterLesezeichen <Internet>Codierung <Programmierung>Physikalisches SystemGruppenoperationAdditionDatenfeldComputerInformationHomepageDatentypSpeicherabzugTheoretische PhysikTechnische OptikKondensation <Mathematik>ProgrammierumgebungKategorie <Mathematik>Umsetzung <Informatik>Quick-SortMAPKategorizitätInformationNatürliche ZahlMultiplikationsoperatorTermOffene MengeZahlenbereichMetadatenKonditionszahlSichtenkonzeptPunktProzess <Informatik>MereologieDatenbankWeb logSynchronisierungMathematikSerielle SchnittstelleZusammenhängender GraphMinkowski-MetrikWeb-SeiteMailing-ListeMomentenproblemGruppenoperationAlgorithmusProjektive EbeneMapping <Computergraphik>SystemaufrufDatensatzFront-End <Software>Digital Object IdentifierJSONXMLProgramm/QuellcodeFlussdiagrammComputeranimation
38:42
TouchscreenKontextbezogenes SystemOffene MengeZählenMetrisches SystemOrdnung <Mathematik>SprachsyntheseBesprechung/Interview
40:01
Offene MengeUmwandlungsenthalpieInformationIndexberechnungKontextbezogenes SystemMetrisches SystemOrtsoperatorDatenfeldMultiplikationsoperatorBesprechung/Interview
40:51
DatenfeldNormalvektorIndexberechnungBitZahlenbereichAuswahlaxiomMultiplikationsoperatorInstantiierungWechselsprungBesprechung/Interview
42:19
Kontextbezogenes SystemBitPerspektiveWechselsprungBesprechung/Interview
43:18
Peer-to-Peer-NetzRechter WinkelMultigraphDefaultUnternehmensarchitekturPhysikalisches SystemMetadatenTelekommunikationPunktOffene MengeSelbst organisierendes SystemLeistung <Physik>Ordnung <Mathematik>Besprechung/Interview
44:53
Ordnung <Mathematik>BitOffene MengeBesprechung/Interview
45:39
PerspektiveFunktion <Mathematik>NormalvektorPhysikalisches SystemMultiplikationsoperatorWeb SiteTermRechter WinkelSichtenkonzeptNP-hartes ProblemPunktFormale SpracheGrundraumVorzeichen <Mathematik>CASE <Informatik>IndexberechnungMinimumMetrisches SystemOffene MengePerfekte GruppeDifferenteBesprechung/Interview
48:32
TermSchlüsselverwaltungStatistikMAPCASE <Informatik>WiderspruchsfreiheitGenerizitätDifferenteBesprechung/InterviewVorlesung/Konferenz
49:32
StatistikVerkehrsinformationSelbstrepräsentationImplementierungAdditionUmsetzung <Informatik>PerspektiveProzess <Informatik>MAPProgrammbibliothekBesprechung/Interview
50:34
AlgorithmusNichtlinearer OperatorRechter WinkelBitOffene MengeOpen SourceOffenes KommunikationssystemPhysikalisches SystemBesprechung/Interview
51:30
Arithmetisches MittelAdressraumTermPerspektiveBesprechung/Interview
52:13
ZählenUmsetzung <Informatik>PerspektivePunktBesprechung/Interview
53:30
ZählenPhysikalisches SystemPerspektiveVerschlingungPerspektiveChatten <Kommunikation>Besprechung/InterviewComputeranimationXML
54:13
MetadatenVerschlingungVollständiger VerbandComputeranimationBesprechung/Interview
54:59
MengeMetadatenDigitale PhotographieWeb SiteBeobachtungsstudieMaschinenschreibenDokumentenserverVektorpotenzialChatten <Kommunikation>Besprechung/Interview
Transkript: Englisch(automatisch erzeugt)
00:00
For those of you that don't know me, my name is Matt Bass. I'm the Executive Director at DataSite. Really pleased to be hosting the second webinar with the various partners in Make Data Count. This is the second of three, as I mentioned, and really important for us at DataSite in the
00:21
work that we're doing with the community and various partners, many of them you'll hear from today, and really our approach to building responsible, meaningful research data assessment as an open community. With that, I will hand over for a brief introduction,
00:43
and then we'll go into the panel. So over to Donella. Thanks, Matt. Great. Well, hello, everyone. It's fabulous to have you all here. For those
01:03
who I don't know, my name is Donella Lowenberg, and I'm based at California Digital Library at the University of California. In one of my roles, I run the Make Data Count initiative, and I'm thrilled to have you all here. The amount of people who have signed up for this is a great testament for how important this topic is, and so I want to give a very brief background of Make
01:23
Data Count before passing it over. Some very quick housekeeping as well. Please use the Q&A function to ask questions. We're going to try and get through as many as we can or chat them. Feel free to use the chat if you want to introduce yourself or comment, but also, as we've seen through all of our Zoom time over the last years, please be mindful of
01:43
speakers. The chat can be really distracting, and please tweet and follow up if questions aren't answered or if you want to promote it or have more community discussion, so handles are here. So, if this is your first time with Make Data Count or your 40th, Make Data Count is a
02:01
scholarly change initiative, and we're focused on the development of open research data metrics. We do a few things. So, we build and we advocate, and as such, we're made up of infrastructure organizations like DataSite, Crossref, and California Digital Library. We also contextualize and have a team of bibliometrists studying researcher behavior and data reuse,
02:22
and then our values, as you can see in the teal, are that we're rooted in open, transparent, and responsible approaches to research data metrics, which is at our core. So, to clarify again, as this can be a little confusing sometimes, Make Data Count is an overarching initiative that many organizations have plugged into. There are standards that we
02:43
and many of you have developed, like Scolix and the countercode of practice for research data, that we build and use as frameworks in the initiative for open data usage and data citation infrastructure, and we see data metrics as a journey. So, standards have been set through work done in groups like RDA, FORCE, ESIP, and others, and we have standards and
03:05
interest in these topics, so we have that community best practice, and right now we're at a pivotal moment where it's required that we focus on broad adoption of standards and on this open infrastructure, and bibliometrists are very eager to work on step three, which is contextualizing, and that's actually why we're having this webinar today,
03:23
so we're going to get into a lot of that, but the goal is for us to get to a point of understanding the reach and impact of research data across the disciplines and have responsible assessment and reward metrics for data, and so through our work over the last years, we've found and exposed a few key points that we think are hurdles the community needs to get
03:44
over for us to get to this point that we're all trying to get to, and that's what this webinar series is based on, and so last webinar, and if you missed it, check it out on YouTube, and I think someone from DataCite can put a link in chat. We focused on advanced ways to forge and use data citations non-traditionally, and that was kind of hump one that we've
04:05
identified. The second is what we're discussing today, and I am thrilled to pass on the mic to another MakeDataCount PI, Stephanie Hausstein, and the rest of the distinguished speakers today who are going to go over this topic, and if you're questioning how is the need for
04:20
an open classification system related to open data metrics, that's what Stephanie is here to talk about and really exposed for us, so Stephanie, over to you. Thank you very much, Daniella and Matt, for the introduction and kicking us off, so I'm just going to share my slides here, so I'm here today to give you a little bit of an intro and start us off
04:44
with this amazing panel. I'm super excited to talk to these people today, and I'm sure that you also have lots of questions, so just as an overview, Daniella just said this, like, why are we talking about open classification when we're coming from a data context, right? So
05:00
within the MakeDataCount initiative, I have been the PI of the Meaningful Data Counts Research Project over the last two years that is funded by the Alfred P. Sloan Foundation, and we thought that, you know, the goal of this research was actually to generate evidence, mixed methods, so quantitative as well as qualitative, on data sharing, reuse, and citation,
05:24
with the, you know, final goal of developing evidence-based indicators for the MakeDataCount dashboard, and this all comes from the perspective that we think research data should be lifted to, like, a first-class scholarly output, and that metrics can help support researchers to show,
05:44
you know, that data is being reused and has been cited, and you can see here, two years ago, we started with this research question, and I kind of highlighted how important scholarly discipline is for us because we already knew from really seminal work by Bornman that, you know,
06:05
different disciplines define differently what data is. They also use data differently, so we were really set from the beginning that if we want to develop any kind of metrics that make sense, we need to incorporate discipline and differentiate between scholarly disciplines and
06:22
ideally also normalize indicators, so that's kind of where we came from. This is why this was so important to us, and then we started our research using, you know, data site as our database and kind of then quickly realized, well, we have a problem here because only less than six percent of all the data sets, the 11 million data sets in data set, actually have any kind of
06:45
information about discipline. So, if you see here, so this is like from last week, I think, 11 million data sets and data site, 5.7 percent have discipline information, and we're using OECD here as, you know, one very general classification system within six larger classes, and you see in
07:05
this tree map how they're distributed, so, you know, we see lots of natural sciences, medical and health research, and some social sciences, and then, you know, even less for the other disciplines, and we also see that formally within the reference list, only 0.9 percent of these data sets have a citation, and even worse,
07:26
only 0.02 percent have a discipline and a citation, so we have a bit of a problem if we want to create metrics about data citations that are field normalized, which is kind of what got us here today, so we're thinking that, yes, we definitely need classification and field information about
07:45
data sets, but actually we need to solve this in the larger context of scholarly outputs in general, and, you know, if we look at the traditional scholarly publications, there are different solutions to this problem, so classification systems and field information usually
08:03
differs by what kind of output type we cover, so obviously, you know, coming from a library world, cataloging monographs, it's really important to have a discipline, and this is where this all comes from, but often in the scholarly communications world, journals and proceedings
08:21
are really important, but also patents, and another type of scope is usually the discipline, so we have a universal classification system, so we all knew no Dewey, we have the library of congress classification, we also, as I just showed you, the really general and broad OECD fields of science,
08:41
which basically narrows all academic or scholarly output down to six top hierarchy levels, we also have the Australia and New Zealand research standard classification, we have something like Elsevier's all science journal classification, in Montreal we use the National
09:02
Science Foundation journal level classification, Leiden has its own, and now also very exciting, OpenAlex has its own classification system, so these are usually universal because they're made for databases that cover all fields of science, and that's usually what we're looking for, that's also what we would be looking for to classify data sets, but obviously there are
09:23
lots of specialized solutions, so there are so many, I can't even list them all, but just as an example, in medicine we have medical subject headings, or in economics we have the journal of economics classification for example, that's really important in that field, then not only
09:41
scope matters, but also level of granularity, so obviously the question is like what are we actually classifying, and lots of the more traditional ones are what I'm calling kind of the collection level or the venue of where something is published that's classified, so you know most of the traditional ones look at what field is the journal in, and that's where we put all of
10:03
the articles published in the journal in the same field, of course also conferences, or also entire repositories, right, we know that in the pre-print world, and also for data it's really, there are lots of like subject specific repositories, so we can classify the entire repository into a
10:21
discipline, but more and more important I think nowadays is the item level classification, and it's obviously also more accurate and solves issues like multi-disciplinary journal, particularly in you know open access mega journals like plus one, it doesn't really help to classify the journal, we need something on the smaller more granular level, so item levels, so
10:44
classifying articles, but also book chapters or monographs for example, and then if we look at you know how do we actually classify, the two broad strokes here are kind of like well either we already have an existing system, so we just assign our items into the system,
11:03
so it's kind of top down usually hierarchical and saying okay this thing is biological research, but very interesting also more and more they're derived classification system, so this is the bottom-up approach where we're actually looking at all output and then we're starting to cluster based on a similarity to say these are similar, so they should be classified together, so that's
11:25
usually the derived or bottom-up approach, and then the question is well who actually does the indexing right, and traditionally obviously this has been intellectual, the first person to say who what their output is about is usually the author, and you know we often
11:41
most of the automatic ways also like use that kind of metadata, the title of an article of a book and so on that usually describes pretty well what the content is about, and it's the author's perspective, but obviously we also have professional indexers that do that task,
12:02
particularly for higher quality indexing, and we also have the user, so more and more with the social web we have the perspective of people tagging content and thinking or showing what they think the document is about, and we have automatic approaches which you know make a lot of sense with
12:22
the mass of information that we're dealing with, so rules-based algorithms or also more recently machine learning, so this is just to show you know I'm a library and information science professor, so this is the very basic stuff of like how do we do intellectual approaches and determining the subject and aboutness is not always straightforward, so it often also represents
12:42
who says what something is about, so there are these different perspectives and you see the kind of metadata that it's reflected in, the indexer usually using controlled vocabulary and the author and especially the reader using just yeah basically free text.
13:01
For automatic approaches we have three main approaches and the classic one that I think is most commonly used is what's called document categorization, so it's basically string matching any form of text, usually the metadata of an item, the title, the abstract, sometimes the full text, from the documents to assign them to a controlled vocabulary, so this is a really classic you know
13:25
rules-based approach of saying if the title contains cancer this is medical research, right? Another approach is the document clustering where basically based on different similarity levels and this could be citation relationships but it could be co-occurrence of terms,
13:45
we generate classes and the classification system itself automatically by grouping things that are similar, the really big problem here is the labeling that we're really good in determining that things are similar but we can't really say what they're about, the problem is also heterogeneity that some clusters might be huge and others might just be smaller
14:05
and the instability, so with a bottom-up system whenever you add a new item in your whole system might change and especially with you know publication output over the years that means every year there's more input your whole classification system might change and the third
14:23
one which is called text categorization also this is supervised machine learning is basically saying we're learning the characteristics of items of the classes from a training set and trying to apply it to you know the rest that we want to classify, the problem here is especially
14:42
when we talk about hierarchical classification systems that the training set is often not big enough and it's inadequate and especially sparse and if you go down all the hierarchies of the classification system, so just to think about this if we're what we're doing today exploring open classification, the really obvious application is information retrieval and it could
15:05
be for outputs, it could be for collections, it could be for people, you know finding the right reviewer, obviously where we're coming from is bibliometrics so you know just determining what is a field is hard, benchmarking and developing field normalized indicators is really essential to know more and put something into the context, obviously this is
15:26
also really helpful for improving existing metadata for example having a semi-automated system where you let the user for example choose from a classification system and so from my perspective what I would love to discuss today is should we even do this, should we even move forward and
15:43
think about an approach to build something open that is applicable to any scholarly output, should we mirror something like excellent initiatives like initiatives for open citation, something like an initiative for open subjects, should we build something like roar like a research subject registry and most importantly also how could we make the system really inclusive, we obviously
16:05
don't want to create another DDC where you know we have a very narrow world view of a white man in North America 200 years ago, we also want to be inclusive of any types of scholarly outputs of any discipline and also of any language and how do we actually do this to
16:24
satisfy so many community needs, one size does definitely not fit all and maybe the solution is not one system but a range to offer both broad top-down and granular bottom-up indexes, so this is kind of my little intro and I'm really looking forward to the discussions. Great thanks
16:45
so much Stephanie and it's been great I think working with you and your team from the data side perspective understanding you know working with other open infrastructure partners about what we can do with the community and how we make this useful for the various implications
17:03
around research assessment and so I guess continuing on that theme we'll next hear from Ludo Weltman from Leiden University to also talk a bit more about the bibliometrics and research assessment perspective so over to you Ludo. Thank you Matt, well I want to start
17:24
actually by congratulating Stephanie and the team on this really fascinating project that they are working on, sounds all very exciting but I hope to do is in just a few minutes to make a few remarks based on my own experiences with classification, my own experiences and the
17:44
experiences more broadly of our team here at the Centre for Science and Technology Studies at Leiden University. So we have I think it's fair to say we have quite a lot of experience with classification at my centre over the past few decades we have actually done a lot of
18:02
scientific projects in which we have in one way or another made use of classifications, research projects, also consultancy projects, all kinds of projects, so a lot of experience and something I did myself for instance is I made an algorithm 10 years ago an algorithm for bottom-up classification, classifying research outputs in particular articles based on citation
18:25
links. I was actually surprised to see how much a ponder that algorithm got over the past decade, it got a lot of attention to my surprise, but we also struggle and that's
18:41
actually what I want to emphasize, so we have done this in Leiden for quite a long time now, this classification in many different ways, a bit along the lines that Stephanie just explained, but even with all this experience and all these projects that we have done I feel that we are still struggling and we still have not really found the right way of dealing with the challenge
19:01
of classification, so what I see is that classification is of course an attempt to reduce a very complex and messy world to something that seems to have a clear and easy to understand structure, but of course it's also something that is highly reductionistic, something that
19:22
in some sense doesn't really do justice to the complexity of the real world, also something that requires painful compromises to be made between different objectives that the classification typically needs to satisfy, so in that sense classification is a struggle and it is still a struggle after having done all these projects
19:43
and after having built all that experience. Quite recently actually a PhD student of mine, Philip Purnell, he looked at for instance SDG classification, so classifications of research outputs in terms of the sustainable development goals and he showed again something
20:02
which I think confirms the challenges around classification, he showed the really low level of agreement between all kinds of different classification approaches for the sustainable development goals which is quite problematic given the importance that the SDGs have nowadays,
20:20
so it's a struggle and how to move on, that's of course the big question then and I want to make a kind of a suggestion and also suggesting that I think aligns with the topic of today's event, so it seems to me and that's really what I learned over the past decade, it seems to me that classifications need to meet two key conditions, I call them
20:41
transparency and democracy, so transparency of a classification means in my terminology that the classification is open to inspection, so basically anyone should be able to explore the complex and typically also the messy data that underlies a particular classification, so anyone should be in the position to make their own assessment of the
21:05
suitability of a classification for a particular purpose, so that's essential, it's kind of unacceptable that we just have classification and we don't really understand what's underlying that classification, how it has been made or how particular data has been classified, that's
21:21
not acceptable. Second, democracy, so that means that the classification or that basically anyone should be able to come up with alternative classifications, so if you have a given classification anyone should be in the position to question that classification and also to actually try to come up with alternative competing classifications, enabling us as a community to
21:42
assess the robustness of conclusions that we draw, robustness with respect to the choice of a classification approach and also allowing all of us to debate the merits of different classification approaches, which I think is essential, so transparency and democracy are the two conditions that I feel classification should satisfy and that of course means that we
22:05
need to have classifications that are open and that's why I'm really pleased to see the public of today's discussion, so I'm definitely in full support of open classification, thank you and I'm really looking forward to the discussion.
22:22
Great, thanks very much Ludo and really also some interesting and key points there around some of the I guess principles of establishing open classification systems that are really important for us to keep in mind, so moving on to the next panelist, we're going to hear from Christy Holmes from Northwestern to talk a bit more from an ontology perspective, so over to
22:48
you Christy. Okay great, thank you very much, I want to thank the organizers for making this webinar series possible and for creating the opportunity to talk about open classification systems, so I'm going to bring a bit of a biomedical perspective to our conversation
23:03
to highlight the importance of classification systems and as many of you might know, I have the great pleasure of both leading evaluation and continuous improvement for the Northwestern University Clinical and Translational Sciences Institute, it's a bit of a mouthful,
23:20
but it's we call it new cats and I also am the director of Galter Health Sciences Library at Feinberg School of Medicine, yeah the comments that I'm going to share today, I want to first give a shout out to Karen Gutzman at Galter Library and also to the new cats evaluation team who lead our efforts in this area and I'll be talking about topics
23:42
that reflect conversations that we've had on our team, so our team is interested in the translation of discoveries into improved human health, but to understand this process more fully and understanding of knowledge translation itself and also tracking how ideas move from basic science to clinical research to clinical implementation all the way to improved health and
24:05
care of our communities as required, data sets play a huge role in this space in biomedical research like many other disciplines, we see an increasing requirement for a multi-disciplinary team and with that multi-disciplinary team come different perspectives, different activities by
24:24
team members and different outputs including data as well as different ways of describing and classifying that research, for our team as we're thinking about understanding knowledge translation and research impact, it's important for us to be able to add context to this space
24:42
as well as to the different outputs like data so that we can understand knowledge translation and that knowledge translation activity more carefully, this also gives us an opportunity to more carefully understand and appreciate the role that data play in translating discoveries as well as to that multi-disciplinary team and how they help to charge forward in making
25:05
meaningful outcomes in health for society, so with that I want to pass the microphone back to Matt and look forward to the rest of the conversation.
25:20
Thanks so much Kristi, we are going to pass it over to Jason, take it away. Thank you, yes I was asked to do two to four minutes, I'm going to do my best to squish a lot
25:40
of stuff into time, so my name is Jason, I'm from a non-profit called Our Research and we recently built a tool called OpenAlex, it's a comprehensive and open directory of all scholarly products as well as authors, journals, institutions, and concepts, and concepts we're talking about today. I wanted to thank Stephanie in particular for really breaking down a lot of the concepts
26:05
behind concepts, I want to just hit a couple of those that were important to us, we really want to emphasize that we think this should be open, I love Ludo's point that this should be something democratic, I think it really can be democratic as long as it's all hidden, so we think that extends to every part of the process, so the methods and the algorithm
26:25
for how the concepts were assigned, so that's something we've done, so I should mention we've done this, we've created a big list of concepts, we've got about 200 million articles, and in those 200 million articles we've assigned about 65,000 concepts, each article has like maybe five to ten concepts, and so I guess instead of saying what I think
26:47
I'll just say what we did to save a little time, so our methods and our algorithm are all open, so you can run our algorithm and do your own concept tagging on our corpus or on some other corpus, the data set itself is open so you can download the concepts, you can also download
27:04
the articles with the concept tags associated with them, there's an API that you can use to query that, there's no registration or anything that's free, and then in a couple months there'll be a UI that you can use to click around and sort of explore that. Our tag properties
27:20
to cover real quick is that it is an item level tagging setup, we think that's much better than journal level for the reason Stephanie mentioned, and inclusive, so that means like different articles like set up multiple tags on them, I think again that reflects kind of the reality of what concepts are, it's hierarchical, so there's five or six levels of hierarchy,
27:44
starting with the root that has nine and then it kind of just explains outwards from that, it's automated and we think that's really important, because you know we have to tag there's about 50,000 new articles a day, and you know that's really difficult and really expensive to do with manual scattering, and it's language agnostic, so the tags themselves
28:05
are entities, those entities which I'll describe in a minute, can be described in any language, and they'll tag articles in any language, so it's possible for me to look at an article in you know let's say Turkish, which I don't read, and I can see the English language tags
28:22
associated with that, we think that it's really important for this to be something that's engaging with the community, so that's why our tags come from Wikidata, and we think that's a really cool place to get them, that's a community that's very large, very active, it's got a lot of really smart people working on it, instead of us kind of just coming up with the hierarchy, we're able to use a lot of, like I said, a lot of smart people having a lot of good
28:45
discussions, because fundamentally I think tagging or concepts has got to be a community oriented thing, I like to say tagging isn't really metadata, you know concepts aren't really metadata, and the same with the authors or title is, concepts are more argumentation right, it's more I'm making a search, I'm making a point, well I think this is about history, no I think
29:03
it's about geography right, it's more part of the scholarly conversation than it is the metadata, and so that's why we think it's really important to connect this with the community. That being said, a lot of times you need to spend forever trying to get 100% buy-in from everybody, so that's why we kind of threw something out there right now, this is what we
29:21
think is a good approach, and we want to hear back from the community. Over time, what we'd like to do is allow people to upload their own sets of tagging to OpenAlex, or even better, they can fork the tags that we've got, so you guys have got to be like 80% right, but here's mine that's a little bit better, and then when you're using OpenAlex you could potentially use this kind of
29:42
a parameter to the API or something like that, you could say hey I don't want to use the default concepts, I'm going to use this other set of concepts, this other set of tagging assertions that I like better, and that would be something you could do, and then potentially which set of concepts are better is something the community could kind of go back and forth with, and something we could sort of merge. That's it, thanks. Thanks Jason. All right, so last but
30:09
not least, we have Jeff Dildner, he's going to take it over, cross ref, Jeff. Hello, so first of all thank you to everybody, people have laid out the issues very nicely,
30:26
and all that's left for me to do is tell you a tale of woe, which is what we tried to do at cross ref, why it was ultimately stupid, how we're trying to fix it in the short term, and then how we hope to fix it in the long term. I'm Jeffrey Bilder, Director of Technology
30:45
and Research. Before this position I was ahead of what we call strategic initiatives, which is sort of a lab's arm of cross ref, and we've developed a lot of things that have since become production things, and this is sort of the start of a tale
31:03
of exactly why you have to be careful about when you move things from labs to production. So almost everything and all of the problems and the stupidities here are ultimately my fault, and that's kind of why I'm trying to fix them. I want to tell you about... Can you share that? We can see the whole thing, yeah.
31:23
Oh, sorry, sorry, yep. So I wanted to tell you a little bit about the classification as we've applied it to what we call a REST API, and if you don't know what a REST API is, it's the main API for cross ref. It allows you to query information about our
31:41
members, about individual works, FDOIs, about funders, about anybody who participates in cross ref. It gets about 500 to 700 million requests a month, so it's pretty popular and it's an upstream source for a lot of data that is fueling a lot of other systems
32:02
that we're talking about here. The REST API started as a cross ref labs project, and we built it really to do something else originally. We built it as sort of a something that we thought might serve as a back end for a tool that would allow people to cite things more
32:23
easily in blog posts, and that was its original conception, and then it kind of blew up and we did all sorts of experiments with it, one of which was to add classification data to the metadata records that we have, and we added classification metadata to the records that we had
32:44
because our members weren't providing this up front. Ideally, we would have liked them to have provided this up front, but they weren't. I don't think we even had a space for it in our schema, and so we had lots of people asking if we could apply classification data, so we conducted
33:04
an experiment, which is what our job was at the time, and what we did was we got this list of classifications from Elsevier, which you can download from their page, and it maps to ISSNs, and we mapped it to our ISSNs and then started including that metadata in our API.
33:25
You can go off and you can look at this classification. It's pretty well documented, but when we applied it, we made a few mistakes. The first, and you can see this here, is that even though the classifications as provided by Elsevier are applied to the journal level,
33:46
to the container level, at the ISSN level, we included this metadata at the works level, and so this is misleading for a number of reasons, not the least of which is that,
34:01
of course, just because something, you know, we've got a whole bunch of things, for example, that might be published in a mega journal like PLOS One or in a very generalist journal like Nature, and so the classification doesn't necessarily apply to the item in that container. So that was the first big mistake that we made, and the second big mistake that we made was that we
34:26
didn't quite realize at the time that we were going to have all sorts of problems with overlap, or lack thereof. So there are ISSNs that we have in Crossref that don't exist in Scopus, and therefore they didn't have a classification for, and then of course there are ISSNs in
34:41
Scopus that were not in Crossref, which didn't affect us as much, but nonetheless it's worth noting. Another problem we faced sort of more recently was what we call the ISSN race condition, which was that ISSN decided that as part of getting an ISSN, as part of the process of
35:03
getting an ISSN, a publication had to have a history of publishing first, and so from our point of view at Crossref, that meant we could no longer use the ISSN as a starting point for assigning categories, because some of our members, our new members who are just starting journals,
35:21
might not have an ISSN, and so we had to assign them what we call the container level DOI instead, and so we couldn't nab those at all. The other problem was that it only applied to serials. Crossref contains a lot more than just serials. We have monographs, we have components, data sets, all sorts of things. As I said, we applied it to the wrong resource,
35:44
and then this last one is sort of a technical problem, but it's a problem nonetheless, and that is that it's not updated accurately, so this particularly is a problem because it's cumulative, so if something changes or if something was mislabeled, we don't update that until we
36:03
re-index the entire database, so the categorizations get out of sync over time until we do a complete re-index of the database, and then finally the other and last problem was the particular categorization that we used didn't seem to be widely used elsewhere,
36:23
and so there are a whole slew of problems with the way that we implemented categorization in the API, and in the short term, we're trying to fix them in an R&D project. Esha Data is I know on this call, and she's in the R&D group and is working on this as we talk,
36:43
and so what we're trying to do is we're trying to fill in the gaps. We're trying to see if we can take the information that we do have and use it as a training set to create an algorithm that allows us to at least tag at the container level the information that doesn't have
37:01
categories at the moment, and we hope to do this across our corpus so that it would apply to everything, and of course we'll make this open, and the big goal here is just to fill in gaps that I highlighted before, but we know that this is inadequate for a lot of the reasons that
37:22
lots of people have already pointed out. It's at the container level. It will be derived based on something that sort of already exists and is imperfect, and so we are looking for a longer term solution, and that's why we're interested in this conversation, why we've been talking to Stephanie and others about what we can do in the longer term. I will repeat probably something
37:44
that lots of other people have said, what we want to see is something that's cross-discipline that applies to the work level and that's open, and there's one thing that I did not list here, and that is that we would like to see it applied as far upstream as possible,
38:01
and that's not because we think that it's necessarily the most accurate place for it to happen, but we would like to see there be some categories assigned to publications as soon as they're registered with a DOI. If for nothing else than to seed a conversation and to allow
38:25
people to identify things that had ostensibly been categorized as this so that they can refine the categories or so that they can apply different categories to them if that's appropriate. So those are the things we'd like to see, and that's really our story of woe and thanks.
38:45
Thanks, Jeff. Appreciate the honesty. So bringing everyone back on screen here, we're getting a lot of questions in Q&A, but one question that Matt and I had when we were thinking about this topic with Stephanie in general, thinking about why is this important
39:02
for data metrics, make data count, our first question you all answered, which is great, which was what can we not repeat from mistakes in the past and what can we do differently? So it seems like you guys have all really pointed out that we needed to be open, cross-disciplined, as Jeff says, upstream, and Stephanie went through a lot of the ways of how.
39:23
And so a question that we want to ask all of you before going to others is, this seems like a lengthy and lofty investment if we're going to create an open classification system, but thinking about maybe especially in the context of us trying to contextualize data and understand data metrics, what's the downside if we don't prioritize this big work
39:47
pretty immediately? So I'd love to hear from each of you to answer that question. Maybe so that it doesn't get awkward, just go with the speaking order. Let's do that. So Stephanie, you first. I'm going to keep it very brief because I already took up a lot
40:03
of time, but from my very specific position of somebody working on metrics, I think if we don't create an open classification system to come up with a good, reliable, field-normalized citation indicator, then all of the efforts of what we have been working on and especially people like
40:24
Ludo of making the citations open and now the abstract open, if we don't have the context of discipline, then I think there's a really big gap and it will be really hard to move away from these proprietary infrastructures where we have built up this information over decades.
40:51
Ludo, Kristi, Jason, Jeff. Well, I think I also made clear why I feel it's really essential to have open classifications. I think I should really use
41:04
plural because of the reasons that I explained. However, there's perhaps one kind of a thing I want to make a warning for. So we as big matrices, we tend to think about classifications indeed as a way to do as Stephanie calls it field normalization. It's also something
41:25
that has a long history at my center and also in my own work. There's also, for instance, I would like to make a strong recommendation to do field classification. I must say over the years, I have become a little bit more skeptical about that because yes, these things need to be
41:42
normalized in some sense, but at the same time, these normalizations are so sensitive to the choices that you make and that's kind of hidden behind the numbers. So end users are not aware of that and I struggle with that. I must say, I find it sometimes a bit uneasy to indeed work with these normalized indicators to present them as things that can be compared
42:03
across fields while we also know that it all depends very much on which classification you happen to be working with and I don't really have an answer to that, but it's kind of warning I want to make. So let's also be a little bit careful with this. Okay, I'll just jump in really quickly to say, especially I've talked a little bit about
42:24
this biomedical context, but also that it reflects so much more than what's happening on a medical research campus. Our teams represent different perspectives from our community and beyond and we aren't able to necessarily understand the context of the work and how it
42:42
relates to one another. I think there's also some challenges when we're looking across disciplines. There are synonyms that are used and it makes it very difficult for us to be able to tease apart what's happening in the biomedical context like we see it from there and recognizing actually it's the same thing in a different discipline. So I love the idea of
43:04
more open infrastructure always, but I think that this is actually a tool that can be used extensively by many different stakeholders. Yeah, and I'll jump in to say I think the other
43:21
panelists covered pretty well. I always say that I think when it comes to classification and just anything in scholarly metadata, if we in the open world don't get it, it's going to be enclosed by those who are going to try and make a commercial enterprise and then charge went for it. And I think we unfortunately kind of got trapped into that with a lot of the other
43:42
systems that we use in scholarly communication like citation graphs and stuff like that. And I think that we have a chance to maybe start it on the open foot now and make that the default. And I think that people 10, 20 years ago from now could potentially be really thinking as well. Right. So I'm clearly pro open infrastructure. And the only thing I'd point out
44:05
is that everybody benefits from open infrastructure, including closed organizations. And I think that that's ultimately the power of open infrastructure. We get a lot of people
44:22
to agree to use it because it doesn't lock them in. The thing that I didn't say is that one of the reasons I think it's important that it be open is that in order to get it adopted upstream, we need something that everybody can use without risk. And so, yeah, I think open is
44:47
important for everybody in the community. Yes, some really interesting comments. And I see a few comments coming through in the chat in both Q&A. I guess Janelle and I have spoken about a bit of
45:06
this before and continuing on this theme. I think it would be nice to hear from the panelists briefly. We've spoken a bit about what is the risk of not prioritizing this now, but
45:20
what do you see now as the next step? And I think that would be interesting to hear for the participants as what do we do as the next step towards an open classification system, in your opinion. And maybe if we just use the same order again. Yeah, so I laid out a little bit the technical problems. And I think it's also
45:45
safe to say like what Ludo said in terms of they have such a long experience of trying to classify things. And it's hard because people don't necessarily agree on what is the subject, what is something about. It also changes over time. In bibliometrics, we also have this
46:05
site-to-site versus siting site normalization. Should something be placed in where it's published or where it's used? But I think in general, I would say that the next steps are more of a community bringing the right people together rather than focusing too much on
46:23
the one technical solution at this point. Because I think we'll never create anything perfect. I think if you talk about classification, this is something you have to accept. There is not a one-size-fits-all. So I think more than just quickly coming up with a new
46:41
system, it's so much more important to get the right people together that we ensure reusability and that it applies to a lot of users. Because creating another silo or even if it's open, if it's only used or backed up by a very small community, it's also not very helpful.
47:04
And I also think that we have to have some definitions of what kind of use cases we want to focus on. Because from an indicator and metrics point of view, a classification system is something completely different than from discoverability or monitoring or evaluating
47:22
signs. And I just want to say that I'm all for reusing existing systems rather than reinventing the wheel. Because I think also with the best intentions, it's really hard to come up with something good and universal. At the same time, we also know that existing systems,
47:41
if they're bottom-up or top-down, they have problems in terms of diversity. Even if we say, okay, we're looking at all scholarly literature and clustering things, well, we already know that feels like humanities and arts aren't well represented in the scholarly outputs we usually look at. So we're excluding those. We're excluding certain languages. And the
48:02
same with these traditional systems that we have, like Dewey, right, is famous for basically excluding lots of different diverse points of view. So I think from my perspective, rather than, we have to find technical solutions, but for me, the most important thing is the community and bringing together diverse points of views to ensure that this is a system
48:25
that's relevant to many and not just few. What I could add to what Stephanie just mentioned is, well, the thing that you did actually mention, Stephanie, the use case, I think that's
48:42
probably what should come first, thinking really carefully about the key use cases that you want to serve and the different use cases will require different solutions. So that's also why I emphasize the need to kind of talk about this in terms of classification, plural rather than singular. And I also think that perhaps there is this issue of different levels of granularity
49:04
that classifications can have. And I feel that especially when you go to lower levels of granularity, it's hard to have anything that resembles a kind of a generic solution. I'm a bit skeptical about that. I think that higher levels of granularity, bigger disciplines, there is something like that. There's a need for something like that.
49:26
So perhaps that's where we should start to have a kind of consistency in the way all kinds of scientific statistics are reported in policy reports globally to just make sure that these statistics reports can be compared in a better way because they rely on a shared classification.
49:47
So that's, I think, probably where I would start. Okay. I'll just jump in. I love everything that Stephanie and Ludo said. I will say that making sure that the open classification is representative of diverse perspectives is
50:03
important, but also understanding the stakeholders who can help with implementation and adoption on a practical level is important. And I think that's a role that libraries play a great role on campus with and can help to advance those conversations in addition to some of the more
50:21
specific technical aspects of the discussion. So making that an inclusive process is incredibly important. Yeah. And I'll say that I'm a co-panelist. I think I've addressed it really well. I would add, you know, one of the things we try and do at our research is
50:41
try and create something that people can talk about. I think it's great to have these discussions, but I think it's also useful to have something that people can look at and say, okay, well, that's wrong or this part's right. So that's why we're hoping we're contributing a little bit to discussion and that we do have an operational open system right now. You can download all the tags. You can download all the articles. You can use API.
51:01
You can download the algorithm that we use that does the tagging. That's all open source. So the thing is kind of open soup to nuts and we really want to hear what people kind of have to say based on, like I said, open tagging system of Wikidata. We want to hear what people have to say about that. Is it good? Is it bad? How do they want to change it? And we hope that maybe something like that can provide maybe some useful examples maybe to add to the
51:25
discussion. Yeah. I mean, I'd pick up on that. I think, you know, for all of my joking about why what we did was stupid, it was useful to actually get something in front of people
51:41
to respond to. And I think probably the only, my only real regret is that we hadn't, you know, that we weren't able to respond to it earlier because we had a lot of other stuff to do. But now we're looking at it and I hope we can do something, as I said, in the short term that certainly won't be, still won't be perfect by any means and won't address a lot of
52:04
the issues raised here, but at least it will make what we have or the limited thing we have a little more consistent. Thanks so much, everyone, for sharing your perspectives on that. I know that all of you who have asked questions are probably upset that we haven't gotten to
52:21
them, which is because there are too many. So we didn't want to just try and answer one, but we have saved them and we're hoping that the panelists can follow up with you, except for if it says anonymous, we don't know who you are, but to try and answer these. And thank you to the panelists for going through and talking about this. I think, of course, maybe it's still a little confusing about why make data count, why do we care about
52:44
this? But as Stephanie points out, you know, as we're trying to get to this point, everyone wants research assessment. We all care about data now, but we can't, this is a hump. And so it actually requires, and with everything around data infrastructure, data is not a silo. So that's why we wanted to have this conversation that it's actually just,
53:02
we need to figure out this infrastructure for everything. And it happens to be something that Stephanie's team really found while doing this research around make data count. And so please keep following around this because I think there's, especially the panelists here have a lot that they want to build up on this. And so we're really excited to see where that
53:21
goes having stemmed from this and moving forward. And I'm going to share my screen and pass it over to Matt to announce the last webinar. Yeah. And I think also just to add a final comment around that, you know, there's some work for us as the community. And, you know, we know that we've heard that we want to do it in an open way. We want to do it upstream.
53:45
It must be useful for everyone. And we need to protect the interests of the community. And so doing this together is really, really important. It's been really great to hear from the panelists, the different perspectives. And we're continuing on the theme and
54:06
the next webinar will be posted in the chat. I think Paul, if you can post that, the link. And so building on from where we are here is to talk about beginning and using metadata
54:25
or metadata for meaningful data. And so a lot of work that we need to do together. If it was easy, we wouldn't all be here today. And so we really thank you all for joining. Apologies that we couldn't get to all the questions. There were some really great
54:43
questions and we will share those with the panelists and try to get back to anyone that did put in comments as well. Just to confirm, has the link for the next webinar been posted in the chat? Yes. And we didn't have fun photos of speakers. It's a little farther away. It's in May and
55:07
sorry for not showing the chat. We can put that date in there. It's in May. And it's kind of third hump that we wanted to get to, which is another thing Stephanie's team found. First of all, there weren't subjects and data sites. So how are we going to even figure out what
55:21
discipline these data sets are from? And then this last one is going to be about what metadata can repositories, publishers, everyone who's contributing in the metadata world, what is required for us to actually build out and understand and start to do these studies to understand research data reuse and potential assessments. So speakers to be announced soon,
55:43
please follow along. Thank you so much to everyone. We'll take a second to save all the chat and everything. Please get in touch if you have any questions and thank you again to our panelists. This was an excellent discussion. We appreciate all of you.