We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

To Humanities and Beyond

00:00

Formal Metadata

Title
To Humanities and Beyond
Title of Series
Number of Parts
14
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Chair: Thaiane Oliveria (Universidade Federal Fluminense (UFF), Brazil). HIRMEOS and Beyond - A New Metrics Service for Books: - Sofie Wennström (Stockholm University Press, Sweden), Rowan Hatherley (Ubiquity Press, USA). - Doing Research Assessment - An Innovative Approach to Analyse Social Science Research: João Costa (Global Development Network, India). - Mapping the Policy Landscape: Euan Adie (Overton.io, UK). - Legal Citation Analysis with CourtListener and Cobaltmetrics: Luc Boruta, Damien Vannson (Thunken, USA)
Computer programmingSoftwareSoftware developerXMLLecture/ConferenceMeeting/Interview
Projective planeBitMatrix (mathematics)BackupQuicksortSlide ruleRight angleLecture/ConferenceMeeting/Interview
Metric systemService (economics)Disk read-and-write headComputing platformRepository (publishing)InferenceStirling numberExclusive orOpen sourceAddress spaceLink (knot theory)DisintegrationVector potentialOpen setGUI widgetElectronic visual displayHypothesisWebsiteComputing platformHypothesisOpen sourceFiber bundleService (economics)Link (knot theory)MereologyDifferent (Kate Ryan album)Process (computing)SoftwareQuicksortSelf-organizationINTEGRALOpen setTwitterOnline helpMetric systemSoftware developerDiagramWhiteboardPattern recognitionProjective planeBitInteractive televisionCorrespondence (mathematics)CollaborationismSlide ruleModule (mathematics)Modul <Datentyp>Statement (computer science)Matrix (mathematics)Integrated development environmentMultiplicationGame theory
Stirling numberCondition numberWeb browserTerm (mathematics)Address spaceEmailTable (information)WhiteboardSign (mathematics)Event horizonMaxima and minimaThomas KuhnHypothesisTwin primeLambda calculusSimultaneous localization and mappingMetric systemGUI widgetDisintegrationOpen setLevel (video gaming)Service (economics)HypothesisSheaf (mathematics)Greatest elementBlogQuicksortMetric systemTerm (mathematics)Web pageTwitterBitInformationMeasurementMultiplication signLink (knot theory)GUI widgetAdditionPerspective (visual)Computer configurationAuthorizationOpen setContent (media)MereologyMatrix (mathematics)1 (number)Peer-to-peerElectronic mailing listComputer animation
CodeSource codeType theoryExecution unitStirling numberFingerprintPlug-in (computing)Expandierender GraphSet (mathematics)Different (Kate Ryan album)Metric systemProjective planeService (economics)Term (mathematics)InformationAdditionLimit (category theory)Open sourceElectronic mailing listDiagramCode1 (number)QuicksortMatrix (mathematics)Link (knot theory)Repository (publishing)
Presentation of a groupSoftware developerSoftwareEmailSelf-organizationLecture/ConferenceMeeting/Interview
FaktorenanalyseRule of inferenceData structureMachine visionIntegrated development environmentSystem programmingLink (knot theory)NumberPhysical systemFaculty (division)System administratorArchaeological field surveyData modelChannel capacityDampingStirling numberLocal ringCollaborationismFormal languageHierarchyImplementationComa BerenicesProduct (business)Context awarenessDimensional analysisTexture mappingSoftware frameworkDeterminantDimensional analysisPoint (geometry)Product (business)Sampling (statistics)Physical systemContext awarenessPrice indexCalculus of variationsTheoryMathematical analysisDean numberVulnerability (computing)InformationDecision theoryImplementationIntegrated development environmentSoftware developerOrder (biology)Wave packetGroup actionComputer programmingKnowledge-based systemsObject (grammar)Ring (mathematics)Video gameComplex analysisNatural numberSoftwareAddress spaceSelf-organizationOcean currentPhase transitionAssociative propertyFunction (mathematics)MappingChannel capacityTraffic reportingSoftware frameworkLevel (video gaming)Key (cryptography)PlanningBuildingBasis <Mathematik>Set (mathematics)MeasurementCovering spaceSystem administratorRankingLatin squareXML
Stirling numberCompilation albumComa BerenicesGamma functionRule of inferenceHypermediaNetwork topologyMeasurementPrice indexMacro (computer science)Open setTheorySoftware frameworkService (economics)Revision controlDivisorDimensional analysisMathematical analysisLevel (video gaming)Core dumpUniverse (mathematics)Product (business)ImplementationDifferent (Kate Ryan album)Natural numberFunktionalanalysisMappingType theoryElement (mathematics)Student's t-testPerspective (visual)Characteristic polynomialCategory of beingCivil engineeringLinearizationContext awarenessPhysical systemReading (process)Interactive televisionoutputCombinational logicWeightCartesian coordinate systemMultiplicationDeterminantElectronic mailing listGroup actionSoftware developerSubgroupInformation technology consultingTheory of relativitySelf-organizationMathematicsMetric systemMotion captureIdentifiability
MassModemMusical ensembleSoftware frameworkStirling numberFunction (mathematics)Strategy gameLocal ringPoint (geometry)Loop (music)FeedbackObservational studyData managementSystem programmingComputer programmingPerformance appraisalComputer networkCASE <Informatik>View (database)Event horizonNumberImplementationSoftware developerResultantComputer programmingEmailForm (programming)Self-organizationLevel (video gaming)Field (computer science)Ocean currentRational numberXML
Stirling numberState of matterField (computer science)Metric systemBitUniverse (mathematics)Level (video gaming)Metropolitan area networkParameter (computer programming)Integrated development environmentPhysical lawExistential quantificationQuicksortComputer animation
Stirling numberBlogNormal (geometry)Variety (linguistics)Source codeHypermediaComputer networkExistential quantificationHypermediaLevel (video gaming)File archiverRevision controlDifferent (Kate Ryan album)Multiplication signForm (programming)SoftwarePoint (geometry)Greatest elementText editorNumberDecision theoryTerm (mathematics)InformationType theoryState of matterWebsiteContent (media)Traffic reportingPermanentRow (database)Library (computing)Open setPhysical systemBitBlogNatural number1 (number)Power (physics)Electronic GovernmentProbability density functionCASE <Informatik>Differential equationRight angleWeb pageRewritingMetric systemSpectrum (functional analysis)SummierbarkeitPhysical law
Source codeNumberPrice indexEmpennageDecimalMathematicsFocus (optics)Order of magnitudeEstimationStirling numberMeta elementFile formatContext awarenessMotion captureFacebookTwitterWeb pageData structureWebsiteProper mapBitRegulator geneElectronic mailing listMathematicsEstimatorProcess (computing)Domain nameNumberFocus (optics)Source codeHypermediaTerm (mathematics)CASE <Informatik>Right angleOrder of magnitudeDesign by contract10 (number)Virtual machineMetadataField (computer science)StatisticsProbability density functionTraffic reportingRankingTrailPerspective (visual)Range (statistics)Greatest elementP (complexity)Function (mathematics)Goodness of fitStandard deviationType theorySubject indexingDifferent (Kate Ryan album)ConcentricLibrary (computing)Addressing modeMetric systemVariety (linguistics)Context awarenessCentralizer and normalizerLevel (video gaming)State of matterComputer animation
Mathematical analysisFreewareGame theoryMetric systemVideo trackingPrice indexRevision controlHyperlinkWeb crawlerInternetworkingPole (complex analysis)Source codeRule of inferenceComplex analysisStirling numberDatabaseUniform resource locatorContent (media)Link (knot theory)MereologyEntire functionStrategy gameDefault (computer science)FingerprintData modelOpen setWebsiteControl flowProjective planeWeb 2.0Text miningUniform resource locatorIdentifiabilityCanonical ensembleLink (knot theory)Element (mathematics)Content (media)Token ringMetric systemPhysical systemCASE <Informatik>BitDatabaseMatrix (mathematics)Point (geometry)Pattern languageTerm (mathematics)1 (number)Web crawlerVideoconferencingVirtual machineProfil (magazine)Subject indexingTwitterField (computer science)MereologyUniformer RaumGame theorySource codeMultiplication signMathematicsHyperlinkSet (mathematics)Type theoryWordInternet service providerRevision controlPrincipal ideal domainNatural languageFreewareNumberOpen sourcePhysical lawFingerprintDigital object identifierTable (information)System callTrailWebsiteProduct (business)File archiverDescriptive statisticsQuery languageDomain nameRegulator geneLocal ringFormal languageRadio-frequency identificationDifferent (Kate Ryan album)CodeStatuteScaling (geometry)Stack (abstract data type)FrequencyState of matterEntire functionLevel (video gaming)Open setForcing (mathematics)Line (geometry)File formatXML
Source codeMetric systemAreaSoftwareRow (database)Data conversionWeb 2.0Dependent and independent variablesMereologyCASE <Informatik>Point (geometry)Formal languageMultiplication signFeedbackDomain nameProduct (business)Control flowVideoconferencingProjective planeTracing (software)Digital object identifierProof theoryFitness functionLevel (video gaming)Network topologySoftware developerComputer animationLecture/Conference
Transcript: English(auto-generated)
Hello, yes, it's working. I'm sorry for this technical problem. We are starting now the panel about humanities and beyond. I would like to thank you
Sophie Wensum. No? Brian Hall. I'm sorry. Sorry, this is an old program. Okay, okay. What's your name? Are you from this table?
What's your name here? I'm so sorry. This is the the previous program. Thank you so much, Joe. So I would like to thank you
Ewan, Ewan Eadie, Luke Boruta and Danny Venson. Thank you so much for composing this panel. I am also representing Joan Costa from Global Development Network.
He couldn't be able to attend this conference and I will present his research. So I would like to invite Sophie. Yeah, okay. Thank you so much.
Can you hear me? Cool. While we wait for my slides, just a bit of explanation of the confusion.
Yeah, this talk was originally supposed to be done by Brian from EuPICTY Press with Sophie being the sort of backup speaker. Neither of them could attend today. So I'm the secondary backup speaker for this talk. Great, cool. All right
Yeah, so Good afternoon. I am here to talk to you about the Hermeos project. Specifically focusing on two work packages that are related to metrics and altmetrics. As I said, I am from EuPICTY Press. I'm a software developer. My name is Rowan Heatherly.
For those of you who don't know EuPICTY Press this is basically just an overview of who we are. We're not going to go into the the full background of our organization. The important things are that we are an open source publisher and open source publishing platform.
We like to say that we're research-led. This is mostly working with our customers to find new ways to achieve our mission statement. This little guy over here is one of our customers steering the ship that is our company. Alright, we like to show this
diagram to our new customers, but it's actually quite nice for this conference as well. It shows the benefits of publishing open access. Focusing more on this little middle bit here, where we talk about, you know, career recognition and collaborations. We talk about how
citations Well, open access corresponds to high citations. We don't really talk too much about altmetrics. As we've heard earlier on today, citations tells you how well your articles are doing in a more academic-based
environment, whereas altmetrics tells you how you're engaging with the broader community, your impact beyond general academics. One last thing, because this does tie in quite nicely, is this is our customer charter. We've got an editorial board who forces us to
maintain these principles. Everything we do is entirely open access. Our software, we strive to be open source, and we have this non all-inclusive bundling. What's quite nice about this is that the partners at Hermey also agreed with this approach. Everything we've done, we've tried to be as open as possible. All the software we've developed during this process is now open source.
It's done in a very modular fashion, which is quite nice. Alright, so just to give a brief overview of Hermeyos. Yeah, Hermeyos was basically started by
operas. I'm not going to go through the full anagram and full abbreviation of what Hermeyos was. But basically, it's a project that was designed to help integrate open access monographs into an open science ecosystem.
This involved developing new services to tackle challenges that are faced in humanities and social sciences. This also involves providing data links and interactions with monographs to make way for new tools
for research assessment. And finally, just integrating these services into our different partner platforms. Okay, next slide is not happening. I need to get up to my next slide, unfortunately.
It's being uncooperative.
Yes, yeah, I can't go back or forward. Great, okay, I'm back. Thanks. Okay, so this is just more of a diagrammatic overview of that process. So we've got the services we developed and they get integrated into our platforms to produce this data, which can be used elsewhere.
Hermeyos was undertaken by, I think, five or more publishing partners and spanned across seven different work packages. Of these work packages, two are relevant to this conference and those are the two I'm going to talk about. The first is
was the fifth work package, the annotations work package. This basically involved integrating a service or integrating a tool called Hypothesis into our platform. This allows users to create annotations on that platform
or on books and book chapters. The nice thing about this is that when users use this annotation tool to interact with books, they are effectively creating things that can become altmetrics. Part of the service is also collecting
these annotations that users made on this Hypothesis platform. This sort of moved into the WP6 package, which was metrics and altmetrics. So providing tools for publishers to measure their usage metrics
There are other metrics such as citations as well as altmetrics, so Twitter activity, Wikipedia activity, Hypothesis annotations, and a few more. Yeah, for those of you who are not familiar with...
I've done it again, I'm afraid. Sorry, I'm stuck again. I'm assuming they might have done it.
Well, while we're waiting for this, I can tell you the benefits are annotations. Alright, so Generally speaking, whenever you... Not helpful, sorry. Yeah, whenever you sort of... I'm not sure if you are familiar with Disqus. It's a
little service that allows your users to kind of put comments at the bottom of an article or blog post or something. Hypothesis takes us to another level. It allows you to highlight a section of text and make comments specifically on that little piece of text.
So this allows... Well, it's used for multiple things potentially. So users and authors can use it to enrich their content. Awesome. Yeah, so any of you who can speak French, yeah, this is from one of our partner presses, Stockholm University Press.
Yeah, this is where Sophie's from. Yeah, so this is one of the books that has a lot of hypothesis annotations on it. Someone's highlighted the section of text and explained, if I can remember correctly, that LOFT is a legal entity and what that means in terms of this book.
So it's a nice tool for enriching data by either providing additional explanation of terms. You can link, obviously, well, link to supplementary data that is specific to a section of text. Authors have used it to update a bit of information. So they say this is new research
based on here. This is what we've come up with. Yeah, it can also be used in post-publication peer review. So a peer review that happens after it's gone to publication, which is quite nice. And quite often authors use it just to generally engage with their readers.
So authors ask questions about their own content and the community kind of discusses things specifically in their books. All right, so what does the metric system look like?
On an actual web page. Part of the work package, which I unfortunately forgot to mention, was this metrics widget that we've developed. So this is one of our books called The Battle for Open. The reason it's here is because it also has a lot of hypothesis annotations on it.
Unfortunately showing a screenshot with the book and the actual metrics because the metrics is a lot further down the page. So just pretend we scroll down and we got that. Alright, so basically the way this metrics widget is structured is we show specifically how many different measures are available. So we don't have a donut saying
this is a score, therefore this is a very good or very bad, well not bad, but less engaged with an article. We just show how many measures we have available. So we're trying to be fairly neutral from that perspective. In addition we sort of have bundled our metrics and altmetrics
together. So this book has a lot of downloads from OAPEN but also 15 tweets. It's been referenced twice on WordPress.com, once on Wikipedia, and has 12 hypothesis annotations. Something I've not shown here and something that will be added
soon. Well, it's already there. The widget allows you to have an optional sort of additional information link. This is where you would link to your usage metrics over time. You know, it says 15 tweets there showing specifically what those 15 tweets are, if people want to see what they are, see when those
tweets were made. Yeah, hypothesis again shows you what those annotations are in the book, where they were made in the book, etc. It links to that bit in the book. Yeah, and the last fairly interesting thing is this little question mark here.
We were talking about, I can't remember who said it earlier on, about you take on sort of trust what people have put in certain scores they develop. In this little question mark section, it'll take you to a
definition of that metric saying exactly how we calculated, how the data was aggregated, you know, did we consider retweets to be tweets for... Yeah, Wikipedia is an interesting one to be honest. That one Wikipedia reference is actually composed of at least 10 Wikipedia pages.
Wikipedia releases new pages in future, so we've sort of removed the duplicates before providing that score. Cool, so moving forward with this project. Yeah, people often ask, is it Hermeos? Is it Hermeos?
And things like that. I've been told this no longer matters. Hermeos doesn't actually conceptually exist anymore. It's a project that happened and no longer is running. Operas will run forward with the project, so they will effectively take ownership of it under Operas
because they were guys doing all the work. And they will be the ones who will be expanding on this Hermeos project. As I said, all the code that has been produced is entirely open source. Yeah, you can go to that GitHub link, you can contribute to the code, you can download the code and use it for yourselves.
And again, it's under MIT license, which means you can develop it further for your own purposes with, I think, pretty much no limitations. Yeah, in terms of us being Ubiquiti Press, different hats, we're focusing on expanding...
Yeah, this isn't easily explained. We're expanding on the list of metrics we'll be collecting. So the way the tool works, basically, is we've designed it to be modular. So we basically run a set of plugins. Each plugin collects a different metric or altmetric.
So the idea is to increase the set of plugins to collect additional information in the future. And yeah, as I said, this service effectively can be used by anyone. We are integrating it into our own sort of ecosystem, as shown here. So this is more or less a basic diagram of the different services
we at Ubiquiti Press have. So we've got repositories, conferences, in addition to journals and books, etc. We've separated the metric system out as its own service, and we will be using it to collect metrics for each of those different
services. Cool. And apologies for the little thing. That was a screenshot without realizing it. Cool. If you have any questions, please let me know. Thank you.
Unfortunately, Jean Costa could be able to attend this conference due to some health problems. So I am representing him. He sent me some notes, and I will read them during the presentation. At the end of the presentation, there is a
his email. If you want to contact him, please do it. The Global Development Network is a public international organization that supports a high-quality,
policy-oriented social science research in developing transition countries to promote better lives. It supports researchers with financial resources, global network, as well as access to information, training, peer review, and mentoring.
GDN launched an innovative program to investigate the challenge of doing quality social science research in developing countries that we are here to present, simply named Doing Research. Our guideline questions were established as we believe that by contributing to a better objective
assessment of research systems for social science in developing countries, it aims to expose weakness and shortcomings that can be addressed through better informed national research policy.
The production, diffusion, and the use of locally-grounded social science research is key to democratic debate and planning for sustainable development. Building on the current discourses on knowledge systems, the program puts forward a full-fledged
definition of what a research system is and operationalizes it to investigate the national environment for social science research in three many dimensions, context, actors, and systematic futures. The program proposes the Doing Research Assessment, a three-step
method to study the national social science research system through a context analysis, stakeholder mapping, and an indicator-based theoretical framework. The program will produce both national and global reports. All data sets will be available in open-access and all outputs will be translated
inaccessible outreach material to support awareness and the action to social science research. The Doing Research pilot aimed to characterize, describe, and whatever possible measure the most relevant features of
the research environment across 11 countries. It was implemented by GDN between April 2014 and April 2016 with seven research teams in Africa, Cameroon, Côte d'Ivoire, Niger, South Africa, Latin America, Bolivia, Paraguay, Peru, and Asia, in Bangladesh, Cambodia,
India, and Indonesia. Covering a diverse sample of countries in very different contexts and using varied research methodologies, the pilot provided rich quantitative information on the complex nature of a research environment.
Each team had their own report where the main challenge, key players, and how could they possibly address them on a long-term basis.
The Doing Research pilot was generous support by the Bill and Melinda Gates Foundation, Agence Francaise de Development, French Minister of Foreign Affairs and International Development, and the Swiss Agence for Development and Cooperation.
The pilot phase also highlighted the different strategic points we needed to address in a scale-up phase. First, we must keep in mind that the final product must be flexible in order to take into account the diversity of context and of research products.
Second, the priority must be clear beyond developing a robust assessment tool before marketing. Developing a rank is a secondary consideration. Third, the final product must serve a diversity of stakeholders and propose policy actors to support the implementation of more efficient and enable research policies, research
administrations such as deans and directors who take decisions at the meso-level, researchers themselves to document the challenges that apply to their research environment, and international donors and capacity-building organizations to better tailor interventions and support.
The implementation of Doing Research assessment begins with an overall assessment of the context for doing research along economic, political, historical, and international dimensions. Step one, followed by a mapping of national research actors to identify research
producers and the users, the context assessment, and mapping of national research actors, and then use as inputs into the Doing Research assessment framework, using a combination of secondary data, surveys, and interview.
Our analysis of the general context in which research takes place is used and made up of four elements, economic, historical, political, and international dimensions. These are
assessed from a qualitative perspective to determine the borders of our analysis, but most importantly, they allow us to develop a contextualized reading of the subsequent steps of the DRI methods. During the current stage, it becomes obvious that instead we should divide it into three
spheres, economical, socio-political, and international, as per the constant improvement from the interaction between GDN team, the national team, and scientific advisor each one of them has.
Documented the context to help develop and understand of exogenous factors that impact the research system, such as the cultural specificities, the nature of the political regime, the level of human development of the assessor to technology. Since the
practice of research is highly dependent on these contextual characteristics, documenting the context is critical for analyzing the indicators measured in step three of the assessment. The mapping is conducted to better identify the research actors, producers,
and users that make up the research system. It is directed at a macro level analysis as the aim is not to assess each and every university or funding agency. Instead, we identify and characterize the importance of different groups of actors and the nature of relations between them and
identify the main players within each group. This allows a more contextualized reading of framework and eventually will enable research using the framework to tailor its application to a particular type of actor. Research actors are divided into four categories
high education, institutions, government and funding agencies, industry, and civil level. These categories have subgroups. For example, it can be divided into public and private universities, which can be for profite and non-profite
organizations, industry including for profite think tanks and consultancies, and civil society include NGO, open leaders, non-profite think tanks, and the media. Government and funding agencies is the most hybrid category. It includes
national ministries and the research council as well as public private foreign donors. Populating the framework is the final step in the implementation of ORI. It describes the three determinants of each of the three main functions of research system,
namely the production, defusal, and policy uptake of the research. The framework follows a linear theory of change, which may well be simplified version of reality,
but it's not the less useful for documenting the factors that enable the production, defusal, and policy uptake of social science research. It will be populated for each country by documenting a list of indicators and aggregating those indicators in the multi-cliteria composite
indicator. The three main functions of the research system are understood as the following definitions. Research production, research diffusion, and the research policy uptake. The indicators, which are core of doing research assessment, is much connected into altmetrics, and we know that this can contribute to
capture diffusion and uptake that is not captured by science metrics.
This is our current stage four teams on the field about to finish their implementation, where they will present their results in a couple of weeks. At the same conference,
we will organize a side event where we are trying to create a global alliance to guide the research on research rational for development. Why are we doing all this? Because we can all together get all
these actors working together for the same goal while implementing their own agendas.
We look forward to work with other organizations, agencies, or interest parties to support research, especially on the field of altmetrics, as we truly think it is a veto for the development of our tool and overall program, and we want to have as much
input as possible. Thank you and apologize for not being present, but I am more than available to reply by mail. Thank you. I used to work for altmetric. I left last year and since then I've been working more on policy, which is what I wanted to talk to you today about.
So the background is I am working on a tool for policy. I'm not going to speak so much about the tool today. What I want to do instead is give you the kind of 10,000 feet view of the kind of policy document landscape, if you like, and I expect some of it's going to be common sense and you'd be like,
yes, why are you saying that? And some of it's going to be new. I've got some data from, well, I've collected some data, I've got some numbers coming out of that data, and some of it, hopefully, some of it you'll think I'm wrong and that'll be useful because then you can come and tell me why I'm wrong and we'll both learn from that, hopefully.
So it's starting at the sort of very top level. Talking of those early AM conferences, there used to be this really annoying question that would come up. Someone would inevitably say, but what are all metrics? What do we mean by all metrics when we say it? Or what counts as an all metric and this kind of thing? And it used to annoy me because, not because it wasn't a good question, because it is,
and it is an interesting conceptual thing, but because it would just end up in, you know, three hours of slightly pointless debate, you know, debating is this actually all metrics or is it metrics more generally, yadda yadda yadda. So it saddens me a little bit to have to start with this, which is, what are policy documents?
You know, a few people this morning have mentioned about policy impact and policy documents, but everybody has a slightly different definition of what a policy document is. It turns out even in the field, in academia, when you're studying policy, there isn't necessarily a clear definition of what you mean by it. So here's a straw man that I'm using, so a policy document.
This is a document specifically written for the purpose of changing policy or practice. It's a bit of a circular argument there. But it's not written, you know, it's not basic research. It's not aimed necessarily at other academics to further a field that's aimed specifically at changing something, changing the way something is done.
So if you take that as a starter, what does that imply or what can we figure out from that? Well, the first thing is if you look at all the documents that fall under that umbrella, you find they're very greatly in scope. So if we're talking about policy documents, there's international policy. If you think about climate, international climate change agreements or trade agreements, this kind of thing, and then nationally, obviously,
you know, things that the country's doing. But then, of course, at a state level, which is less important in some places than others. If you think about states in the US where you've got state universities, their mission is to contribute to the environment of the state, and that includes state policy. And down to the city and even the street level and, you know, the parking bylaws in this small town.
I actually don't know where Three Rivers is, somewhere in the UK. They care a lot about, something's happened to make them care about skateboarding. So, yeah, so what I should add here is pragmatically, there's some decisions to be made
because if you include all of this, and I've got some, you know, more thoughts on this later, but, you know, there's a lot of information here, and it's not necessarily relevant to policy altmetrics. So we have to make a kind of pragmatic decision about what's actually most interesting to users of altmetrics. What can we get a handle on, make kind of a, you know, how do we approach the problem?
Realistically here, I think the answer is that people care more about the state level and above than, very broadly, than the city and street level. You're really kind of getting into the weeds once you're getting at that point. So they also vary widely in form. And this happens, I mean, this happens in scholarly publishing as well, right, where if you have a letter in nature,
a letter in nature is original research, right, it's peer-reviewed. It's not like a letter to the editor where you're complaining about skateboarding restrictions or whatever. And equally, in the policy world, sometimes you'll see a blog post from a think-tank that's actually the accumulation of, you know, many months worth of work. It's a well-researched piece.
It's got references at the bottom. It's a report by any other name. Just because it's on PDF form doesn't mean you shouldn't treat it the same as a policy brief from somewhere else. So the reason for that is because it's, and it's simple when you think about it, is that they're designed to reach a particular audience, right? The point of this
is to change policy. We decided, like, you know, this document is to change the way something is done. And sometimes the best way to do that is by a blog post and not, you know, a 120-page PDF report. So you have to bear that in mind when designing systems to collect and process them. The culture
is very different. And again, some of this hopefully, you know, is common sense, but first of all, in terms of comparing it to the scholarly record where we have much more of an emphasis on making sure that a document's permanent and, you know, even if a publisher goes bust, there's still an
archived copy that libraries can access that's permanent. You know, when something's got a DOI, that DOI points specifically to one version and not, you know, like five or six rewrites afterwards. None of that exists in the policy world. It's rare to have a permanent record, so if a government department ceases to exist, it's not necessarily the case that someone's going to be archiving all that content.
Often with a think tank, the document archive goes back to the last time their website was redesigned, right? So they've got a fresh start from there, which is not necessarily something you want when you're looking at, you know, if you're a funder of a university and you're trying to look back 10, 15 years. And the citation culture is also very different.
I don't have any evidence for this, but it looks more overtly political to me. I'd love to work with someone to figure this out, you know, figure out if this is true or not. And there's certainly more self-citation in policy documents than in the scholarly content. And there's a lot of, I think, very interesting questions posed by these differences in citation culture.
And if anyone is interested in doing this research, I'd love to work with you. But yeah, does open access status of a paper help it get cited in these policy documents? Can you detect political motivations in citations, like the big example from a couple of years ago was after Trump got into power with all these
documents going missing from government website or being deleted from government websites, right? Ones that mention the climate, man-made climate change, for example. So can you detect that in the citations in documents as well? So getting down to the numbers and the data, you know, what does this look like in terms of our metrics? Well, so another thing that's very different
in policy document references compared to scholarly references that they're made up of different types. So of the identifiable references in the tool that I've been working on, when you look at policy documents across the spectrum, so, you know, this could be in, there could be
clinical guidelines, it could be about science technology, education, economics, crime and punishment, this kind of thing. Across the whole spectrum, it works out to about a third of those references are to other policy documents. A third of them are to scholarly research, so things for the DOI
or a book. About six of them are then to media outlets, which surprised me a lot. So I was saying before, there are more citations in policy documents to the New York Times, for example, than there are to The Lancet. And then there's a sixth there, you know, to legislation and to patents and other things as well.
When you're talking about tracking the impact of policy, and this is a broader Almetric, suddenly we talked about almetric.com a lot, but it's especially true in policy. To kind of trace that path, the second order of citations are very important. So often what you'll see in policy is, I mentioned the New York Times being cited a lot,
so you'll get research that's mentioned in a newspaper story, and then it's a newspaper story that's cited as kind of evidence in the policy. Or you'll get research from a think tank that's a bit more academic and, you know, has done a bit more of a review or pulled in research from the scholarly record, and then it's that think tank report that gets picked up and cited in the government piece or acted on
at a national level. So you really need not just the straight, you know, academic paper to policy document, equals impact thing. You need the kind of full citation network to get a fuller picture.
So now I'm gonna get a bit hand wavy, and this might be where you disagree with me. I'd be interested in seeing what other people think, but it's hard enough to know how many scholarly articles exist in the world. Unless somebody has an answer now, I don't know. I think something like 180 million is the best
estimate I've seen. And that's in a field where, you know, we have collection librarians and people, it's a whole profession and science behind it. In policy documents, how many, you know, how many policy documents are there? How many should we be looking at to get a proper picture? Well, we can kind of make some guesstimates, try and get in the right order of magnitude.
So if we think about, first of all, governmental sources, you know, government departments, parliamentary bodies, so things like here in the UK, the Parliament has a library and it briefs MPs. It goes off and it does academic research or pulls in research from academics. And then they synthesize it into these briefs, and that's what MPs get
before, you know, voting on the relevant topics. Legal bodies, regulatory agencies, central banks, all these kind of things. We can very roughly say, okay, we've got 195 countries in the world. Let's say 20 of these places produce documents in significant numbers. So obviously, you know, Tuvalu probably has less.
Germany has far more, but let's assume it bounces out. So that gives us about 4,000. Then in terms of non-governmental sources, where can we look? Think tanks are a big source. So there's a, again, a wide variety of, a wide range of estimates here, but
the global go-to think tank index is the big kind of annual thing. This is what you see on think tank websites. They say, like, we're the number three most influential think tank in Southeast Asia working in economics, and this comes from this think tank index report. And it says there's about three and a half thousand worldwide. There's a little bit, the data's a little bit dodgy, and then when you go and you actually look at those stats,
you look at the rankings, you follow them, you see, like, you know, out of the three and a half thousand, number 300 on the list is this place in Albania that shut down ten years ago, and the website, you know, redirects to a domain parking thing. So you think, well, how can that still be, you know, in 2018, this influential?
But let's take three and a half thousand, and then add NGOs, IDOs, foundations, and private, so contract research firms like RAND and MITRE and people who do work for the government. And so you end up with somewhere in the region of 10,000 different policy sources. So that's a tractable problem, right, to track 10,000 different places.
That's just the sources, not the documents. In terms of the ISO documents, what you see is a long tail. So this is a list of different sources along the bottom, and then how many documents they've got. So you can see that there's a bit of a concentration, but there's a few
sources at the top here that have a lot of different policy documents with references in them, and that's like the WHO. You definitely can't read it from where you are, but it's like the GovUK in the UK, like a lot of big aggregators, a lot of big IDOs.
So a lot of sources have tens or hundreds of documents, not thousands necessarily. It's not like a journal. And you definitely can't focus on the big sources alone. You can't be like, well, do we get a good picture just by focusing on the top 500 policy sources? And that's because obviously context is very important. The whole kind of quality over quantity thing definitely applies here as well.
If you think about the IPCC, Intergovernmental Panel on Climate Change, that is obviously very, I think that's a good place to have impact. It makes for quite a good story. But on the website, they only have 50 or 60 different reports, right? The output isn't necessarily as great in policy document terms
as a lot of other sources that might be less important for what's that worth. So for the best estimate there, I assume we stick to those definitions about, you know, state level and above, 10,000 sources, this kind of thing. I think there's about
3 million useful documents where we think if you have 3 million in a data set, you probably have a pretty good idea or could do a pretty good job of guessing where all these policy impact happens, and then an order of magnitude more kind of filler document. So for every document that does kind of cite its sources and things, there'll be, you know, regulations or
or memos, committee notes, this kind of thing. And it is important from a policy perspective, but not so much from an all metric tracking perspective. So collecting all this data is hard. It's hard collecting data from scholarly websites, and it's much, much harder getting it from policy websites.
So there's no good standard for metadata, so you routinely find policy documents, but on the website there'll be no date on them, for example. So it seems like quite an important thing, but not in any machine-readable way. There's no meth data in the PDF, and there's no date listed on the website. And there's certainly no meth tags. It's a bit more of a technical complaint. There's certainly no meth tags,
except maybe for social media. Often the policy is hosted by places with very low budget for IT or, you know, there's obviously one person in the think tank who's responsible for updating the website. So it'll be on off-the-shelf sites like Wix or, you know, handmade, hand-rolled, HTML, and this kind of thing.
It's not the case where you can go to these places, and it's an easy, structured way to get documents to process. And that's, so that's what I've been working on for the past year, trying to address some of these problems. And it's not anywhere close to 10,000 sources.
But it is enough of it, I think, to get quite good picks and get started on some of this research. I'm gonna end there, but having said that about Overton, those are the next steps if you're interested. Especially if you have a research project in mind, if any of these questions made you think, well, actually, that would be quite interesting to find out more about. Please come and talk to me. I've got data where you can have it. Please do the research.
Or otherwise, yeah, come and talk during the break. Thank you. So I'm one half of Funken. The other half, Damien, is sitting at the back. We're a tiny text mining web mining company based in Washington, DC. We are obsessed with identifiers and URLs. Anything that is dirty and not canonical, non-standard, that's kind of what we're interested in.
And we started working on citation tracking a couple of years ago. So yeah, the work that we do is really a partnership with a nonprofit based in the US called the Free Law Project. And they have that project called Court Listener that gathers court opinions, legal opinions, in the US from all courts, local, state, federal, and a few special courts in the US.
So we'd also like to thank Mike Listener from the Free Law Project who helped us make sense of the data. Because we're not lawyers. And Casey who used to work at Funken. A few words about Cobalt Metrics because we're one of the smallest players in the field and newest projects.
We're interested in what we call web-scale citation tracking. So metrics are a sampling game. Nothing new here. Imbalanced data sets reinforce discrimination. At first I used to talk about like there's a bias in the data. Now I talk about discrimination, linguistic discrimination. You know, it gets people more interested.
And to me, it is not up to citation aggregators, altmetrics data providers to define what is citable. So you cannot have an a priori filter on the type of identifier that needs to be used, the format, the language, publication venue, anything like that. We shouldn't be the ones doing that. And our role is to observe all citations patterns on the web.
So to me, a citation from a document identified by DOI to another document identified by DOI is just as good as say a Twitter profile linking to a porn video. That's a citation. Then you're never going to use it, but we need to collect it so, you know, you don't need it. So with Cobalt Metrics, we crawl the web to index hyperlinks and PIDs as first-class citations.
And then we have what we call the URI Transmutation API to collect citations to all versions of a document. So the API takes one URI, one PID, and returns all the URIs and the PIDs that we know identify the same documents.
So in terms of sources, I'm only going to talk about one today. We do crawl everything from Wikipedia, Wikimedia, sorry, all projects, all languages. Same thing with Stack Exchange. We also have started crawling data from Common Crawl, but today I'm going to talk about the stuff we do with US
legal opinions and the data that we get from court listener. So in terms of scope, we get all US legal opinions from 400 different jurisdictions in the US. The data source is obviously court listener, so that project maintained by the Free Law Project. And in terms of citation extraction, court listener provides all the citations from legal to legal documents.
And we're adding citations from legal to anything else, which is all URLs and PIDs mentioned in legal documents. We add that and we turn that into a searchable citation index. So if you think citations in this call are complex, welcome to the legal world.
This is only like part of the table of contents for the Bluebook, which is the uniform system of citation for legal documents. Which is a good reminder that uniform doesn't mean simple. So we have to handle all those cases and most of the time court listener takes care of those for us. So the citations in legal document are going to be old-style, text-only citations that are strongly formatted, though.
So they turn those into URLs to court listener, and then we take everything else and we make that machine readable. Another challenge, and that's an example from a recent opinion from the US Supreme Court, the layout of legal opinion is very narrow, so URLs tend to be split when the line breaks.
So we've started to fix URLs. In some cases like those, it's kind of obvious, you know, the tokens that you need to join. In some other examples, you don't really know if it's the last element of a slug, for example, at the end of the URL, or if it's just natural language.
So we've started working on that. At some point when the system becomes good enough, we will recontribute the data back to court listener so that it becomes available for other projects that build on the database. And the last challenge, which happens with every domain, is link rot and content drift. So nothing lasts forever on the web, and link rot in legal citations has been measured.
So more than 70% of URLs in the Harvard Law Review are now dead. And more than 50% of URLs in the Supreme Court Opinions are now dead, too. So this is real bad. And one of the solutions, which is not different from many of the solutions we know in the scholarly world, is called Prima, Prima.cc.
It's kind of like the wayback machine in the archive projects. And that's one of the things that we can use to still make sense of the data once it gets old. I'm actually going super fast.
And that's where I need some inspiration. So yeah, the API is public, it is free. It is just capped at some point, because our infrastructure cannot support hundreds of thousands of queries. So in court listener, we have about 3 million citations from US legal opinions to anything on the web.
Primary authority, and that includes opinions, statutes, rule, regulations, legislation, anything legal. It comes from 99% of citations. And then secondary authority, it comes from 1% of the citations.
So that's everything else. Citations to dictionary, short URLs that bring you to somewhere else on the web. Product description that have been cited in a case or something. And 99% to 1% is exactly the kind of stuff we're interested in in Cobalt Matrix. We're interested in a long tail, in low frequency phenomena, the stuff that doesn't happen often. That's the stuff we're interested in.
And I had a very nice screenshot, but you can go on Cobalt Matrix yourself. And for example, there are 26 DOIs or URLs whose host is doi.org cited in court listener. And we're going to continue work on that to make more sense of scientific data being cited in legal data.
And all of this is part of Cobalt Matrix as part of our citation index. I'm looking at those and advancing the other one. A note of reproducibility. So in Cobalt Matrix, we aggregate many different data sources. So there are many moving parts. And we do love APIs, especially streaming APIs.
But it becomes complex when you want to make sure that the dataset that you use is reproducible. Because if you keep pulling, for example, from court listener that has an API, you start comparing, I don't know, like opinions from California to opinions from Iowa, for example. At some point, we keep pulling data from court listener and the corpus changes. How do you know whether you can still compare the numbers that you got?
So what we do in Cobalt Matrix is that we ingest the entire datasets so that we control when and how the data gets updated. And with the API, you can get both a fingerprint of the whole database so you know exactly whenever anything has changed, as well as the log of all the web resources that we remixed. So court listener and everything else that we remixed to make that work.
A few words in conclusion about Cobalt Matrix in general. We're currently mostly closed source. But there are a few bits and pieces that have been open sourced. We have recently released an open roadmap for the next data sources, the next features, the next endpoints in the API that we're going to release.
And everything on the website is now CC by. And most of the sources that we remixed are either CC0 or CC by. So you can reuse, you know, the data that we produce kind of freely. And yeah, that's all I have.
Question. Thank you. Sorry. Yes, please.
Less of a question, more of a comment. There are DOIs for porn videos. Just worth noting, in case anyone kind of turned up their nose, that example. These are things which are registered, so that's just worth saying. And I think it's really important to focus on this kind of low level recording of linkages between things. I think this approach is great.
Have you found an audience which is maybe more concerned with metrics, perhaps, open to ideas of talking about this kind of underlying linkage? Like, what's your kind of general response as you talk about this kind of thing? Most of the time, people who work primarily in English
and on publications that are, or works that are privileged enough to be assigned DOIs don't really get the point. And the feedback that I usually get when we talk about a couple of metrics and not only court listener is, okay, so we're missing maybe 1% tops of the global production. So do we really need another project for that 1%? Or could, you know, altmetric come up with a new feature or course of even data or something?
And my answer is that maybe it's only 1%, it's surely even less than that. But that 1% may come for 100% of my scientific production that is not picked up by, you know, solutions that focus on DOIs or sources that are mostly in English.
And so I think, I don't want to have the same conversation about, like, everything, like, oh, proofings are fine, but what about data, software, etc. So it's really just a citation is a citation, a web resource that links to a web resource is enough for us to collect it. And then we need to collect everything for you to find what you want to find,
because it's easier to filter out that to, you know, add back. So I don't know if that answers the question. But so yeah, to answer the question where people who are interested in development trees are mostly people who work on languages other than English, social sciences, humanities, like less privileged, you know, domains, fields, areas. And do you think the people who are part of the English communities,
do you have the same problems, but they don't realize them? So I think, and maybe that would be a better fit for the panel, but I think the domain has been very much saturated by a few players. And if that's the only data that you can get, well, that's the only data that you can get.
And it's good because that's the only thing that, you know, because it's hard to build a new infrastructure to collect citation by yourself. You cannot do that, you know, on the side. So I think, yeah, we need to explain to people that, you know, it's fine that they cannot find everything about their data in other solution. And maybe by adding a couple of metrics on top of everything, they'll find more, uh, citation and, you know,
traces of attention and impact. I'm so sorry. We are a bit late, but I would like to invite you to ask these questions to look during the coffee break. I'm so sorry. Uh, and we return here, uh, in 10 minutes.
Thank you.