Outside the article - tracking other research outputs
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 12 | |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/46289 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
6
7
11
00:00
GoogolCountingMoment (mathematics)Metric systemMessage passingMultiplication signProbability density functionResultantVirtual machineSlide ruleGreatest elementEndliche ModelltheorieDifferent (Kate Ryan album)BitFunction (mathematics)PhysicalismNeuroinformatikUniformer RaumLecture/Conference
02:39
GoogolNumberSoftware testingMultiplication signWeb pageScaling (geometry)Different (Kate Ryan album)AuthorizationGoodness of fitPhysicalismDistribution (mathematics)Block (periodic table)Reading (process)Machine visionFigurate numberRight angleAreaMetreLecture/Conference
03:43
GoogolMultiplication signSpeech synthesisMetropolitan area networkService (economics)Entire functionDivision (mathematics)Lecture/ConferenceComputer animation
04:16
TheoryVolume (thermodynamics)Self-organizationResultantOpen setMultiplication signFunction (mathematics)Field (computer science)
04:53
Content management systemEvent horizonTape driveLine (geometry)2 (number)CollisionTape driveObjektprozessgraphTable (information)ThetafunktionIdentity managementField (computer science)Limit (category theory)Touch typingTheoryVaporProcess (computing)Universe (mathematics)
06:09
TheoryEvent horizonTape driveExpected valueActive contour modelMassScale (map)FaktorenanalyseState of matterSign (mathematics)Gamma functionAsynchronous Transfer ModePhysicistProcess (computing)TheoryArithmetic mean10 (number)Musical ensembleFigurate numberFile formatVideoconferencingOrder (biology)Web pageLecture/Conference
06:42
Tape driveEwe languagePersonal identification numberGamma functionGoogolTape driveWeb pageLine (geometry)Point (geometry)Software testingFigurate numberShared memoryEntire functionResultantOpen setMereologyCollaborationismTheoryShape (magazine)Information securityComputer animation
07:18
DatabaseFood energyPauli exclusion principleMaxima and minimaResultantPhysical systemInformationDigital librarySpiralRow (database)Open setPoint (geometry)2 (number)PhysicistField (computer science)Cellular automatonOffice suiteSource codeXMLProgram flowchartComputer animation
08:00
Web browserCorrelation and dependenceMIDIWide area networkMaxima and minimaUniformer RaumComa BerenicesGamma functionConvex hullPhysicistPhysicalismPhysical systemMultiplication signEvent horizonSet (mathematics)Water vaporThetafunktionInterior (topology)Limit (category theory)Cartesian coordinate systemOpen setOffice suiteVaporSystem callSurfaceRight angleComputing platformText miningDiagramSource codeXML
09:13
Gamma functionLocal ringMIDICellular automatonWebsiteDatabaseTerm (mathematics)Standard deviationFood energyPhysicsAsynchronous Transfer ModeLine (geometry)Web pageLink (knot theory)DecimalService (economics)Special unitary groupDemosceneWeb 2.0DatabaseWebsiteExistenceEmailBerners-Lee, TimSource codeXMLComputer animation
09:51
Computer wormUniversal product codeHiggs mechanismState of matterInterior (topology)Digital object identifierVacuumBoom (sailing)PhysicsReduction of orderMoving averageArithmetic meanAsynchronous Transfer ModeRow (database)Data centerSet (mathematics)Digital object identifierSoftware testingMultiplication sign2 (number)AuthorizationMetropolitan area networkServer (computing)Office suiteSource codeXML
10:28
TwitterSet (mathematics)BitGodMultiplication signXMLUML
11:08
QuantumMaxima and minimaInformationComa BerenicesConvex hullBitCausalityVirtual machineMultiplication signData miningInternetworkingSymmetry (physics)Row (database)Texture mappingAddress spaceSoftwareSet (mathematics)Digital object identifierSource codeXML
11:45
PhysicsMeasurementTablet computerSimultaneous localization and mappingContent management systemCollaborationismCodeObject (grammar)Maxima and minimaAddress spaceCollaborationismBeta functionPattern recognitionNumberPhysical systemMiniDiscCoefficient of determinationCodeXMLProgram flowchart
12:45
Set (mathematics)CodeComputing platformService (economics)Computer fileRow (database)Field (computer science)SoftwareBasis <Mathematik>Local ringKey (cryptography)INTEGRALSingle-precision floating-point formatOffice suiteAuthorizationExistenceXMLProgram flowchart
13:28
Lattice (order)Normed vector spaceState of matterScripting languageWeb pageSoftwareAdditionMeeting/InterviewSource codeXMLComputer animation
14:03
outputDifferent (Kate Ryan album)Order (biology)Goodness of fitNumberCodeSoftwarePhysical systemMassSymbol tableProfil (magazine)Revision controlAuthorizationMoment (mathematics)Virtual machineRoutingPhysicalismFunction (mathematics)Set (mathematics)Cartesian coordinate systemAddition
16:17
Profil (magazine)OpticsContext awarenessFile archiverBitBlogCASE <Informatik>Metric systemTrailNumberComputing platformPhysical systemStaff (military)Game theoryPlanningCodierung <Programmierung>Standard deviation1 (number)TwitterSource codeXMLComputer animation
17:43
Particle systemPhysicsTheoryGUI widgetMetric systemMoment (mathematics)Field (computer science)Function (mathematics)User interfaceVideoconferencingGUI widgetXML
18:18
GoogolOpen setScaling (geometry)Endliche ModelltheorieOrder (biology)Graph (mathematics)Computing platformCondition numberPhysical systemServer (computing)Term (mathematics)CognitionVideo gameAverageExponential functionPhysical lawDivisorCentralizer and normalizerComputer animationLecture/Conference
20:24
Identity managementDivision (mathematics)Computer programMultiplication signPoint (geometry)Figurate numberDivision (mathematics)BitCybersexMetric systemView (database)SoftwareTerm (mathematics)Sign (mathematics)Lecture/ConferenceMeeting/InterviewComputer animation
21:18
MathematicsField (computer science)ComputerSource codeData flow diagramMathematicsUniverse (mathematics)Process (computing)BitArithmetic progressionCASE <Informatik>AreaComputer animation
22:01
Computer sciencePhysicalismSlide ruleDivision (mathematics)CybersexAreaConnectivity (graph theory)MathematicsComputer animation
22:52
Division (mathematics)Service (economics)Computer networkData structureRepository (publishing)SoftwareSystem programmingData storage deviceIntegrated development environmentVisualization (computer graphics)CybersexService (economics)Revision controlData structureSoftwareVisualization (computer graphics)Product (business)Data storage deviceNeuroinformatikIntegrated development environmentSoftware repositoryPhysical systemTerm (mathematics)Computer animation
23:49
StrutSoftwareSupercomputerSoftwareTerm (mathematics)NeuroinformatikCluster analysisGene clusterComputer scienceDivision (mathematics)Computer animation
24:22
SoftwareSoftware maintenanceComputer hardwarePhysical systemType theoryProcess modelingMiddlewareAlgorithmMathematical analysisAreaBitSoftwareSoftware developerVariety (linguistics)Virtual machine1 (number)Product (business)Computer scienceComponent-based software engineeringPoint (geometry)Library (computing)Type theoryRange (statistics)System softwareSoftware maintenanceLatent heatNumberTable (information)Fiber (mathematics)Office suitePhysical systemMathematical analysisGateway (telecommunications)InformationCartesian coordinate systemPairwise comparisonDomain nameAlgorithmMiddlewareDifferent (Kate Ryan album)Computer hardwareComputer animation
25:34
SoftwareRoundingLink (knot theory)SoftwareProjective planeTerm (mathematics)Roundness (object)Element (mathematics)Computer clusterLevel (video gaming)Different (Kate Ryan album)Connectivity (graph theory)Library (computing)Slide ruleStandard deviationSoftware frameworkSoftware developerTwitterPressureOpen sourceComputer animation
26:58
Decision theoryProcess (computing)RoundingElectric currentSoftwareNeuroinformatikAerodynamicsLibrary (computing)MathematicsProjective planeProcess (computing)Peer-to-peerSoftwareInformationCompact spaceMeasurementMoving averageServer (computing)Computer programmingOffice suiteWeb 2.0MathematicsLattice (order)StapeldateiMultiplicationExecution unitOrder (biology)AreaDecision theoryBeat (acoustics)Right angleSound effectLibrary (computing)Computer animation
28:26
MeasurementUsabilitySimulationPhysicsSource codeSoftware developerMetric systemMathematicsLibrary (computing)Similarity (geometry)Physical systemSupercomputerComputer-generated imageryPoint cloudProjective planePhysical systemMetric systemLibrary (computing)Term (mathematics)Goodness of fitMereologyDifferent (Kate Ryan album)MathematicsOpen sourcePoint (geometry)Service (economics)PhysicalismMedical imagingSimulationCartesian coordinate systemPeer-to-peerPoint cloudProcess (computing)Semiconductor memoryRight angleDistribution (mathematics)Row (database)Computer animation
30:13
SoftwareComputer programMaß <Mathematik>Element (mathematics)SoftwarePoint (geometry)Execution unitElement (mathematics)Projective planeMultiplication signComputer animation
30:50
SoftwareGoogolPhysical systemPhysical systemGoogolRectangleVideo gameSoftwareBit rateComputer animation
31:51
SoftwarePhysical systemMeasurementHypothesisGroup actionSpherical capStatement (computer science)MereologyFormal languageCASE <Informatik>Observational studyMeasurementSoftwareHypothesisComputerLine (geometry)Text miningTraffic reportingSet (mathematics)FrequencyOrder (biology)Metric systemComputer animation
33:47
SoftwareNormal (geometry)Metric systemSoftware developerComputer programmingResultantSoftwareView (database)Point (geometry)Projective planeNormal (geometry)Computer animation
34:40
SoftwareDigital signalOpen setProduct (business)Digital object identifierSmith chartGroup actionUniform resource locatorAddress spaceWeight functionoutputFunction (mathematics)Library (computing)Software developerPhysical systemImage registrationProduct (business)SoftwareExtension (kinesiology)Function (mathematics)Software developerLevel (video gaming)BitMetadataLibrary (computing)Digital object identifierSystem callView (database)Group actionPhysical systemPoint (geometry)WeightRow (database)AuthorizationElectronic mailing listProcess (computing)Reading (process)Standard deviationPower (physics)Rule of inferenceReal numberRight angleFitness functionOffice suiteComputer animation
36:49
SoftwareTrailTerm (mathematics)MiniDiscDesign of experimentsMoore's lawSimilarity (geometry)Staff (military)Endliche ModelltheorieUniverse (mathematics)SoftwareExtension (kinesiology)EmailArithmetic progressionSlide ruleBitSimilarity (geometry)Speech synthesisTwitterTerm (mathematics)Address spaceEvent horizonSeries (mathematics)Amenable groupInterface (computing)Dressing (medical)Message passingKeyboard shortcutComputer animation
38:39
Computer programSmith chartElectric currentTelecommunicationMultiplication signAuthorizationLevel (video gaming)Turtle graphicsDifferent (Kate Ryan album)Link (knot theory)Slide ruleMetric systemSoftwareAreaElectronic mailing listPhysical systemMereologyLimit (category theory)TrailTheoryInformationGroup actionRight anglePresentation of a groupComputer iconIdeal (ethics)Video gameFactory (trading post)Statement (computer science)Chord (peer-to-peer)Standard deviationComputer animation
Transcript: English(auto-generated)
00:00
Yeah, I think altmetrics are in really good places among us to apply to data to help us do what we need to do. Because citation counts, yeah, it's not designed for data. Altmetrics, we are in a situation where we could possibly kind of tweak it and make it more suitable for data. And altmetrics is more likely to be able to answer that question of how people use data
00:22
than citation counts are. Thank you. I have no idea what I'm doing with the computer. I have no idea where we're going to talk to you, either. So it makes two of us. Oh, OK. Cheers, everybody.
00:40
We've broken everything else. It's PDF. Yes, it's PDF. Come on, Doctor. Come on, then. Thank you. I know what I am doing now. It's a bit stretched.
01:02
I didn't exactly do it. Anyhow, I'm not going to talk exactly about what we want to hear, but I hope I'm going to talk about something which is relevant to me when we bring up metrics when it comes to tracking something else. I'm going to actually repeat quite a few things I've discussed with you. So I will get Cern.
01:21
Who knows what Cern is? OK. It makes it easier. So we have taken about 10,000 people, 10,000 scientists in Romania, for some, I don't know why we restarted the thing here. This is not good. Just because. Sorry. All my slides have something written on the bottom. So here is something on the bottom.
01:53
Yes. Anyhow, we've just taken a large amount of people, a large amount of time to build a very big machine. I'm sorry. I'm never going to buy this thing,
02:01
because we don't want to do something like that. So we aren't going to scroll like that. A large amount of people, a large amount of time to build a very big machine with which we do physics.
02:22
And one of the results of our physics research has been, very recently, to discover these things. Of course, this is not a very different kind of scientific output than other scientific outputs. It's in that what we have done after we discovered this model is to publish a paper on discovery
02:41
of the explosion, which looks like every other scientific paper. So I'm not speaking about data yet. That's what the scientific paper looks like. On page number four, there is a small green blob, which means we have discovered the explosion. Very good. It's a small thing that makes this paper look different via the papers and account
03:00
of their test data, which leads to the distribution of the papers and authors. After having written this paper, what happens is that somebody else has got a number from the physics research, which is 60 years. Sorry, 50 years before. If you're at a physics research, why am I saying 60 years?
03:21
Because actually, on Monday is our last day. Monday is 7 times 60, which has been there for almost half of the time, so it makes you realize that this is not so much. It doesn't scale when it gets to that. So with our 60, this is actually something which is relevant to what I want to say today.
03:41
60 years ago, or 61 years ago, there was the first convention, which was written, which was giving CERN its mandate, what CERN had to do. You start to see how the world was influenced 60 years ago. They didn't have any laptops, so they got this laptop, and they gave it to somebody who was smoking in the room. And this is our last time we've had this. And by the way, the man that we call him,
04:01
Francois Delos from France, passed away earlier this year. He was 103. And he came a few days before passing out to give out for his last speech at CERN. So he has seen in 60 years history of CERN. What other convention says is that there is also CERN. Remember, this 1953 was a very strange time in the world in Europe.
04:21
The world was ending up. The results of CERN, or we say experimentally generated the world, shall be published, or otherwise made generally available. What does it mean for us to make our result generally available? On the one hand, it means open access to papers. Next year, we should reach 90% of open access to our scientific outputs.
04:42
And we are aiming for 100% by the end of the mandate of our current CEO. But this is not this conference. This was last week. This conference is about data. What does it mean in the field of data? What does it mean to make publically available our raw data? We think that that's what a collision like the LHC
05:01
looks like. The one that was nice, clean, and with nice lines. Most of all, we hate this kind of crap. And we think about 25 billions of those every second. So you cannot save 25 billions of those every second. You only save 100 of those every second on tape, which still means 400 petabytes of data on tape.
05:23
This is not the kind of data you can save. It's not the kind of data that makes any sense, let alone track. Is that what it is that you can use it for? As it happens, if it's useful for whom? The reason I showed you the 2000s, which are the names of the papers in the future
05:41
over the theories that have always gotten overpriced later on. Is that there is a secret tension with this thing about having an idea and verifying it. And then checking in your data so it makes sense of whether there is somewhere else. I don't feel you're just thinking of a new theory or a new state. It will freak you out. We are just in one universe reaching parallel universe
06:00
and we cannot talk to the others. But data aligned to the data that are there can help you figure out whether the crisis theory are working or not. And this is the job of most of the community, which are theoretical physicists. What the theoretical physicists need is not something like that. Not something like that. After they get very happy with tens of tens
06:21
of them, they're flexible and true. What they need is something which is very basic. They need the small band that I showed you on the 3000 paper before, which is this figure. And they just need this figure in a very simple, simply numerical format. And if you print it, that's at least 20 pages of text.
06:41
So from 100 petabytes on tape, you need to go to 20 pages of text. Point is, and this is very similar for the disciplines, is that there ain't no infrastructure to share these 20 lines of text. If you know somebody, you're going to get the figure in text for you and you're going to use it. But this defines the entire purpose of open sharing
07:02
of knowledge and meet our results publicly available. So we have, with our small open data team at CERN in the last two, three years, been working very close to the experimental collaborations and to the theory community to understand how we could work together to help them shape these results for research.
07:22
The tool that we have used is this one. It's also called Inspired by XANADER. Inspired is what Sara was saying. That's the digital library we opened at CERN together with partners in America and China, which serves scientific information to our discipline. So it's a discipline-closed system. We have one million bibliographic records.
07:41
Half a million of them are open access full text. We have 20 million edition triples. And we are fully disambiguated about 20,000 authors. We have 50,000 users, which means every living and active physicist in this field uses this system. And we have about two searches per second. You can do interest science with this.
08:01
The search is inspired by time. So I can demonstrate that half of the physicists, this is lunch time, this is night. So half of the physicists work overnight. And 10% work all night long. Nobody ever measured that.
08:21
It's true that 10% of physicists work overnight. So if everybody uses this system, you can do something interesting, which is to use this system as a platform for open data, for data sharing, for data publication, for data citation, and to try to bring around all the items that the speakers before me had been discussing.
08:41
Let's see how we make this work. That's what an article looks like. That's the discovery week. It was an article, very boring. There is a button there which says, show all the 2,000 and other equity offers, which you don't want to create. There are the citations. By the way, somebody before was mentioning that we don't own the citation. Actually, we own the citation.
09:01
We text mine everything. And the citations are free. We don't know that. We know who cites this paper, how we understand citation, and we are able to call citation and allow navigation. And very recently, by the way, you might have known that all the other screenshots you have seen are beautiful. The screenshots of the published industry
09:21
is totally beautiful papers. We have to work very hard to make this look ugly, because our community was asked to reconstruct the look and feel that it had in 1993. Because the predecessor was actually the first website in the US and the first database on the web.
09:41
And I'm so old for the original email in my inbox from Tim Bednar's lead when Tim BL was announced that this database existed. So it looks very ugly. And there's something new now, which is a tab over there which is called Data. If you go on the Data tab, there are a few data sets.
10:03
If you go on a record, on a data set, you get on the record, where you see something which is we showed also before. There is a data set which has authors. You can show them to talk to them if you wish so, as a DOI. And you can get the data out.
10:21
So that's what we have done last year. What has this implied after one year? What does that mean? What have we observed? The first thing that we have observed is that the theorists, they lied. There's a tweet from one of these people trying to reconstruct and interpret this data. And we're saying, again, the data inspired is like a day you asked for a party and somebody gives you a unicorn.
10:45
And if there was one good reason to get a Twitter feed, a Twitter presence in this file, you just get this tweet now, of course. The second thing that has happened is that we have put a data side URI to this data set. What has happened?
11:01
And we have done nothing. We just left it in the wild. A bit of the time, things have started popping up and creeping up. Now every year in this file, we address between 50 and 100,000 records. We text mine, in fact, text, to spot references. And we create all our references at our networks. So we've trained our machines to recognize a citation
11:23
to a DOI of a data set that we have. And sure enough, what happens a little bit at a time, there's five citations that are suddenly popped up for this data set. And the data set is decided five times. It's being co-signed with these records, which actually is interesting because it lets you understand what
11:42
people read when they use the data. And without any intervention from our side, this is what a citation looks like. It looks like exactly what we have done is suggest a citation, address collaboration data, and then delete the URI. People will do it. Fine. Only five papers have done it.
12:01
But this has happened in the wild because all the steps that we have laid out, all the pieces of the infrastructural, became so obviously invisible to this kind of thing. And then something funny is arriving. So people come to us and say, well, it was good that you did it with data. You know, I also have some data. I'm a small team. I don't have a legacy in my garden,
12:21
but I actually write code. And people might use this code. Can I also get one of these numbers where I can report? And we are recognizing now about once a month, now it's getting once a week. Somebody's coming and asking. You see, since we are communicating again, everybody lives in this system all night long. So this is happening in the wider world.
12:43
And we have today a dog solution. Somebody came and said, I have this code on GitHub. Can I put it in my paper in this file? Yeah, sure. We did an intermediate set for another platform that we run in 7.5. It's a notebook. And we now have a record inside this file where the software expresses with a data side
13:02
that the URI, which is not in the files in the notebook, exists. So those two authors, they can represent the integrity of their scientific contribution in a single place. And that's exactly the key that we are observing, to get scientists involved. I don't have much to think about scientists and what they want or what they want to do.
13:22
Yeah, it's more. How do you get scientists involved? What is a scientist? What is an author in our field? So I can introduce one author. It's one of the best people who are working very closely with Professor Crammer at NYU. So what Kyle is, if you look there,
13:43
he's three pages down. But one interesting thing, and it depends on where you work on this additional piece of software to analyze some piece of this data. Now, the additional piece of software to analyze some piece of this data is just one of those authors. The rest is 3,000 of them.
14:01
So what happens when you click on his name? You see one particular feature that we offer in this file, which is, again, everybody's in the author profile. There is a difference in the author profiles that we present, is that this is how people get hired in our discipline. Because if everybody uses this system,
14:21
every high community uses this system. And when you look out for somebody, you tend to look out for the person, and then you look for the paper. We launched this three years ago. And it's the single most user feature that we have. So this can be a forceful route for physical data citation and for trying to build this infrastructure.
14:41
Because if there are people who think about people, we are sociable apes. We are not machines that look at data, because some of us are. Two things. The first thing is that we present, usually, here, who is this person or who they cannot remake. All the citations are managed to all the papers that this person has written a sheet that exists
15:01
and everything you want. This comes out of this engine that you can get anywhere else, but you also have pre-publication. So these numbers are higher here than what I said. And then the second thing is the route of the person. The first thing that we've taken now is that in the initial code of the application of the person, now we can show all the data sets and code
15:20
and additional kind of material that this person has produced. So we have already addressed the fact of generating visibility for the scientific outputs of somebody, which share those scientific outputs. The next step that we are about to take, but first we want to build enough data sets not to look like we are serving special interests,
15:41
is to evaluate the path on the side. Because there are all the citations, all the papers of this person, and all the conference papers, all the books. What I can do next there, it's really a quick switch, is to add all the citations to the other stuff that this person has done. This is your software, this is your code. The moment in which you do that,
16:00
you have generated a closer market, the market supply in a closer system, and everybody will move more data, and more data seen, because it shows our integration, which is used for hiring, assessment, and just analysis. And of course, once you click on the optical profile here,
16:22
you go on the optical profile archive and find the data for this. So we are closer to in this particular context. And those lines, it looks like I'm working with something else here. I am not talking, and I have not talked,
16:41
about how do we use our metrics to figure out tracking of data. I don't know how often those data have been tweeted about, blogged about, or they've appeared in any other kind of platform. I know often the stuff is downloaded, and I tell you, if you don't want
17:02
the system to be gimmed, these numbers stay in our logs, they don't go out. So what I've talked about here, I've talked about something the scientific community seems to want, which is what's the answer? Can you take data and code, and make them look as things? Can we have more of the same?
17:25
And which is desired, which is exactly the opposite, or seems we are talking about here, which is how we measure stuff in a different way. In the case of the scientific community, it was, can we measure all this new stuff, like we've used it to measure the things? This seems to be a bit of a segue not to close this,
17:43
and therefore, I'll show another picture. The sky itself, in the moment in which we have all these scientific outputs out there, including the data, is not very difficult to pull out the field of all our UIs. So it's not very difficult for him
18:02
to build an automatic widget to everything he has. He uses to build an automatic widget about all the data he has in his file, and now he can build an automatic widget through our comments provided for all the data that he has there. So I guess that the lesson this gives us
18:20
is that not every scientific community is ready to use different ways to measure things. And that way, the different ways of measuring things can scare people off. But if we just normally build a graph for things to gradually evolve, starting from an existing mental model into a new one, then people are going to run after you to say, can I have more of the same, please?
18:42
Thank you.
19:01
This is what's applied, is we have a very closed world and one simple system. Like all this, I've worked for social sciences, biology, and whatnot, which is much more dispersed in very different places. We had nothing of these two years ago.
19:21
We had zero of these. Then we set up these fantastic things, and the annoyments are done there. We're going to run our own data team and serve them. And what we did was to listen to people. So if you listen to what people want, you give them what they seem to need. They seem to work for us. Now, you don't need to be a physicist,
19:42
or you don't need to be a servant, in order to ask people what they really want, in terms of their abilities to them. And since we are able to leave the platform to my applause over the larger packages, we've been discussing before. We are in an even-specter condition to affect take-on and social trends, because we have only 60,000 people on our platforms.
20:00
We are enormous, and that makes more. If anything, it is a leader. But that's not going to send what people want. There was another thing which flapped it out before that happened somewhere. OK, so we'll focus on lunch.
20:22
And then we'll have everybody there. I think we're going to run over at the end. So we might not have time to grab everyone's camera at the end. So do grab a salad, so we're after this. But up next, hopefully, Kat, can you help us with that? OK. Yeah. Thank you. So let's see. So I'm going to probably come in between these two
20:42
talks, in terms of talking about a little bit of metrics, but maybe not a huge amount of metrics. I'm going to talk about things from the point of view of software. And I guess first, is Simon here? Can you pop your hand? No, Simon? Sorry, I was going to put an elephant figure
21:01
on this picture. And so I just wanted to see if anybody here was going to be upset. OK, but it's not. It's OK. OK, very good. Thank you. Sorry. Let's see. So Simon is the National Science Foundation in the Division of Advanced Cyber Infrastructure. And what we do, basically for anyone who doesn't know,
21:23
yeah, this is amazingly small, is to help promote the progress of science and to advance the national health, prosperity, and welfare. We have about 7.2 billion that we spend. About 95% of that goes out to external entities to do research.
21:41
And a little bit of it is used internally for our process. And we fund 24% of all federally supported basic research at U.S. colleges and universities. And in many areas, we're the primary funder. In some cases, like mathematics, there's something like 95% of the research comes from us.
22:02
So this is just to review what we do. I know you can't see this very well. I can't see it very well. But basically, it's biologic sciences, education, engineering, geosciences, mathematics, and physical sciences, and social sciences, more or less. And the one that I left off in the middle is computer sciences.
22:21
And so that's where I fit in. I'm going to get this third slide down, which is the Division of Advanced Cyber Infrastructure. And so we think of ourselves in two roles. One is that we fit into computer science because we're taking a lot of computer science tools. But we really try to span across all of these areas
22:41
because the cyber infrastructure that we're developing is something that should be of use to everybody and its components that are coming from all these different areas that are what together builds up that cyber infrastructure. So specifically, what we do is to support, importantly, and then help acquire and provision state-of-the-art cyber infrastructure resources,
23:01
tools, and services, try to support research and creation to make the cyber infrastructure better over time, and then serve the community of scientists and engineers across all disciplines to make sure that they have the infrastructure that they need to do their work. So I keep saying cyber infrastructure. So here's the definition that I like. Computing systems, data storage systems, advanced instruments,
23:22
and data repositories, visualization environments, and people all linked together by software and high-performance networks to improve research productivity and enable breakthroughs not otherwise possible. So sometimes this might be an e-science infrastructure or e-research infrastructure or something else like that. But sometimes when people say those things,
23:40
they kind of leave off the data and the software and some of the other pieces. So cyber infrastructure is really meant to be a pretty inclusive term. So software is one of the things that I mentioned as infrastructure. And so we have a bunch of clusters in my division. I'm running the software cluster. There's also a data cluster and an HPC cluster
24:01
and a couple of others. And I would say we're not going to talk primarily about software, but I think most of the things that I'm saying also go for data in one way or another. So software is the thing that actually fits between science and the computing infrastructure that actually makes the science possible to do
24:20
in terms of computational science. And at some point I was looking through Science magazine and looked through a few different issues and noticed that about half of the papers seemed to be software intensive. And all the ones that weren't software intensive, many of them had pretty large software components as well that really made them possible. I think research and science in general is becoming more dependent on advances in software.
24:42
Significant software development happens across NSF in a variety of domain-specific areas as well as computer science areas. It has a wide range of software types from system software to applications, modeling, gateways, analysis, algorithms, middleware, libraries. Software is not a one-time effort. It has to be sustained.
25:01
And it involves development and production and maintenance and all these things are people intensive as opposed to machine intensive. Software lifetimes are long as compared to hardware. They're probably short as compared to data, but that's a slightly different issue. And software often has an under-appreciated value. And so I think one of the things I want to say about this
25:21
is that for software to be sustainable, it has to become infrastructure. I don't necessarily mean that in the invisible sense, but I mean that in the sense that it is relied on. So maybe there's a way that actually maybe infrastructure is a little bit too far. So software infrastructure projects that we have. So we think of software at a few different levels.
25:44
We think of it first in terms of elements, little pieces of software that are self-contained. And we've funded about 65 of these projects over about five years. These are often one to two PI, three-year projects. We think then about software frameworks which take multiple elements and put them together.
26:03
The elements can be funded by us. They can be funded by somebody else. They can be open-source with nobody funding them. But they're somehow a larger component. And we've funded so far four rounds of these, about 35 projects over four years. And then we think kind of down at the end of institutes. And these are the things that I would say
26:21
are really aimed more at communities and less at people. So how do we make software standardized for disciplines? How do we make sure that the kind of underlying libraries that are needed work across different software expectations? How do we educate and train people to build software and then to use software? These are the kinds of things that institutes would be looking at more than purely software development.
26:43
And as we go across this sector, the communities that we want to impact get larger when we're thinking about which projects we're going to fund. We also want then everything to be reusable. And the slides are on SlideShare now. And there's a link that I put on Twitter, so you can leave those up there.
27:01
When we have projects that come out of this process, we have a bunch of proposals. We go through peer review processes. And the peer reviewers look at software aspects of things. And they look at science impact and what they think the science impact is going to be. So I, in this kind of cross-NSF role, take that information. And then I try to do batch meeting
27:21
to find other areas of NSF that want to support this software. So we have something like a bioinformatics application, what I would call a unit disciplinary project. I go to try to find the bioinformatics program officers and they say, is this something that's important to you? And they say yes or no. And that often leads to a funding decision.
27:42
We're assuming we have a good proposal to begin with. So there's something that happens before this. If I have something that's multidisciplinary, I can work with multiple program officers. In this case, I work with chemistry and material science program officers. And if I get one that's interested, then that's good enough to go forward. If I get something that's really going across all disciplines,
28:02
like a web server or a math library, then I have to try to work with all program officers. And the answer to all program officers is usually, it's not really enough for me, so you should figure out if you want to fund it or not. And we do that. And so in order to actually judge the software and what all these program officers and I are doing, we have to try to understand or forecast the impact.
28:22
Either understand the impact that's happened in the past, or forecast the impact that's going to happen in the future. And so that's where we get into questions about how we actually do this and what the metrics are that we should be using to think about these things. So if you're thinking about somebody that's developing an open source physics simulation as an example, we've got a bunch of different things that we can look at,
28:42
from how many things have been not loaded, to what papers have been cited. And these go from easiest to measure and least value to hardest to measure at most value. And unfortunately, that's the situation that we're in, is that the things that we really want to know, we don't necessarily have a good way of measuring,
29:02
and the things we can measure are not the things that are really very helpful to us necessarily. If we think about somebody that's building an open source math library, there's probably similar metrics, but the citations of that itself are less likely. Because it's not clear who would actually cite a math library. The user of a physics application that uses the math library
29:23
probably isn't thinking about what the libraries are underneath. So we have problems that the users don't necessarily download things like this either. They could be part of the distribution, they could be pre-installed on the system that somebody's running on, it could be part of a cloud image that they're using in their application, or it could be a service where they're not actually downloading it, they're just calling it somewhere.
29:44
So this is what we're thinking about in terms of impacts of measuring things that exist. In terms of future impacts, one of the things that we ask for in these proposals is what's the impact of your project going to be if we fund it? We don't say what the metrics should be, we let the proposer suggest what metrics they want to be evaluated against,
30:03
and we use that in a peer review process to judge, are these the right metrics? Are they going to lead to something that we're going to think is going to be successful? So that's kind of how we're handling that at this point. So as a program that we're talking about here, we work with other units to support projects that lead to software as an element of infrastructure.
30:23
One of the issues is the amount of software that's infrastructure grows over time, and that's growing faster during our funding, which is more or less flat with inflation at this point. So the question that comes up is how do we ensure that the software infrastructure that we need continues to appear without us actually funding at all? And the answer to this is incentives.
30:42
We want other people to want to do this, not because we're paying them, because there are other reasons that they're going to do it. And the question is actually how do we do that? So we're talking in other forums, and in particular this effort called WSPI, which is working towards sustainable software for science, practice, and experience, which you can Google WSPI. I picked an acronym that nobody else has picked,
31:04
so Google actually does very well with this. We've had two previous workshops, and we have another one coming up at SC14 in November, and so if people are interested in thinking about basically how some of these things happen, I'd be very happy to have more people come to this. The lessons that came out of these initial workshops, I think there's probably two lessons that I wanted to highlight.
31:26
One of them is that many of the issues that seem to occur in developing sustainable software are social issues and not technical issues, and so we need to think about this in a cultural sense more than in a technical sense. And the other is that software work is inadequately visible in the ways that count
31:42
within the reputation system that underlies science at this point. And so these are really the things that we're thinking about. So I've had a few little things in red rectangles. They've actually been these four things primarily. So we need to be able to forecast impact in order to judge software.
32:00
We have to try to make sure that people want to produce software, not just because we're paying for it. We have these problems, these issues are social in many cases as opposed to technical, and we have the issue that software work is not visible and not in the ways that count. And so we have a hypothesis come out that says that better measurement of contributions
32:21
can lead to rewards or incentives, which can lead to career gaps, which can lead to willingness to join communities, which can lead to more sustainable software. And so this is really what we're trying to do, and this is all derived by basically started by having better measurements. So I don't know what these measurements are necessarily. This is part of the reason I'm coming here, is that I want to get people that are interested in this really to think about
32:41
what the metrics should be for software. Actually, I want to just mention one quick thing based on the previous talk. There are a group of people at NCAR, which is the National Center for Hemispheric Research in the US, that decided they wanted to try to get people to cite things that they were producing. And so they suggested that people cite their computer system when they used it,
33:01
and they cited some data sets when they used those data sets. And they did a study then, over a year period I think, about did people actually do this. And what they found was that about a third of people that they could find did cite them, and they cited them as citations, as endnotes or footnotes. About a third of people didn't actually use the language they suggested,
33:22
but instead put something in their acknowledgement statement, saying that this is something they used. And the other third of people didn't do either of those things. They put something just in line with the text saying this. And so the NCAR folks were able to basically do a lot of text mining and find these things and determine what had happened. But they weren't really able to encourage people to do what they wanted.
33:41
And so I think that's an issue that we have to think about, is how to actually get users to do this. So we're trying to do a few different things moving forward. As I mentioned, I'm very eager to have ideas from this audience and from others. From the point of view of NSF, we put out a dear colleague letter, which is basically kind of like a reminder that we have programs available that people can apply to.
34:03
And we basically were asking for people to study norms in supporting scientific discovery through practices for software citation and data citation. And what came out of this is we have six eager projects, we call them smallish projects for one to two years that have been funded,
34:23
and three collaborative workshops. And so we'll let those go for a year, see what happens, and see if they actually lead to results that are big enough that we've built a larger funding program out of, or if they just lead to results that we say, okay, this is the results and now we know what to do with them. But that's where we are there.
34:40
I'm also thinking about this as a researcher. So I'm thinking from the point of view that I would like to see a system where products are registered, and I would call products software or papers or data. When they're registered, I would like to imagine that people can create a credit map for those products, which is a weighted list of contributors. And contributors here are authors, acknowledgements, citations, anything else that's used.
35:05
I know this is difficult, and we've had a lot of discussions about if this is really possible or not, and I'm not sure of the answer to that, but it's an idea at this point. A DOI or something like that would be an output that goes with this. And this leads to this idea that I think is actually fairly important in transit of credit.
35:22
And so there's one paper that I wrote just about this idea, and then I've written a recent paper with Arthur and Seth that tries to figure out a little bit of how we might have to mess up this as well. So the idea here is that if paper 1 provides 25% of its credit to software A, it registers that in the credit map, and then software A previously had a credit map where it provided 10% of its credit to library X,
35:43
now we can see that library X is getting credit for paper 1. And that's the thing that I'd like to be able to do through provenance, through something like this. I don't know exactly how, but all the stuff that's indirectly being used doesn't really get a fair shake at credit at this point. And I'd like to figure out ways to encourage people to do those things underneath.
36:04
So this helps developers show that my tools are important. There's a bunch of issues that I really don't have solutions to at this point. So one is trust in the person that registers a product to do the credit map in a way that's happy for all the people that have contributed it. And I think we've kind of known some solutions to this to some extent.
36:22
This is the discussion that has to happen with paper authorship listings. And they're the process by which most authors end up being happy with the way that's done. So I think this is kind of a social thing that we can't do. Technologically, how do we actually record this and how do we register it? Can we do this on top of the DOI system? Can we do it on a separate system?
36:41
How does the metadata work? Are there standards? There's a lot of questions. I'm really just bringing this up as an idea that I think is worth exploring. The community as a whole, and here I'm speaking to some extent about the people who have been coming to some of the WSPI events and communicating through that, are asking questions like, is there a role for non-tenure-track researchers
37:02
who produce software or data in universities? And everybody seems to think the answer should be yes. And then the question is, do universities recognize and support this? And most people would say the answer is no. So then the question that comes along is how to get universities to do this. And my feeling and the only answer that I've had from people
37:22
is that if we take the young people that believe this and we wait 30 years, they'll be the older people that will be running the universities. I think that's probably true, but I'm hoping that we can do something that's a little bit sooner. And the question is exactly what? I think actually in the UK there's probably been a little more progress in this, and maybe it's because of the funding model for universities
37:41
and that things are a little bit more centralized. What's needed to support reproducibility of science in terms of both data and software, again, are more questions. I think there's a lot of entities that have similar interests in software and data. Just research councils here, NIH, DOE, in the US, the Sloan and Moore Foundations in particular
38:01
have been doing some interesting things with trying to get universities to change the models that they're using and try some new models for supporting data. Mozilla and Apache Foundations, for example. And I guess I didn't really leave. I didn't mention things like cross-direction and vivo and other things that are castrated. There's a bunch of other things as well, RDA.
38:22
So I'd like to encourage some participation in this series. I see that as an opportunity to think about some of these things and talk about them. And other ideas and questions are also welcome now or in the future. I've got some email addresses there. There's a Twitter handle that I've been using as well that was on the first slide. You can find it there.
38:40
Just quickly, in the slides there's a bunch of links to different stuff that I've talked about that people can look up later. And there's a lot of different people who've been involved in working in different parts of this and coming up with some of this. So just leave that there and say thank you.
39:04
Maybe old metrics aren't ready. There's areas that aren't ready for old metrics in non-traditional research efforts. And then we saw how things like a credit map can help track things better. But again, it seems to focus more on citations and how you apply that to old metrics. Does anyone have a quick question for Dan at the moment?
39:21
Yeah, Dan. Right. So the question was about the credit map and how you define the limit. And so I think the thing, because of this idea of transitive credit, you only have to worry about the things that you're directly using.
39:41
Anything that's indirect you don't worry about because the direct usage of it. I've already recorded those things in an ideal sense. So then the question is, of all the things that you're using, how far do you go? And I would say anything really that you would put in a citation list or you would put in an acknowledgement statement, say in a paper, are the things that you should be thinking about.
40:01
I don't need a paper. Who's drafting this? I'm sorry? Should these people not get credit for helping me in this way, in this presentation? So most of these people actually are in papers. So these are the people that I would acknowledge that this is a paper on the top. And the people down on the bottom actually are paper authors.
40:21
So are you suggesting that software should be dealt with in a very similar fashion?
40:44
Yes. So I'm actually suggesting that all software and data should be dealt with in a similar fashion. And I think one of the issues that comes up with this is that it's not clear to me that it's practical to do that using the system of taking notes and then kind of going back over the notes and trying to figure out which of these things led to this paper.
41:02
So I think we really need systems that do this, particularly for electronic information like software and data. And I'm actually hopeful as we move forward with provenance systems that really do more tracking of workflows and let you kind of go backwards and forwards over different ideas that some of this can automatically be extracted from those.
41:21
One last question? I find this idea created by people is that it should be if you just get something really bizarre and maybe it's the perfect, like NumPy, it would be like everything, right?
41:44
Is that... Yes, I think that would really be good because I think that stuff like that isn't getting...
42:05
So in theory NumPy would have this credit map of their own and if they chose to credit Python, which I assume they would, then yes. Because it keeps going down, right? It's the turtles. So thank you once more to Sarah, to Salvatore, to Dan.
42:24
It's interesting that Stan is going to follow on this after lunch, but we've already ran out of time.