We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Reusability Through Community-Standards, Tidy Data Formats and R Functions, Their Documentation, Packaging and Unit-Testing

00:00

Formal Metadata

Title
Reusability Through Community-Standards, Tidy Data Formats and R Functions, Their Documentation, Packaging and Unit-Testing
Title of Series
Part Number
5
Number of Parts
9
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Attribute grammarUsabilityMetadataPhysical lawBitLocal ringInsertion lossXMLUMLLecture/ConferenceMeeting/Interview
Attribute grammarInformationSoftwareCausalityVirtual machineLatent heatPlastikkarteDifferent (Kate Ryan album)CollaborationismXML
Attribute grammarMeta elementTime domainStandard deviationVirtual machineFile formatImplementationLaptopHarmonic analysisInformationOrder (biology)SoftwareLevel (video gaming)User interfaceVirtual machineGoodness of fitCASE <Informatik>Point (geometry)Open setLatent heatInternet service providerMetadataFile formatPlastikkarteDifferent (Kate Ryan album)Multiplication signStandard deviationService (economics)Repository (publishing)Faculty (division)Moment (mathematics)Set (mathematics)Generic programmingJSONXML
Context awarenessCondition numberFatou-MengeVariable (mathematics)Parameter (computer programming)SoftwareRevision controlInformationSoftwareVariable (mathematics)User interfaceState of matterGroup actionVirtual machineRevision controlParameter (computer programming)Point (geometry)Set (mathematics)MetadataFile formatWebsiteCondition numberContext awarenessRepository (publishing)Software developerUsabilityHarmonic analysisComputer configurationBeta functionPower (physics)Process (computing)Template (C++)Cartesian coordinate systemSound effectMetreXML
Context awarenessLink (knot theory)Condition numberVariable (mathematics)Parameter (computer programming)SoftwareRevision controlInformationMetreInformationCondition numberTouch typingProjective planeSet (mathematics)MetadataContext awarenessPlastikkarteXMLComputer animation
Context awarenessLink (knot theory)Condition numberVariable (mathematics)Parameter (computer programming)SoftwareRevision controlInformationAxiom of choiceStandard deviationGamma functionRepository (publishing)State of matterUser interfaceMereologyParameter (computer programming)Field (computer science)Point (geometry)Observational studyAuthorizationStandard deviationRepository (publishing)FeedbackMusical ensembleParametrische ErregungOnlinecommunityComputer animationXML
Context awarenessCondition numberVariable (mathematics)Parameter (computer programming)SoftwareRevision controlInformationRepository (publishing)Link (knot theory)Axiom of choiceStandard deviationFunction (mathematics)UsabilityComputer filePhysical systemMeasurementComa BerenicesRepository (publishing)FeedbackInclusion mapBitComputer animationXMLUML
SoftwarePhysical systemMeasurementData structureIntegrated development environmentBitRepository (publishing)Touch typingAreaField (computer science)Set (mathematics)Student's t-testDifferent (Kate Ryan album)Degree (graph theory)Computer animation
AngleView (database)Process (computing)MetadataWeb portalSoftwareSoftware frameworkComputer networkLogic gateThermal radiationForschungszentrum RossendorfChemical polarityIntegrated development environmentDemosceneHorizonElectronic program guideService (economics)Windows RegistryPhysical systemNumberRepository (publishing)Set (mathematics)User interfaceOperator (mathematics)InformationStatisticsTotal S.A.BitLine (geometry)Physical systemGoodness of fitFerry CorstenStrategy gameField (computer science)WebsiteComputer animationXML
SoftwareDrill commandsCore dumpContinuous functionBlogAverageData recoveryMetreFile formatAbstractionOperations researchMeasurementGreatest elementPhase transitionSampling (statistics)WeightComputer fileCharacteristic polynomialStandard deviationDrill commandsProfil (magazine)Electronic visual displaySet (mathematics)Image resolutionComputer animation
File formatAbstractionOperations researchMeasurementGreatest elementCore dumpContinuous functionBlogDrill commandsAverageData recoveryMetreUser profileFluidGradientPhase transitionThermodynamic equilibriumForceTime zoneEnterprise resource planningMaxima and minimaSoftware bugTape driveSoftwareParsingLine (geometry)Basis <Mathematik>Physical systemExecution unitLink (knot theory)Meta elementContent (media)Computer-generated imageryScripting languageWrapper (data mining)Computer iconKeyboard shortcutElectronic visual displayFile formatDifferent (Kate Ryan album)Projective planeVirtual machineRevision controlWeb pageMultiplication signSource codeLanding pageInformationTexture mappingComputer animation
Link (knot theory)Drill commandsUser profileMeta elementScripting languageHypertextComputer iconPoint cloudContent (media)Dublin CoreSource codeML <Programmiersprache>Point (geometry)File formatInformationData typeGoogle MapsSystem programmingSoftwareCore dumpContinuous functionBlogAverageMetreEnterprise resource planningData recoveryIntegrated development environmentHost Identity ProtocolOperations researchMeasurementGreatest elementThermodynamic equilibriumPhase transitionWaveFunction (mathematics)Google AnalyticsInheritance (object-oriented programming)FluidForceDistribution (mathematics)Software bugCategory of beingHausdorff spaceTime zoneExecution unitGeneral linear modelInformationMereologyVirtual machineMetreTexture mappingIdentifiabilityMetadataType theoryBitMeasurementAbstractionSet (mathematics)Degree (graph theory)Variable (mathematics)Line (geometry)Computer animationSource code
Software bugSoftwareScripting languageSoftware protection dongleTime zoneDistribution (mathematics)Function (mathematics)FluidForceBitLimit (category theory)Sampling (statistics)Virtual machineSet (mathematics)Repository (publishing)Order (biology)Data centerMetre
Scripting languageFunction (mathematics)Google AnalyticsFluidForceDistribution (mathematics)Software bugExecution unitHausdorff spaceTime zoneDrill commandsUser profileMeta elementContent (media)HypertextDublin CoreLink (knot theory)Element (mathematics)Electronic meeting systemNumbering schemeFile formatSoftwareParameter (computer programming)Inclusion mapMetadataStandard deviationPower (physics)Metre1 (number)Formal languageInformationOrder (biology)Content (media)Multiplication signWeb 2.0Repository (publishing)Mobile appSet (mathematics)Computer animation
SoftwareFamilyMathematical analysisSet (mathematics)Form (programming)FamilyComputer programmingProcess (computing)TableauDifferent (Kate Ryan album)Transformation (genetics)Dimensional analysisState observerFile formatRule of inferenceXMLComputer animation
Variable (mathematics)Cellular automatonEmailNumberRow (database)Type theoryVariable (mathematics)Category of beingSampling (statistics)Virtual machineCellular automatonMeasurementState observerSet (mathematics)Single-precision floating-point formatMultiplication signComputer programmingMereologyExecution unitFile formatTableauXML
Table (information)Presentation of a groupData structureVariable (mathematics)Task (computing)MeasurementTableauDegree (graph theory)Row (database)Type theoryComputer animationLecture/Conference
Variable (mathematics)Table (information)Presentation of a groupSmith chartSoftwareString (computer science)Type theoryVariable (mathematics)ResultantSystem callMeasurementState observerContext awarenessArithmetic meanCellular automatonRevision controlFile formatTableauInferenceSet (mathematics)Computer animation
Data structureConsistencySmith chartTable (information)Presentation of a groupSoftwareKey (cryptography)Set (mathematics)State observerBitVirtual machineSound effectResultantNumberParameter (computer programming)Latent heatExecution unitEmailOntologyComputer animation
Variable (mathematics)Table (information)Presentation of a groupSoftwareSmith chartDigital object identifierMathematicsData structureFunction (mathematics)VolumenvisualisierungSoftware developerFaktorenanalyseRepository (publishing)InformationSheaf (mathematics)Materialization (paranormal)ResultantTableauGraph (mathematics)Process (computing)Asynchronous Transfer ModeOffice suiteSoftwareDigital rights managementComputer animationXML
SoftwareFaktorenanalyseMetadataComputer configurationSheaf (mathematics)Software developerVolumenvisualisierungRepository (publishing)Source codeDigital object identifierMeta elementProgramming languageSoftwareDescriptive statisticsCASE <Informatik>Point (geometry)Latent heatComputer fileMetreAuthorizationSoftware developerMetadataStandard deviationInformationVideoconferencingSheaf (mathematics)Content (media)Web serviceIdentifiabilityMultiplication signDigital rights managementRepository (publishing)Digital object identifierConstraint (mathematics)Planning
VolumenvisualisierungSoftware developerMetadataSoftwareFile formatComputer fileSoftwareType theoryComputer configurationBitNumberOvalRevision controlCovering spaceRange (statistics)CASE <Informatik>Field (computer science)Web pageOnline helpMetadataComputer fileKey (cryptography)Cellular automatonPlanningCyberspaceFile format
MetadataSoftwareVolumenvisualisierungLanding pageFile formatComputer fileSoftware developerSource codeMeta elementGame theorySoftware testingBitCoordinate systemFile formatCodeComputer configurationVirtual machineWebsiteDigital rights managementForm (programming)Measurement
File formatComputer fileSoftwareSoftware developerMetadataMeta elementVolumenvisualisierungWeb pageInstallation artZeno of EleaDemo (music)Digital object identifierDatabaseInterface (computing)Scripting languageBit rateImplementationMathematicsForcing (mathematics)Group actionSource codeSoftwareSlide ruleAsynchronous Transfer ModeCodeType theoryGenetic programmingDemo (music)XMLComputer animation
Interior (topology)Mathematical optimizationVacuumData modelSummierbarkeitExecution unitUser interfaceSineUniform resource locatorInterface (computing)Scripting languageComputer fileDigital object identifierNormed vector space12 (number)Computer-generated imageryDrum memoryLink (knot theory)Musical ensembleDemo (music)Wide area networkStandard Generalized Markup LanguagePay televisionMenu (computing)Maxima and minimaEmailClique-widthMoistureSoftwareDatabaseTerm (mathematics)Arithmetic progressionComputer programmingLibrary (computing)Type theoryLibrary catalogRight angleBitField (computer science)Computer animation
Interface (computing)Scripting languageComputer fileSoftwareCore dumpSet (mathematics)CurveMetadataLink (knot theory)Web pageParameter (computer programming)Personal digital assistantCodeView (database)Video game consoleForm (programming)BitField (computer science)CodeComputer programmingNormal (geometry)Landing pageArithmetic meanPoint (geometry)Different (Kate Ryan album)Computer animation
SoftwareCurveCore dumpMetadataLink (knot theory)Parameter (computer programming)Personal digital assistantDiagramDrill commandsProjective planeMeasurementSet (mathematics)CurveExecution unitInternetworkingWeb page
FrequencyFile formatAbstractionCore dumpGradientImage resolutionSeries (mathematics)Greatest elementSurfaceData modelMeasurementExecution unitPrincipal idealMeta elementDrill commandsHill differential equationBit rateExecution unitMeasurementWater vaporParameter (computer programming)Set (mathematics)Greatest elementTableauMultiplication signResultantRight angleComputer animation
Digital filterSphereCore dumpUDP <Protokoll>WindowAliasingSeries (mathematics)Greatest elementSurfaceData modelMeasurementExecution unitPrincipal idealCodeDrill commandsParameter (computer programming)Personal digital assistantLink (knot theory)DiagramMaß <Mathematik>Data conversionPoint (geometry)Proxy serverView (database)Video game consoleDigital electronicsScripting languageData structureSoftwareCurveParameter (computer programming)Flow separationSet (mathematics)Data structureDiagramVariable (mathematics)Execution unitHydraulic jumpMeasurementProxy serverComputer animation
Series (mathematics)Greatest elementSurfacePoint (geometry)Data modelMeasurementExecution unitPrincipal idealScherbeanspruchungView (database)Maß <Mathematik>Data structureScripting languageProxy serverDiagramoutputVector spaceInformation managementScripting languageComputer programmingComputer configurationFunctional (mathematics)Set (mathematics)WebsiteWriting
Proxy serverDiagramMeasurementMaß <Mathematik>Data conversionData structureScripting languageComputer fileFunction (mathematics)Vector spaceWritingoutputVideo game consoleWeb pageOnline helpPoint (geometry)UnicodeDefault (computer science)Chemical polarityIntegrated development environmentGame theoryField (computer science)Physical systemOpen setContent (media)Term (mathematics)Commitment schemePasswordSet (mathematics)Object (grammar)Virtual machineSoftwareProcess (computing)Software frameworkSource codeMetadataThermal radiationPole (complex analysis)Service (economics)SeitenbeschreibungsspracheHorizonElectronic program guideLibrary (computing)Distribution (mathematics)Demo (music)Mach's principleFingerprintCompilation albumNo free lunch in search and optimizationComputer-generated imageryPlot (narrative)FreewareForm (programming)SurfaceData warehouseMenu (computing)Computer wormWeb pageSoftwareInternet service providerElectronic mailing listComputer animation
Dew pointSource codeCommunications protocolProcess (computing)Peer-to-peerSoftware maintenanceStaff (military)Demo (music)Coma BerenicesLatent heatDatabaseClient (computing)SoftwareMaxima and minimaEmailDatabaseCASE <Informatik>Electronic mailing listClient (computing)Repository (publishing)Source codeWebsiteTwitterComputer networkInformation retrievalQuicksortSource codeXMLComputer animation
EmailBitSoftwareInterface (computing)Branch (computer science)Error messageLatent heatDifferent (Kate Ryan album)Web 2.0Repository (publishing)Computer animationXML
SoftwareComputer-integrated manufacturingData modelMeta elementLink (knot theory)Maxima and minimaComputer iconKeyboard shortcutScripting languageInformation retrievalFunctional (mathematics)Kernel (computing)Computer configurationSet (mathematics)Computer animation
Function (mathematics)SoftwareContent (media)Cloud computingLibrary (computing)Computer configurationFunctional (mathematics)Web pagePoint (geometry)Roundness (object)Source codeLibrary (computing)XMLComputer animation
Message passingLibrary (computing)Object (grammar)Video game consoleSoftwareRippingView (database)Electronic mailing listLemma (mathematics)Social classVariable (mathematics)MetadataFunction (mathematics)Execution unitDuality (mathematics)Digital rights managementData structureVariable (mathematics)BitMereologyGoodness of fitMeasurementParameter (computer programming)MetadataElectronic mailing listSoftware frameworkSet (mathematics)TableauComputer animation
12 (number)Plot (narrative)Execution unitFunction (mathematics)Variable (mathematics)CodeView (database)Library (computing)Video game consoleSoftwareMiniDiscWechselseitige InformationWitt algebraMathematicsComputer iconSimulationPlotterMathematical analysisCodeRow (database)InformationMathematicsTheory of relativityComputer programmingLibrary (computing)Shift operatorVariable (mathematics)Formal grammarFunctional (mathematics)MereologyProjective planeDot productHydraulic jumpSystem callRevision controlMeasurementComputer fontCASE <Informatik>Error messageCyberspacePoint (geometry)Control flowSet (mathematics)Cartesian coordinate systemFile formatSoftware frameworkGraph coloringSource codeTableauTexture mappingBlock (periodic table)Element (mathematics)Right angleOvalState observerCuboidKeyboard shortcutComputer animation
GeometryPoint (geometry)Texture mappingVideo game consoleSimulationView (database)Library (computing)SoftwareFunctional (mathematics)Parameter (computer programming)Drill commandsZustandsgrößeGeometryMeasurementPoint (geometry)Set (mathematics)Cartesian coordinate systemCalculationMultiplication sign2 (number)Right angleDefault (computer science)Metric systemHydraulic jumpComputer fontComputer animation
Intrusion detection systemRippingSoftwareLibrary (computing)Alpha (investment)Reverse engineeringData compressionGoodness of fitGodDiagramPlotterFunctional (mathematics)HypothesisScaling (geometry)Point (geometry)Set (mathematics)PixelCartesian coordinate systemOnline helpGraph coloringAlpha (investment)Texture mappingDifferent (Kate Ryan album)State of matterBitRight angleComputer animationXML
Plot (narrative)DreizehnWeb pageDefault (computer science)Texture mappingCodeCore dumpVariable (mathematics)Uniform convergenceDisintegrationConditional-access moduleVirtual machineIntegrated development environmentSet (mathematics)Online helpComputer fileAlpha (investment)Level (video gaming)Power (physics)PhysicalismMeasurementWater vaporField (computer science)Point (geometry)Proxy serverInheritance (object-oriented programming)Computer animation
Video game consoleTexture mappingBuildingVariable (mathematics)Alpha (investment)Core dumpWeb pageEwe languageQuadrilateralIntegrated development environmentSanitary sewerPlot (narrative)WebsiteAngleThetafunktionOverlay-NetzShape (magazine)WeightComponent-based software engineeringPoint (geometry)Visual systemElectronic visual displayGraph (mathematics)Template (C++)Line (geometry)Local GroupGeometryWordSineStatisticsContinuous functionDiscrete groupTendonClique-widthPhysical systemSoftwareSimulationLie groupGamma functionSet (mathematics)Probability density functionFormal grammarGeometryLine (geometry)Point (geometry)PlotterLevel (video gaming)Functional (mathematics)Coordinate systemScaling (geometry)Normal (geometry)Texture mappingComputer animation
Variable (mathematics)Inverse elementSoftwareAdvanced Encryption StandardDiagramCausalityAlpha (investment)Error messageXMLUML
Transcript: English(auto-generated)
Day four or five, so we're nearing the goal. Reusability is the topic for today. And as usual, I would like to give you a short introduction here. So again, we have four central topics here concerning reusability.
So one is, of course, that metadata and data should be as accurate and detailed as possible. So one should look out that they actually described in a sensible kind of way.
And this includes, of course, a clear and accessible usage license. The whole topic of licenses is a bit complicated because, of course, local laws have to be looked up and stuff like that. So we decided to move this topic to tomorrow in the morning.
So there will be a good, hopefully good discussion also between the differences between data and software licenses and why they can be really important when it comes to the FAIR principles. The next one is the issue about provenance.
Here we have the situation that provenance information is often produced, for example, by the machines also you are using in your laboratories. So provenance information can be on a technical side,
be added quite easily if some specifications, of course, are kept. And also in collaboration with industry, one can say that we are in some disciplines we are moving toward what is called the so-called smart labs.
So imagine in the future you put your glasses on or have a small camera included, you have a microphone included, and every step you do actually in the laboratory will be recorded. And all the machines you are using, all the scientific instruments you are using,
they will be connected to each other. And so your experiment will be completely recorded. So you actually have all that provenance information in the future in an interoperable format and in one spot on one laptop, for example, in your lab.
So there are some efforts regarding this also here at the University of Hanover concerning the faculty of applied chemistry.
And we are really looking forward to what are going to be the next steps coming up in the future. So maybe you should just keep that in mind. The other thing is, of course, that the community standards have to be addressed in order for data and metadata
to be reusable. And I will show you some example here later on. So as an institution and repository, it should be made clear that metadata schema are both in human machine readable format. I guess we have discussed it in the last days already.
It is requested that the repositories should make it really easy. So you have a good user interface. You should have an open API in order to get this relevant information. And of course, you should also offer support
when it comes to choosing a license for data and software. Here as well, there are some tools and services out there. On the downside, they are not synchronized or harmonized between the different repository providers.
So it's a wild card. So every repository provider can decide on the licenses they are offering to you when, for example, submitting a data set. So there are some catches here. But there are tendencies, at least in the data issue,
to move toward, of course, including more the Creative Commons licenses. But again, the harmonization is not reached to a level where we can safely say, OK, there,
let's say, five or 10 different data licenses. And if you use them, you're good to go. Sadly, we are not at that point in the moment. But we are working toward it. And yes, of course, on repositories,
there's two possibilities, basically. One is that they are generic. So it doesn't matter which discipline you come from. You can submit your data set to this generic repository anyway. And on the other hand, they are discipline-specific repositories who have often been established
for a long time, in some cases. They contain 10,000 to 100,000 to millions of data sets. And they are very established in their disciplines. So again, here, one has to look what kind of standards
should be included. And as a scientist here, you can be, of course, you are advertised to be as detailed as possible when adding data and metadata to provide a useful context.
The purpose of the data creation is really important, like the collection date and conditions. And it's, of course, also important to mention what state the data is in. Is it raw data? Is it processed data? Secondary data?
Is it the data that comes with a publication, and so on? Another point here, and that one actually can be very important when it comes to reusability, is that you should clearly explain what variables and parameters you measured, and also the formats
you include here. So it's not self-explanatory, and there had been some big mistakes in science already because formats haven't been harmonized
or have not been explained. So this is also an issue. And again, here within the machine actionable world, we are not there yet to have a complete harmonization in place. Again, it's also important within a data set
that you, of course, cite also the software you are using. You should cite it. You should state the version you used to process the data, to visualize the data. And this is something that is not mandatorily required by most of the data repositories out there. So this is something you have to think about as a scientist
yourself first, and then include this information. On the user interface side, it's often possible to include it even if in a data site and meet a data schema. It has been included in the last years. But it's all only a recent development.
And some data repositories do not offer in their metadata templates this option, for example, to also highlight the software you used. So this is another fact. As a recommendation, set a license. Doesn't matter which one, just at the first place,
just choose one. That's the most important thing. And of course, as a library, we tend to use the Creative Commons licenses, CCBI, as the most, because we think that follows the good scientific practice
the most. And of course, if applicable, you should also provide additional information on maybe legal conditions that may apply. So if you have embargoed data, if you have data that has restricted access and cannot be made open,
only if the metadata can, then please provide some information, some background and contact information to get in touch with you and clear up if for certain scientific purposes the data could be addressed anyway
or there could be a cooperation project or something else established based on this data set. Specify your provenance information. Again, we are not in the digital age of the smart lab
world yet, but hopefully, we are getting there. And until then, we would like to ask you to please specify any provenance information you may have in the metadata and maybe also in the technical appendices or anything you
include in this kind of data set and choosing your license in a way that also includes maybe your citation wish or state your citation wish clearly. And yeah, again, for the last point, if there is one, please use the community standard for data archiving
and publication or if not, then explain why you used a particular parameter setting or something else. The last point here is that you actually, again, request that repositories in your field of study
collect these details. Again, I know as a scientist, you want to have maybe your user interface and your submission fields, and you want just to click and give it away and have it done and be done with it and continue
in your lab with the more important part of your work. But here, the repositories, surely, they need feedback. They need stuff like you say, hey, I want my ORCID ID included with my author details. Or they really sometimes, especially the small local institutional repositories,
which are popping up now or have been in the past years all over the place at your local universities, they really need your support and they need you to tell them what they should be actually including. So this feedback is really important for them.
OK. Now, before we move on to the tidying data, we were asked, can you? We were asked yesterday to get a bit more hang on the detail
and how a repository maybe can provide some of the, how you can see that maybe some of the structure they provide and so that it is probably on a good way towards being fair
and maybe it should be used. So of course, scientists like to talk about their own research disciplines. So I will talk a bit about the Pangea repository, which is a major data publisher for Earth and environmental sciences.
It's been there for more than 10 years as of now. And it's one of the, I would say, one of the largest repositories by now when it comes to Earth and environmental sciences. And as PhD students in the area of climate sciences,
we got in touch with it very early, even starting with our master degrees, for example. And you can get here, usually you get your user account.
You can search across data sets. You can submit data sets. And you can learn some more about Pangea. And as you can see, climate science, this is also a broad field and coupled closely to the environmental sciences. So we cover here lots of different disciplines.
And a small number you can see here is the number of data sets actually included in this repository. So again, here you have a human readable interface, nice pictures. You get to know maybe the repository a little bit. You can click around.
You can read their background. You see the operators very clearly. They very clearly state their policies, recommendations for safeguarding good scientific practice. You can see that they receive public fundings. And I think they are well in line with the fair guiding
principles here. You get more information on interoperability. And you can see actually a little bit about their, you can click on their team and you see the persons behind their data holdings in total with some statistics
that they are a member here, very important in climate sciences of EXU, a world data system. So that means also when you are a researcher in that field, you will know that they are certified when they can put this logo on their website.
So they are a registered repository in the world data system, and that means that they have to have an exit strategy, that they have to have good data policies in place, and so on. So so much for the human readable side. Then I did some search already and looked up.
This is the first impression you get here from this data set. So before you see anything else, and that's actually fact for climate science, you get the geographical information. And you have to know here, you can zoom out also, but you have to know here that it is actually
in the Antarctic. So it's a nice sample of the Antarctic. And down here, you get now in this resolution the data set. So you see it's from Christian Bucher in 2000,
and it's on the temperature and resistivity profile of a drill hole in Antarctica, and it's supplement to a publication. So this standard display here is a characteristic for Pangea, and you can get all kind of different citation formats here.
You have your abstract, as usual. You have the project, which is also linked, described, and mentioned. You have the coverage. Again, that's so important in climate science, with the latitude, longitude, date, start time, and end.
And again here, as we mentioned on the first or second day in a machine, readable format, and stuff like that. So you have here, but overall, this is the human landing page. So that's the human readable version. Now, when you look at the source code here,
in the first part, we have this geographical map information. And then, sorry, if you scroll down, you get here, very quickly, to the metadata information,
including here the DC mapping through Dublin Core, which is very clearly displayed. And so the metadata is completely machine readable. You get the information here, again, regarding.
So the identifier, for example, is included with the DOI, and the schema with the UI, of course, is identified. And then, I have to check, exactly, yes.
And then, if we move to line 51, so this one, we are also getting a quick introduction how Pangea does expose schema.org. So they are using JSON-LD to actually, again,
have the metadata information. But when you go down, so here comes the abstract, just for reference. But when you go down, you will find that there are also
types included here, which are the variables that are measured. So you can actually search in the data set the column for depth, for example, of the sediment rock or temperature in rock in the sediment given in degrees Celsius.
Now, this given in JSON-LD, this is important because it makes the data set, not only the metadata data, but the data set in itself a bit more machine readable. But this is the limitation we have here, because this parameters, they vary, and they are very different when you move from data set to data set.
So some maybe mentioned temperature in Celsius, and others mentioned the temperature in Kelvin or Fahrenheit. And you will not know only, for example, by temperature in rock sediment, if this sample here in Antarctica, would it be comparable to one
in the Arctic. And we will have an example of that with a small R package later on. So this is basically what I wanted to show you. And now you have to imagine to actually get those parameters out of all of the research data sets
that are published so far in all of the other repositories. So Pangea is one of the better ones, let's say it like that, to include JSON-LD parameters. But we have many others which just focus on the standard metadata parameters.
So like, for example, here, like again, Dapleen course, where the metadata you need for citation is included, but not the data, not the information you actually need when you want to search inside of a data set.
So this is just one example of being fair, which can be more or less implemented, depending on the repository you're looking at. Now, you all maybe have used data repositories by yourself
in the past. And I would like to invite you to have a closer look if they use JSON-LD or any other language to actually describe not only the metadata, but also if they try to describe the parameters which are used inside of a data set.
And if they can be found when you, for example, access, you can access this JSON-LD via content negotiation. And then you can also search inside of the data set. And yeah, maybe it would be interesting to see
if there are any other repositories which also included or what kind of metadata and data standards they are offering. So that's all basically about this we wanted to show you. Wake up, everybody. Please get active. About tidy data, one of the sayings
that you will find quite often is that tidy data sets are all alike, but every messy data set is messy in its own way. That's a variant from a Tolstoy quote about happy and unhappy families. And what is tidy data?
A different name for it is long-form data, which I presume most of you have not seen so far. Because in Excel and the table-like programs, you usually have this wide form because it is on widescreen monitors. For the human eye, it is quite easily possible to have different columns for the same observation,
for example. So this kind of transformation from this wide format to the long format in the vertical dimension makes a tidy data set, but that's not all of it. There are some rules. First of all, if you have one table,
there should only be a single type of data in it. If you have one variable, and that's the columns, then one column should also include exactly one variable. As you see here, the A is, for example, a measured variable.
Anybody have a good guess right now what an A could be? A pair would be a unit, but some physical property or some absorbance, exactly, thank you. So an absorbance, for example, we're measuring a bunch of samples. For example, A1, we're measuring it three times.
A2 is a second sample, we're measuring it three times. And A3 would be a third sample that we're measuring three times. Then putting all of these absorbance values into a single cell in one column would be a tidy data set, whereas when we have technical replicates, it would be in the rows here.
The rows are also called observations, and at the very right, usually, of one row is the actual value that was observed. And all of the other columns, they are, as I mentioned, variables that have been measured. For example, when we stay with the absorbance example,
it could be different treatments for some kind of sample that you have collected, and in the end, there's one absorbance measurement or observation. And that, of course, in its summary here, has the outcome of having exactly one value per cell.
The column IDs, or the variable names, sorry, the column headers, they are used as the IDs. So as you can see here, in an Excel-like setting, you would maybe have a column that is called ID, for example, patient ID or sample ID, and then you would have a different ID in each row.
That is not super tidy yet. So yeah, as I mentioned, all the table-like programs, they nudge you towards this wide format, and this is very human-readable, but as we have taught you this week, I hope that the fair principles are also about machine readability for the most part, and therefore, we are going to have a little exercise now
about what this little data set for some patients and two treatments actually has, what kind of structure it has. So these are right now only labels, the column labels, treatment A and treatment B, and the row labels of your patient names.
And here we have some kind of measurement values. The two tables here are exactly the same, just pivoted by 90 degrees. So what would you think are the variables here? What are the measured variables?
We have, actually also the names are variable as well, so we have three variables here. We have treatment type, basically, and the treatment type has the values of A treatment and B treatment. We have the variable name or person or patient,
and the values are the actual names as strings, and our observations are the third variable which we may call result or score or measurement, however. So actually, a tidy version of this table would be like this. We have now many more rows, so it's longer,
and we have also some kind of repetition, of course, in the values because each person was treated twice. But the advantage here is, or one unexpected advantage maybe, that we can also notice and infer why, for example, values are missing
or what the meaning of missing values is because surely this person was treated, but for some reason, the result was not captured. So there's, as you maybe know, different types of missing data values. For example, because the data couldn't be measured for some reason, or it may have been removed
as an outlier, and if we have it in a tidy data format, we are much more easily able to notice that something has happened, and we can start inferring what has happened. So this would, as I said, be a tidy data set because each value belongs exactly to one variable and one observation.
Yeah, I talked about the conclusion about missing data, and we will later get into R, and the tidyverse here is a set of packages which help you work with this data, which help you produce data like this, and in the Python universe, pandas, for example,
can work with this kind of data as well. So it's not a dead end. It is maybe a little bit less human readable, but it is definitely more machine readable, and therefore, especially for larger data sets, in the end, much more effective. The comment was here that in particular, what we named here just generally result.
Exactly, it could, here you can be much more precise of what the actual result is, what it means. So as we saw in the Pangaea example, you could have a really specific parameter name here with unit, so it would be much more clear what this number actually means. And also when you think about vocabularies and ontologies,
what this result would be called, this column header, might be already defined in your community if you stick to a standard. So here, you may save yourself the trouble of finding a good way, a good name for this treatment, but then everybody else will start having to think about maybe parse the information from the table caption
or from the materials and methods section or from the results section from the unstructured text, basically. So yeah, that's another advantage. Thank you. All right, and then one topic we will talk about more probably on Friday as well is how to cite stuff.
And maybe have a little show of hands, who is really happy with how a software citation, first, who has tried to insert a software citation in a paper? Okay, about half the people. Which reference managers did you use and are you really happy with one of them
to solve this problem? Okay, Zotero was one suggestion. Okay, several more use Zotero. Me too, there's another one. JobRef, okay. What do all of these have in common? Or was there other suggestions, sorry. Mendeley, okay, as well.
OrgRef, that's Emacs Org Mode, okay. Okay, could it be possible that most of the suggestions we've heard just now in the end somehow use Biptash or compatible with Biptash?
Seems to be some agreement around the room, okay. So I have drawn down here a workflow of where do the citation metadata come from? Obviously, the author of a paper or the author of a dataset or the developer of a software, at some point has to provide the citation metadata.
Or in some cases, it may also be generated automatically from, for example, the description file of an R package or other kind of community standards or programming language specific metadata files. Then as we have seen, some repositories expose the information quite nicely
and have linked data and nicely searchable content of the datasets even. And therefore, a user who finds a dataset, therefore can also see what the citation information is. Then most of you use some kind of reference manager. So the next step would be to import this.
Some have one-click buttons. Sometimes you can copy and paste a snippet and import it like this. Some have a DOI lookup where you just throw in an identifier like the DOI and a web service provides you with a citation metadata in the end. But in the end, it all has to be inserted into some kind of document
and then rendered according to a certain style. So because of time constraints, I would maybe not play this developer section here. So I've given a talk before about this topic when I got into it, but it's just a few minutes. And for the people on the video, they can also maybe pause and hop back to the other video for a minute.
The impression I got was that currently, BIPTECH and BIPLATECH and Biba, for example, would be another option. They cover most of this workflow. As you saw in the Pangea example, you can import a snippet from there. It is okay to write by hand,
but as we will see later, R, for example, generates citation snippets for you, for R packages, for example. And in the end, there's also a huge number of BIPTECH styles that renders your citations. And in between, there's a large number of tools that can either import BIPTECH or export BIPTECH.
So this one covers probably the widest range. It's not all perfect in each case, especially for software citations. There's few styles that recognize this add software citation key as an item type in itself.
And thereby, for example, also rendering versions and other software-specific metadata correctly. But it's a chicken and hen and egg problem, right? You have to start somewhere. And I would recommend if you're publishing a software, use this add software key.
Add MISC is often used miscellaneous or add manual in case you are referring to the help pages of a package, for example. And yeah, if you're knowledgeable in BIPTECH, or if you know someone, maybe sit together with them and implement a little update for one style
that is relevant in your field to render a citation of a software or a dataset usefully. There's also the citation file format, which makes it a bit nicer to support the supply. The metadata, it is a YAML-based format. It's a bit more human readable than BIPTECH, maybe, has less brackets, for example.
But so far, the end game here for them is to convert into BIPTECH also. Code metajson, that's a format that is really just for machines. There is, for example, an R package to generate this, but as far as I know, it is exposed on websites so far,
but I don't know any citation manager who then can also import it from there. And as we've seen, most of the imports or import options offer BIPTECH or RIS anyway. So, yeah, also, of course,
if you're interested in this topic, there's a Force 11 software citation implementation working group on GitHub, where you can read up on this discussion and where the sources of this slide are also from. Okay, then let's have a very quick demo, for example. Do you remember that we have
put a toy on a software recently? And I have to switch out of full text mode. As you can see on Zenodo, oops, yeah, you see everything, okay? On Zenodo, we have the software item type here already,
and Zotero, for example, can be integrated in the browser, and with this little code symbol here, it recognizes that this is a software item that it could import. So we're going to try this, and this demo is not prepared. Oh, Zotero is not running yet.
There it is. Database is loading, loading, loading. Okay. Try again. Here we are. That's data that was being imported.
It's a computer program item type. The URL is there. The DOI has been extracted as well. Library catalog, Zenodo. Yeah, looks pretty good, right? So I would use this for a citation now, and if I then notice that the citation
is rendered a bit strangely, I may have a little correction here, for example, in that my name has been put into the last name field. So it's not perfect. It's pretty good. Yeah, and then we would like to have the Pangea example
in a bit more expanded form. How do you feel? I would suggest a little break before, because then you can also start up RStudio and follow along.
So whether you have this code in front of you or not,
so it's an R markdown file, meaning a literate programming example where you have normal prosa text, but also R code, and since we have seen Pangea landing pages already, I don't need to introduce too much.
What I now just want to highlight is that one aspect of reusability is really in these column names. In different data sets, if you have the same column names that include, for example, also the unit, it is extremely easy to combine them, and the two data sets that I found here, for example,
is an ice core drilling from both the Arctic and the Antarctic, and as you can maybe guess, one of the measurements that is done with ice cores is reconstructing the temperature of the paleoclimate. So we're going to see which exact measurements
are being done and which exact column names are being used for that. But what we want to combine now is these two data sets into a single diagram to see what the temperature curve from far, far away and long, long ago has been in these two projects
or these two publications. So because the data sets are here identified with a DOI, we can already say they are pretty findable, and we can also download them just over the internet, HTTPS, that is also nicely accessible.
We will go onto one of the pages after all, because what you saw briefly when Angelina scrolled through her example was that here in the bottom, you have a parameter table, and each of the parameters in this data sets
are being named, shortly described, the unit is highlighted, and you can even search the rest of Pangea for other Delta 18O measurements from water, for example. That's the 18O, it's an isotope of oxygen, right? And the isotope of oxygen can be used
to reconstruct the temperature. So that's the exact measurement, but the result of this will be a temperature reconstruction. As you can see, there's over 900 other data sets that use exactly this parameter. So potentially we could combine all of them into new graphics, answering new research questions maybe.
But the question for us is just Antarctic, Arctic, what are the two temperature reconstructed curves that we get from this? So, and therefore our question is, is it also interoperable?
It looks like it is, because these parameters are being reused in several data sets. And now we want to approach this question, how should we do it? We want to get the outcome of comparing the temperature proxy measurements
in a single diagram from two different data sets. So what do we have to do when we think backwards to our present situation? We definitely have to make sure that both of the axes use the same variables. We need to check whether the units are exactly the same or whether we have a jump of a thousand
or 0.1 or something in there to convert stuff. We need to extract the values from the data sets, for which we need to know the structure first, and we need to download the data sets in a reproducible manner. So of course I could click the download button here on the website, it's somewhere here.
So I can download this as HTML or as tab delimited text, but we want to have our R script do this for us. So therefore the challenge would be, for you that's the question, should we, well, I already answered it, right? We shouldn't download it. Sorry, we shouldn't download it. But what do you think about this option,
writing our own little download function and, for example, putting both of the IDs into it? Who's in favor of this option, of the second option compared to the first? Oh, a small minority, okay. Can anybody think of a third option?
So we have manual download through the website. We have programming a download function. Who can think of a third example? So please raise your hands. Okay, several, and just shout it, what would you do?
Was that the same that you all said, reusing? Exactly, so yeah, we can reuse something. So where would you try to find something? So I suggested this to them. We will see if they already implemented the suggestion. There's about tools page somewhere.
Ah, tools for data publishers. So software provided by Pangea. Yeah, okay, no, they didn't yet.
As far as I'm aware, Pangea itself does not provide something for R. However, there's a community of scientists, ropenci.org, and that's the hint. They have a really big list of nice packages.
In this case, we will just search for the repository name Pangea, there. An R client to interact with the Pangea database. So it's unfortunately not listed on their official tool site, but it has been developed by other people. Lesson learned first in the R community,
ropenscience is quite a big source for nice packages. And of course, C-RAN, the comprehensive R archiving network is the biggest one for packages. So we can look at their list, R packages sought by data publication, for example.
And as we search here for Pangea, there we find it as well. So R data retriever. Oh, nice. Nice, okay, thank you.
So there was a suggestion to also highlight this general R package, which helps retrieving data from some repositories or non-specified repositories.
So Pangea is a specific one, but as we have for the data repositories themselves, some are specific for discipline, but it seems that there are also R packages which are very general across, work across different repositories. Yes, one more suggestion.
Yes, okay, yeah, their suggestion was also to talk about rcurl and mention it, but rcurl is a very general download thingy. Would it, for example, be able to resolve a DOY, if you give it a DOY URL? I'm not sure.
So underlying these more specialized data retrieval packages are probably rcurl functions in the end. Okay, but let's continue with our example. So yeah, Pangea, before installing something. Now it's of course easy to do that.
It's just here in the packages install option, Pangea. It's just a few clicks away, but how would you find out if this example is actually useful for you, if it has the right functions for what you want to do? Remember, we want to download a dataset and we have the DOYs.
All of the rOpenSci packages have reference pages, or most of them, I'm not being 100% sure here, but they have reference pages that are automatically generated from the function documentation, which we will learn about later today.
And yeah, there's some download options, as you can see. And here, this pgData looks like it can take a DOY and do what we want. So for example, pgData, exactly, we put in the DOY and we get dataset back. So seems to be what we want.
So have all of you installed Pangea R successfully? I already have it. Please put up a red sticky if the download or installation doesn't work and a green one if you have installed it
and have loaded it with either checking the checkbox here in RStudio or with the library Pangea call, function call. Right, hm, hm, yeah, yeah, yes.
Oh yeah, okay, that's a really good point. So rOpenSci is, as I mentioned,
a community of scientists and they do peer review on their packages. So C-RAN has this as well. C-RAN is the biggest source. rOpenSci is especially for scientific packages and it is also peer reviewed. The packages that you find, other packages that you find on GitHub may not be reviewed.
So yeah, quality is unknown, but it is possible to install stuff directly from GitHub in R, yes? Yeah, so for dependency management, there's Pac-RAT.
It's also a tool in the R universe. We will not cover it today. No, sorry. Yeah, so ask Mateusz. He has experience with this. I just know that it exists and I haven't used it myself. Okay, good.
Then let's just get our two datasets here. As you can see, the parameter that we need to put in is the DOY. This is nicely fast and we get a list and in the list is another list and then we have some metadata and we have the actual data.
As you can see here, these two variables with a bunch of, well, 5,000 measurements. We can also have a look at the structure of the data. That's S-T-R.
Ah, it's, whoop. So that's the other way to have a look at the structure. This one is a bit larger dataset. As you can see, the H is present and the D18 as well, but lots of other measurements.
Now, what we need to find out in particular is whether the column names here are really exactly the same so we can automatically process them further. So first to extract the dataset itself, we are going to access into the data frame and just extract the data part of it.
So I'm going to overwrite my previous, the initial downloads. Now we can have a look at the structure and we have this nice table, not this list of lists and nested lists anymore.
And with the names function, we can check the column names of the one dataset and of the other, and to have it automatically compared for us, the intersect is what we want. And as you can see exactly character by character,
it is the same, otherwise it would not have been intersected. Then let's get to the plotting. The most popular plotting library in R is probably ggplot, which stands for grammar of graphics, which is a really interesting topic in itself,
how plots are structured like sentences and there's a certain grammar to it to build up plots in a logical and visually appealing way, even without customizing the styles and the fonts and the colors already. So we're just going to use this. And as I mentioned yesterday in the versioning example,
I still have version two to one. So I'm pretty sure my example will work, but some of you may have installed the version three in the last few days or just now. So if it doesn't work, we know that this major version jump introduced some breaking changes in relation to this example code.
So it could be that my example is small enough that it slipped through the tsunami wave there of ggplot updates. So I'm just going to load it with the library ggplot call or in RStudio you can also click the button, the checkbox here, and ggplot, the function requests one data set
and the data set should be in a data frame format, which we already have, as we could see here. It's nicely visible table. One aspect of the grammar of graphics is a so-called mapping of the aesthetics and the aesthetics are the x-axis, the y-axis.
Then if you have elements that can be varied in size, for example, dots have sizes, colors, and all of these things can be aesthetics. And we need to put in the variable names here that we got from the intersect check just before. So it's best to copy and paste them.
And this is the one criticism of the way Pangea does it. They use spaces in between or within the variable names. So we need to have back ticks. And on the German keyboard, that is shift and then this little apostrophe thingy to the very top right
and afterwards you have to press space once. So that's a back tick. It's used quite often in programming and in markdown, for example, to denote little blocks of code. So it's not quoted, not italic, but it looks like a source code. In the formatting and in R,
you need to enclose variable names with spaces in these back ticks. So we want to have the age on the x-axis and the measurement of the oxygen isotope on the y-axis. Because the first data set here is from Greenland, Northern Greenland ice project.
I'm not sure what the error is for. It's a reusable ice project, maybe. We're just going to color it dark green and we want to have a point measurement. So as you maybe see here, we have 400 observations. So we should expect, we see about 400 dark green points when we execute only this first part
of our plotting code. So there's a mention that some rows have been removed because the data is missing. And this was a question in the break whether you should remove missing data or not. I would say also a missing measurement contains some information.
So when in doubt, I'd rather leave it in. In the tidyverse, you can be pretty confident, which ggplot belongs to basically, you can be pretty confident that you will be notified about missing values, but that your analysis will not run into some kind of error because the handling of missing values is quite advanced.
And there's even packages to help you visualize missing data, which is also helpful in some cases. The comment was that there's a parameter in many R functions called na.rm for remove. And that can help you ensure that they are not, then they don't enter in any kind of calculation.
So that's what I meant with, they are handled quite in advanced way. So we have the one data set here already. The zero here is the present time. So we're going to do something with this because we're of course drilling into the past when we're drilling an ice core. So we will also in a minute flip the axis
so that the left is the past and the right is the present. But first let's introduce our other data set here. We're using the geometrics point function and this one, the geometrics point function or all the other geometrics functions
can also be passed the argument of a data set. So if we execute the second geom point measurement here with a plotting function with a different data set, it will try to use the aesthetics from before. And because the column names are called
character by character exactly the same, this will work. Oh, I overwrote it, I wanted to execute it, sorry. Again, we are having some missing data here but it's exactly the same missing data. And there we go. I think the black here is the default which I'm not sure if Antarctica is usually colored black
but we will leave it here. So we already have our two data sets here on the same X axis. You can already see that the ice core from Antarctica was apparently a lot longer or was measured more carefully, we don't know. Sure, okay, but that also means of course it was probably longer
unless there's compression of the ice which is much more. Okay, yeah, good. Oh my God, I shouldn't go into the glaciology topic too much. And yeah, the last thing we wanted to do is to reverse the X scale. So there's a specific function for that scale
underscore X underscore reverse. There's probably also one for reversing the Y axis. Yeah, because the H means the past. So we're going to execute this again. And there we go. We have our two different data sets remixed in a single diagram. And we can now answer other scientific questions
which we couldn't have if we, for example, only had one of the data sets or two data sets but in an incompatible, non-interoperable, non-reusable manner. All right, that's a... Okay, then let's have a little prompt impromptu
styling discussion for plotting. One aesthetic that we could use is called alpha. That is from, it's the alpha channel in a color pixel or value.
It is exactly, it's transparency or opacity. And I'm not sure if one is completely transparent or completely opaque. So we're just going to use one, 0.5, sorry. 0.5. Let's try it. Because I want to overwrite the aesthetic here.
Yeah, I need to wrap both in aesthetics, right? I know I can even, yeah, exactly. No, in the GG plot in the very top, right? Yeah, yeah. Okay, so it's always inherited downwards.
So before you put it into both points, better put it into the general mapping of the whole plot. So this is to avoid so-called over-plotting, as you saw. It wasn't clear where, for example, the points were very dense and were just somewhere overlapping.
And with 0.5 opacity, we can see that some peaks, for example, are not as over-plotted anymore, or even we go a bit lower.
I think, nine. Yeah, yes, it does. It's just not very visible. I see it also here, but...
It's not exactly what I was expecting. We could look it up, for example, in the ggplot help. So yeah, that's what happens when you introduce unscripted examples.
There's no alpha here in this help. Okay, I'm in the wrong help file, so we'll just continue. This example, or more precisely, we will conclude with this example that on the technical level, in the data set files exactly already using the exact same column names,
then other colleagues in your field, that already helps. That's the point I wanted to make here. Are you on the x-axis or the y-axis? On the y-axis, right?
Okay, the oxygen isotope 18 measurement is on the y-axis, or the delta of it, to be precise, and that is a proxy physical measurement to calculate backwards towards the temperature that was present when the snow fell down
and therefore gas bubbles were included and in the ice, of course, the water contains oxygen and you can calculate that backwards towards the temperature that happened, and then the snow was compressed towards ice.
That's the measurement here. Okay, so I mean, scientifically, this is not really a super interesting example here. All the climate scientists will be nice, please. But we can, as I mentioned, conclude here
that there's the very technical aspect of using the same variable names that helps a lot for remixing data sets. It's a PDF. Let's download it.
Yeah, here's a very basic grammar example. So you have the data. The data is put into a geom like lines or bars or points and together with a coordinate system, this already forms the very simplest grammatical plot. But there are several more grammar aspects to it.
And yeah, the data, the mapping, and the geom are the absolutely required functions that what we focused here as well. But for example, because we flipped the scale, we used one non-required level of this grammar.
Yeah, thanks. Okay, I didn't know that this was in our studio itself. So help, cheat sheets. Great. Did anybody encounter any plotting errors or did anybody get the, please put up a green sticky if you saw the diagram in pretty much the same way as I did,
whether alpha or not. So as you can see, not every update is a cause for concern. I'm going to replot it without the alpha now.