Reusability Through Community-Standards, Tidy Data Formats and R Functions, Their Documentation, Packaging and Unit-Testing

Video in TIB AV-Portal: Reusability Through Community-Standards, Tidy Data Formats and R Functions, Their Documentation, Packaging and Unit-Testing

Formal Metadata

Reusability Through Community-Standards, Tidy Data Formats and R Functions, Their Documentation, Packaging and Unit-Testing
Title of Series
Part Number
Number of Parts
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
Attribute grammar Insertion loss Local ring Usability
Collaborationism Latent heat Information Software Causality Virtual machine Plastikkarte Attribute grammar
Laptop Point (geometry) Meta element Standard deviation Service (economics) Multiplication sign File format Virtual machine Open set Metadata Time domain Latent heat Goodness of fit Different (Kate Ryan album) Energy level Harmonic analysis Implementation User interface Standard deviation Information File format Plastikkarte Attribute grammar Virtual machine Software Personal digital assistant Repository (publishing) Internet service provider Order (biology) Universe (mathematics)
Metre Point (geometry) Group action Context awareness Beta function State of matter Virtual machine Metadata Variable (mathematics) Usability Power (physics) Template (C++) Computer configuration Software Harmonic analysis Condition number User interface Context awareness Information File format Fatou-Menge Software developer Sound effect Parameter (computer programming) Cartesian coordinate system Process (computing) Software Software repository Repository (publishing) Revision control Website Condition number
Context awareness Context awareness Link (knot theory) Metre Information Projective plane Set (mathematics) Parameter (computer programming) Metadata Variable (mathematics) Software Touch typing Revision control Condition number Information Condition number
User interface Point (geometry) Context awareness Standard deviation Onlinecommunity Standard deviation Link (knot theory) Axiom of choice Observational study State of matter Feedback Parameter (computer programming) Mereology Field (computer science) Variable (mathematics) Repository (publishing) Software Revision control Repository (publishing) Condition number Information Musical ensemble Parametrische Erregung Gamma function
Context awareness Standard deviation Link (knot theory) Axiom of choice Touchscreen Computer file Feedback Parameter (computer programming) Usability Coma Berenices Bit Variable (mathematics) Measurement Repository (publishing) Function (mathematics) Software Universe (mathematics) Revision control Repository (publishing) Condition number Information Physical system
Area Degree (graph theory) Integrated development environment Repository (publishing) Different (Kate Ryan album) Software Bit Data structure Student's t-test Physical system Field (computer science) Measurement
Web portal Statistics Ferry Corsten Angle Horizon Chemical polarity Demoscene Field (computer science) Number Strategy game Forschungszentrum Rossendorf Operator (mathematics) Software Thermal radiation Integrated development environment Process (computing) Physical system Service (economics) View (database) Interface (computing) Electronic program guide Metadata Computer network Total S.A. Bit Line (geometry) Windows Registry Repository (publishing) Software framework Website Logic gate Physical system
Computer file Gradient Characteristic polynomial File format Thermodynamic equilibrium Average Measurement Blog Different (Kate Ryan album) Phase transition Software Drill commands Abstraction Operations research Standard deviation Metre File format Data recovery Weight Sampling (statistics) Core dump Continuous function Greatest element User profile Fluid
Meta element Keyboard shortcut Scripting language Gradient Multiplication sign Source code Time zone File format Thermodynamic equilibrium Mereology Measurement Blog Phase transition Google Maps Information Abstraction Enterprise resource planning Point cloud Computer icon Source code Software bug Parsing Link (knot theory) Mapping File format Data recovery Point (geometry) Hypertext User profile Fluid Physical system Web page Metre Identifiability Wrapper (data mining) Line (geometry) Computer-generated imagery Virtual machine Maxima and minima Average Content (media) Revision control Dublin Core Software Drill commands Data type Operations research Execution unit Metre Information Tape drive Projective plane ML <Programmiersprache> Core dump Continuous function Greatest element Basis <Mathematik> Electronic visual display Force
Meta element Scripting language Distribution (mathematics) Time zone Set (mathematics) Thermodynamic equilibrium Measurement Blog Phase transition Information Category of being Enterprise resource planning Software bug Link (knot theory) Data recovery Point (geometry) Sampling (statistics) Bit Measurement Degree (graph theory) User profile Type theory Fluid Order (biology) System programming Metre Inheritance (object-oriented programming) Virtual machine Google Analytics Average Host Identity Protocol Wave Software Drill commands Integrated development environment Software protection dongle Operations research Execution unit Metre Information Core dump Continuous function Limit (category theory) Greatest element General linear model Hausdorff space Function (mathematics) Data center Abstraction Force
Metre Meta element Mobile app Scripting language Distribution (mathematics) Multiplication sign Time zone File format 1 (number) Google Analytics Content (media) Power (physics) Formal language Web 2.0 Inclusion map Dublin Core Electronic meeting system Software Drill commands Software bug Execution unit Standard deviation Link (knot theory) Information Content (media) Numbering scheme Element (mathematics) Hypertext User profile Fluid Hausdorff space Software repository Repository (publishing) Function (mathematics) Order (biology) Force
State observer File format Transformation (genetics) Set (mathematics) Mathematical analysis Rule of inference Computer programming Dimensional analysis Process (computing) Different (Kate Ryan album) Software Table (information) Family Form (programming) Family
Email State observer Table (information) Multiplication sign Execution unit Cellular automaton Virtual machine Mereology Computer programming Variable (mathematics) Medical imaging Single-precision floating-point format Data structure Task (computing) Presentation of a group File format Cellular automaton Sampling (statistics) Variable (mathematics) Measurement Degree (graph theory) Type theory Category of being Number Table (information)
Presentation of a group State observer Context awareness Table (information) File format Cellular automaton Variable (mathematics) Measurement System call Variable (mathematics) Smith chart Revision control Type theory Arithmetic mean String (computer science) Software Table (information) Resultant
Presentation of a group State observer Email Table (information) Consistency Execution unit Set (mathematics) Bit Parameter (computer programming) Smith chart Number Latent heat Ontology Software Universe (mathematics) Key (cryptography) Data structure Resultant
Presentation of a group Graph (mathematics) Table (information) Information Software developer Materialization (paranormal) Sheaf (mathematics) Digital object identifier Variable (mathematics) Smith chart Mathematics Process (computing) Function (mathematics) Software Repository (publishing) Faktorenanalyse Office suite Volumenvisualisierung Table (information) Data structure Resultant Asynchronous Transfer Mode
Metre Point (geometry) Meta element Computer file Software developer Multiplication sign Sheaf (mathematics) Latent heat Web service Software Repository (publishing) Authorization Videoconferencing Volumenvisualisierung Descriptive statistics Source code Programming language Constraint (mathematics) Information Software developer Content (media) Metadata Planning Digital object identifier Digital rights management Computer configuration Software Personal digital assistant Repository (publishing) Sheaf (mathematics) Faktorenanalyse
Web page Software developer Range (statistics) File format Online help Field (computer science) Number Revision control Landing page Computer configuration Oval Software Software testing Volumenvisualisierung Covering space Source code Key (cryptography) File format Cellular automaton Poisson-Klammer Computer file Metadata Planning Coordinate system Bit Software Personal digital assistant Game theory Spacetime
Meta element Implementation Group action Installation art Software developer Forcing (mathematics) Computer file Web page Demo (music) Source code File format Virtual machine Metadata Measurement Mathematics Software Zeno of Elea Computer configuration Software Website Volumenvisualisierung Form (programming)
Meta element 12 (number) Scripting language User interface Computer-generated imagery Database Bit rate Data model Uniform resource locator Zeno of Elea Term (mathematics) Musical ensemble Drum memory Summierbarkeit Execution unit Link (knot theory) Sine Computer file Interior (topology) Database Digital object identifier Symbol table Type theory Software Normed vector space Interface (computing) Mathematical optimization Arithmetic progression Asynchronous Transfer Mode Vacuum
Email Scripting language Demo (music) Curve Maxima and minima Standard Generalized Markup Language Computer programming Field (computer science) Uniform resource locator Frequency Software Personal digital assistant Set (mathematics) Video game console Form (programming) Execution unit Pay television Link (knot theory) Clique-width View (database) Computer file Web page Moisture Metadata Code Menu (computing) Core dump Parameter (computer programming) Bit Library catalog Digital object identifier Type theory Interface (computing) Right angle Wide area network
Point (geometry) Web page Curve Link (knot theory) View (database) Web page Projective plane Execution unit Curve Metadata Set (mathematics) Core dump Parameter (computer programming) Measurement Landing page Computer programming Arithmetic mean Different (Kate Ryan album) Drill commands Internetworking Personal digital assistant Software Video game console Diagram
Surface Meta element Greatest element Gradient Multiplication sign Execution unit File format Set (mathematics) Water vapor Bit rate Parameter (computer programming) Measurement Data model Drill commands Abstraction Execution unit Image resolution Principal ideal Core dump Greatest element Measurement Frequency Series (mathematics) Hill differential equation Right angle Table (information) Resultant
Surface Digital filter Scripting language Proxy server Execution unit Set (mathematics) Dirac delta function Parameter (computer programming) Sphere Measurement Data model Personal digital assistant Software Drill commands Video game console Diagram Data structure UDP <Protokoll> Proxy server Maß <Mathematik> Hydraulic jump Window Curve Execution unit Link (knot theory) View (database) Aliasing Principal ideal Point (geometry) Code Core dump Parameter (computer programming) Variable (mathematics) Greatest element Flow separation Measurement Digital electronics Series (mathematics) Data conversion Diagram Data structure
Scripting language Surface Execution unit Information management Scripting language Proxy server View (database) Point (geometry) Principal ideal Set (mathematics) Greatest element Scherbeanspruchung Functional (mathematics) Computer programming Measurement Data model Computer configuration Series (mathematics) Vector space Website output Diagram Maß <Mathematik> Data structure
Scripting language Distribution (mathematics) Demo (music) Horizon Unicode Chemical polarity Measurement Seitenbeschreibungssprache Object (grammar) Vector space Set (mathematics) Thermal radiation Process (computing) No free lunch in search and optimization Pole (complex analysis) Library (computing) Source code Service (economics) Computer file Web page Point (geometry) Metadata Menu (computing) Term (mathematics) Open set Virtual machine Data warehouse Software framework Compilation album Data conversion Data structure Physical system Web page Surface Freeware Proxy server Computer-generated imagery Password Content (media) Plot (narrative) Mach's principle Writing Software Video game console Integrated development environment Computer worm Commitment scheme output Maß <Mathematik> Game theory Default (computer science) Online help Electronic program guide Field (computer science) Software Function (mathematics) Diagram Fingerprint Form (programming)
Email Demo (music) Source code Maxima and minima Database Client (computing) Staff (military) Software maintenance Twitter Peer-to-peer Software Process (computing) Communications protocol Source code Electronic mailing list Coma Berenices Client (computing) Database Latent heat Software Repository (publishing) Personal digital assistant Information retrieval Dew point Website Quicksort
Web 2.0 Email Latent heat Error message Software repository Different (Kate Ryan album) Repository (publishing) Software Interface (computing) Branch (computer science) Bit
Meta element Computer icon Keyboard shortcut Link (knot theory) Scripting language Maxima and minima Set (mathematics) Functional (mathematics) Computer-integrated manufacturing Data model Kernel (computing) Computer configuration Information retrieval Software
Web page Point (geometry) Roundness (object) Computer configuration Function (mathematics) Software Source code Cloud computing Content (media) Functional (mathematics) Library (computing)
Set (mathematics) Electronic mailing list Parameter (computer programming) Mereology Metadata Variable (mathematics) Duality (mathematics) Goodness of fit Object (grammar) Software Video game console Data structure Message passing Library (computing) Execution unit View (database) Lemma (mathematics) Electronic mailing list Metadata Bit Ripping Variable (mathematics) Measurement Frame problem Digital rights management Function (mathematics) Universe (mathematics) Social class Table (information)
State observer Wechselseitige Information Plotter Source code Set (mathematics) Mereology Variable (mathematics) Mathematics Oval Cuboid Error message Library (computing) Computer icon Theory of relativity Mapping View (database) File format Witt algebra Keyboard shortcut Variable (mathematics) Measurement Functional (mathematics) Wave Right angle Simulation Spacetime Point (geometry) 12 (number) Control flow Graph coloring Plot (narrative) Element (mathematics) Revision control Software Video game console MiniDisc Hydraulic jump Execution unit Shift operator Information Projective plane Mathematical analysis Code Cartesian coordinate system Frame problem Mathematics Personal digital assistant Function (mathematics) Formal grammar Table (information) Library (computing)
Point (geometry) Default (computer science) View (database) Texture mapping Multiplication sign Point (geometry) Parameter (computer programming) Computer font Functional (mathematics) Measurement Geometry Drill commands Calculation Software Zustandsgröße Video game console Right angle Metric system Simulation Library (computing) Hydraulic jump Geometry
Point (geometry) Pixel Scaling (geometry) Mapping State of matter Plotter Set (mathematics) Bit Ripping Functional (mathematics) Graph coloring Hypothesis Intrusion detection system Software Diagram Right angle Reverse engineering Data compression Library (computing) Alpha (investment) God Alpha (investment)
Point (geometry) Texture mapping Web page Disintegration Code Uniform convergence Physicalism Core dump Online help Water vapor Dreizehn Measurement Dirac delta function Plot (narrative) Field (computer science) Variable (mathematics) Virtual machine Power (physics) Energy level Integrated development environment Conditional-access module Proxy server Alpha (investment) Default (computer science)
Overlay-Netz Texture mapping Weight Graph (mathematics) Plotter Visual system Angle Variable (mathematics) Tendon Word Geometry Website Sanitary sewer Mapping Clique-width Building Web page Point (geometry) Coordinate system Statistics Functional (mathematics) Quadrilateral Normal (geometry) Simulation Alpha (investment) Physical system Geometry Point (geometry) Ewe language Line (geometry) Plot (narrative) Local Group Thetafunktion Software Energy level Video game console Integrated development environment Gamma function Scaling (geometry) Sine Discrete group Core dump Line (geometry) Continuous function Shape (magazine) Template (C++) Component-based software engineering Lie group Formal grammar Electronic visual display
Advanced Encryption Standard Causality Software Inverse element Diagram Variable (mathematics) Alpha (investment)
and a fall 5 so we are nearing the goal and we usability you as a topic for debate and as usual I would like to give
you a short introduction and
so again now we have a 4 have been before central and public were concerning
reusability so well 1 is of course that to data and data and it should be as a cure it's and detailed so as possible and so 1 should look out and
that's they actually I described in some sense full of some kind of way and this includes of course a clear and accessible at news its license the whole topic of licenses is a bit complicated because of course and local and loss of have to be a have to be looked up and was stuff like that so we decided to remove this
probably and to to and in the morning so where there will be a
good hopefully good discussion also between and the differences between daytime and their software licenses and why did I can be will important when it comes to the fair principles and the next 1 is the issue about provenance and here we have the situation that provenance information is often could for example by their and machines also you are you using in your laboratories us so provenance information can be on the technical side and it will be added quite as easily as if some the specifications of cause I kept and also where with fair in collaboration with industry 1 can say that we are in some disciplines we are moving to what's what is called the so called smart let's Solway imagine in the future you put your glasses on all our half a small candlelight included you have a microphone included and every step he do actually in the laboratory will be recorded an older and machines you're using all their scientific instruments and you're using and they will be connected so to each other and so when your experiments will be a completely would called it so you can you actually have all that provenance and the information in the future in in interoperable formant and in in 1 spot and on 1
and and laptop for example in your lab so and there are some efforts and I regarding
this and also here at the university and hand of Hanover and concerning at the Faculty of and applied chemistry and then we'll really looking forward to what are you going to be the next step so coming up in there in the future so maybe you should just keep that in mind and the other thing is of course that said the community standards have to be addressed in order for a day tending to data are to be reusable and I will show you some example here later on so and as an institution and repository and it should be made clear that the air and metadata schema of both in human machine readable format I guess we have discussed that in the last days already and who it is their request that's and that's and then repositories should make it really easy so you have a good user interface here should have an open API in order to get this well of and information and of course you should also offer Joon support to when it comes to choosing a license up for are data and software and here's well there's some tools and services out there and on the downsides down not to synchronize the harmonized between the different I will corsetry providers so it's a it's a wild card so every repository provided can decide on the licenses they are flowing to you and for example submitting an dataset as so and there are some some Petruchio blobs it's that they have a tendency is that at least in the data issue to move to to of course including more the Creative Commons licenses but again been bad it the harmonization is not the which to dis- to a level where we can safely say OK there are let's say 5 or 10 different data licenses and if you use them you good to go said he we are not at that point in a moment but we're looking to other and yes of course they should on repositories there's told 2 possibilities basically a 1 is that they are generic so and and it doesn't matter and which discipline you come from you can submit your data set to just generic repository anyway and on the other hand there are discipline specific repositories who have often been established for a long time in some cases and they contain then thousands to dolphins to millions of datasets and there where we established in there and this the plants so again here I 1 has to look and what kind of standouts as should be I should be included the
and as a scientist here you act can be a of course it's you I brought at to can be as detailed as possible when adding bacon me
the data to provide a useful context and the purpose of the data creationists really important like the collection date and conditions and and it's of course also important to you mention what state the data as a war data as a process later secondary data is that the data that comes with for publication and so on and another point here and that 1 is actually can be very important when it comes to use we usability and it is that you can't should clearly explain what to air wow Leopold's and their power meters you measure to and also the formats and you include here so it's not their self explanatory and there had been some big mistakes and science all idea because the costs are i've formats and haven't been bore harmonize the have not been explained as sold doses are also an issue and again here with in the machine actionable world we are not there yet to have a complete so harmonization in place again and it's also important within and as a dataset and that you of cross-eyed also the software you're using you should try that you should state you used to process the data and to visualize the data and this is something that is not mandatory Aliyun require by most of the work that they do repositories out there so this is something you have to think about as a scientist yourself 1st and then includes this this information under the user interface side is often possible at 2 included even if in India beta site and metadata schema might has been included in the last years and but it's all only a wee science and development and some data repositories do not offer in the Indian meter data templates this option for example to also highlight and a software used so this is another effect and the has a recommendation said a license and it doesn't matter which 1 just at a 1st place just choose 1 that's the most important thing and of course as a as a lot we we tend to use it to create a common licenses says CC BY as the most because I we think that follows the good scientific practice and the most and of course and if applicable and you should also
provided additional information on and maybe legal conditions that may apply so if you have an Bao data if you have stated that has was to at the the Xs and cannot be made open
and only if needed and can then please provides some information as some background to and the context information and to to get in touch with you and key up if for certain scientific purposes the data that could be addressed anyway I could be and again could be cooperation project are something else and established based on the status of and specify your provenance information again we are not in their age in the digital age of the smart let waltz yet but hopefully we are getting there and untill then we would like to ask you to place as specified any provenance information you may have in the metadata and maybe also in the technical and attendances or anything you and you include in this kind of a data set and and choosing your license in a way that also includes maybe you're
citation which are state you're citation which clearly and yet again for the last point and if there is 1 please the
user community standard for our data archiving and the publication or if not then explain why you use the particular parametres setting or something else the last point here is this and that you actually again requests that repositories in your field of study and collect these details again I know as a scientist you want to have your maybe you user interface and you're submission Brunei has submissions fields and you want just click and give it away and never done and then the be done with that then continue in your letter if you're this the more important as part of your work and but here the repository state surely they need feedback they need and stuff like music I want my Orchid ID included and with
my offer details are they they really sometimes it especially the small local he institutional repositories which are popping up now have been in the past years all over the place at your local
universities they really need you support and and they need you to tell them what what they should be actually
including as so this feedback is some is really important
for them k now before we move on
to the tidying data and
we're asked can you there's
a 1st assessment or as all of the
screens are therefore she is and will ask if they do get a bit more hang on the that detail would call and probable positive we maybe 10 can provide at some of the how can yet
how you how you can see that maybe some of their and the structure they provide and so that that there is or is probably on a good way to what's being an fair and maybe it should be used so if so of course scientists like to talk about their own we search disciplines so will talk a bit about the Pentium repository which is major data publisher for and environmental sciences and
it's been and therefore off and more than 10 years so as of now and it's 1 of the I would say it 1 of the largest repositories by
now when it's comes to an environmental sciences and as as said PhD students in the in the area of climate sciences we got in touch with but also well we ability and even stopping the follower master degrees for example and you can and you can get yeah and usually you get you a user account you can search across datasets you can submit datasets and you can learn some more about Pangea and as you can see climate sciences is also a wide what fields and and couple that closely to the Environmental Sciences so we cover here and lots of different disciplines
and a small number you can see here is the number of datasets actually included in this repository so again here you have
human readable interface nice pictures that you get to know maybe the repository a little bit and and you can you can click go
on to you you can read their background you see the operators very good
up clearly it would clearly state their policies and recommendations was that having good scientific practice and you can see that day
was for public funding is and they think they're well in line with the fair guiding principles yeah you get more information on into our ability and you can see actually a little bit about the there there you can click on the team and you see the persons behind that aid that they data holdings and and total of some statistics that their member here where important in climate science off the l the CSU World data system so that means also when you're as researcher in that field you will know that that there are certified when they can put this logo on their on their website so the which asserts repository interval data system and that means that they have to have an exit strategy that they have to have good and data policies in place and so on so so much for the human readable side than I am I did some
research already and I looked up this this this is the 1st impression you get here ponders on this dataset so before you see anything else and that's actually affect for climate science you get the geographical information and you have to know here as you consume out also but you have to know yet that it there's actually in in the Antarctic so it's a it's a nice and a sample of the UN Arctic and down here that you get Noland is there with solution the data so you see it's form christian occurred in 2 thousand and it's on the temperature and to was was this they would people file of at will hold it in an article and its supplements to add to a complication so this as standards display here is is a characteristic yeah Paul
Angier and you can you can get all kind of different citation formats here and you have fewer you have your
abstracts as usual you have the project which is also linked squat the mentioned you have to college again that's so important in climate science
had with the latitude longitude and date and start time and end and again he as we've mentioned on the 1st or 2nd day in a machine you should use this kind of readable format and and and stuff like that so you have fewer but and all this is to human lending page so that's the human readable version know when you and I look at them the
source code here putin we still
in the 1st part we have we have this geographical and that the information and then sorry he is called on you get yeah very quickly to dipmeter meter data information including the at the DC mapping and so duckling caller which is very clearly and this plate and so the meter data is completely and machine readable you get the information here again with guarding so the identifier for example is included with the with the GUI to and a
schema with you I of course is identified so and then bloom I constant checked exactly yes and then if we move to align 51 so this 1 and we are also getting and now quick introduction how old and exposed as the mother of all so they're using chase we to actually again I have the needed data information but when you go down and so here comes the abstracts just for a reference but when you go down you was that they also types included here which are the variables and that our measure so as you can actually search in the data sets and the column and fall called that for example to sediment what what temperature in the book and the sediment in a given in degrees celsius not this given and J. Smale the this is important because it makes it it makes the dataset not only do need a data but the dataset in itself a bit more
machine readable this is the limitation we have here because this 10 meters they well and they way different when you move from data center dataset so some maybe you mentioned the temperature in Celsius always mention the temperature in kelvin of famine height and you will 1 we will not know only too for example by 10 temperature and walk sediment if this as sample here in that in that in and Arctic I would be it would be compatible to at 1 in the Arctic and will have an example of that but the smart package later on so this is basically what I wanted to to show you and all you have to imagine to actually get those parametres out of the of of all of their such datasets that are published so find all of the order I would posit always so at and yet
is is 1 of the the 1 of the better ones let's say like that is to include education l the power meters but we have many of us who just focus and which just focus on the standards and me data
power meters as so like like for example here like again at that clean and causal where the needed data you need for a citation is included but not the meat and not the data not the information you actually need when you want to search inside of the data set so this this and this is just 1 example of being being FAR which and which which can be a more or less implemented depending on the repository you're looking at no you all maybe have used to a data repositories and by yourself in the past and I would like to invite you to to have a closer look if they use Jason B. any other language to actually describe not only at the needed data but also give it a try to describe the depend meters they you in which I used inside an up-to-date as that and if they can be found at at when you have for example an excess you can exist this days now the will web content negotiate negotiation and then it you can also search inside of the app dataset and yet maybe it would be interesting to see if there are any order repositories on the wall by which also included or what kind of to date time and data as not stay out of 4 and so that's all I basically about this we want to show you weight of
everybody he's good at about tiny data on the 1 of the is saying is that you will find quite often is that tidy days assets are all alike but every messy data set is
messy it's all way that serve variants of from a Tolstoy quote about happy unhappy families and what is tiny data on them a different name process long form data which I presume most of you have not seen so far because in ex and the the table like programs you usually have this wide form because it is on
wide-screen monitors for the human eye this quite easily possible to have different columns for the same observation for example so this kind of transformation from this wide format to the long format in the vertical dimension makes a tidy datasets but that's not all of it there are some rules 1st of all if you have
1 table there should only be a single type of data in it if you have 1 variable and that's the colonists do then 1 column should also include exactly 1 variable X you see here the aid is for example a measured variable there's any of the it's so and I have a good guess right now what and a could be pale would be units but some some physical property of some absorbance they say thank you so an absorbance for example we're measuring a bunch of samples for example 1 were measuring it 3 times 8 2 was a 2nd sample mole imaging it 3 times and a 3 would be a 3rd sample that we're measuring 3 times then putting all of these absorbance values into a single cell in 1 column would be a tidy dataset whereas when we have a technical replicates it would be in the roles here the roles also called observation and at the very right usually offer 1 role is the actual value that was observed and all of the other columns they are as I mentioned variables that have been measured for example when we stay with the absorbance example it could be different treatments for some kind of sample that you have there collected and in the end there's 1 absorbance measurements or observations and that of course there in its summary you has the outcome of having exactly 1 value per cell the column idea so although of variable names beside the column headers them and they are used as the idea is that you can see here in an X alike setting you would maybe have a column with the is called ideas for is of a patient ID or sample idea and then you would have a different idea in each role that is not a super tidy yet so we as I mentioned all the table like programs in not due to what's this wide format and is very human readable but as we of torture this week I hope that the fair principles are also about machine readability for the most part and therefore we're going to have little exercise now about what is this little
dataset for some patients entered to treatments actually has what kind of structure task so these are right now and labels the column labels treatment In treatment B. and the role labels of the out patient names and here we have the the some kind of measurement values the 2 tables here are exactly the same just pivoted by 90 degrees so what would you think the variables here whatever measured variables we have and
Italy also the names are variable as well so we have 3 variables you we have the treatment type basically and the treatment
type has the values of a treatment and be treatment we have the variable name of personal patient and the values are the actual names as strings and observations are the 3rd variable which we may call result aware of school or or measurement however
cell actually a tidy version of this table would be like this we have now many more so it's longer and we have also some kind of repetition of course in the values because each person was treated twice but the advantage here is 0 or 1 unexpected advantage maybe that we can also notice and infer why for example values are missing or what the the meaning of missing values is because surely this at person was treated but for some reason the result was not captured so this has a maybe no different types of missing data values for example because today that couldn't be measured for some reason or it may may have been removed as an outlier and if we have it in a tiny data format we are much more easily able to notice that something has happened and we can start inferring what has happened so this would so as I said the entire dataset because each value belongs exactly 2 1 a variable and
1 observation observation the of yeah I talked about the conclusions about missing data and we will later get into and the tidy versus here is a set of packages which help you work with this data which help produce data I like this
and in the Python universe upon us for example can work with this kind of data as well so it's not a dead end there is maybe a little bit less human readable but it is definitely more machine readable and therefore especially for larger data sets in the end much more
effective than the the comment was here that in particular what we named he just
generally results exactly it could he
you can be much more precise of what the actual result is what it means as we saw in the punk example you could have a really specific parameter name here with unit so it would be much more clear what this number actually means and also when you to think about vocabularies and ontologies what this result would be called this column header might be already defined in your community if you stick to a stand out so here
you may save yourself the trouble of finding a good way a good name for the for this treatment but then everybody else will start having to think about maybe parse the information from the from the table caption or from the materials and method section of from the result section from the unstructured text basically so that's an advantage thank you
them all right and then 1
topic we will talk about war probably on on Friday as well is how to cite
stuff and many have little no show of hands who was really happy with the Hollis Software citation or the 1st who tried to in office citation in a paper OK about all about half the people which reference manages to to use and I you really happy with 1 of them to solve this problem for his Otero 1 suggestion several more years Otero to there's job graph OK to all of these have in common or was on the suggestion sorry delay OK as well or a graph of he makes all modes OK who the fruitcake would it be possible that most of the suggestions we just now in the end some use the dish or compatible with protection seems to be some agreement are often through OK so I have drawn down here
what flown off where does his dissertation MIT at other come from all this obviously the author of a paper or the off the data said or the developer of a software at some point has to provide the citation MIT had other or in some cases it may also be generated automatically from for example the description file of an R package or other kind of community stand out of programming language specific meter out files and then as we have seen
some repositories expulsive information quite nicely and have linked data and other nicely searchable the content of the data even and therefore use finds dataset therefore can also see what the citation information is then most of you use some kind of reference manager so the next step would be to import this summer F 1 click buttons sometimes you can copy and paste a snippet and imported like this some as the door look up the way you just throw in an identifier like with the white and a web service provides was at my dissertation meter data in the end and 2 but in the end it all has to be inserted into some kind of document and then rendered according to a certain style so because of time constraints I would a planes developer section years I've given a talked before about this topic when I got into it but it's just a few minutes of and for the people in the video and they can also maybe Paulsson topic to the other video there for a minute the impression I got was that the
current dictation the plantation be bound for example would be another option they cover most of this workflow sensors on the plane example you can import is snippet from there it is OK to write by hand space with the data our for example generates citation snippets for you for packages for example and in the end so there is also a huge number of bit taste styles and huge number that renders of your citations and in between there's a large number of tools that can either import of dictation or export it kcal and this 1 is so of put covers probably the widest range not all perfect in each case especially for suffers citations as few styles that recognize this and software citation key as a as night and tied itself and thereby for example also rendering versions and other sulfur specific meted out correct the Butts is a chicken and handed egg problem right you have to start somewhere and I want to recommend a few hours if you're publishing a cell-free users at software key x Mrk is often use miscellaneous or manually in with in case you're referring to the help pages of a package for example and few knowledgeable in in dictation all if you know someone maybe you sit together with them and implements the update for 1 style that is relevant in your field that to render a citation of a cell for a dataset useful there's also the citation 5 format which makes it a bit
nicer to the support those supplied the it had not died as a Yemen-based format a bit more human readable than dictation test maybe less brackets for example but so far the the end game here for them is to convert into bit Taishet also as a coordinator Jason that's
a form that is really just a machines there is for example an R package to generate this but as far as I know it is exposed on on web site so far but I don't know any citation measure within can also import it from there and as we've seen that most of the imports or import options offered dictation or I anyway so on I also of course
the changes in this topic there's a force 11 soft recitation implementation working group that on get-up where can read up on this discussion and where the sources of this of these of this light also from but
then let's have a very quickly more for example you remember that we have put a doIlar software
recently and so the to switch out of full text modes as you can see on the normal you see everything working there on the normal
we have the software titan type you already and Sotero for example can be integrated in the browser and with this little called symbol here it recognizes that this is a software item that it could import so we want to try this and this then was
not prepared also chair was not running at
the term various the database is rolling progress were
trying and here we are that's data that was being important it's a
computer program item types the URL is there at the doorway has been extracted as well library catalogs a normal I look so it's pretty good right
so I would I would use this for a citation and if I then noticed that the citations remnants bits of those strangely I made I have a little correction year for example in that my name has been put into the last name field so so perfect period the food yeah and then so we would like to have
the Pangaea example in a bit more expanded form of muddy field I would suggest a little break before because then you can also start
up our studio and follow along so it whether you have this so-called
in front of you or not so that's the it's an hour and a marked down fired meaning of literate programming example where you have a normal prose on the texts but also are called and so since we have seen a point landing pages already I don't need to introduce you to watch what I now just want to highlight is that 1 aspect of reusability is really in these column names in different datasets
if you have the same column names that include for example also the unit is extremely easy to combine them and the 2 data sets that I have found you for example is and I scored drilling from both the Arctic and the under thick and as you can maybe guests 1 of the measurements that is done was ice measurement was of course is reconstructing the the temperature of the polyol climate so we're going to see which exact measurements are being done in which the exact column names are being used for that but what we want to combined now is these 2 data sets into a single diagram to see what the temperature curve from far far away and long on a goal then has been in these 2 project so these 2 publications so because the datasets are he identified with the doorway we can already say there are but pretty findable and so we can also download them just old Internet http s that is also nicely accessible that we will go on to 1 of the pages after all
because what you saw briefly when Argentina scrolled through her example was that here in the bottom you have a parameter table do and each of the parameters in this data sets are being named shortly described the unit is highlighted and you can even search the rest of Pangaea for other del time 18 old measurements of from water for example it's
the all Haiti no exerts an isotope of oxygen right and the the the isotope of oxygen can be used to reconstruct the temperature so that 6 at measurement but the result of this will be a temperature reconstruction is against U.S. over 900 other data set that use exactly this parameter all potentially we could
combine all of them into a new graphics answering new research questions may but the question for us is just
undoctored Arctic what are the 2 temperature reconstructed curves that we get from this to them
so and therefore our question is how is it also interoperable it looks like this because this this parameter these parameters are being reused in several data sets and now we want to approach this question how should we do it we
want to get the outcome of comparing the true temperature proxy measurements in a single diagram from 2 different datasets so what we have to do when we think backwards to all present a situation we definitely have to make sure that both of the Axios and use the same variables we need to check whether the units are exactly the same whether we have a jump of a thousand or 0 comma decimal 1 or something in there and tool to convert stuff we to extract the values from the data sets for which we need to know the structure 1st and we need to download the data sets in a reproducible manner of course I could click
the Download button you on the website it's somewhere here so I can download assess HTML or as tab-delimited text but we want to have
script do this for us so therefore the challenge would be therefore you that's a question should we well I already answered it right we we shouldn't downloaded sorry which invalid but audience thing about this option writing our own little Donald function and for example putting the both of the IDC into its who's in favor of this option of the 2nd option compared to the 1st was small minority OK can anybody think of a lot option so we have manually McDonald's food the web site we have of programming at Donald functions all I can think of 3rd example please raise your hands kissed several just shouted what would you do was that the same that your set reusing exactly so young we can reuse something so where would you try to find something so I
suggested this to them we will see if they already implemented the suggestion there's about
tools page somewhere that
tools for that the publishers so software
provided by Paguia now OK no they didn't get them as far as I'm aware pending itself does not provide something for our however there's a community of scientists are open open side-order and
that's the hint that have a really big
list of nice packages in this case we were just search for the repository name pieing gay are there and client to interact with the Pontiac database so is unfortunate listed
on their official Twitter site but it has been developed by other people so lesson Laurent 1st in the community are open science is quite beak
source for most packages and of course C ran the comprehensive archiving network is the biggest 1 4 packages so we can look at their list of are packages of those sort by date of publication for example and this research year for parking be out there we find it as well so Our data retriever
bonus well nice OK thank you
so there was a suggestion and to also highlight this general our package which helps retrieving data from some repositories or of non- specified repositories supplying the as a specific 1 but as we have for the data repositories themselves some specific for discipline but it seems that there also are packages which are very general across the Web different work across different repositories useful more suggestions
yes OK and you have a suggestion was also to talk about Pearl mention it but our current is a very general down on things
who would for example be able to resolve if you give it a door I'm not sure so those underlying these more specialized and data retrieval packages are probably are kernel functions in the end
OK but that's continue with our example so
the playing the other before installing something not of course easy to do that it's just you the packages install option mind it's just a few weeks away but how would you find out if this example is actually useful for you does if it has right functions for what you want to do remember we want to download a data set and we have the boys all of the open
science openside packages have reference pages or most of them not being on a per cent sure yeah but they have reference pages that are automatically generated from the function documentation which we learn about later that to date and so now there's some download options as you can see it and this year this PG data looks like it can take a droid and to do what we want so for example PG data exactly we put in the doorway and we get data set back so seems to be what we want so have all of you will install expanding
our successfully I already have much he's put up a red sticky if the downloader installation doesn't work and a green 1 if you have installed at and have loaded it was either checking the check you know a studio or with the library and call the function call right yes only have OK that's a very good point so our open science is as I mentioned a community of scientists and they do peer review on their packages so C round has this as well 0 is the biggest source are open side is especially for scientific packages and is also peer reviewed the GOP packages that you find other packages that you find on get-up may not be reviewed soul now quality there is unknown but it is possible to install stuff directly from get-out in yes here
I am so for dependency management there's PackRat it's also a tool in the universe and we will not cover today no
yeah so I ask my here should be he has experience with this I I just know that it exists and that use of myself looking goods then let's just get our 2 data sets here as you can see the parameter that we need to put in this the door it this is a nicely firms and we get a lists and in the list is another list and then we have some metadata and we have the actual data as you can see here these 2 variables with a bunch of while 5 thousand measurements we can also have a look at the structure of the data it's st so that's another way to have a look at the structure this was a bit larger dataset as you can see the H is present in the DA teen as well about so lots of other measurements now what we need to find out in particular is whether the common names here are really exactly the same so we can automatically process them further the 1st 2 of extract the data set itself we're going to access into the data frame and just extract the data part of its ongoing to override my previous the initial downloads now we can have a look at the structure and we have this nice table not this list of list and nested lists anymore ends with the names function we can check the column names of the 1 and of the other and to have it automatically compared us intersect is what we want and as you can see exactly character by character is this is the same otherwise it would not have been intercepted then let's get to
the plotting the most popular plotting
library in is probably GG plot which stands for grammar of graphics which is a really interesting topic in itself how plots are structured like sentences and there's a certain grammar to it and to build up a plots the logical and visually appealing way even without the customer having the styles and the performance and the colors already so we're just going to use this and as I mentioned yesterday in the versioning example I still have version 2 to 1 so I'm pretty sure my example will work but some of you may have installed the version 3 in the last few days or just now so if it doesn't work we know all that this major version jump introduced some breaking changes in relation to this example called so it could be that my example is small enough that it slipped through the tsunami wave there of gee you plot updates so understand to note that with the library EGG plot called or in Austria can also click the button that the check box here and to did you plot the function of the requests 1 the data sets and the data dataset you should be in the that frame format which we already have as we could see here it's nice visible table 1 aspect of the grammar of graphics is a so-called mapping of the esthetics and the esthetics are the X axis the Y axis then if you have elements that can be varied in size for example box of sizes colors and all of these things that can be esthetics and we need to put in the variable names you that we got from the intersect checked just before so it's best to copy and paste them and this is the 1 criticism of the way Pangaea desert they use spaces in between or within the variable names so we need to have backed takes and on German keyboard that this shift and then this little trough thing to the very top right and afterwords have to the press Space once so that's a back take it's used quite often in programming and in in in marked down for example to 0 denote the little blocks of cold so it's not quoted not italic but it looks like a source code in the formatting and in you need to and clause of variable names with spaces in the spect text so we want to have the age on the X axis and the measurement of oxygen isotope on the Y axis because the 1st dataset is from Greenland's northern Greenland who lies projects so I'm not sure the errors for its reusable ice project married we're just going to colored . real and we want to have a point measurement system see here we have 4 hundreds so observations so we should expect we see about 400 a dark green points when we execute only this 1st part of our and plotting coat so there's a mention that some roles has been that have been removed because the data is missing and this was the question in the break whether you should remove missing data or not I would say also missing measurement contained some information so when in doubt I would rather leave it in in the tidy verse you can be pretty confident which did you put belongs to a basically you can be pretty confident that you believe will be notified about missing values but that your analysis will not run into some kind of error because the handling of missing values is quite advanced and there's even packages to help you visualize missing data which is also helpful in some cases the comment
was that there is a parameter and many are functions called in don't for Wolf and
that that can help you and show that they are not than they don't enter in any kind of calculation so that's what I meant was their hand at quite an advanced way so we have the 1 with dataset your the 0 here's the present time so we're going to do something with us because we're of course drilling into the past when we're drilling and I score so we will also in a minute flip the axis so that the left is the past and the right is the present but 1st let's introduce our other dataset here we using the G metrics of point functions and this 1 the gym at his point functional all the other geometrics functions can also be passed the argument of a data set so if we execute though the 2nd to jump point measurement here with a plotting of function with a different dataset it would try to use the esthetics from before and because the column names are called character by character exactly the same this will work sure I all the road and I watched execute so we can again we're having some missing data here but it's exactly the same missing data and there we go I think is the the black here is the default which and ensure a font Arctic ice so usually colored black but we will leave it here so we already have our 2 datasets here on the same X. axis you can already see that the score from a article was apparently a lot lot longer always measured more carefully we don't know
yet is Asia OK but that also means of course it was prolonged unless this compression of the eyes which is much more it you down my god I should go into the glaciology topic too much to and yet the last thing we wanted to
do is to reverse the EEC's scale so there's a specific function for that scale underscore X underscore reversed this probably also on for their reversing the Y. axis there because the H means the past so we're going to execute this again and there we go we have our 2 different datasets remixed in a single diagram and we can now answer other scientific question which we couldn't have if we for example only had 1 of the data sets or 2 datasets but in incompatible non interoperable non-reusable manner yes sir all I did so but in a similarly prompt impromptu styling and have a discussion for for plotting them 1 esthetic that if we could use is called how far that is from what the alpha channel in the in the color of a pixel value it is a z its transparency all opacity and not sure if 1 is completely transparent or completely opaque so we just just going to use 1 0 comma decimal 5 start so 4 well that's this right because I want to go over right the esthetic each year yeah I I I need to wrap both in acidic spread and I can't even exactly knowing that you put in the very top yeah yeah OK so it's always inherited downwards so before you put into both points better put into the general mapping of the whole plot so this is to avoid so-called although plotting as you saw it wasn't he aware for example that poise will very dense and where just somewhere overlapping and with 0 4 0 comma decimal 5 opacity we can see that some peaks for example are not as overplotted anymore or even we go a bit lower in the on state see in learning yeah India Sedaris's just but very visible other I see it also here but it's not not exactly what I was expecting or we would look it up for example in the GG plot helps so that's what happens when you
introduce unscripted examples have all from this no alpha here in this have began in the wrong help file so we'll just continue this example or or what precisely will conclude with his example that's
on the technical level in the dataset fires exactly already using the exact same column names of then other colleagues in your field that already helps that's the power point I want to make you feel good so that that the rely on the X. axis on the Y axis on the Y. axis but it the the oxygen isotopic 18 that is on the Y axis or the delta of it's to be precise and that is a proxy of physical measurement to calculate backwards to what's the temperature that has the after that has there was present when the snow old fell down and therefore a gas bubbles were included in the in the in the eyes of course the water contains oxygen and you can can't let that back what's with the temperature that happened and then I saw the soul was compressed what I said that's a measurement here OK so I mean scientific this is not really a super interesting example here all the climate scientists will this will be a nice please but we can as I mentioned conclude here that this the the very technical aspect of using the
same variable names that helps a lot for remixing datasets i a PDF of
well let's downloaded yeah he's a very
basic grammar examples we have the
data on the data is put into a like lines or bars or points
and together with the coordinate system this already forms the very simplest grammatical a plot but this ever more grammar aspects to it and yet the data on the map mappings and the geo mother absolutely required functions that would we focused here as well but for example because we flip the scale be used 1 norm require that level of this grammar yeah things Eckart and know that this was in our studio
itself so help she cheats great the did anybody encounter any plotting also or did anybody get the this evergreen sticky if you saw the diagram and pretty much the same way as I did where the iPhone not so you as you can see not every updates is a cause for concern and 2 replanted without the alpha now