Provenance and Social Science data - 15 Mar 2017 - TIB AV-Portal

Provenance and Social Science data - 15 Mar 2017

00:00

6

Related Material

Australian Research Data Commons (ARDC)

Car, Nicholas McEachern, Steve Alter, George

Formal Metadata

Title

Provenance and Social Science data - 15 Mar 2017

Title of Series

Making Data Social - Webinar series

Number of Parts

4

Author

McEachern, Steve

License

CC Attribution 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/35978 (DOI)

Publisher

Australian Research Data Commons (ARDC)

Release Date

Language

Content Metadata

Subject Area

Computer Science

Genre

Webinar/Tutorial

Abstract

This is the first webinar in the “Making Data Social” webinar series, which will discuss data issues of specific interest to the Social Sciences Gives a brief introduction to data provenance and provenance standards +++ Data Documentation Initiative (DDI): A free, international standard for describing data produced by surveys and other observational methods in the social, behavioral, economic, and health sciences. It can document and manage different stages in the research data lifecycle, eg conceptualization, collection, processing, distribution, discovery, and archiving. Documenting data with DDI facilitates understanding, interpretation, and use -- by people, software systems, and computer networks. +++ The C2Metadata Project is producing new tools that will work with common statistical packages (eg R and SPSS) to automate the capture of metadata describing variable transformations. Software-independent data transformation descriptions will be added to metadata in two internationally accepted standards: DDI and Ecological Markup Language (EML). These tools will create efficiencies and reduce the costs of data collection, preparation, and re-use. Of special interest to social sciences with its strong metadata standards and heavy reliance on statistical analysis software.

Making Data Social - Webinar series3 / 4

1

01:04

Snippet - Documenting Data Transformations

2

00:19

Snippet - Managing and publishing sensitive data in the Social Sciences

3

54:21

Provenance and Social Science data - 15 Mar 2017

4

1:02:07

Managing and publishing sensitive data in the Social Sciences - 29th March 2017

Automatic playback

Speech

Text

Image

00:00

RippingWhiteboardSoftwareProcess (computing)Term (mathematics)WordRecursive descent parserPhysical systemWeb pageGroup actionMetadataDigital rights managementAddress spaceArchaeological field surveyService (economics)ImplementationEndliche ModelltheorieTouchscreenOrder (biology)BuildingEnterprise architectureComputing platformProduct (business)Online helpInformationData managementAssociative propertySeries (mathematics)File archiverTheory of relativityResidual (numerical analysis)BitElement (mathematics)Point (geometry)Information systemsMereologySelf-organizationData storage deviceLiquidSpeech synthesisWhiteboard

02:12

Digital rights managementSystem programmingData modelRevision controlPositional notationCategory of beingMetadataMaximum likelihoodLink (knot theory)Dublin CoreAsynchronous Transfer ModeProcess (computing)Graph (mathematics)ExplosionDatabaseDivision (mathematics)Representation (politics)Physical systemMSXPartial derivativeObject (grammar)User profileSharewareDiagramArithmetic meanSlide ruleSoftwareComputer fontConfiguration spaceEndliche ModelltheorieInformationFunction (mathematics)Library catalogSocial classAreaPhysical systemOntologyDatabaseTheory of relativityCore dumpPoint (geometry)DemosceneGeneric programmingSet (mathematics)Physical lawQuicksortTraffic reportingPersonal digital assistantLink (knot theory)SpeciesProduct (business)System callObject (grammar)File formatMultiplication signDoubling the cubeScaling (geometry)Centralizer and normalizerResultantInternetworkingMusical ensembleExtension (kinesiology)Different (Kate Ryan album)MathematicsForm (programming)CausalityVirtual machineQuadrilateralStructural loadBitAdditionPosition operatorAuthorizationRule of inferenceSelf-organizationNumberProcess (computing)Closed setTerm (mathematics)Revision controlLevel (video gaming)CASE <Informatik>Series (mathematics)Operator (mathematics)Row (database)Student's t-testTask (computing)Process modelingData modelCurve fittingPositional notationConstraint (mathematics)Electronic mailing listTheoryInversion (music)World Wide Web ConsortiumExpressionObservational studyExpandierender GraphData storage deviceAmsterdam Ordnance DatumNegative numberCategory of beingMetadataPattern languageRepresentation (politics)Single-precision floating-point formatArithmetic progressionField (computer science)2 (number)CodeContent (media)WordRight angleDublin CoreGoodness of fitComputer animation

12:01

Group actionDifferent (Kate Ryan album)Level (video gaming)Point (geometry)QuicksortEndliche ModelltheorieMereologySpacetimeData managementVideo gameEmbedded systemCycle (graph theory)Multiplication signComputer animation

13:08

Archaeological field surveyEuler anglesOpen sourceObservational studyInformation privacyPhysical systemService (economics)Uniform resource locatorCategory of beingQuicksortMereologyTerm (mathematics)Multiplication signService (economics)Different (Kate Ryan album)Equivalence relationInformation retrievalObservational studyProjective planeState observerPotenz <Mathematik>Open sourceResultantSign (mathematics)Uniform resource locatorInformation privacyFlow separationPhysical systemFile archiverPoint (geometry)Internet service provider2 (number)AdditionSound effectComputer animation

14:55

Structured programmingMetadataData modelRevision controlCodebuchGame controllerProcess (computing)Endliche ModelltheorieVariable (mathematics)Computer fileAsynchronous Transfer ModeDifferent (Kate Ryan album)View (database)Form (programming)Point (geometry)Revision controlQuicksortMultiplication signLine (geometry)Internet service providerWebsiteProjective planeVideoconferencingLetterpress printingMaterialization (paranormal)MetadataObservational studySheaf (mathematics)Open sourceCharacteristic polynomialTerm (mathematics)Distribution (mathematics)NumberVirtual machineBitMathematical analysis1 (number)Software frameworkProcess modelingSoftware developerEndliche ModelltheorieDescriptive statisticsData structureData typeCodebuchProcess (computing)PhysicalismGame controllerVariable (mathematics)Scaling (geometry)Similarity (geometry)Order (biology)Self-organizationElement (mathematics)Computer fileLatent heatGroup actionFile formatExpected valueExpressionMoment (mathematics)Integrated development environmentCategory of beingAreaTouchscreenVideo gameCycle (graph theory)Content (media)Table (information)Set (mathematics)CybersexNumbering schemeFocus (optics)Plane (geometry)Archaeological field surveyReal numberOnline helpDrop (liquid)19 (number)Linked dataWhiteboard10 (number)Traffic reportingTheory of relativityCondition numberComputer animation

20:53

Physical systemSelf-organizationData structureMetadataObject (grammar)Observational studyArchaeological field surveyContent (media)Shape (magazine)Computer filePairwise comparisonQuicksortGroup actionVirtual machineProjective planeMetadataData structureComputerEvent horizonDifferent (Kate Ryan album)Revision controlComputer animation

21:38

Variable (mathematics)CodebuchProcess (computing)CodeUniverse (mathematics)View (database)HierarchyDigital rights managementCategory of beingCodierung <Programmierung>Transformation (genetics)Level (video gaming)Data modelVariable (mathematics)Fundamental theorem of algebraMorley's categoricity theoremEndliche ModelltheorieFile formatCodeNumberDigital rights managementCodebuchMeasurementInformationElement (mathematics)Software frameworkGroup actionVirtual machineMetadataDependent and independent variablesQuicksortLevel (video gaming)Reading (process)DigitizingLibrary catalogExtension (kinesiology)Category of beingVideo gameAttribute grammarAutomationDifferent (Kate Ryan album)Mixed realityCycle (graph theory)Line (geometry)BitData compressionObservational studyComputer animation

23:55

Motion captureMetadataInformationVariable (mathematics)Transformation (genetics)Process (computing)Projective planeMetadataMultiplication signJSONXMLUMLComputer animation

24:47

WordMetadataOffice suiteContext awareness19 (number)Interior (topology)Sinc functionMetadataBuildingPoint (geometry)CodebuchRevision controlProbability density functionComputer animation

26:09

DatabaseVariable (mathematics)Variable (mathematics)Term (mathematics)BuildingCodebuchSet (mathematics)Observational studyResultantCharacteristic polynomial1 (number)Web pageNumberProgram flowchart

27:11

CodebuchDatabaseVariable (mathematics)Electronic mailing listVariable (mathematics)Electronic visual displayNumberTouchscreenCodebuchGraph (mathematics)Set (mathematics)CuboidProgram flowchart

28:13

Pairwise comparisonElectronic visual displayVariable (mathematics)HyperlinkBlock (periodic table)Pairwise comparisonObservational studyVariable (mathematics)CodeTouchscreenSet (mathematics)Dependent and independent variablesCodebuchResultantCombinational logicProgram flowchart

29:45

Menu (computing)Variable (mathematics)Pairwise comparisonElectronic mailing listTotal S.A.Data storage deviceProjective planeTouchscreenCodebuchMetadataObservational studySet (mathematics)Archaeological field surveyNetwork topologyDifferent (Kate Ryan album)Computer animation

31:49

Uniform resource locatorDependent and independent variablesDifferent (Kate Ryan album)Pattern languageProjective planeRevision controlArchaeological field surveyWordCodebuchProgram flowchart

32:53

SharewareInformationControl flowInternet service providerMetadataStaff (military)Set (mathematics)PrototypeProbability density functionVirtual machineGroup actionArchaeological field surveyXMLProgram flowchart

33:48

Pattern languageTransformation (genetics)Archaeological field surveyComputerE-learningSoftwareSpherical capMetadataMetadataProjective planeMotion captureTablet computerType theoryVariable (mathematics)Computer programDigitizingInformationAnalytic continuationLaptopComputerDataflowFile archiverComputer animation

35:00

Projective planeMetadataSet (mathematics)Physical systemComputerComputer programDifferent (Kate Ryan album)CodeCodeVariable (mathematics)Scripting languageScaling (geometry)Matching (graph theory)Subject indexingNumberIdentifiabilityPoint (geometry)SoftwareComputer animationProgram flowchart

36:48

MetadataComputer fileFile archiverMetadataScripting languageInformationStatisticsVariable (mathematics)Computer animationProgram flowchart

37:39

Different (Kate Ryan album)SynchronizationScripting languageMotion captureParsingFile formatSoftwareRevision controlProcess (computing)MetadataInformationVariable (mathematics)Modulare ProgrammierungFormal languageComputer animationProgram flowchart

39:07

File formatState of matterRight angleFormal languageScripting languageSet (mathematics)Different (Kate Ryan album)ParsingCASE <Informatik>File formatThetafunktionSingle-precision floating-point formatComputer animation

39:54

State of matterVariable (mathematics)Row (database)Computer programFormal languageResultantIdentity managementNumberThetafunktionComputer animation

41:04

State of matterRegulärer Ausdruck <Textverarbeitung>Pairwise comparisonFunction (mathematics)NumberExpressionPairwise comparisonCASE <Informatik>LogicResultantInfinityFormal languageValidity (statistics)Core dumpComputer animation

42:06

Motion captureMetadataE-learningInformationTransformation (genetics)StatisticsSoftwareProcess (computing)Projective planeMetadataMereologyInformationArchaeological field surveySoftware maintenanceObservational studySet (mathematics)Computer animation

43:22

Metropolitan area networkEndliche ModelltheorieDirection (geometry)Variable (mathematics)MultiplicationMereologyProjective planeMathematicsSubject indexingHorizonMixed realityComputer animation

45:25

Observational studyAttribute grammarProjective planeEndliche ModelltheorieInformationDataflowInternet service providerMetadataPhysical systemMotion capturePoint (geometry)PhysicalismBitGame controllerProcess (computing)DeterminantLatent heatGame theorySpacetimeSensitivity analysisQuicksortSign (mathematics)Mereology2 (number)SoftwareInstance (computer science)Category of beingProduct (business)AreaVariable (mathematics)Service (economics)Term (mathematics)Meeting/Interview

48:22

Set (mathematics)Computer programMetadataEndliche ModelltheorieAuditory maskingParsingComputer fileScripting languageWeb serviceCASE <Informatik>Web 2.0Server (computing)Point (geometry)Meeting/Interview

49:52

Set (mathematics)Repository (publishing)Representation (politics)Generic programmingLevel (video gaming)Digital rights managementLine (geometry)MereologySocial classInformationEndliche ModelltheorieCausalityPressureAssociative propertyMeeting/Interview

51:06

NumberDirection (geometry)CalculationQuicksortOpen sourceLatent heatCodeWordMultiplication signMechanism designInformationOpen setProcess (computing)Presentation of a groupAuthorizationEndliche ModelltheorieState of matterTablet computerDisk read-and-write head1 (number)MetadataValidity (statistics)Confidence intervalSoftware repositorySign (mathematics)DigitizingTrailGeometryBookmark (World Wide Web)GodSoftwareInternet service providerComputer animationMeeting/Interview

53:44

Multiplication signOffice suiteFitness functionSeries (mathematics)Meeting/InterviewXMLUML

Transcript: English(auto-generated)

00:00

Today we're going to be speaking about provenance and social science data so you should be able to see on our screen we're showing our data provenance community page and we have a data provenance interest group and if you're interested in that you can contact us through the contacts on that page.

00:21

We have our speakers here. I'm Kayla May. I'm from ANS and I'm one of the research data specialists at ANS and we have George Alter, Stephen Agock and Nicholas Carr. We'll give each of them a little bit of an intro when we get to their point in speaking. So as I mentioned this is part of a series. Today's our first one. So I'd like to introduce Steve and Nick who will be speaking first.

00:48

So Steve is the director of the Australian Data Archive at the Australian National University. He holds a PhD in industrial relations and a graduate diploma in management information systems and has research

01:01

interests in data management and archiving, community and social attitudes surveys, new data collection methods and reproducible research methods. Steve has been involved in various professional associations in survey research and data archiving over the last 10 years and is currently chair of the executive board of the Data Documentation Initiative.

01:20

And Nick, Nicholas Carr is the data architect for Geosciences Australia, GA. In that role he designs and helps build enterprise data platforms. GA is particularly interested in the transparency and repeatability of its science and the data products it delivers. For these reasons Nick implements provenance modeling and management systems in order to represent and store information

01:43

about data lineage, what was done and who did it and what they used to do it. Previous to working at GA Nick was an experimental scientist at CSIRO and researched metadata systems, provenance, data management and linked data. He currently co-chairs the International Research Data Alliances Research Data Provenance Interest Group which the

02:05

Anteprovenance Interest Group works with and through that and other groups assist organisations with provenance management adoption. Okay, thanks Kate. Alright, so this is a very quick introduction to Prov. So Prov is a provenance standard and what you see on that first slide there is a

02:25

very, very simple diagram of a little provenance network and I'll discuss some of that as we go. So it's not just a frivolous diagram, it actually has some meaning. Okay, so the outline for today. So what is Prov? I'm just going to mention that very quickly and then I'm going to get to how do I actually use this thing in a couple of different ways.

02:44

So first I'll talk about modeling, then I'll talk about how do I actually manage the data once I've collected or made provenance data and then I'll talk about using Prov with other systems. So what is Prov? So Prov is a W3C recommendation. So W3C is the World Wide Web Consortium.

03:04

So it's one of the governing bodies of internet standards and they don't issue any documents called standards, they issue documents called recommendations. So Prov is a recommendation, it's its top level of standard I suppose. Other standards by the W3C are things like HTML, so I'm sure everyone's familiar with HTML at least to some extent.

03:24

Prov itself was completed in 2013 and sort of formalized by the end of that year. So it's only a couple of years old and a large number of authors were involved in Prov. There were several initiatives to make provenance standards before Prov over the last perhaps 20 years.

03:44

And many of the authors involved in those standards such as PML and OPM, I'm not going to elaborate all that. So if you're interested in those previous standards just Google them. Many of the authors involved in those initiatives were involved with Prov. So Prov really does know about those other initiatives and it's simpler than

04:01

those precursors because it's trying to do a sort of a high-level standard. It doesn't do as many of the tasks that those precursors do, but it certainly represents the very important bits that they came up with. Another thing to say about Prov is there's no version 2 planned any time soon. Why am I bringing this up now? Well, it's a pain for people to have to deal with standards and then version 2s and 3s and 4s of standards.

04:27

Prov doesn't quite operate like that and I'll explain how. It is what it is and there are ways to extend it and use it in different circumstances, but it's unlikely we're going to see any version change in the next few years I would think.

04:43

And it's seen good adoption. Prov is really the only international broad scale problem standard and as a result people are happy to adopt it in lieu of really anything else. Right, so Prov is actually a collection of documents and I've just listed them there. I'm not going to go through them all in great detail, but there is

05:01

an overview document and then certain bits and pieces which are actual recommendations or standards and additional things that just help you use the Prov theme. Now the main document is the Prov-DM, the data model, and that tells you what Prov contains, how its classes operate and so on. And then there's a series of documents like an XML version of Prov and our ontology version and special notations and so on.

05:25

The only other one I'll mention is the Prov constraints which is a list of things, of rules that Prov compliant chunks of data must adhere to and that works across any formulation of Prov and I'll provide a link there to the collection of standards of documents.

05:40

Alright, so how do I use Prov? This is a modeling. How do I actually model something using Prov to do the core of Prov and its representation? Well, I'm starting off with some negatives. So don't do it like this. Don't take a document for something, perhaps a metadata catalogue entry and expect to shove a bunch of information into some field within that document.

06:04

So ISO 19115 is a standard for spatial datasets and it's got a field called lineage and some people expect to take Prov provenance information and stick it in that lineage field. Don't do that. Prov doesn't let you do that and I'll explain why in a second.

06:21

So that's one thing not to do. We're not going to see a single item's metadata record containing a bunch of provenance information. We could do that but not recommended. What else should I not do? So this diagram here is the class model of the DCAT, the Data Catalog Vocabulary, which is a very generic metadata model.

06:42

It's used in relation to things like Dublin Core and various catalog style things and we're not going to link a dataset or any other object in DCAT or Dublin Core or other standards like that to a class of provenance information. And this is true for Steve's DDI initiative as well. We're not going to take objects in DDI and link to a provenance object

07:03

that tells you the provenance of that object. That's an anti-pattern right there. Okay, so what are we going to do? Oh, and we don't even do this using Dublin Core's provenance properties. So Dublin Core vocabulary has a property called provenance and the wording for that says use this to describe lineage and history.

07:21

Prov doesn't want you to do that. Exactly like that. What does Prov want you to do? Prov wants you to think of everything that you're interested in in terms of three general classes of objects. So is the scenario and the things that you're interested in, are they things? Are they entities? Are they occurrences? Are they processes? Are they activities? Or are they causative people or organizations which Prov calls an agent.

07:45

So Prov says model everything you know about using those three classes and then link them together. And that's what Prov is all about. So how does GA use Prov? So we often process chunks of data at GA and so we have a very simple model that's using the provenance ontology

08:01

and it looks like this. There's some process. The process generates outputs. The outputs are entities. The process itself is an activity. And then there's data and code and configuration and so on that feed into that process and those are also entities. Finally, the process and the entities might be related to a system and even a person who operates that system.

08:22

So that's the model we use. Okay, so how do I actually manage the data that I get in provenance or that I get according to Prov? Well, you can create reports. So if you go and do something, a human or a system could log what they've done and they could store that information in some kind of database according to the Prov model and then you can, it's a document database but you can query that thing.

08:45

So we often have systems that sort of send reports every time they run. And you might have a form that looks like any other metadata entry form where you fill in the details and you hit enter and that sends your provenance information. But again, it's not storing it with respect to one specific object. It's linking existing objects together.

09:02

So some data set that is produced from another data set is going to link those two things together. For catalog things, we can link things again and if we have a catalog that has a data set A or X and a data set Y and we want to show there's a linking, we can say data set Y was derived from data set X and record that information somewhere.

09:22

Now, data set Y may record I come from data set X, but that's just a very simple little bit of provenance information. It's not a whole blob of provenance information stored within data set Y. And we can ensure that any system that has information that is provenance information like who the creator of a data set was, does so in accordance to the Prov model.

09:45

So in this case, if we had a data set that had a creator, we would say the data set was associated with an agent and the agent had a role to play and that role in this case was creator. That's now a Prov expression of that relationship. And for databases, it can be very difficult. I can't explain it in depth here,

10:01

but there's many ways in which databases could store provenance or Prov-related provenance information, but they would need to be able to show that they can actually export their provenance content according to the Prov data model. You actually have to prove that if you want to say that you're compliant with the standard. Okay, so fairly quickly, how do I get Prov to work with other systems?

10:22

Well, we can fully align our system, whatever the system is. So I've used a theoretical example of metadata system X. How do I align metadata system X with Prov? I could classify all the things in metadata system X according to Prov. Prov requires a metadata model for metadata system X, sorry, a data model, not just in coding formats.

10:43

We can't just deal with XMLs and so on. We actually have to have a conceptual model and then we can say this class of thing in metadata system X is the same as this class of thing in Prov. Now Prov's only got a few classes, so that's usually pretty easy to do. But it will definitely prompt you to do things that you wouldn't normally do.

11:01

You may have to tease apart some of the objects that you know and love into things that Prov recognizes as different objects. You could do a partial alignment. You could take your metadata system X and only acknowledge that some of the things in that scenario are Prov-understood things. So maybe you've got a metadata model that talks about all kinds of stuff and one of the things it talks about is a dataset

11:21

and you say your dataset is the same as what Prov thinks of as an entity and maybe you ignore all the other things. You would still need to demonstrate that you could extract valid Prov out of that and not all the other stuff but that would be one way to do it and you could also link to things not in your own data model

11:40

if you also classified those things according to Prov. The last scenario you could think about is to just duplicate your obviously not as good systems and use Prov and that would require you perhaps to make either a new dataset of Prov information or a data store and put that information somewhere. And that's it. Thank you very much Nick. So we'll move on to Steve.

12:02

Nick's talked about the general Prov model that is increasingly getting used in various different spaces. I'm going to talk specifically about the various ways of thinking about provenance in what we're doing in the social sciences and particularly within the standard that we utilise

12:20

and I'm now the Director for the Data Documentation Initiative. Part of the reason we've connected these two together is we're now looking at how we can leverage the Prov standard inside DDI in point of fact so Nick and I and a group of others have been working on how we might go about this. I'm not going to touch too much on that but I'll return to that at the end.

12:41

I want to talk more generally about how we might think about provenance at different stages in the data lifecycle, different stages in the researcher and the data management experience and how we've progressed thinking about provenance over that time. Just to give a sense of what sort of things we can do already

13:02

and how can we increasingly capture a provenance in what we do. Quickly and for those who don't know, the Australian Data Archive, we've had various names over time. We're going to do a quick introduction. We've been around for a little while now based here at the Research School of Social Sciences at ANU

13:20

and our mission is to collect and preserve Australian social science data on behalf of the social science research community in Australia and internationally. We've sort of developed a collection of over 5,000 datasets now over 1,500 different studies as we call them or projects, lots of different sources, lots of different provenance from various different locations, academic, government and private sector.

13:45

As our holdings have developed, our understanding of provenance has developed probably alongside that. Maybe we didn't call it that at the time, but over 35 years, I think that's always been sort of underpinning a lot of what we've done and it's helping researchers who might be the secondary uses of our data

14:02

to know where did this come from, what was it used for and how might I use it in the future is really the emphasis there. For those who don't really know what we're talking about, when I use the term data archive, we're using the term trusted systems out of a project done by the Social Science Humanities Research Council of Canada,

14:22

kind of the equivalent in Canada for the ARC. Accessible and comprehensive service empowering researchers to locate, request, retrieve and use data resources so you've got to be able to find and understand it in a simple, seamless and cost-effective way while at the same time protecting privacy, confidentiality and intellectual property rights of those involved.

14:42

Part of why we're interested in provenance is really that last point. One is to help researchers understand where this came from but B is to sort of recognise and acknowledge the intellectual property that's been developed in those resources over time. Okay, so I'm going to give a brief introduction to the DDI standard and its different flavours.

15:01

As Nick pointed out, having multiple versions is not always much fun. We're up to version 4, we're about 20 years old now so I think that's not too bad from Nick's point of view. And how we've sort of captured what we might think of as different forms of provenance over time. So I've got the website there, the ddialliance.org website

15:20

if you're interested in knowing more. You can go and explore the different versions of the standard there. So what is DDI? It's a Structured Metadata Specification developed for the community and by the community. So particularly the social science data archives that exist in most OECD countries

15:44

It's used in about 90 different countries around the world now thanks to work by the World Bank and the World Health Organisation and others. And there's two major development lines that are basically XML schemas. One's DDI Codebook and the other DDI Lifecycle which kind of correspond to version 2 and version 3 of the standard.

16:05

And I'll talk a little bit more about those in a moment. We have some other elements to it as well. Additional specifications including some control vocabularies often for things like encoding methodology, typing out data types and data capture processes

16:25

and some RDF vocabularies so that we can sort of start moving into a linked data world so you can leverage the standard, particularly the lifecycle standard into a linked data environment. The version 4 is in development at the moment.

16:41

It has been over the last couple of years and that's where the work with Nick has come on board as well. It's moving to a model-based specification so rather than being based in a particular schema we're looking to focus on the model and then its expression into various different formats. The provisional ones at this point are XML and RDF

17:02

and that includes support for provenance and process models. So we're looking at that point at how do we leverage what we know from PROP to support the provenance model within the new version of the standard. And it's managed by the DDI a lot. So briefly on the two versions of the standard that are already in place.

17:23

So it's been around in the codebook format which has its origins in the print codebooks that were produced by organizations like Georgia's which is going right back to the 1960s and 70s. So we formalized into social science as a fairly structured way of thinking about describing data

17:40

about 40 years ago really. So the codebook version of the standard really is an after the fact description of what a dataset is about. It includes four basic sections. The document description which is describing the document that's describing the dataset. A study description.

18:00

We use the term study to describe the package of datasets that encapsulate a project. So that includes characteristics of the study itself that the DDI is describing. That includes lots of sections on authorship, citation, access conditions,

18:20

but particularly from the point of view of provenance We have the methodological content, data collection processes, sources and then we also include a lot of what we call related materials. Documents associated with the project. We told you something about the provenance of where it came from. This includes all the questionnaires, previous codebooks, technical reports, etc.

18:43

So from a human point of view you're starting to get into the area of thinking about provenance even though it's not really a machine actionable version of that. We also describe the files themselves, the characteristics of the physical data files, data formats, etc., the size and their structure. And then what we call variable descriptions,

19:02

the variables that are included in the data file. The simplest way of thinking about this is the columns of a tabular dataset. What does that column mean? Because in a lot of the social sciences a number is not actually a number. It represents a characteristic of some sort. For example, a five-point agree-disagree scale in a sort of a

19:23

how do you interpret a lot of novias becomes important. And George is going to talk to a specific project looking at how we do a lot more with the variable description and the properties of variables in a moment. So the codebook was really developed to describe things after the fact.

19:42

The DDI lifecycle model takes a more data lifecycle approach to thinking about capturing metadata and prominence. Underlying it is the model we have on screen here. This is a working model describing the different processes in the DDI framework that a dataset can go through.

20:04

Everything from conceptualizing the study in the first place through collection and processing and distribution as a sideline archiving that data and storing it away for future use and then rediscovery and analysis and repurposing into the future. So it was built with the intent of reusability

20:22

and particularly machine actionability as well. So that the metadata that's developed in a dataset can be reused in the future for the same purpose, a similar purpose or something entirely new. And in order to do that you need to be able to understand where did it come from. So embedded in that is generating metadata going forward

20:41

to be able to look backwards through the lifecycle as well. As I say, it's focused on metadata reuse. And that reuse of metadata really implies a provenance in expectation. So why DDI lifecycle? The things it can do. It's machine actionable. It's more complex.

21:00

There are 27 different schemas. It's probably overly complex, if we're being fair. It's structured and identifiable. So every metadata item is actually able to be permanently identified, managed and repurposed if that's required. It supports related standards and it supports reuse across different projects. And again, that's sort of something that George is going to touch on as well.

21:23

I'll move past this because I think there are some particular features for it that I can refer back to in the future. I want to talk very briefly about how do we think about provenance within the different versions and then pass to George to talk specifically about the projects there. So if we think about how provenance has been supported here,

21:42

Nick's approach to the prob model is really a machine actionable model fundamentally. DDI codebook is not really designed for that, but it is designed, at least, to be able to describe to a human reading a catalogue entry what the provenance of this dataset was. So it includes attribution, methodology, data processing, collection,

22:01

and all the documentation we can find on what happened to the data. But it doesn't really do that in an automated way. It's really focusing on a human response for our research to be able to come back and have a look. Similarly, with variables, question text, where the name, what the value labels mean are all there. DDI lifecycle is really trying to...

22:22

It was our first attempt really to look at the machine actionable problem. So can we capture this along the way? It represents, again, the information from the studies, attribution, methodology, and so forth. But particularly with variables, it's really trying to look at the reusable elements of how we might reuse questions, reuse columns of data,

22:43

and understand and reuse the basic conceptual ideas that are embedded within that. So, for example, if you've got a variable measuring employment, can I reuse that employment? Maybe the categorization that was used, the numbers that were used in the survey, and so forth. And then where we're going with DDI4,

23:01

our tagline for that is what we call DDI views, is to what extent can I actually embed a provenance model inside that framework? So now we're moving towards really recognizing the importance of provenance, both conceptually and in the physical and digital formats of data as well.

23:23

Managing codes and categories across the lifecycle. For example, managing provenance through missing values. If your value of a datum changes, how do I understand that? So we've got really, we're able to generate this out automatically, what happened at the level

23:42

of an individual datum of a variable or of a dataset. So we're moving progressively towards the sort of framework that Nick described, but that requires the management of the metadata that we have to be moved forward. Let's find a bit from me. Hi, everyone. Thanks very much to Anne and to the ADA

24:02

for inviting me to be here. What I'm going to talk about today is a project that started in October with funding from the U.S. National Science Foundation about capturing metadata during the process of data creation. So I don't think for this audience

24:23

I have to go into this, justify metadata, but the big problem that we face is how do we actually get the metadata? That's often more difficult than, you know,

24:40

it's a lot easier to describe it than it is to actually get it most of the time. So to give you some background, I'm going to put this in the context of my home institution, which is the Inter-University Consortium for Political and Social Research located at the University of Michigan. We've been in the business of archiving social science data since 1962,

25:05

and we're an international consortium of more than 760 institutions. We were also one of the founding members of the Data Documentation Initiative Alliance, which Steve just talked about,

25:21

and we actually provide the home office for the DDI Alliance. And ICPSR has been using DDI for many years, but we're now getting to the point where we're able to build new kinds of tools that take advantage of DDI.

25:41

One of the first things that we've been doing, which we've been doing for at least 10 years, is that when you download data from ICPSR, you get with it a codebook and PDF. But the PDF is actually created from the DDI, not the other way around. So for us, the DDI is the native version of the metadata.

26:07

So what we've started to do is to take advantage of DDI to build new kinds of tools. One of the first ones we created was what's called our variable search page, where you can put in a search term

26:24

and look for questions that have been used in datasets that are like that search term. So this is an example of the results that come out of a variable search, and we are now searching over more than 4.5 million variables

26:42

in about 5,000 studies or data collections. One of the things that DDI makes possible is that we can go from this search to other characteristics of the data. So you can see here in the blue

27:03

that there are a number of things that are hyperlinked. If you click on the place I've got circled, it takes you to an online codebook, and the online codebook has a number of features, it tells you the question that was asked,

27:21

it tells you how it was coded, if the data are available online, you can go to a crosstab tool, and it also can link to an online graphing tool. And the other thing that you see on the left side of the screen is a list of the other variables in the dataset.

27:41

So you can move around in the dataset and clicking on any of those variables will bring up a display that's similar to this. Another thing you can do from our variable search screen is that if you click on these checkboxes on the left,

28:00

you can pick out a certain number of variables that you want to look at more closely, and clicking on the compare button at the top there brings you to this screen, which is a side-by-side comparison of these different variables

28:21

which come from different studies, and so you can see whether they're asking the same question, whether they're coded the same or differently, and as before, this screen is also hyperlinked to the online codebook, so you can go back and forth.

28:43

One of our more recent tools, which I think is one of the most powerful, is that you can now search for datasets that include more than one variable that you're interested in. So this is a search in using what we call

29:02

our variable relevance search that's actually in the study search rather than the variable search, where we're looking for three variables about three different things. Does the respondent read newspapers? Do they volunteer in schools? What's their race? And you can see here that the results come out

29:21

in three different columns within each study, so you can see which variables are present in each study, and as before, everything is hyperlinked to both the online codebook and the variable comparisons, so you can check on any combination of these variables

29:41

and compare them side-by-side. Well, another thing that we did as another previous NSF project, working with the American National Election Study and the General Social Survey, we made a crosswalk of the variables that are available in those two studies.

30:04

Now, the American National Election Study started in 1948, and it's done every four years. The General Social Survey started in 1972, and it's done every two years. So we're actually going to be looking over 70 different datasets, and what we've done is created this crosswalk

30:23

where we've grouped the variables according to certain tags. We've got eight lists of tags and then 134 tags in total. The columns here, each column represents a dataset,

30:40

and there are 70 datasets. All of the variables are linked here, and I can't actually show it here, but if you hover over one of those variables, it shows you the question text for the variable. And again, you can use the checkboxes to pick out things that you want to compare

31:01

and go to the variable comparison screen. So a crosswalk like this is a tool that's actually very common. You've probably seen these before. There are two things that are different about this, though. One is that this is all keyed into the online codebook,

31:21

so you can go transparently back and forth. The other thing is that we can use this tool to crosswalk any of the 4.5 million variables in the ICPSR collection because this is drawing directly from our store of DDI metadata, and we don't have to build a separate tool for each one.

31:43

This one tool works over all of these datasets. Another thing that we did in this project was to think about how we could extend the online codebook. And so here's our online codebook that you saw before,

32:02

which has the question text and how it was coded, but this version has something new in this location here. It shows how you got to this question, and in big surveys, every respondent doesn't answer every question. There are what are often called skip patterns.

32:20

So you get asked what your marital status is, and if you're single, you go to one question. If you're married, you go to another question. Divorced people go to a third pattern. So there are different pathways through the questionnaire, and what we've done here is try to show, here's how you got to that question, which explains why some people didn't answer the question.

32:43

We also represented it in words down here. So we built this, and we were quite proud of ourselves for building it because this does answer the question

33:00

about who we answered this question in the survey, but then we ran into a problem. So how do we know who answered the question in the survey? And the answer is that we get that information from the data providers in a PDF, and the only way we could build this demo prototype

33:22

was to have one of our staff members enter this program flow information manually into XML for one of data sets so we could show how this works. So we showed a tool that we think is really useful,

33:41

but we reached a roadblock because we don't actually get machine-actionable metadata about this kind of information. And the problem is that when the data arrive at the archive, they don't have the question text. That's something that we at ICPSR and ADA have to type in.

34:03

They don't have the interview flow, they don't have any information about variable provenance, and variables that are created out of other variables are not documented. So the project we're working on now, which we call C2 metadata for continuous capture of metadata,

34:23

is about how do you get that. And to understand that, how we get it, you have to think about how the data are created and what happens. So first of all, the data themselves are actually born digital. People do not go around with a paper questionnaire these days.

34:40

They use these computer-assisted interview programs. They're on telephone or they go around with a laptop or a tablet to answer them. There's no paper questionnaire. There is instead a program and it's the program that's the metadata. So technically at the beginning,

35:02

you start with this computer-assisted interviewing system and what you get out of it is the original data set. But you can also derive from it DDI metadata in XML and there are programs, a couple of different programs,

35:23

that will take these CAI systems, the code that they run on, and turn them into XML. But what happens next? Well, what happens next is that the project that commissioned the data

35:40

is going to modify the data. There are a number of reasons for doing that. There are some things that are in the data that are purely there for administrative purposes. There are some variables that have to be changed to reduce the identifiability of individuals, some variables that need to be combined into scales or indexes.

36:02

So what they do is they write a script that's going to run in one of the major statistical packages and they take that script and the script and the data go through that software and what comes out is a new data set.

36:21

Well, what happens to the metadata? Well, at this point, the metadata don't match the data set anymore and you would need to update the XML to fix it and nobody likes updating XML so the metadata get trashed and thrown away.

36:45

What happens then is this, when the data, after the data are revised, the metadata are recreated and what happens is that we at the archive take the revised data

37:00

and extract as much metadata from it as we can so we get an extracted XML file and what about the things that went on in this script here? Well, we actually have to sit down and extract them by hand so a person has to read the script and write down what happened.

37:23

Well, what we're working on in... Well, so what are we missing? Well, what we get from these statistics packages are just names, labels for variables, labels for values and virtually no provenance information. So what we're working on is a way

37:42

that we can automate the capture of this variable transformation metadata. So our idea is this, that we're going to write software where you could take the script that was used to modify the data,

38:04

take the very same script and run it through what we're calling a script parser and pull from that the information about variable transformations and put that into a standard format which we're calling a standard data transformation language.

38:23

And then you take that information and incorporate it into the original DDI, you update the original DDI and then you've got a new version of the XML that is in sync with the revised data.

38:43

So this process then requires two different software tools, one that will read the script and turn it into a standard format and a second one that will update the XML and that's what we're building. So we are building tools that will work with the different software packages

39:05

and update XML. We're actually writing these parsers for scripts in four different languages SPSS, SAS, Stata and R and the reason we're doing four languages is that if you look at the column over there on the right

39:20

which is based on downloads at ICPSR in cases where the data set had all four formats you can see that there is not a single dominant format. SPSS and Stata are the most downloaded formats from ICPSR and they both have about 24%. SAS and R both have about 12%.

39:43

If we did one package we wouldn't be pleasing only a few people and we couldn't have an impact. So we're actually writing parsers for four languages. Here's something I thought that's come out of our work that you might find interesting and this is about why we need to have a special language

40:06

for expressing these data transformations. So here are three brief programs in SPSS, Stata and SAS that all are designed to operate on the same data

40:25

and I tried very hard to make the programs, the scripts identical and I think that I succeeded but if you run these three programs you get three different results and the key thing here is to look at the last row

40:42

that the row in which we set the minus one to be missing and in SPSS you get two missing values, in Stata and SAS one of the variables is set to a number but it's a different one in each one.

41:02

Why does this happen? Well the reason is that in logical expressions SPSS treats a missing value as makes the result of a logical expression that includes a missing value missing which in most cases is treated as false.

41:22

Stata treats a missing value as a number which is equal to infinity. SAS treats a missing value as a number which is equal to minus infinity. So both Stata and SAS actually do return a number when you have one of these comparisons so it's actually more accurate to represent the data in this way

41:45

which you wouldn't see if you just looked at the data set. So what we're doing is creating our own language where we're actually using a language that's been created by another community the SDMX community called Validation and Transformation Language

42:01

so that we can put all three of these languages into a common core. So what are we doing and why are we doing it? So the goal of the project is to capture this metadata and automate it. If we can capture more metadata from the data creation process

42:24

we'll be able to provide much better information to researchers about what's in the data set. Automating this process we hope will make it cheaper for everyone and make it easier. And that has been one of the principles that we've tried to do here

42:43

that if we can't make it easier for the researchers they're not going to do it. So the hope here is that the software we get will make their lives easier. And here's just to acknowledge some of my partners in this

43:02

we've got partners from a couple of software firms Colectica and Metadata Technology North America, the Norwegian Center for Research Data and two of the two projects I mentioned the General Social Survey and the American National Election Study are part of the project too.

43:22

So that's my talk. Thank you very much George. We had a question that came through earlier in your talk when you were speaking about people putting variables into ICPSR and searching for them. And me has asked, when a user searches for a variable or variables

43:43

do they need to come up with the exact variable name as in the variable index? So right now what we're doing is really a text search. And when you search for variables you're searching over the variable name

44:03

and the variable label. And it also can bring up items that are in the values for the variables. But one of the problems in the social sciences is that people don't use questions very often.

44:21

So we don't have a tradition of reusing questions and it's very hard to find the same question in multiple datasets. The kind of search we're doing now in our question bank is frankly kind of clunky and it often misses things

44:42

and that's an issue that I'm trying to address in some other projects where we're trying to improve the way we can search over variables. Thank you very much. We've got a question for Nick as well. So Nick, we've got a question.

45:01

How widely is Prov used and what have you found to be the main challenges working with Prov noting that a V2 is not on the horizon? Is it easy to update a Prov model if a change is required? Okay, so first part first. How widely is it used? So I have a direct interest in things provenance

45:23

but aside from that I have an interest in things geospatial and I guess physical sciences data. In that community there's only one game in town. That's really Prov. But it's early days so most of the spatial, geophysical, blah blah blah, those hard physical sciences side,

45:44

they either are using their own systems or they're intending to use Prov. There's not many that are actually already using Prov but there are certainly not many that are intending to use something other than Prov. Outside of my own Geoscience Australia area, other communities that I know of, including DDI and so on,

46:02

because Prov has only been around for a few years, if people can characterize their problem in a provenance way, like they actually understand this is a provenance question as opposed to some other kind of question like an ownership or an attribution question, they fairly quickly end up at Prov. So I think it's about as,

46:22

it's certainly more widely used than any other provenance standard has ever been and it's showing signs of being much more widely used than that and that's because any other initiatives in the space have been sort of swallowed up by Prov. Now, the second part of the question was what are the problems and I've identified one already, which is people have to know that they're asking a provenance question.

46:42

So we get a lot of questions which are synonyms for provenance questions, probably much like variable naming, where people say I'm interested in the lineage of my data or the transparency or the process flow or the ownership or attribution and those are all or could be all provenance questions. The hardest thing to work out is specifically what questions are being asked

47:06

and then if there is an existing metadata model or something in that space already, what's it doing and what's it not doing and therefore do we need provenance, a specific provenance initiative. So for instance, many metadata models have authorship, ownership,

47:22

creator information indicated in them. So if your provenance question is I want to know datasets created by Nick, that kind of provenance question you can usually answer in other metadata systems. You have to have something a bit more complicated than that and a term of provenance to then think about using a provenance system.

47:40

The other thing is to move away from what I call point metadata where you've got a single thing with a bunch of properties that come from it. So a study or a document or a chunk of data with a bunch of properties, that's one way to do things, but what Prov and what other models are interested in is whole networks. Things that relate to other things. It's more complex, but it's much, much more powerful to do that.

48:02

Great. Thank you very much. So a question for George. How is sensitive data variables or values controlled for during the C2 automatic catch-up? ICPSR has a confidentializing service on ingest. Is this process carried over to the C2 metadata project? Is this activity captured in Prov-like metadata?

48:22

So the C2 metadata model is to operate solely on the metadata, not on the data. So it doesn't really play into the issue of confidentiality.

48:40

If you're interested, in two weeks we're going to have another webinar where I am going to talk about how we manage confidential data. But in general it's rarely the case that we have to mask the metadata of a data set for confidentiality reasons. Obviously controlling the data is something else.

49:01

So we've got another question here for George. Your script parser that reads from SAS script, do researchers need to install that in their SAS package? We haven't gotten to that point yet, but probably not. Probably what we'll do, at least as a starting point, is offer it as a web service.

49:20

And what you'll do is simply export your SAS program into a text file and upload the text file to the web service and it will download a new XML file. So we've got another question here.

49:41

Does Prov support the workflow of creation and approval of provenance data, e.g. the Prov entry is proposed and has been submitted to the data custodian for approval? That's got two kind of answers to it. One is a generic Prov answer and the other one seems to be more in line with a particular repository doing a particular set of steps.

50:02

So this isn't exactly what you asked, but I'm going to answer it in a slightly different way. You can talk about the provenance of provenance, which is a bit tricky. But say you had information about the lineage or the history of a data set and you wanted to control that chunk of stuff. You could talk about that thing being a data set itself, even though it's about something else,

50:23

and manage that. And you could certainly work out how to link your data set to the data set that contains its provenance information. So you can do that. But the second part of the question, or I think the general sense of the question, is more to do with how does a specific repository do things? Does that make sense?

50:41

Does the Prov support the workflow of creation and approval? In general, you can represent anything in Prov because it's really high level and it's got those three generic classes of entity, activity, and agent. And there's almost nothing in the world that I've come across that you can't decompose down into one of those three things. Is it a thing, is it a causative agent, or is it a temporal occurrence?

51:03

So in general modeling workflows, yes. And so Natasha asks, philosophical question for the whole panel, how do you think provenance relates to trust? So I'm going to just jump in very quickly and say, provenance models before Prov often had the word trust in them somewhere. And many of the motivations for provenance models were to do with trust.

51:25

We deal with trust as the goal of Geoscience Australia to put out data and make it open and transparent. It's fundamentally a trust issue for users of that data. So they want to know how the state comes to be. So that's really what provenance is about. It's about telling the history of something so you can generate all this data trust.

51:46

Then the specifics of what you put in there, you can work out, do I trust the people who created this thing? Do I trust the process that was undertaken to deal with it or transform it? Do I trust the particular chunks of code that were used? So that's the generic answer.

52:01

Then there's the sort of more specific ones like for data in this repository, how do I trust that it's, even though you're telling me something about it, that it's in fact true. There are also very difficult things about how do I actually trust this metadata, even if it looks like it's all correct. This data comes from God, delivered to you on a stone tablet.

52:22

I could write that down, but is it true? You have to work that out. That is now a non-problems thing. You have to work out some other way of attributing a trust metric to that claim. That might be that it's digitally signed and you trust the agency that delivered it. So that's an appeal to authority. You might trust that there is enough information present for you

52:41

to understand the process enough to have confidence in it. It might link to well-known sources like open code or something like that that you trust. Or maybe there's a mechanism for you to validate certain chunks of data or calculations. So the total number was five, and you can look back to the provenance and see somewhere two plus three, you see five,

53:03

that you can calculate and you can establish that trust directly. So I think Nick said it very well, but I'll say the same thing in fewer words. Thanks, George. That Revenance is really fundamental to trust. And Nick really hit the nail on the head when he talked about transparency,

53:23

that provenance is about transparency. And in the world we live in now, even appeals to authority don't work very well anymore. And I think that for science to gain legitimacy and gain trust, we have to be transparent.

53:41

And that's what provenance metadata is all about. So we've reached the end of our time. And I'd just like to thank our three speakers for coming along to our ANDS Canberra office today and speaking to us about provenance and introducing lots of new acronyms to us all.

54:04

Every time I encounter anything new at ANDS, there's always more acronyms to learn. So thank you very much for coming. We have two more webinars in the social science series, so hope to see you there again.