Managing and publishing sensitive data in the Social Sciences - 29th March 2017

Video thumbnail (Frame 0) Video thumbnail (Frame 3132) Video thumbnail (Frame 13858) Video thumbnail (Frame 23956) Video thumbnail (Frame 25714) Video thumbnail (Frame 28107) Video thumbnail (Frame 31687) Video thumbnail (Frame 33502) Video thumbnail (Frame 35336) Video thumbnail (Frame 38157) Video thumbnail (Frame 39734) Video thumbnail (Frame 40999) Video thumbnail (Frame 43128) Video thumbnail (Frame 47906) Video thumbnail (Frame 50306) Video thumbnail (Frame 52528) Video thumbnail (Frame 54337) Video thumbnail (Frame 57713) Video thumbnail (Frame 60307) Video thumbnail (Frame 67249) Video thumbnail (Frame 70549) Video thumbnail (Frame 73305) Video thumbnail (Frame 74657) Video thumbnail (Frame 83847) Video thumbnail (Frame 93037)
Video in TIB AV-Portal: Managing and publishing sensitive data in the Social Sciences - 29th March 2017

Formal Metadata

Managing and publishing sensitive data in the Social Sciences - 29th March 2017
Title of Series
Number of Parts
CC Attribution 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
Managing and publishing sensitive data in the Social Sciences: 29 Mar 2017 featured: 1) Dr Steve McEachern (Director, Aust Data Archive) Steve will discuss how the Australian Data Archive manages and publishes sensitive social science data More about ADA: -- The Australian Data Archive (ADA) provides a national service for the collection and preservation of digital research data and to make these data available for secondary analysis by academic researchers and other users. -- The ADA is comprised of seven sub-archives - Social Science, HIstorical, Indigenous, Longitudinal, Qualitative, Crime & Justice and International. -- ADA data is free of charge to all users -- The archive is managed by the ADA central office based in the ANU Centre for Social Research and Methods at the Australian National University (ANU). 2) Prof George Alter, (Research Professor, ICPSR and Visiting Professor, ANU) George will share the benefit of over 50 years of experience in managing sensitive social science data in the ICPSR More about ICPSR: -- ICPSR (USA) maintains a data archive of more than 250,000 files of research in the social and behavioral sciences. It hosts 21 specialized collections of data in education, aging, criminal justice, substance abuse, terrorism, and other fields. -- ICPSR collaborates with a number of funders, including U.S. statistical agencies and foundations, to create thematic collections
Sensitivity analysis Group action Statistics INTEGRAL Decision theory Source code Information systems Electronic program guide Archaeological field survey Motion capture Client (computing) Metadata Information security Associative property Time zone Pairwise comparison Theory of relativity Projective plane Shared memory Data management Software Network topology Volumenvisualisierung File archiver Whiteboard Family
Group action Euler angles Direction (geometry) Archaeological field survey Set (mathematics) Database Function (mathematics) Mereology Different (Kate Ryan album) Information Multiplication Personal identification number (Denmark) Physical system Identity management Area Service (economics) Sound effect Attribute grammar Student's t-test Measurement Type theory Arithmetic mean Process (computing) Auditory masking Energy level Information security Point (geometry) Neighbourhood (graph theory) Service (economics) Identifiability Characteristic polynomial Tape drive Mathematical analysis Student's t-test Field (computer science) Number Latent heat Term (mathematics) Uniqueness quantification Directed set Address space Multiplication Information Inheritance (object-oriented programming) Archaeological field survey Projective plane Neighbourhood (graph theory) Database Mortality rate Causality Uniform resource locator Personal digital assistant Sampling (music) Natural language Identity management
Standard deviation Sensitivity analysis Touchscreen Program code Group action Scripting language View (database) Multiplication sign Set (mathematics) Information privacy Computer Neuroinformatik Virtual reality Different (Kate Ryan album) Information security Physical system Scripting language Interior (topology) Electronic mailing list Computer Virtualization Bit Control flow Measurement Type theory Repository (publishing) Software repository Internet service provider Hard disk drive Energy level Authorization Remote procedure call Information security Annihilator (ring theory) Resultant Point (geometry) Statistics Random number generation Identifiability Online help Spyware Field (computer science) Number Term (mathematics) Internetworking Uniqueness quantification Energy level Integrated development environment Communications protocol Address space Game controller Noise (electronics) Information Projective plane Expert system Code Planning Computer network Hypermedia Software Function (mathematics) Universe (mathematics) Communications protocol Address space
NP-hard Service (economics) Virtual machine Login Computer icon Neuroinformatik Number Virtual reality Internetworking Operator (mathematics) Energy level Cuboid Integrated development environment Information security Physical system Exception handling Addition Virtualization Staff (military) Bit Integrated development environment Software Repository (publishing) Hard disk drive Website Quicksort Window
Addition Information Planning Mereology Information privacy Rule of inference Wave packet Electronic signature Number Process (computing) Personal digital assistant Universe (mathematics) File archiver Information security Exception handling Physical system
Service (economics) Neighbourhood (graph theory) Intel Dependent and independent variables Table (information) INTEGRAL Calculation Cellular automaton Electronic mailing list Term (mathematics) Number Type theory Oval Radio-frequency identification System identification Table (information) Thumbnail Data integrity
Ocean current Standard deviation Meta element Graph (mathematics) Covering space Projective plane Interior (topology) Set (mathematics) Public key certificate Number Wave packet Finite element method Goodness of fit Friction Extreme programming Personal digital assistant Different (Kate Ryan album) Data acquisition Videoconferencing Endliche Modelltheorie Physical system
Game controller Game controller Graph (mathematics) View (database) Mathematical analysis Sound effect Set (mathematics) Mass Virtualization Mathematical analysis Function (mathematics) Mereology Likelihood function Dimensional analysis Finite element method Virtual reality Process (computing) Extreme programming Matrix (mathematics) Integrated development environment Remote procedure call
Web 2.0 Term (mathematics) Kälteerzeugung Gradient Archaeological field survey Set (mathematics) Figurate number Bookmark (World Wide Web) Form (programming)
Personal digital assistant Gradient Connectivity (graph theory) Prisoner's dilemma System identification Set (mathematics) Virtualization Identifiability
Sensitivity analysis Presentation of a group Context awareness Link (knot theory) Gradient Connectivity (graph theory) Interior (topology) Cartesian coordinate system Computer configuration Term (mathematics) Personal digital assistant Data center File archiver Software framework Endliche Modelltheorie Quicksort
Point (geometry) Sensitivity analysis Mobile app Greatest element Statistics Context awareness Service (economics) Identifiability Multiplication sign Sheaf (mathematics) Set (mathematics) Mathematical analysis Function (mathematics) Faculty (division) Data model Different (Kate Ryan album) Term (mathematics) Species Integrated development environment Software framework Process (computing) Endliche Modelltheorie Distribution (mathematics) Information Electronic program guide Computer file Projective plane Bit Degree (graph theory) Type theory Uniform resource locator Process (computing) Personal digital assistant Function (mathematics) File archiver Video game System identification Cycle (graph theory) HTTP cookie Quicksort Species Family Spacetime
Statistics Computer file Source code Decimal Expert system Computer programming Googol Directed set Endliche Modelltheorie Series (mathematics) Address space Link (knot theory) Information Computer file State of matter Sampling (statistics) Computer Line (geometry) Cartesian coordinate system Flow separation System call File Transfer Protocol Degree (graph theory) Type theory Data model Keilförmige Anordnung Sample (statistics) Series (mathematics) Blog Order (biology) Right angle Quicksort Row (database) Address space
Statistics Building Hoax Computer file State of matter Execution unit Set (mathematics) Mereology Number Revision control Profil (magazine) Different (Kate Ryan album) Band matrix Directed set Office suite Endliche Modelltheorie Absolute value Information security Form (programming) Physical system Area Electronic data processing Pattern recognition Dependent and independent variables Chemical equation Computer file Mathematical analysis Physicalism Virtualization Student's t-test Degree (graph theory) Inclusion map Type theory Process (computing) Software Integrated development environment Bridging (networking) Endliche Modelltheorie Remote procedure call Table (information) Local ring Row (database) Spacetime
Sensitivity analysis Dynamical system Group action State of matter Multiplication sign Execution unit Combinational logic Set (mathematics) Function (mathematics) Replication (computing) Different (Kate Ryan album) Circle Endliche Modelltheorie Website Extension (kinesiology) Information security Physical system Moment (mathematics) Bit Student's t-test Degree (graph theory) Type theory Process (computing) Exterior algebra Normal (geometry) Endliche Modelltheorie Absolute value Quicksort Remote procedure call Virtual reality Thermal conductivity Row (database) Point (geometry) Service (economics) Table (information) Tape drive Distance Code Element (mathematics) Number Goodness of fit Profil (magazine) Term (mathematics) Intrusion detection system Integrated development environment Absolute value Traffic reporting Form (programming) Information Chemical equation Projective plane Content (media) Limit (category theory) Vector potential Similarity (geometry) Integrated development environment Personal digital assistant Function (mathematics) Mixed reality File archiver Table (information)
Code 1 (number) Rothe-Verfahren Set (mathematics) Function (mathematics) Code Theory Data model Term (mathematics) Set (mathematics) Software framework Software testing Endliche Modelltheorie Extension (kinesiology) Associative property Absolute value Descriptive statistics Logical constant Information Projective plane Shared memory Content (media) Code Degree (graph theory) Uniform resource locator Personal digital assistant Internet service provider Quicksort Remote procedure call Thermal conductivity Spacetime
Touchscreen Set (mathematics) Function (mathematics) Wave packet Goodness of fit Virtual reality Different (Kate Ryan album) Integrated development environment Extension (kinesiology) Absolute value Physical system Domain name Focus (optics) Trail Projective plane Expert system Code Performance appraisal Zugangsverfahren Message passing Process (computing) Computer configuration Integrated development environment Personal digital assistant Function (mathematics) Statement (computer science) Self-organization Absolute value Remote procedure call Quicksort Procedural programming Physical system
Sensitivity analysis Context awareness INTEGRAL Direction (geometry) System administrator Range (statistics) Source code Set (mathematics) Parameter (computer programming) Information privacy Mereology Neuroinformatik Expected value Medical imaging Sign (mathematics) Computer configuration Different (Kate Ryan album) Endliche Modelltheorie Office suite Extension (kinesiology) Physical system Source code Data management Process (computing) Computer configuration Self-organization Website System identification Endliche Modelltheorie Quicksort Spacetime Laptop Point (geometry) Service (economics) Electronic program guide Similarity (geometry) Control flow Wave packet Element (mathematics) Number Power (physics) Hacker (term) Term (mathematics) Energy level Default (computer science) Focus (optics) Standard deviation Information Suite (music) Content (media) Planning Limit (category theory) Uniform resource locator Word Personal digital assistant Mixed reality Universe (mathematics)
Group action Direction (geometry) Archaeological field survey 1 (number) Set (mathematics) Insertion loss Information privacy Mereology Usability Neuroinformatik Software bug Different (Kate Ryan album) Endliche Modelltheorie Area Source code Curve Arm Static random-access memory Bit Flow separation Order (biology) Computer science MiniDisc Cycle (graph theory) Quicksort Momentum Observational study Letterpress printing Checklist Event horizon Smith chart Term (mathematics) Operator (mathematics) Energy level Traffic reporting Address space Condition number Form (programming) Addition Information Consistency Prisoner's dilemma Projective plane Content (media) Cartesian coordinate system Integrated development environment Personal digital assistant Universe (mathematics) Video game Table (information) Family Spectrum (functional analysis)
oh but aren't known all good morning if
your render / timezone to everyone thank you for pulling into our webinar today we've got some handouts I'll in today's webinar as well we've got a guide to publishing and sharing sensitive data that is immensely thoughts and also and sensitive of course mr. decision tree and that a one-page summary of the information available in our guide I so I just like to introduce our two guests so we've got professor George also he's a research professor in the Institute for Social Research and professor of history at the University of Michigan its research integrate Syrian methods from demography economics and family history with historical sources to understand demographic behaviour in the past from 2007 2016 gives the director of the entry University Consortium for political and social research icpsr the world's largest archive social science data he's been active in international efforts to promote research transparencies data sharing and secure access to confidential research china is currently engaged in projects to automate the capture of metadata from statistical analysis software and to compare fertility transitions in contemporary and historical populations and we're lucky it was lucky to currently have him at rebuilding professor at anu and dr. Desmond oaken is the director of the Australian data archive at the Australian National University he holds a PhD in industrial relations and a graduate diploma in management information systems has research interest in data management and archiving community and social attitude surveys use data collection methods and reproducible research methods Sid has been involved in various professional associations in survey research and data archiving over the last 10 years and is currently chair of the Executive Board of the data documentation initiative so firstly we're going to hand over to George is going to share the benefit of over 50 years of icpsr managing sensitive social clients data to you
George thank you it's a pleasure to talk to you today I see us are as detention has been in they'd archiving for 50 years and in increasing the amount of our effort has gone into devising safe ways to share data that have sensitive and confidential information at the heart of everything
we do in terms of protecting confidential information is is a part of the research process where when we asked people to provide information about themselves to us we make a promise to them and we tell them that the benefits of the research that we're going to do we're going to outweigh the risks to them and we say that we will protect the information that they give us we have a lot of data that we receive that I suppose are inherently at the ATA that include questions that are very sensitive often we're asking people about types of behavior that could cause them harm that we might be specifically asking them about criminal activities we might be asking them about medications that they take that could affect how their jobs or or other things so we have to be careful about it and we're up you know we're afraid that if the information gets out it could be used by various actors for specific purposes could be used in the divorce proceeding sometimes we interview adolescents about drug use or sexual behavior and we promised them that their parents won't see see it and and so on in the data archiving where we often talk about two kinds of identifiers there are direct identifier which are things like names addresses Social Security numbers that many of which are unnecessary but some types of direct identifiers such as of geographic locations or genetic characteristics may actually be part of the research project and then the most physical problem often is the indirect identifiers that is say characteristics of an individual bad when taken together can identify them um we refer to this as often as deductive disclosure meaning that it's it's not obvious directly but if you know enough information about a person in a in a data set then you can match them to something else frequently we're concerned that that someone who knows that another person is in the survey could use that information to find them or that there is some other external database where you could match information from the survey and we identify a subject that the disclosure is often dependent on contextual data so if you know that a person is in a small geographic area or you know that they're in a certain kind of institution like a hospital or a school it makes it easier to narrow down the fields over which you have to search to identify and unfortunately in the social sciences contextual data has become more and more important there's people now are very interested in new things like the effective neighborhood on behavior and political attitudes you're the effect of available health services on morbidity and mortality and there are a number of different kinds of contextual data that can expect effective disclosure and in so we're in the world right now where social science researchers are increasingly using data collections that include items of information that makes the subjects more identifiable so for example people studying the effectiveness of teaching often have data sets that have characteristics of students teachers schools school districts and once you put all those things together it becomes their very identifiable so we had icpsr and I think the social science community in general have taken up a framework for for protecting confidential data that was originally developed by to the kitchen in the UK that talks about ways to make data safe and so I'm going to go through these points but Richie talks about safe data safe projects save settings saying people and save outputs and the idea this is not that anyone of approach solves the problem but that you can create an overall system that draws from all of these different approaches and uses them to reinforce each other so says data means taking measures that make the data less identifiable ideally that starts when the data are our collective so there are things that data producers can do to make their data less identifiable one of the simplest things is to do something that masks the geography if your tape doing interviews it's best to doing interviews in multiple locations that in add to the anonymous ation of your interviewees or if you're doing them in only one location I should contact the information about the location as secure as possible once the data has been collected on research projects have been using a lot of different techniques for many years to mask the identity of individuals so one of the most common one is called pop coating where if you asked subjects about their incomes the people with the highest incomes are going to stand out in most cases and so usually you group them into into something that says people above 100 thousand dollars in income or something like so that there are there's not just one person at
the very top but a group of people which makes them more anonymous and this list of things that I've given here which goes from aggregation approaches to actually affecting the values is is listed in terms of the the amount of intervention that's involved some of them were recently developed techniques actually involved adding noise or random numbers to the data itself which tends to make it less identifiable but it also has an impact on the research that you can do it with the data faith project means that the project's themselves are reviewed before access is approved at most data repositories when the data need to be restricted because of sensitivity we ask the people who apply for the data to give us a research plan that research plan can be reviewed in several different ways the first two things are things that we do regularly at at icpsr we ask first of all do you really need the confidential information to do this research project and if you do need it would this research plan I did you know identify individual subjects we're not in the business of helping marketers identify people to target marketing so we would not accept them Lisa plan that they're back there is also there are also projects that actually look at the scientific merit of a research plan to do that though you need to have experts in the field who can help you to do that same settings means putting the data in places that reduce the risk that it will get out and I'm going to talk about three approaches and the first one is our cultures actually the first one is data protection plans so when we toured in de that our need to be protected but the level of risk is reasonably low we often send those data to a researcher under a data protection plan and they use a green which will come to it in a couple of minutes and a bit of protection plan specifies how they're going to protect the data and here's a list of things that we worry about that one of my colleagues and I CPS are made up you know one of the things we tell people is you know what happens if your computer is stolen however the confidential data be protected and there are there are a number of things that people can do like conflicting their hard disk locking your computers and in in a closet whether or not being used that can address address these things and I think that the protection plans need to move to just a general consideration of what it is that we're trying to predict against and allow the users to propose alternative approaches rather than saying oh you have to use this particular software of us or that we have to be clear about what we're worried about a a couple of notes about data security plans data security plans are often difficult partly because of the approach that has been taken in the past and also because researchers are not computer technicians and we're often giving them confusing information one of the ways that I think in the future in the u.s. at least universities are going to move beyond this is I'm saying universities developing their own protocols where they use have different levels of security for different types of problems and adding to level they specify the kinds of measures that researchers need to take to protect data that is at that level of sensitivity and from my point of view as a as a repository director I think that a time that the institutions provide guidance is a big help to us the other way is to make the data safe by making putting it in a safe setting is actually to control access there are three main ways that repositories control access one kind of system is what I called you a remote submission and execution system where the researcher doesn't actually get access to the data directly they submit a program code or a script for statistical package to the data repository the repository runs the script on the data and then sends back the results that's a very restrictive approach but it's very effective well recently however a number of repositories and statistical agencies have been moving to virtual data on price and these enclaves with Charlie illustrate between in a minute use technologies that isolate the data and and provide access remotely but restrict what the user can do and the most restrictive approach is actually a physical on clays at icpsr we have a room in our basement that has computers that are isolated from the internet we have certain day that are highly sensitive and if you want to do research with them you can but on the way into the enclave we're going to go through your pockets to make sure you're not trying to bring anything in and I'll get the way out we're going to go through your pockets again and you'll be locked in there while you're working because we want to make sure that nothing that uncontrolled is removed from the on clay the disadvantage of a physical enclaves that
you have actually have to travel to Ann Arbor Michigan to use those data which could be extensive and that's the reason that a number of repositories are turning to virtual data on place this is a sort of a sketch of what the technology looks like what happens is that you as a as a researcher look over the internet log on to a site that connects you to a virtual computer and then that virtual computer is in contact is as access to the data but your desktop machine does not you only can access the data through the earthly virtual machine and I think this are we actually use this system internally for our data processing to provide an additional level of security so we talk about the virtual data Enclave which is the service we provide to to researchers and the secure data environment which is where are our staff works when their works run on sensitive data and it's a
little bit of the letdown but this is what it actually looks like what i've done here is the window that's open there with the blue background is the hard virtual day Enclave and I've opened the window for sada inside there the black background is my desktop computer and what if you a closely you'll see in the corner of the the blue box that you see the usual Windows icons and that's because when you're operating remotely on in the virtual Enclave using Windows it looks just like windows and access like windows except that you can't get to anything on the internet you can only get good to things that we provide for a level of security on top of that the software that's used and we use em where software but there are other brands that do the same thing essentially turns off your access to your printer turns off your access to your hard drive or USB Drive so you cannot copy data from the virtual machine to your local machine you can take a picture of what you see there but I that and because there you have that capability you we also restrict people with a use of greener and that's my next
topic how do you make people safer or the main way that will make people safer is by making them signed at use agreements or by providing training the day use agreements used at icpsr are are frankly rather complicated they consist of the research plan as I mentioned before we require people to get IRB approval for what they're doing a data protection plan which I mentioned and then there are these additional things of behavioral rules and security clenches and an institutional signature which on mentioned now so the process if
you look at the overall process of doing research there are a number of legal agreements that get passed back and forth it actually starts with an agreement made between the data collectors and the subject in which they provide the subjects with informed consent about what the research is about and what they're going to be asked and it's only after that that the data go from the subject to the data producers then on the data archive such as icpsr or a DA actually reaches an agreement with the data producers in which we become their delegates for distributing the data that's another legal agreement and then when the data are sensitive we actually have to get an agreement from the researcher and these are pieces of information we get from the researcher and in the United States our system is that the agreement is actually not with the researcher but with the researchers institution so at icpsr we're located at the University of Michigan and all of our data use agreements are between the University of Michigan ends and some other universe University in most cases there there are some exceptions so it's only after we get all of these legal agreements in place with the researcher against the data one of the things in
our agreements at icpsr is a list of the types of things that we don't want people to do with the data so for example we don't want someone to publish a table across tabulation table where there's one self that has you know one person in because that makes that person more identifiable and there are a number there's a list of these things I think often we have like 10 10 or 12 of them that are really dented Lisa's thumbs at statisticians have developed for controlling we identification the icpsr
agreements are also as I said agreements between institutions and one of the things that we require is that the institution takes responsibility for enforcing them and that if we at icpsr believe that something has gone wrong the agreement the agrees that they will investigate this based on their own policies about scientific integrity and protecting research subjects a few ways are not
ideal they're actually there's a lot of friction in the system what currently in most cases a p.i needs a different dais agreement for every data set and they don't like that we can I think in the future reduce the cost of data use agreements by making institution-wide agreements where the institution designates a steward who will work with researchers at that institution and there's already an example of this bday buried project which is a project in developmental psychology that shares videos has done very good work on winkle agreements and my colleague the current director at icpsr margaret levinstein has been working on it on a model where a researcher who gets ladies agreement from one data set can use that to get a venue screamin for another data set so that individuals can be certified and include that certification in other places one of the things that I think we
need to do more about is training a number of places like a da train people who get confidential data we've actually
done some work on developing it on my tutorial about disclosure risk which which we haven't yet released but is I think something that should be done
finally there's safe outputs one of the the the last stage in the process is that the Oscar it can review what was done with the data and remove things that are a risk to subjects this only works if you retain controls so it doesn't work you can send the data to the researcher but it does work if you're using one of these remote systems like remote submission or a virtual data on price often this kind of checking is costly there are some ways to automate part of it but of immanuel views is almost always necessary in the end so a last thing about the costs and benefits obviously their protection has costs modifying data effects analysis if you restrict access your imposes burdens on researchers and our view is that you need to weigh the costs with the risks that are involved and there are two dimensions of risk one dimension is in this particular data set what's the likelihood that an individual could be we identified if someone tried to do it and secondly if that person was we identified what harm would result so we think about this as a matrix where you
can see in this figure as you move up you're getting more harm as you move to the right you're increasing the probability of disclosure so if the data set is on is low on both of these things you know for example if it's an it's a national survey where a thousand people from all over the United States were interviewed and we don't know where they're from and we ask them what their favorite brand refrigerator is that kind of data we're happy to send out directly on over the web without a day-use agree with it looks simple terms of use but as we get more complex data with more
questions more sensitive questions we often will add some requirements in the form of the data use agreement to assure that they are protected and when we get
to complex data where there is a a strong possibility of the identification and where some harm would result to the subjects we in that case we often add a technology component like the virtual data on place and then there
are the really seriously risky and sensitive things my usual example of this is we have a data set at icpsr that was compiled from interviews with convicts about sexual abuse and other kinds of news and prisons and that data is very easy to identify and very sensitive and we only provide that in in
our fiscal data center so um that's the
end of my presentation and I thank you for your attention and we'll take questions later great thank you George so we'll park over the dr steve link
again and to give his presentation about managing sensitive data australian data archive okay um so my aim today is to do what George's talked about typically taking the five size model and looking at what the situation is in the australian case i'll talk about this rain data archive and how we support its data but i want to put in the context of the broader framework of how we access sensitive data in in Australian Social Sciences generally so I'm going to talk about some of those the different options that are around sort of picking up on what George's discussed in terms of some of the alternative available and some demonstrative let me other different ways these are in use here in Australia I'm really focusing more on the fire safes model and its application in Australia than I am specifically on this on 88 as I say we are one component of the broader framework for census data access here um so just as only know what
I really wanted to cover off here is thinking about some as if that are the five space model look at the different frameworks but there's if data access in Australia and where you might find them and then how we apply the five size model at a DA in particular and then be on time I might say something briefly that but the data life cycle and sensitive data as we go through so I wanted to pick up on the
particular Leanne's definition here at sensitive data and I think a lot of families within the context of most of what we deal with at ATA at some point in its life cycle has been sensitive data it's more often that information is collected from humans often with some degree of identifiability at least the point of data collection not necessarily the point of distribution so a lot of what we do with and this is this is true for a lot of social science archives has been subject to you know it would form the past and sensitive data but there's a sense in there between what we get and what we distribute that we would probably for Lauren stinking so in terms of our definition eros this is the handout I think that's in them the the handout section and bio online that is going to use identifying individual species object process or location introduces a risk of discrimination harm or unwanted attention now we tend to think in terms of human risks more than anything else aside their risks to humans and into an individual but it does apply in other cases as well so for example the identification all sites for indigenous are right in and of itself leads bit more people do want to go and visit that location and you know it says destroy the thing that you're actually trying to protect so the more visitors that they actually get a bit more degraded the app itself becomes so it doesn't just hope you know given research for that's probably our primary emphasis of nada I so just just a reality right the my face again so we talked about why I think people project settings data and outputs and there's an the reference here is at the bottom you can look at the the document that Felix which chicken cookies colleagues developed sort of framing up they down the five size model what I would say about this is a tween adopted you know directly with our UK data service that's where it has origin the basic principles are applied in a lot of the social science better are finding is now actually been adopted by the australian bureau of statistics as well so their framework for thinking about output of different types of publications literally leverages this model following up with hands are quite useful as a framework for talking about I'm going to take a slightly different approach to join in thinking about how
we think about what we're worried about and I'm going to take you know is it a positive you worry about now there's the risk of disclosure as a researcher what's inside of that why do we need access to sensitive data what does it provide the necessary foundation four or five years ago put out a call around how does how can we improve access to microdata particularly from government sources and it's sort of highlights you know the sorts of things why we did we talk about the need for access is it the sorts of research you can do the initiative comes from a submission from David card right shitty and and several economists in in the US and elsewhere they were highlighting all you know what what's needed direct access is really the critical thing here and direct access to microdata my microdata we mean individual intimate intimate about individuals line by line aggregate statistics synthetic data we can create take people as it were or submission of computer programs or someone else around really don't allow you to do the sorts of work you need to add that policy questions in particular and a lot of the particular social policy research is focused in this way so in order to do certain things with access to this data is necessary so what how do we facilitate that taking account of the sorts of concerns that have been raised all inside that is well how do people
expect to access it this was an interesting blog post from a researcher base to the in credit of the University of Canterbury comparing how you access US census data versus the New Zealand ten system similarly we could say the same their best rayon centers as well in the u.s. you can get a once-in- sample of the Bensons you just go and download a file directly open as what's called a Polly goose microcenter file and you know those are directly available in New Zealand there's a whole series of instructions you have to go through you might be subject to as a user agreements you might be subject to an application processes etc etc now he's criticizing thing is must be must be it should be much easier it should be the US model appropriate here rather than them for me the New Zealand model by what we're really talking about here is one of the both are appropriate depending upon the sorts of detail a sorts of identifying information in a row both might be you know valid models they disallow through different things the first model really focuses on and it is masking the data some degree and some of us are safe data models that going to talk about the other uses other types of aspects of the safe the safe model to to address confidentiality concerns are and what
you also find these researchers understand these you know that there has to be some trade-offs and you know there's a recognition of the need for confidentiality is is recognized and understood and that there may well be and then you know there's ought to be trade-offs in return for that so for example cotton is calling suggestible is you know a set of criteria is that you could put in place for enabling one access their form of access to to microdata too sensitive microdata and it might be as a reference you know access through local statistical offices through remote connections such as the virtual globe with George talked about and monitoring of what people are doing if you're going to have highly sensitive that are available the trade-off for that process should be appropriate monitoring so there is a recognition that they are these they only this is just one possible approach but a recognition that you know that access brings with it responsibilities and appropriate checks and balances so what I want to talk about it all how is that eventuated in Australia what do we see so so the focal models that we see here
in Australia I've broken and our gear broadly outside we can yes for broad areas the one that people are probably most familiar with is the ABS you strained Bureau of Statistics they have a number of systems and access methods that so different types of fake profiles so these include what's called confidential a unit record files cuz the what has a remote access data lab which is one of their online execution systems they have an on-site data lab you can go to the bowels of the ABS buildings in certainly in Canberra and other leaders other states as well and do on flight processing then they have other other systems but probably the best releases what's called table builder which is online data aggregation tool which does safe data processing on the fly our emphasis of a day is primarily on what these confidential I'd unit record file so we provide unit record access and some aggregated data access as well then we have a remote execution of a remote analysis environments I put under this model the stratum research infrastructure network or geographic and ER access in particular the secure unified research environment produced by the population research network is an example of George's remote access environment as well and even data linkage facilities another part of the PHR in network picked some degrees under this title secure access model that's it has a more extreme version of that and then we have other ad hoc arrangements as well things like the physical secure rooms number of institutions have a secure space there are a number here at Anu for example and then you might have other departmental arrangements as well that exists only we can probably
classify those in terms of the distinction of the type of approaches that we have so what I've done here is a very simple assessment from you no not at all to you know a very strong yes it can be 0 if it's within this or addressing this safe element form from from from low to high I have some question that's on some of the facilities particularly sure that linkage which is not because I don't think they can do it it's like I don't have enough information to make an assessment there but if you look at the different types things like the the ABS moles are tended towards saving data so the sorts of prompt neutralization normalization routines data output checking and secure access models and tabulation systems are secure access point as well they've tended less towards safe people and placed projects so checking of people checking a project we tend to put more a lot of cases there's more trusting the technology than there is in the people using the technology which i think is a little bit problematic given that there and I'm gonna talk to Samoa definitely good processes in Australia for actually assessing the quality of people particular and to some extent to projects so we can kind of profile the point the point here on making is that different city owes it you have different alternatives for how you might make sensitive data available there's not a one solution it's you know what you know what the mix of things that I might feel not come back to that again so in the Australian experience say we
have a strong emphasis on face data we came up with the term in a circle of confidential ization which probably the champion see most regularly in anywhere else in the world it's you attempt anonymization I'm not quite sure about this with the case but as a in Australia determined we can use a confident ization the strange out our archive use this model the ABS and the social services things like the household income later dynamics in Australia use and automation techniques as sort of a starting point so you can make data safe before you release it it has its limitations and every a good example of that is some of the deserts released into the executive guard environment used anonymization safe data in some sort of the priority are the potential for it to be reverse engineered account owner internalization properly then you have it could be reversed and you get you know a safe data disk so it has its flaws and this is why we've tended toward looking at sort of a combination techniques but as George pointed out say if the risk of actually being identified is low and particularly the harm that comes from that is low then you know there may be the case of the distance efficient certainly a lot of the content that we have an 88 most very impulsive actions like that animal there yes say settings we do have as they examples here a tabulation systems you know things where you can do prospects or more and a fundamentally a safe settings model people don't get access to the unit record data I just get access to the system's produce outputs on remote access systems the the receptor state allows PHR ends short system and a new system that the ids are bringing on is there were most data lab they're making their daily lives available a virtual environment they're in pilot stages that we working with them on at the moment are increasingly being used as well there are also secure environments as I mentioned the data Lebanon securities save out books a number of the safe settings environment that they tend to use highly sensitive data has saved output models as well the real problem has been with these in scaling that it requires manual checking more often than not reviewing the output of the these sorts of systems that requires people that requires time it's hard to automate as well maybe a socially invested a lot of money into automating kept checking point of facts their tables little system is one of the best that's around the but then your remote legs still has manual check your apples so depends on what you're trying to do the sorts of outputs your texting has to be sent to which you your size difference with outputs you're producing as to whether you can actually automate the checking as well the other side of this that I think will become increasingly relevant to is the reputation and reproducibility elements of things that come out of systems like this how are we going to facilitate you know replication models within those environments and I'm not sure they've questions being addressed yet safe researchers and say projects in Australia to be Franklin there they are considered in most models but are not really closely monitored and exist because a difficult how do you follow extent to which people for the things that they signed up to anyone who's involved in reporting of research outputs ara or anything well no they've been people actually fill out the forms you know if putting in place or by produced was would be hard give coming out from society live in compliance with it whether it they used to Green is even tougher that says we do have you have the subtext and balances over there certainly in terms of the ethics models and the codes the codes of conduct research do provide some degree of you know getting insurance for those that go go through that sort of system and you know that's is that we have some kicking about all in place for particular University researchers to address the sorts of concerns I think an emphasis increasing the old tape researchers and projects might be one that we can we can leverage a bit more carefully as they because of
the frameworks can have in place distract the code of conduct an increasingly professional association and journal requirements as well as the data sharing are going to put a degree of assessment on the sorts of practices we use as well american economic association the darts agenda in their political science Plus Ones requires that a theory these are actually an exodus of also progressing partly as extent to which or wide by sharing with assessing the sharing of data it also assessing the extent to which your attention it again becoming describes it as well there's that something to be you know to be considering a cute couple I'll quickly send it to there's the I do
model and then sort of then wrap up so the I do mol SAR emphasis is primarily on safe data that is anonymize it tends to be through the agencies that provide all the researchers are provide data tools and advanced and we will also do some review on content as well will provide recommendations back to our depositors as so you know these are the sorts of things you probably want to think about in terms of have you included things like post codes or occupational information if I no tellin postcode that occupation in their age the fair test that I feel identify them in many cases in remote locations in Australia equity swap so there's some some basic church we've been moved certainly save people and and and i think our data access is all almost all mediated we must get ified you must provide contact information and supervisor so we do some checking on safe people and we provide information on project descriptions what you intend to do with the data as well a particular where we have more sensitive content often that's a requirement where depositors we don't apply frankly stage settings and take outputs that's not the space that we work in we work with other agencies such as the ABS you know where there's a access to certain sensitive content will point people stupid all other locations and you know where you've got highly sensitive content i've been busy or make available ok something lighter the
remote data lab where is its focus a focus less on safe data so there are virtual on place I don't prohibit the use of save data practices but they do limit as where you have highly sensitive data there's a whole dedicated assessment process on the projects of the outcomes highly safe settings sitting at the ABS the problem is that the cosplay have is in establishing the system itself they vent all the outputs it has it has cost associated with it they have safe people those trainings
researchers pilot accessing the system there is some challenging assessing the backgrounds of people for example halyard is this was the need for domain experts if you've got a fully assess people and projects near going to assess their domain expertise you need two main experts to be able to that sort of evaluation so the emphasis might not be on the are you using appropriate team you know techniques are you maintaining there's no secure facilities and you potentially you know what the procedure planets don't look like is more the emphasis then the quality of the science that's a much harder painter to evade like say projects that has been used in some places as the idea sometimes it's requires the legislative reason if the extensive data release is dependent itself upon meeting a public good statement for example one of the questions are Peter person organizations issue this now basic research itself might generate useful in fact that you didn't expect so in some cases you again you're going to be probably moving delivers focus on different aspects of a safe environment I guess the message we
want to put through here is 70 sweater options are available for your for accessing sensitive data different models exist they have different you know ranges of high space you can certainly incorporate take people models curious a lot of morals focus on expectation we have an intruder hackers are coming in in to access that system actually what tends to be the case in more often a lot is the silly mistakes you know I made a mistake by leaving like you know my laptop on the train or leaving my USB in them in the in the computer lab that's far more common we have you know so we tend to try and profile to default options what would you know in terms of our mix of save settings but if there are options available to you what you have to think about is what's appropriate for the whole of data that your plan to work with yeah fundamentally the argument is
the crizzle should enable the likely to stasis recommended data source okay thank you very much Steve it was really great overview about the different ways that the five elements of the face can be mixed and used in Princeton um I thought it was really interesting that both of you mentioned that a safe location within a gradebook I hope we've got these images of people locked up investment I also wanted to note that George mentioned data masking and using the identification methods and they've also mentioned confidential ization anonymization that similar sort of words for similar processes and has a the identification guide available on our website now and so if you're interested in that it's more detail on that information we have that guide there that you can have a look at and I was also wondering about to what you were talking about with the data protection plan and data use agreement that the onus is on the institution that if someone breaks it that they need to put them through some sort of research integrity investigation or something like that if that doesn't happen is there any potential recourse to the University like who died sleepy as I turn around and say well you didn't follow the process you're not going to be pressing any about that are anymore sure actually on our website we we actually risk the levels of escalation we're going to go to and we can certainly cut off the institution from the from accessed icpsr data but the what is what really gets people's attention is that and the National Institutes of Health in the US has an office of Human Research protections and if we thought that someone was breaching one of our agreements and endangering the confidentiality of research subjects I would report them to that office that office has a lot of power they regularly publish the names of bad actors what's more they can cut off all nih funding to universities and they have done that in the past when they thought that protections were worked in place so i always think of that as the nuclear option and i know for a fact that university administration's and their trustees Regents are terrified that NIH will move them something like that so just waving that in front of the university compliance officer gets their attention ok excellent and Stefan's wondering the fuel and the turns out on our cars with the use agreements 30 people signing that is that with the individual user or with the institution as it recedes within primarily as we've been with the individual we have a small number of organizational agreements but not me there is I more efficient at but yes can focus on and a very clean individual any organization rather than at the organizational some organizations do are open think this will actually for pragmatic reasons than it is but compliance reasons is they will want to host content and manage access by now request access to a particular data set for all members of their reception for example so it sort of just make better easier as it were there are other other models so say the ABS model is actually the agreement is with the institution and then individual sign up Quincy the institutional agreement the pants of services model is planning as well in winching to see the extent to which we move in one direction or another i say i think the compliance argument hasn't been monitoring all that common here in australia it's actually been so except imitative where you have gotten a better place probably the situation but for you know for academic juice data have intended to be any process okay and with George's agreement with institutions where the mayor of the course is that the institution should then had some integrity investigation what level of recourse do you have with the agreement with individual limiter you know I mean you know we would probably report back to the institution kilometers oh we do have the question that supervising arrangements we would probably also follow you know sort of questions on the better the kind of combat for a sec so it's like I make reference back to there was an overarching set of obligations on are those within Australia institutions will pursue them you know in that way one of the challenges for us central then I'm going to get the George as well is just finding where you get breaches of reliance on the heart of things because actually point weapon it's the first place we've had one encased I'm aware of it also in mind my predecessors lifetime yeah which is yeah go get to the late nineties yeah so it's not common occurrence that we're aware but yeah okay excellent so George mentioned standardized status agreement between us institutions has that been a formalized across the number of institutions as part of a consortium arrangement or is
it more of an informal and gaining momentum well the example I gave us is the data berry project and they're the only ones I know they have done this in a formal way where they get institutions to sign on as an institution and and then that covers all of the researchers at that institution it took them a while to negotiate that and get the bugs out play but i think it's it's paying off for them and this is something that that I think other groups like icpsr should move to but right now it's it's a big problem that about one in six of our data you screams that I CBS are involved a negotiation between lawyers at the University of Michigan and lawyers that at the other institutions so it's a major cost I think it's one of the ways to go I was cited I mean in Australia we have a pretty strong event which is the universe to destroy our IPS agreement only let that model facilitate a whole lot of things yeah so it's sort of an able access so that the broad collection of ABS curve data under a ton dressing will bring that gathers universities flying out to a cost that comes with that as well they're paying a fee for that but then that it covers the whole spectrum of what they can do a challenge in some cases one get one horsing if you've got or dissemination of the content as I say over the next apart I've had this discussion with various departments could we establish a consistent direct lease agreement and it's because of the bit patterns and simple set on models and becomes for legislation sorry the the impact of that is ok we'll sorry about the same set of conditions but certainly there is you know something you know Pasadena supply analyze some of that and I've been to the city centre which the print edition report that's coming a better access while address on that other questions about okay great and so just quickly there's a question about other any triplets or guideline the new researchers to assess their research surveys for the level of confidentiality so I think that they're talking about my privacy risk assessment this is actually we have an internal checklist this is something we talked about in terms of thinking about whether you what you need to do in terms but it really depend on publication we talked before about the fact that in order to do so in research you need to have actually something that might be identifying so it depends on which point the data life cycle you're actually talking about here when we think about that and release then you say we would basically apply some some basic principles for these are the sorts of things that we look for and actually we've talked about making that that tables available in terms of these are the sorts of things you have to be concerned about there is advice around we could probably bring together arm but and saline is usability versus confidentiality question again so one of the things we sometimes do is we split off those things that are behind probably family recently and we actually released several different set of data so if you need that additional information you can actually make that wandering separate additional set of requirements and possibly under a different technological set it so I say I think it depends a little bit on when you when in the life cycle you're talking about here it often is useful to have as much you have information if you for example if you're running a longitudinal study you must have identifying information going forward you Coco the other contacts are next on real people what you're trying to achieve but yeah there are there isn't losses on the crab and there's a literature that's been used by statistical agencies about what they can lives but that whole area is is right now someone contentious because the disc allegiance ease develop that literature largely in the age when data were released in the form of published tables when the data are available online and you can do repetitive iterative operations on them you're included in the world and there's a visit it is a separate literature that's developed from the computer computer science world and it anyway it is a problem there there is guidance out there in in really complex areas like in some health care areas doing a full assessment of the data set to can be very complicated and and and difficult so so I think you know my recommendation is that people you know start at the basics and think about you know how would you identify this person and if this information got out what what harm would it cause often the researchers themselves have a good sense sends it back from the research they doing okay this one last question are the five tastes applicable in all research discipline or are they specifically limited distributor social sciences I I think that all this lickable I mean I mean it's interesting we would have any discussion about social sciences but for example we work a lot with Health Sciences worth of environmental sciences in the life and the you know it's I don't see any reason why they shouldn't be applied elsewhere that that part of the question actually is you know it is more prison well what do you think about in terms of the privacy in profit Charlie risks far more so than what the topic the topic helps you make some sort of judgment about harm in indore discerns but yeah he is the conventional deductions ever wonder where they might look like a bit 30 thank you thank you very much to judges
amongst our webinar today and thank you everyone for calling