Delivering Comprehensive Research Data Management Services at University College London - March 26th 2014

Video thumbnail (Frame 0) Video thumbnail (Frame 2392) Video thumbnail (Frame 4448) Video thumbnail (Frame 6983) Video thumbnail (Frame 10453) Video thumbnail (Frame 13428) Video thumbnail (Frame 17515) Video thumbnail (Frame 18673) Video thumbnail (Frame 23828) Video thumbnail (Frame 25321) Video thumbnail (Frame 27181) Video thumbnail (Frame 30200) Video thumbnail (Frame 32428) Video thumbnail (Frame 34405) Video thumbnail (Frame 36193) Video thumbnail (Frame 37670) Video thumbnail (Frame 39117) Video thumbnail (Frame 41185) Video thumbnail (Frame 43161) Video thumbnail (Frame 45103) Video thumbnail (Frame 48073) Video thumbnail (Frame 49545) Video thumbnail (Frame 51151) Video thumbnail (Frame 52491) Video thumbnail (Frame 56004) Video thumbnail (Frame 57430) Video thumbnail (Frame 69888)
Video in TIB AV-Portal: Delivering Comprehensive Research Data Management Services at University College London - March 26th 2014

Formal Metadata

Delivering Comprehensive Research Data Management Services at University College London - March 26th 2014
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
Dr Max Wilkinson described the approach taken to establishing and developing a Research Data Service at University College London. From development of a strategic roadmap to designing and implementing a series of services, Dr Wilkinson discussed UCLs approach to supporting better data management by UCL researchers, their partners and collaborators.
Service (economics) Standard deviation Service (economics) Gender Variety (linguistics) Bit Staff (military) Student's t-test Disk read-and-write head Data management Data management Centralizer and normalizer Universe (mathematics) Order (biology) Table (information)
Area Execution unit Service (economics) Decision theory Moment (mathematics) Execution unit Length Interactive television Range (statistics) Mathematical analysis Entire function Machine vision Universe (mathematics) Self-organization Video game Right angle Diagram Commitment scheme Cycle (graph theory) Address space
Group action Service (economics) Multiplication sign Modal logic Execution unit Collaborationism Mereology Automatic differentiation Field (computer science) Cartesian coordinate system Supercomputer 2 (number) Number Goodness of fit Software Area Domain name Programming paradigm Software developer Moment (mathematics) Shared memory Cartesian coordinate system Exploit (computer security) Data management Exterior algebra Programmer (hardware) Integrated development environment Angle Personal digital assistant Order (biology) Universe (mathematics) Video game Cycle (graph theory) Abstraction
Group action Greatest element Dynamical system Multiplication sign Set (mathematics) Formal language Expected value Web service Mathematics Hypermedia Object (grammar) Software framework Hard disk drive Multiplication Service (economics) Programming paradigm File format Building Closed set Gradient Moment (mathematics) Data storage device Shared memory Staff (military) Exterior algebra Internet service provider Hard disk drive Software framework Self-organization MiniDisc Right angle Metre Slide rule Server (computing) Service (economics) Authentication Online help Distance Number Power (physics) Local Group Frequency Bridging (networking) Computer hardware Boundary value problem Absolute value Engineering physics Authentication Scaling (geometry) Forcing (mathematics) Weight Content (media) Client (computing) Group action Personal digital assistant Universe (mathematics) Table (information) Routing
Data management Integrated development environment Projective plane Planning Data storage device Backup Form (programming)
Service (economics) INTEGRAL Variety (linguistics) Multiplication sign Disintegration Data recovery Mathematical analysis Streaming media Image registration Disk read-and-write head Revision control Data management Heegaard splitting Web service Centralizer and normalizer Different (Kate Ryan album) Physical system Self-organization Area Service (economics) Multiplication Dependent and independent variables Mapping Moment (mathematics) Data storage device Cartesian coordinate system Degree (graph theory) Data management Telecommunication Mixed reality Chain Universe (mathematics) Self-organization Website Right angle Whiteboard Musical ensemble Library (computing)
Axiom of choice Group action Service (economics) Overhead (computing) Heat transfer Centralizer and normalizer Core dump Point cloud Enterprise architecture Service (economics) Overhead (computing) Dependent and independent variables Constraint (mathematics) Feedback Moment (mathematics) Data storage device Computer network Limit (category theory) File Transfer Protocol Value-added network Population density Exterior algebra Universe (mathematics) Data center Point cloud Object (grammar)
Axiom of choice Point (geometry) Service (economics) Multiplication sign Connectivity (graph theory) Decision theory Propositional formula Data storage device Heat transfer Shape (magazine) Architecture Frequency Web service Human migration Flow separation Different (Kate Ryan album) Hypermedia Single-precision floating-point format Energy level Extension (kinesiology) Abstraction Resource allocation Physical system Form (programming) Area Dependent and independent variables Axiom of choice Scaling (geometry) Building Software developer Multitier architecture Projective plane Data storage device Heat transfer Basis <Mathematik> Degree (graph theory) Performance appraisal Data management Integrated development environment File archiver Self-organization Point cloud MiniDisc Abstraction Extension (kinesiology)
Group action Overhead (computing) Multiplication sign Decision theory Bound state Parallel port Supercomputer Local Group Data management Frequency Centralizer and normalizer Mathematics Computer configuration Scalar field File system Damping Resource allocation Multiplication Metropolitan area network Installable File System Form (programming) Axiom of choice Electric generator Projective plane Data storage device Heat transfer Parallel port Supercomputer Mathematics Data management Exterior algebra Computer configuration Integrated development environment Order (biology) File archiver Right angle Whiteboard Spacetime
Point (geometry) Presentation of a group Overhead (computing) Multiplication sign Data storage device Network-attached storage Metadata Storage area network Number Database normalization Architecture Web service Goodness of fit Object (grammar) Scalar field File system Physical system Point cloud Presentation of a group Overhead (computing) Point (geometry) Moment (mathematics) Open source Data storage device High availability Virtualization Cloud computing Representational state transfer Storage area network Scalability Open set Uniform resource locator Data management Point cloud Object (grammar)
Point (geometry) Standard deviation Meta element Overhead (computing) Open source INTEGRAL Connectivity (graph theory) Multiplication sign Tape drive Data storage device Client (computing) Rule of inference Metadata Product (business) Force Sign (mathematics) Web service Type theory Different (Kate Ryan album) Bridging (networking) Phase transition Object (grammar) Energy level Physical system Point cloud Area Standard deviation Multiplication Tape drive Uniqueness quantification Moment (mathematics) Data storage device Metadata Generic programming Cloud computing Line (geometry) Type theory Internet service provider Bridging (networking) Point cloud Speech synthesis Right angle Object (grammar) Active contour model
Group action Mereology Metadata 2 (number) Time domain Data management Medical imaging Mathematics Web service Energy level Resource allocation Area Service (economics) Dependent and independent variables Projective plane Moment (mathematics) Metadata Independence (probability theory) Image registration Group action Degree (graph theory) Skeleton (computer programming) Data management Universe (mathematics) File archiver Right angle
Group action INTEGRAL Multiplication sign Disintegration Tape drive File format Spiral Open set Mereology Staff (military) Metadata Software maintenance Data management Frequency Web service Human migration Hypermedia Software Computer hardware Data integrity Service (economics) Electric generator File format Metadata Staff (military) Software maintenance Human migration Process (computing) Frequency Hypermedia Software Computer hardware Order (biology) File archiver Self-organization Object (grammar) Reference data Library (computing) Row (database)
Ocean current Point (geometry) Service (economics) State of matter Multiplication sign Water vapor Mathematical analysis Open set Image registration Disk read-and-write head Term (mathematics) Computer hardware Position operator Boss Corporation Dependent and independent variables Moment (mathematics) Projective plane Mathematical analysis Staff (military) Bit Term (mathematics) Frame problem Number Data management Word Commitment scheme Mixed reality Order (biology) Right angle Cycle (graph theory) Volume Electric current Library (computing)
Building Server (computing) Service (economics) Overhead (computing) Variety (linguistics) Decision theory Direction (geometry) Authentication Collaborationism Public domain Network-attached storage Data storage device Open set Declarative programming Product (business) Mechanism design Mehrfachzugriff Extension (kinesiology) Local ring Traffic reporting Stability theory Authentication Default (computer science) Enterprise architecture Standard deviation View (database) Projective plane Moment (mathematics) Commutator Computer network Virtualization Statute Mechanism design Software Internet service provider Universe (mathematics) Mehrfachzugriff File archiver Connectivity (graph theory) Point cloud Quicksort Spacetime
Greatest element Context awareness Multiplication sign System administrator 1 (number) Public domain Information privacy Fluid statics Web service Different (Kate Ryan album) Information security Area Email Moment (mathematics) Data storage device Shared memory Maxima and minima Degree (graph theory) Tablet computer Type theory Data management Sparse matrix Process (computing) Repository (publishing) Internet service provider Order (biology) Hard disk drive Right angle Quicksort Point (geometry) Web page Inheritance (object-oriented programming) Identifiability Service (economics) Connectivity (graph theory) Online help Metadata Product (business) Number Frequency Energy level Address space Domain name Dependent and independent variables Key (cryptography) Interface (computing) Projective plane Planning Basis <Mathematik> Hypermedia Integrated development environment Software Personal digital assistant Universe (mathematics) Strategy game File archiver Object (grammar) Local ring Library (computing)
thank you to everyone for coming today we're very lucky to have max wilkinson from the University College London research data services and he's given to a wonderful title of delivering comprehensive data management services at London's global universities Thank You makes for being with us today thank
you very much thank you very much nothing if not an ambitious title what
I'd like to talk to you about today is how I've approached this particular problem at the University in central London and in order to do that we need to understand a little bit about the University of University College London is quite an unusual University and has come about through a variety of means it
consistently ranked as one of the world's top ten universities depending on which tables that you look at it has alumni eldest quite rich in Nobel Prize winners was founded in 1826 and was the first to admit students without regard to race religion or gender so that meant people from other countries women and people that will not be the Church of England or Catholic or indeed had any religion at all we're allowed to join UCL as a complication as a student we have about 8,000 staff and a quarter of which come from countries outside the United Kingdom on University standards in the UK quite high student population with over a third of these as post graduates have about 5,000 researchers in an annual turnover around 900 million pounds we're quite multidisciplinary in fact I believe that the only two disciplines that we don't provide our veterinary science for some reason and because of our founding fathers quite appropriately theology large
universities have had their visions that they may attend to and this is ours is called news younger and challenges and more specifically for research activities use your research frontiers and they build themselves around four main areas of global health human well-being into cultural interaction and the sustainable cities collectively these are the visions that UCL I stood for they're under review right at the moment we have a brand new provost and president who has come in and to take us into the next decade specifically about
research IT services we sit ourselves or we were created within our central IT provisioning this was a strategic decision to build on the skills that already existed in a very large and dispersed IT service but within that there was a dedicated unit that was specifically designed to attend to research I team needs were established in June 2012 so a relatively new establishment within the organization and our brief was to support IT across a support research life cycle with regards to IT and there's a little diagram there of what we consider to be a very quadrant pised IT research life cycle that concerns innovation active research publishing that research and exploiting removing the nonsense and going into the cycle again because we're relatively new and
because we don't generally fit with either an academic organizational unit or an IT organizational unit we had to focus on innovation and skills innovation was our brief and in order for that to happen we had to invest in skills investing in skills means investing in people and expertise and we've found that this has been required in more established areas as well as the newer areas so for example our high-performance computing provision has been expanded we have a brand new software development initiative that supports software development activity and also we have the group that I work for and that's the research data services we see this as a collaborative environment so we don't only look within our University my brief extends to consortium that exists in the UK and constantly being rebuilt in the UK but actually the wider research community is all active in this area and we're all actually doing it quite differently which to my mind is quite a refreshing approach there's no top down activity as yet however of course that means that's quite a fragmented community and it needs to be brought together We are fortunate in supporting this within central IT is that we can build on a lot of vendor partnerships as well this is about technology and this is about culture there are challenges in both of them and I don't think it's an unusual mantra that most of the ads community will be familiar with this is not necessarily a technological challenge it's more of a culture challenge well actually I think it's a relatively equal challenge between two and we can talk about that later on in the discussion to show you how we've approached it at UCL so to my mind a
research data services actually impacts all the quadrants of the research life cycle as we as we diagrammed it we have to support the access and innovation to initiate the research I think it's well established now that increasingly hypothesis-driven research is being complimented if not supported by a more data-driven activity the idea of reusing data is well established in science but the critical need as we see it right at the moment is in the undertaking of research we see a lot of people engaged in what would be considered not very good practice however it's not bad practice what it is is practice that is born out of necessity because there is no alternative and I think this is something that is sometimes lost in the research data management field people have always been sharing data have always been managing data it's just the burden has become too much for a number of reasons and we've taken that abstraction down as a way of approaching the providing solutions and enabling a much better practice now you can't talk about research data unless you talk about research data sharing sharing is a very important part of research and as I just said I believe that it's it's done and has been done for a long time my own background and most of my colleagues backgrounds are in the research domain and most of our research would not have really progressed very far unless we were able to share data from others and that is something that we attend to is the second concern innovation and exploitation well building on it on the shoulders of giants or just simply reusing data that stops you from having to generate it again is really the foundation of the business case as we see it if you don't have to generate data then that saves you money in the long run however the other angle to this is then actually this is good practice in the research lifecycle people share publications all the time they're unable to do that by the publication paradigm and actually if you look deeply in too many publishers guidelines they require you to provide the data that underpins your publication upon request now I know as a molecular biologist that's been sometimes difficult to do however that paradigm shouldn't be lost because that's one of the fundamental tenants of good research practice so you
can't UCL's case for a comprehensive research data service wrap was really elicited through three main concepts and that is that most departments and research groups have a long history of excellence but there's a lost opportunity here for building on old this is attending to the reuse the reuse issue UCL is highly multidisciplinary in for a long time there have been some very conservative to bridge boundaries across disciplines I'm a very big believer in cross-disciplinary research however also a very big believer that you have to enable this to happen you can't force it to happen I believe there's a lost opportunity for cross-disciplinary research through the reuse paradigm unmanaged data sets are lost data sets I think that really goes without saying the number of organizations and people when I go and talk to that have racks and racks and racks of staff sitting on disks on tables on draws on everywhere it never ceases to amaze me in fact you can generally pick when someone moved out of a research activity by the format media that they last have in their collection of drawers minds were zip disks which I think had a life of about six months before they were superseded and that's where mine stopped but increasingly people are buying the two three four terabyte hard drives of retail grade and using that to provision net storage this is just as bad as the bottom drawer activity as is the the racks of servers that people put into their closets because they don't have the alternative so this is a burden content this is increased burden to researchers and if we can't help that then we've failed they constantly talk about back up back up in my mind is also another consequence of having to manage their own hardware and that's to fight against the failing USBs and hard drives and controlling the proximity issue that I'll talk about in a second that has to do with authentication and deciding to whom you share your later and when so our approach was
threefold we started by saying we have to remove the burden from our UCL researchers to attend to this pc world conundrum or Korea's conundrum I don't know what the hardware provider is in this country but people can go down to PC world that is is not 200 meters from UCL by 2 terabytes of an external hard drive for close on 140 power actually that's about half that now this this is an old slide if we can't compete against that then we're in trouble more importantly for the organization is this hysteria that's been created created around compliance and policy alignment I understand that the aarc is starting along this route as well about two years ago the Research Council's banded together actually three years ago the Research Council's join together and produce some research data principles from those principles and many of them aligned their own data sharing policies some of which were very mature and quite well established some of which were not and the epsrc the engineering physical resizes research council was tasked with the strongest language in that they wrote to all Vice Chancellors and Provost's talking about their expectations that organizations will provide services that will enable researchers to keep their data for long periods of time and share it appropriately they used very strong language and they started to talk about time scales and spoke of a period of maintaining data for ten years past its last access this created an absolute listeria across UK universities and I think probably in a lot of ways has been misunderstood Research Council's primary goals in the UK and most elsewhere is that they support good practice they see this as good practice most people understand this is good practice but there's a big gap between what we can do right at the moment and what is required to fulfill that good practice so there's a very rich and dynamic policy landscape both from the publishers the funders governments charities and actually UCL and we'll come on to that in a second but of course you can't remove the burden and you can't encourage become compliant unless you provide an incentive for them to do that some of the ways that people store their data right at the moment really feel to this and if we can't provide an incentive framework than actually I don't think it's really worth while starting down this path at all so UCL and and a number of other universities are thinking about this as a two-stage Dre and that's what the next couple of slides are going to be talking about so we're talking about not only an infrastructure change we're also talking about the culture change this bilateral conundrum it's not new to anyone least of all pairs so is it all
carrots and no sticks well we're providing an offering project will and they can opt out we intend to remove burden of managing storing or trying to educate and consult our researchers about resiliency and he'll back up is managed and how data remains stable and managed effectively in a technology environment and removing this burden of compliance well so it's not really your
character at all there are some sticks and those sticks other compliance Research Council's are increasingly expecting compliance we've been given deadlines in the UK to provide elicitation about how data should be managed in the form generally of data management plans but more more practically in the form of infrastructures over which people can manage data so it's not all carrots it's not all sticks and one of the first
things that I did when I came along to UCL about two and a half years ago us to say what we need is a roadmap to how we get from where we are right at the moment which is a rather mixed bag of service provision and intent to the place where we want to be which to my mind was attending to the first critical need which was providing infrastructures over which people could manage their data and we've split that into six different work streams I'm not going to talk about all of them in any great depth but just to overview the first one that was very important to me because there was only myself and one other person in the organization when we came on board and that was to put together a team and build the skills that were required to do this I was asked to write a policy in partnership with our head of library services and this was attending to the concerns that the organization and others had that was talking about how to make responsibility chains available to individuals and the university to make this work I'll talk about policy in a second infrastructure this had been talked about for a very long time in many different areas and I was determined that we weren't going to do anything anymore talking until we have some people in front of some infrastructure and doing stuff I'm pleased to say that at UCO have benefited from central support I've considered that to be absolutely essential to get get the ball rolling at UCL for this the stuff I don't lose sleep too much about his data management that's not not so much a time bomb but more something to look forward to and a challenge to look forward to in the future I was very concerned about providing infrastructures over which people were able to manage their data rather than trying to get them to manage their data first and then get them to that move that into an infrastructure that was resilient and safe there is an awful lot of data that can be used already there's lots of registration and grant management systems and application services we need to think about service integration because if there's one thing that'll turn people off its having to register something multiple times that's an opportunity for us it's not a high priority right at the moment but it's something that we have to keep an eye on and of course even though I enjoy the benefit of centralized support right to the very top of the university I have to make this in some way sustainable that doesn't necessarily mean it's a full cost recovery but what it does mean is that we have to be open and honest about how much this costs so I've actually been enjoying myself immersing myself in the business of procuring managing and storing data and enterprise-grade storage facilities I think what we're coming out with is going to be certainly something that's workable however what needs to be sitting across all six of these work streams is a communication activity that starts to elicit the responsibilities of the individuals and roles within the entire and within and across the entire institution we have to make sure that the policies understood and supports those roles and responsibilities and the infrastructure the data management facilities are capable of providing those facilities this was a road map I don't know whether it's publicly available just yet but it will be shortly and anyone can go to our website and download both the version one of our policy and this roadmap shortly just quickly these are the
organizations we are a band of four three of us know what we're doing this is Daniel Duggan and Alastair they both have higher degrees in high technology areas in a variety of disciplines and I'm very fortunate to have a very enthusiastic team but the research data
policy was the first goal on my track and that talked about three core activities that was responsibility who any university is responsible for this about compliance how do we make people compliant and what does that mean good practice which is really talking about we're helping you to undertake and better the practice that you want because you're doing stuff because you have no alternative but there's also a carrot there an incentive in the way that we have the value-added services of long-term provision of storage accessible and sizable objects that's all contained in our research data policy I welcome feedback drama there's been a rash of policies that have been published in the UK and then no doubt much the same is happening here of
course when you talk about infrastructure and going out and collecting requirements you don't get surprised by what you hear back research has generally want everything they want everything that's familiar to them but I want more they want it faster and they want to shinier they want their NFS sifts exports they want their secure transfers they want high performance transfers and then the dreaded cloud the cloud storage that dare not speak its name this is something that I've talked about in a second but suffice to say if you cannot provide these then you have failed so you have to work out how to provide all these things but still maintain an administrative low overhead that is manageable in a group of three it has to be value for money I'm very open about how much it's costing us that generally transpires into a cost per terabyte and of course we're in central London I'm fortunate in that I'm not constrained by financial constraints right at the moment but I am constrained by is very expensive real estate and a limited power supply we run to high performance data centers managed data centers within our footprint in bloomsbury that is a very costly enterprise so the solutions have to fit all five of these requirements and the choices that we
made and these were designed the choices that first of all we're going to start with very very simple service propositions you want to start with infrastructures that's been tried and trusted and make a challenge to this big data history I don't buy into the big data in hysteria i buy into people having more data than they're able to handle but for very clear reasons I think we can solve that we need to look at strong abstractions because we want to avoid locking to proprietary technology I've procured a single composable component technology but we are open to joining together with distributed technologies that are already embedded in the department's UCL is a large and complex organization that is grown by consumption and we have to be able to attend to that but also we need to hedge our bets in migrating between storage solutions if everything is on high spinning high-value faster spinning disk that's not a very efficient way to manage storage solutions we look at the conventional tiering systems and see that they're very useful in business organizations but possibly not so useful in a research data environment where he we envisage a much richer landscape in both media and storage facilities and we have to be able to join those together whether they exist in departments or whether they exist in in centrally managed services we have to be able to provide for people that have to and that's really born out of the idea and the notion that different disciplines are funded to different degrees and if you can't provide to all disciplines which is my brave across UCL then actually that's not working very well at all being an IT department where based on project development but that's not a great way to be innovative it's a great way to develop stable and operational services we have to split our time between projects based development and we have a high degree of innovative development which is very small scale at home and one particular area that we're doing that I'll talk about in a second is our cloud evaluation so the first thing that
we started at was that we separated the concerns that people generally get mixed up having a live area where they are immutable but attended to the critical requirement of stopping people from going down the road and buying more terrible fright but in a private and project level agreement was was was our fire was our first concern the central concept was a project not an individual allocations are provided to projects that have individuals and that's very important for two reasons and will come to those in a second but moving away from a live environment and transferring data into an archive environment was something that had missed a lot of the previous work that have been done in our organization so what goes on there you get a transfer of responsibility things that go into an archive go in there and stay there unless they're thrown away there's no point in archiving something if you can't access it in some way shape or form and exploit us but most importantly there has to be this extension or this transfer of responsibility if an organization is to look after data for long periods of time they have to be able to take responsibility for those data and make decisions on those data and that's a departure from the way people think so we said this was going to go in two different this will form the basis of our first two services a live area which was just a simple allocation and an archive area which guaranteed preservation for long periods of time the second principle is that our project
was the central concept that's really important for for two fantastic reasons firstly it's bound by time so projects have a start and finish they well good projects have a start and finish bad projects go on forever and that's bad practice we don't want to do that but if you bound your allocation to a project that has a start and finish date you can measure your storage allocation and data storage as a direct cost of the research there is also a very rich group ethos that exists in research with regards to projects and groups so when people come to me looking for allocation they get a project crown but also it's changes people's behavior and this dis proved to be one of the one of the real eyeopener's here instead of the normal I am going to do my project generate data and right at the end I'm going to have a man run around to try and find how to look after the data it gets people thinking about what to do with their data after their project period right at the beginning of their project rather than right at the end of the project and that's been a real breakthrough in changing the way that people think about data management in a research environment so there will be ready to archive well I say they will be ready to archive we'll have to see I think there will be fewer last-minute Russia's but I can't guarantee that there won't be any the third principle
is that we were going to use established technology to attend to the early adopters we did this in the form of IBM's general parallel file system presented this gpfs of greeting them as the technology we are using this ddn's group scalar its conventional POSIX file system as high performance and these are the people that have large amounts of data and actually can teach us about how to move data about it's very important they were able to connect to the high-performance computing resources this project used to be connected to hpc but was moved away in order to encourage people who have no need for hpc resources the so-called long tail of researchers to come on board and that's proved to be quite a good strategic decision however we're starting to move together now and anyone involved in hpc will know that there's a common problem of people using scratch space for storage because there is no alternative we'd like to apply that alternative to hpc users as well there are many options for exports here either and the native file system sound or SCP or in Ephesus it's non-trivial to manage and it has quite a high administrative overhead but it's familiar and the early adopters were very important to get on for
however our fourth principle was the way we required room to innovate and we innovated by looking at the different technologies that were around that attended to the big hysteria of data is that it will grow in an undefined and an unquantifiable way this needed to be scalable generally speaking file systems are not scalable unless you can increase your administrative overhead almost exponentially really a relatively new technology at the time which was object storage seemingly flat flat service that essentially exists of a digital object that has a metadata tag and location to it that sits behind a management system that is automatic there's a REST API their arts of policies that manage how the data objects are replicated if provides a resilience that is much more efficient and rave and it's highly scalable or actually they are greater than 1 100 pedo pipe deployments right at the moment as a low administrative overhead and we have a native irods connector which I'll talk about in a
second so storage access we have three main activities the HPC now our good scalar system is a San presented as gpfs but the most important thing for us was developing the of the object storage which essentially is a nice presentation sitting behind a very high availability and high-performing virtual infrastructure which exposes a native cysts or NFS mount points and is resilient in a very efficient way and then this cloud you talk to anyone they want to use something that looks like Dropbox because that's what they're using right at the moment and there are very good reasons why people use that they're not alone Dropbox g-drive SkyDrive any number of third-party cloud providers however it's not performant it's not cheap and it's also not very secure so if we can't provide an alternative then we've failed before we even start so we went had this thinking
that actually there must be some quite mature clients out there over which we could build our private cloud these cloud activities are very widely used quite unofficially in academia but we found that actually the landscape is evolving rapidly and we've been evaluating these for cloud clients have not been particularly happy with any of them but there's still quite a mixed offering and we don't have a solution right at the moment very interested to hear people's experiences on how they've been finding this each of them has a unique sign point which is great but fails on a couple of other activities so we're very keen for this to go for very rapidly there's two things that have been quite useful in Driving Miss forward to us one is that debacle with the NSA and gchq the great thing to come out of that was it gave people an understanding of the value of metadata which is really great the other thing that came out of that particularly in organizations such as higher education where the researchers are particularly paranoid is that they much prefer to use private clouds and public clouds so we're intent on providing this and we're very keen to hear about people's experiences on how they've been finding any clients so do feel free to to drop us a line we're not convinced that it's evolved enough to provide this a stable and production level service with a low low administrative overhead the
component that makes us agnostic to bender snake oil is irods the integrated rule re-enter tate data system it's a very generic policy engine that can be used to integrate multiple storage technologies so generic in fact that it needs quite a lot of effort to configure there is however attraction gaining in the open source community to support I a-rod's been going around for a long time and there's also traction in the vendor community to support I roads which we think is a great activity so this helps us bridge between the different areas and storage types from our conventional object and tape storage that we control but also departmental level activities and the archiving components that were piloting this year there's also a metadata store that we would like to help people use to enrich I don't spend too much time on metadata I'm a big believer that people will enrich their metadata as they see value for this is about building standards is about community conventions and this is about me an accidental IT person getting involved in activities that are not appropriate for me to do so I'm happy to provide facilities for people to do that and we'll have to wait and see how that goes speaking of metadata anything you
can do i can do mentor my belief is that less is more and so when we register projects or allocations we collect a small amount of metadata that sits sensibly around the investigators the project's the members of the project groups how long the project goes for if a funder is available what that funder is and a small amount of narrative they provide the enrichment when it moves from a live area to an archive area is domain-specific and therefore the responsibility the researcher we intend to help them as much as we can but right at the moment we're quite content that what we gather is sufficient now that will probably create house of laughter across community there is no semantic or syntactic interoperability available here at least at the very lowest level that I think is something that we can help to change but we can't change narrowing right and I don't think it's something that can be done at the university level either we have to maintain a degree of Independence and that anything that has to do with metadata and providing comprehensive meta data has to be done at international or an international level we are very happy to be part of international and national activities but we're not we're not in the business of providing that at an institutional level the second part of course of data
management is data preservation moving from one side of the concern to the second and this is the second service offering that we're providing we're piloting it this year and bringing people around to the idea that data preservation is not like picking onions there has to be a degree of planning there has to be a degree of intervention I don't expect anyone to focus too long on that image down the bottom there's a skeleton of data preservation activity but this one actually that I lifted
ruthlessly from archival data preservation organization and third-party in the UK is quite sweet if we look at this as a 20 year time span and each transition from grey yellow indicates and intervention is required so when you're monitoring media a large tape library for example you have to do that constantly and tape libraries generally do the monitoring in the maintenance of the hardware is something that happens periodically as part of maintenance integrity checking is a much more involved process but has to happen quite regularly as do hardware upgrades and software migrations then hardware migrations as we move for example from lto-5 1206 or lto 62 LT 07 we can't ignore the fact that this takes people to do a lot of this not all of it but staff generally create the large cost here and then every now and then you're completely flummoxed by a format migration particularly if you have a format or user format that's in some way prior to your lock in all those transitions are required to be looked after when you preserve data for long periods of time and this is only data preservation this is not digital preservation the idea that having software in order to render the data is equally important as preserving the data is not attended to hear that would an added cost on top of that once we start talking about that we really start thinking about how much this costs and costs spiral quite
quickly UCL being quite large service integration is something that really plays in my mind it's a great way of collecting metadata about the objects that you're storing it's a great way of enriching the scholarly record so the idea that an institutional private Ori that attends to the open access agenda and publishing can be supported by an even open access agenda in data is something that I'm very keen to develop but this is about professional record people don't publish these these articles for anything other than their professional record so supplementing that with the data that they generate is something that I think we're all intent on delivering of course I'm not into the idea of duplication if data can be held somewhere else then I'm happy for that to happen as well the idea that we can reference data that was generated that UCL has a record generation rather than anything else is sufficient for the service integration that I see business
analysis four points to measure here the financial landscape which is quite hysterical right at the moment the responsibility which is almost non-existent right at the moment the continual mix up with licensing and the idea that when you say you have a data asset that you can then go and sell it troubles me deeply if there were one thing that keeps me awake at night it's the mess and misunderstanding opera's licensing research data but the second thing is long term commitments are required in order for this to be sustainable commitments that are so long term that they generally go outside the employment of any of the staff members that are involved in it including the provost and a quite an unusual time frame for an IT department at six and three to five year hardware refresh cycles so my business analysis doesn't only attend to the costs that researchers and their grant funding bodies bear but also the responsibility that my bosses and their bosses play in this analysis the current state of play
we have about 40 projects this has all been done by a sloth launched through word of mouth that's about 200 users which is a rather modest collection of users we have about 300 terabytes that's been uploaded in the last six months and people are testing the water with this but we are starting to develop a bit of a registration backlog around the registration time so we need to attend to that I'm quite concerned that because this is new technology you need to deploy slowly and expand users appropriately that will all change when myself and the head of library services do a double act around all the departments and Dean's talking about both the open access agenda and the data management agenda and reminding them of UCL's position in doing this I believe that we will start to see an increase that will be a lot more impressive for me we have four main challenges that
were identified I don't think any of these are really new we're very concerned about authentication the idea of shared services is in the UK as well as I understand here that will not work unless you have stable trust networks and authentication mechanisms we don't want to start our own although we do need to find a solution this is something that we're hoping to attend to in a variety of manners one of those is a project called Project moonshot ran out of janet it's not a product per se it's more a standard to write your authentication services to and with all standards they're long to develop and difficult to understand we have very big concerns about networking we're going to be flooding our network with data movement and that's going to be something that will bring the attention of our our very well developed and and very extensive network in UCL we have the largest private network in London and we have license to annoyed commuters by digging up roads so anything that starts to affect that will bring the attention of a very large and powerful people plus some departments are not provisioned very well and we need to identify those and attend to them as quickly as we can we want to provide multiple access mechanisms that presents a the Ministry of overhead and a software and virtual infrastructure overhead that creates challenges those are generally attended to by having a large team well I have three people that know what they're doing we will be needing more more people shortly and then finally licensing and I'm very happy to take questions about how people see research data licensing building one thing that we managed to get through in our policy was that by default unless the declaration made anything that goes into our archive will be dedicated to the public domain of IRS cc0 waiver we think this is a very important development and and we're going to push to have the majority of data in there dedicated to public domain unless we have obligations under statutory legislation or any sort of contractual arrangements so I just to finish up if
you wanted to read more about how we've implemented all of this then there are a couple of reports that are out we contributed to these Sciences and open Enterprise both as UCL but I also contributed to that when Ashley brary recently there's been a Larry report the League of European research universities that spoke specifically about a joined-up roadmap for research data in there we talked about how much it costs to establish the services that we have and how we attended to our policy decisions and there's a rather ambitious press release from one of our vendors DataDirect networks that talked about a 100 pet applied cloud to share and preserve data I I think we can safely say that we've taken a step in that direction but I don't have the space for a hundred petabytes right at the moment that too may change I'm very interested to see how our servers are taken up as we see
so if anyone wants to send any questions through I understand there's a facility for that and we can greet them out and attend to them otherwise I thank you all for listening that's my email address or a more generic one if you don't wish to talk to me and down at bottom of the URL that takes you to some of our rather sparse web pages where you can download some other some other question we have one question how actually talked about changing behaviors with data management how was this tackled at UCL did the library also work with you on this the library is one of my key stakeholders we work in central IT and that that generally puts you at the bottom of the university as far as I can make out and I only have four people we need reach across the university and order to do that and the library is our key stakeholder that so when I say that we're myself and the director of library services is going around doing a double act that is the beginning of our outreach layer that that that has feeders in to all all the departments the library is looking for new roles and I think this is a fantastic new for a new role for the library changing the about changing the behavior of of researchers is something that's going to be time-consuming and we need help for it the reason I say that is because our experience has been that when you go and talk to people and I'm very much of the opinion when you gather requirements you look at see what people do and try and help them do it better rather than try and second-guess what they need and so we went and looked at what people did and what they did was mess around moving data between the portable hard drives and their computational resources so we said well if if you had a central facility perhaps you could think about doing it this way where all your data your initial data your abstract data goes into cycle from from a resilient centralized service to your local provision way you know essentially no network is ever going to beat the performance of both of having your data locally but then feeding it back wiping the slate clean you have a central resilient service that holds all the products of your research and the artifacts of your research and you can pull those across as and when required from your dream your enquiry and actually it only took a couple of minutes to before they see actually that's a much better way of at least managing the technology and we've had a lot of people start doing that which which which actually increased somewhere so that's the beginning but and that's only the beginning I'm very keen of that that's recognized as only the beginning yeah I think there's a long road ahead okay you mentioned some data is held elsewhere EG in discipline repositories do you know how much of the UCL data is not under your management I don't have any idea right at the moment anything that's held outside of UCL is likely to be represented in the archive rather than any live environment a live environment is for active projects rather than static data so anything that's held outside of UCL i would consider static and so represented in the archive duplication is not ideal but it is required sometimes and that's what the that's what the live area is for I don't expect anyone to try and deposit stuff that exists elsewhere but I'm open to the idea that mistakes can be made or there is a requirement that we haven't seen yet how many research projects currently using I roles generally Oren Uziel at UCL no one is because we're using it as an administrative interface right at the moment we haven't clearly implemented it and when we do that's when it will be opened up to to the users so we use it as a way of managing the different storage components that we have about the place and it's not it not in a production service just yet we need to understand and we need to configure it the way that we wish it to behave as with most policy engines you have to write your own policies generally the plan is when a researcher leaves does UCL key and an archived copy of the data product anything that goes into UCL's archive will be considered the responsibility of UCL unless there is a case that individuals researchers can make then it will generally remain in the archive we don't throw stuff away unless we're required to so that's the starting point of course the upside and the benefit for that is that they don't have to look after it themselves so when people archive into the UCL research data archive they've there provided with stable object so they can reference and then you know that can provide data side do eyes or any other persistent identifiers is there a discovery for you see arc not yet that is the third service offering I've been very determined to separate how all these concerns I'm very concerned that I've made it a third service offering there's no point in archiving something unless you can get it again but it will follow closely behind I'm very mostly concerned about getting people understanding the importance of archiving data and the implications of archiving data for long periods of time of course anything that goes in that has to be able to come back up again do nigerians assist researchers with challenges other than storage most definitely the librarians have very rich domain level knowledge and they will not only assist with storage there will be the kingmakers in research data management because they will be the ones that will be able to advise on the value of metadata and I think that that was really going to be an important process gear that's the degree to which we can act locally so when I talk about national initiatives I don't know whether you can have a semantic and syntactic interoperability around the world but you can have it in your domain I have to say you that i'm also very big believer that people generally don't stumble across data that they find useful they actually haven't a priori knowledge of what these data may look like and they go looking for them I think it's it's quite rare and often only serendipitous of course we have to allow for that but I can't design against that need do you have a takedown a removal policy for public data we don't intend to store other people's public data the data that we do store is taking on an understanding that people have the rights to dedicate those data to the public domain if it turns out that they don't then they'll be liable for takedown absolutely but you know we can't ignore requests but the requests will be looked at on an ad hoc and as required basis it won't be a different situation that anyone says take down and we will we have security quite quite a large security data security team and we work quite quite closely with them in these sorts of situations I'm big believer that we should start from a situation for making the most data available most appropriately I will act on the advice as to whether something needs to be taken down or not has there been any dialogue about creating or maintaining stable access to global sharing network or does one currently exist there's been lots of talk I think about a global sharing network this happens in a number of domains particularly in astronomy or particle physics anyone that has large amounts of data or enormous facilities generally have a sharing network but there's no global network as far as I'm aware for all types of data so you'll see in two main specific examples but you won't see a generating network they're quite difficult to set up I hope that answers all the questions indeed thank you very much max I don't
wonderful heavy gear and actually have you here in person I suppose Terrill university i was on the graduate here fantastic kind of thank you once again to everyone for attending and thank you for max but uh thank you