Open Science and Collaborations in Digital Humanities Part 4

Video in TIB AV-Portal: Open Science and Collaborations in Digital Humanities Part 4

Formal Metadata

Open Science and Collaborations in Digital Humanities Part 4
Alternative Title
Data management
Title of Series
Part Number
Number of Parts
No Open Access License:
German copyright law applies. This film may be used for your own use but it may not be distributed via the internet or passed on to external parties.
Release Date
Production Year
Production Place
Dubrovnik, Croatia
Architecture Data management Data management Observational study Content (media) Planning Information Data structure Data structure
Curve Observational study Service (economics) Code Core dump Digital signal Open set Mereology Repository (publishing) Search engine (computing) POKE Uniqueness quantification Universe (mathematics)
Ewe language Game controller Existence Computer-generated imagery Virtual machine Computer Tablet computer Information Data structure Observational study Scaling (geometry) Content management system Web page Projective plane Mathematical analysis Degree (graph theory) Structural analysis Emulator Personal digital assistant Network topology Boom (sailing) Office suite Videoconferencing Data structure Spacetime Ecoinformatics
Web page Building Identifiability Gender Multiplication sign Source code Mereology Usability Mechanism design Term (mathematics) Data structure Endliche Modelltheorie Traffic reporting Source code Observational study Web page Projective plane Latin square Database Digital signal Word Digital photography Process (computing) Personal digital assistant Series (mathematics) Self-organization Website Data structure
Observational study Vapor barrier Computer file Information Link (knot theory) Open source File format Computer file Usability Mass Function (mathematics) Open set Subject indexing Data management Googol Search engine (computing) Single-precision floating-point format Search algorithm Point cloud Diagram Family Annihilator (ring theory) Form (programming)
Backup Observational study HD DVD Decision theory Data storage device Ultraviolet photoelectron spectroscopy Insertion loss Data storage device Directory service Control flow Mereology Revision control Mechanism design Type theory Revision control Backup Directed graph Physical system
Backup Observational study HD DVD HD DVD Point (geometry) Floppy disk Bit Data storage device Type theory Hypermedia Operating system File archiver Absolute value Backup Physical system
Backup Observational study Server (computing) HD DVD Multiplication sign Point (geometry) Data storage device Computer network Data storage device Digital signal Directory service Type theory Compact Cassette Repository (publishing) Point cloud
Open source 1 (number) Data storage device Digital library Function (mathematics) Open set Mereology IP address Metadata Web 2.0 Revision control Type theory Term (mathematics) Repository (publishing) Installable File System Physical system Point cloud Observational study Server (computing) Digitizing Data storage device Content (media) Computer network Cloud computing Bit Digital signal Open set Process (computing) Repository (publishing) File archiver ECos Point cloud Video game Cycle (graph theory) Musical ensemble Quicksort Communications protocol
Standard deviation Standard deviation Observational study Computer file Closed set File format Content (media) Bit Data storage device Line (geometry) Cartesian coordinate system Open set Type theory Term (mathematics) Ontology Representation (politics)
Laptop Point (geometry) Implementation Computer file Code Multiplication sign System administrator Web browser Mereology Computer 2 (number) Twitter Revision control Data model Type theory Form (programming) Scripting language Algorithm Observational study Information Cellular automaton Projective plane Code Planning Parameter (computer programming) Bit Type theory Process (computing) Integrated development environment Mixed reality Phase transition Video game Cycle (graph theory) Laptop
Probability density function Addition Context awareness Observational study Computer file Multiplication sign Projective plane Library catalog Database Bit Mathematical analysis Term (mathematics) Variable (mathematics) Type theory Word Personal digital assistant File archiver Energy level Energy level Position operator
Probability density function Standard deviation Building Observational study Open source Code Relational database Decision theory File format Numbering scheme Library catalog Open set Revision control Word Software Internet service provider Network topology Data structure Descriptive statistics Computer architecture Library (computing)
Point (geometry) Scripting language Backup Group action Observational study State of matter Decision theory Projective plane Data storage device Directory service Control flow Revision control Mathematics Repository (publishing) Revision control
Point (geometry) Web page Execution unit Observational study Software developer Multiplication sign Content (media) Reflection (mathematics) Line (geometry) Control flow Computer programming Revision control Mathematics Goodness of fit Latent heat Natural number Revision control Energy level Row (database) Information Endliche Modelltheorie Game theory
Revision control Game controller Mathematics Observational study Code State of matter Repository (publishing) Revision control Control flow Communications protocol
Addition Observational study Link (knot theory) Military base Code Digitizing Web page Source code Database Digital signal Line (geometry) Control flow Computer programming Computer programming Revision control Mathematics Natural number Series (mathematics) Revision control Website Right angle Object (grammar) Arc (geometry) Library (computing)
Source code Observational study Observational study Usability Bit Digital library Open set Product (business) Data management Word Mechanism design Search engine (computing) Quicksort Error message
Dynamical system Implementation Statistics MP3 Computer file Source code Mathematical analysis Open set Function (mathematics) Mereology Code Metadata Formal language Synchronization Bus (computing) Energy level Metropolitan area network Physical system Algorithm Observational study Mathematical analysis Sampling (statistics) Metadata Independence (probability theory) Physicalism Formal language Entire function Structured programming Process (computing) Grand Unified Theory Software Visualization (computer graphics) Calculation Video game Speech synthesis Formal verification Flynn's taxonomy
MP3 Open source Chemical equation Source code Mathematical analysis Icosahedron Replication (computing) Computer Element (mathematics) Different (Kate Ryan album) Descriptive statistics Physical system Form (programming) Scripting language Observational study Software developer Metadata Formal language Entire function Structured programming Software Blog Network topology ECos Formal verification Flynn's taxonomy Library (computing)
Uniform resource locator Execution unit Observational study Digitizing Source code Object (grammar) Digital object identifier Resultant
Web page Pattern recognition Electric generator Multiplication sign Wage labour Projective plane Virtual machine Content (media) Thermal expansion Explosion Triangulation (psychology) Medical imaging Process (computing) Different (Kate Ryan album) Universe (mathematics) Negative number Website Universe (mathematics) Task (computing) Task (computing)
Standard deviation Focus (optics) Game controller Observational study Touchscreen Digitizing Projective plane Replication (computing) Cartesian coordinate system Event horizon Demoscene Mechanism design Mechanism design Process (computing) Website
Web page Point (geometry) Standard deviation Observational study Computer file Validity (statistics) Web page Content (media) Mass Dynamic random-access memory Mereology Mechanism design Mechanism design Process (computing) Different (Kate Ryan album) Normed vector space Universe (mathematics) Formal verification Game theory
User interface Standard deviation Machine learning Game controller Turing test Observational study System administrator Wage labour Content (media) Virtual machine Expert system Line (geometry) Demoscene Peer-to-peer Mechanism design Mechanism design Data management Labour Party (Malta) Website Series (mathematics) God Physical system
Pattern recognition Observational study Transport Layer Security Weight Computer-generated imagery Expert system Virtual machine Medical imaging Coefficient of determination Personal digital assistant Hypermedia Endliche Modelltheorie Quicksort Error message Task (computing) Library (computing)
Peer-to-peer Process (computing) Computer-generated imagery Formal verification Quicksort Food energy
Building Computer-generated imagery Collaborationism 1 (number) Shape (magazine) Different (Kate Ryan album) Software Formal verification Gamma function Computing platform Form (programming) Data type PowerPoint Observational study Information Interface (computing) Projective plane Electronic mailing list Planning Bit Multilateration Shape (magazine) Open set Product (business) Corporate Network Mathematics Data management Process (computing) Website Office suite Videoconferencing
Dependent and independent variables Observational study Information Projective plane Interactive television Knot Event horizon Data transmission Hand fan Process (computing) Different (Kate Ryan album) Right angle Process (computing) Logic gate
Area Enterprise architecture Group action Projective plane Plastikkarte Planning Computer programming Subset Type theory Goodness of fit Different (Kate Ryan album) Universe (mathematics) Self-organization
Group action Observational study Digitizing Projective plane Workstation <Musikinstrument> Maxima and minima Bit Mereology Event horizon Computer programming Number Web 2.0 Event horizon Causality Natural number Different (Kate Ryan album) File archiver Website Video game
Group action Context awareness Variety (linguistics) Real number Multiplication sign Patch (Unix) Decision theory 1 (number) Materialization (paranormal) Virtual machine Set (mathematics) Student's t-test Event horizon Computer Theory Internetworking Operator (mathematics) Electronic visual display Software testing Computer engineering Error message Game theory Library (computing) Wechselseitige Information Trail Inheritance (object-oriented programming) Expert system Computer Bit Arithmetic mean Internetworking File archiver Video game Right angle Game theory Family Fingerprint Library (computing)
Point (geometry) Wechselseitige Information Freeware Link (knot theory) Algorithm State of matter Multiplication sign Virtual machine Client (computing) Computer Wave packet Centralizer and normalizer Natural number Authorization Data conversion Social class Point cloud Algorithm Observational study Arm Inheritance (object-oriented programming) Simultaneous localization and mapping Cellular automaton Forcing (mathematics) Projective plane Computer Bit Software maintenance Virtual machine Word Network topology Universe (mathematics) Website Point cloud Video game Right angle Quicksort Freeware Freezing Library (computing)
Frequency Group action Observational study Frequency Multiplication sign Projective plane Quicksort Descriptive statistics Product (business) Formal language Product (business)
ok so now what we're going to do is compressed the last session in with the second half of this session will the previous session if that makes sense are complete with the data management and had ever to jane the some. public engagement content you are ok so i'm somewhat as i said before some of this stuff might seem commonsensical it might be stuff that you already do but being explicit about it helps in your planning and thinking about managing your data so talk about the data deluge about small techniques first structuring data. how useful structures can be and then what to do with preserving it and being sustainable with the data so essentially you've got more motivated to work with him we already have the challenge of wedlock lives and the huge corpus of sticks to work with but this is from twenty four tain it was saying that data is doubling in size every two years.
he and the exponential curve.
you've got more data to work with all of the research partners i mentioned previously you have to have a poke around to see if there's anything useful for you whether it's easily easy to access code or is a kind of aggregate research service that harvest all of the mirror from the institutional repositories across the uk so this is. the traditional research outputs mostly but there's data is starting to appear in these and this is an aggregate search engine and if you look at was about three years ago has sixty six million open access articles a year ago one hundred thirty one million one hundred thirty five million open access articles for how you search through the as part of your research for practice. this is getting increasingly complicated.
i was meant to do about five is also something that challenging to work with where the working with the mirror that are alone all the text itself is different requirements not just in a tiny this stuff but computing and what i guess a lot of your research is focused on his his analysis gap is how you work with data at that scale.
while using machine learning techniques to enrich an enhanced and grab hold of the features that you're interested in the whites important i think two structures to do this is taking it very messy very quickly this is an example of a research project that didn't have any editorial control over its vocabulary.
just fill that mattered a willy nilly it let research assistants use use the content management system used to oppose a tree to fill that you can see they got three hundred forty six and fifty four degrees of space and spy space for this is with a structured research project that was creating minute are not with a research project that was mining and extract. they made it out of from noisy text so already the problem exists.
mechanisms of to control those really kind of editorial practices all methodology for yourselves to employ which is restricting your vocabularies this is part of the qualitative process as you go through in kind of identify terms and words that makes sense and reformers are the time but it's always worth keeping in mind. strain this as much as possible as long as it's usable for you.
an example of how you structure in british history online which was the research project or talked about previously is part of the report project was its you are all structures was this one at the top which had sourced always picks question much public was seven three nine now seventy nine was just a database identify it was designed by a person who thought about data. cases in terms of databases they were really thinking about what that identify men is no relationship between seven three nine and the text itself so i'm rebuild the site we deliberately an explicitly put a lot of effort into using the structures of the euro to become meaningful structures not just for what the content of the. william that the identify his but with the structure within the organ is well so using the anchor tag and your all to represent paragraph five days to match future paragraph identifies in the h.t.m.l. of the page so if someone is not to do a digital photo session of this u.r.l. and the british journal and fight went offline they could. still the rider from the text of the u.r.l. what the actual printed source was that was used in which you can't do when you just using numerical identifiers so usable structures is something worth thinking about when your building models were traditional models but very important and very usable but how many people's desktop look like this.
this. yeah ok yet so and the brothers have yet yet so you like the pile your information and search for a later rather than filing your information and then finding later and so this kind of to approach as they were very used to the google century single keyword search the entire mass and we expect to get. but what we expect to get back we don't really question whether or not the first ten results. everything that we could find you are with the filing in the medical management and jane mention the importance of medicare and putting your research out online it makes it much more rapidly discoverable.
and this is a diagram from timber honestly when he was pushing for the link up under a cloud and i know you've had a session a link to date already but the idea with this is that if you publish your output says p.d.f. its upright any form and it's difficult to get into their eyes open source tools that can extract text from p.c.'s but it's a lot more difficult. to do then if you publish your data in see his faith in the middle or in just plain text that much more accessible for tools much more accessible for people who know tools and much more accessible for search engines to index and linked up and are not think it's. i'm ambivalent about linked up and data. i think it's useful i think it's still slightly inaccessible it's not quite the same as publishing open text open data in a standard format is still have skills barrier for people to use linked up and data and that's i think one of the challenges going forward is if you're publishing makeup and data how do you make it usable but people the data itself.
so now storage preservation of sustainability this stuff is probably going to sound familiar. how many people who do back up its of your work.
yes it's dropbox sink or is an actual back up. the actual back up again and you use u.s.b. discs or cd romans or another how drug ok good as his experience data loss they've lost the hardest to know just like a life. ok. right now it might sound silly if you haven't experienced it but when you do you realize the importance of backups whether this sink back ups are additional harder backups decision making a copy of something you having to maintain two versions of it so a system in place to mecca nine is that is backups where the going to cast. made versions of your research data directory that your collecting and building as part of your research is is up to you but there's mechanisms to help with that.
how drives also go out of style yet this is a floppy disk in the university of london archive which were unable to get the better off because we don't have a for the quarter inch floppy disk anymore and if we do have a foreign a quarter inch floppy disk and we put it into a modern operating system that could read the fall system it would put all of its medicare on. there would be next but also that it made its searchable inconvenient there by destroying the actual bits of the discs a there's something about backups that is difficult to read use in the future but at least you got them for disasters as a whole other host of absolute media effectively think the.
these are obsolete as well as.
the niceties they say they know the news that discs just assess the.
two years cassettes to store data. you did it took a long time to get your day off of them. here it is. i'm not sure how many cassettes would go and the awkward but again isn't obsolescence problem with the backups here. so they know it. it. may not. it's a difficult problem so yeah you have to keep your active research directory accessible to you and backed out for you as well it. i know it. it does in time and in storage.
so the talks of storage is well a lot of these you'll be obviously familiar with the digital repositories an archive is why you should really start to think about where you want to deposit your cleaned up and finalized research outputs and the uk web archive is well not a place to put your research data.
but perhaps the european open science cloud is it and nobody other cloud infrastructure is that you'll tap into his part of your content discovery process is that you should think about reusing at the end of your research life cycle as well.
we've been exploring using archive america and preserve eco which of digital preservation systems send their is another version of those with an open source i guess metadata repository the ones across the bottom solos is the distributed storage system that timber honestly has been promoting recently. ip addresses the interplanetary fall system and that is also a distributed data protocol these are fairly new technologies the not widely used their distributed technologies so they're actor bit like a torrent so you need to at least have a host seeding your data other people to replicate and sing syndicated it. i'm but i think these sorts of technologies will start to come into play over the next five to ten use in terms of propagating and making research dr available.
and the reason i mention this because the old technologies are going to get obsolete as they necessarily do so let's talk a little bit documentation and using close formats there's not much you can document in terms of accessing the content of the files themselves unless you've got the applications to open them this is important so research of standards.
like the guy in age to know and r.d.f. and the ontology z's so popular ontology his own everybody can create their own ontology but that's just creating another file or so. a little bit of effort into figuring out which standards are preparing it for the tools and the data park lines that you're producing goes a long way.
documentation types earth century is part of the research project you've got administrative methodological descriptive and technical documentation administrative documentation goes back to the research project life cycles i was talking about and often no appeal again to chart where he just forward plan phases of your research things that you might be looking. yet have the time is dependencies to these as well does anyone think about the research activities in this phase way. as anyone planned ahead to the kind of second and third year. a little bit know it. ok so i mean you always have to be adjusting those you can never for plan too far ahead but it is useful to look as far as you can given the information currently got and just be comfortable reforming the plan changing it completely or ditching a completely as long as you're keeping documentation for this then you can justify your. millions of what you do what you did at that point in time of year research project methodological documentation this is mainly coming out in the form of jupiter notebooks yet as your scripting if you just editing a text file in your writing algorithms and scripts you're not necessarily keeping the versions as you go but you quite you're able to do that much more.
so easily in the jupiter notebook is used to print books. yet their web browser implementation of the programming environment so what you can do is mixed cells of code with cells of texts with cells of graphics and charts and so essentially if you thinking about writing in a physical notebook your processes explorations can do the same thing with computational. twenty. and these are increasingly published on here as well that kind of shows people the thought process that shows how you're kind of interpreting each of the algorithms as you went in it it's a trend of useful way to build things up.
descriptive documentation is more about kind of project level documentation file or database level documentation all very well i tumbled documentation and it's not quite the same as the technical data it's more to do with the context of each of these items. within your collection. in the uk data archive actually mandates someday documentation when you deposit your data and if you're thinking about other apologies to put what you're creating in collecting this is where these types of documentation which are additional files that have to create additional work but it's well worth it.
and some of the positives will mandate that you include various documentation and now it's a little bit like metal data that could be more births you could just use a word document to kind of describe what you're thinking about your context of the time but there's going to be really instrumental when you go back in one year or three years' time to look at that director your files that six hundred gigabytes. in size to inform you what you're doing with that at the time.
and the technical documentation a lot of this already comes online and if you're using open standards not because then the technical standards in the scheme descriptions will already be available online. but they would change over time because if you're using libraries that have an i.p.o. i for example than the ip ice technical documentation that the available online will be kept up to date with the latest version of those libraries and you need to make sure that those open source software provider is that you're using a woman tiny back catalogue of the versions. of the a.p. eyes that you'll code is dependent on. building something yourself then you should put best efforts into describing technically the architecture is and the scheme is in the decisions for you all data structures.
a this is where standards helps that's not a standard that's just another tree relation database it.
and who uses version control in their scripting or their documentation you you're ok to use it for overturning your data. it uses version in your project documentation. and it can be used for all those things yet you can have private repositories to keep snapshots of your working research directories a backup is a snapshot as well but you don't quite get this kind of replay ability with get for example you can kind of revert back to a previous state so if you complain. the restructured your research data collection way you had been a fifty different p.d.f. that you've collected and group nicely together and you restructured those but you want to get back to that previous way of describing things then if you had kept all of your changes in version control and you could actually get back to that point to rain for me so. of the decisions made in the past. thank urge you to learn get if you don't know what i'm hearing aid for.
so it at a good kind of collection of data level it's really useful but at an individual document level is were really originated from was fur source code management for software developers to track into in teams who is making changes to which lines of which far so are just by the nature of this so they could co-ordinate. right there program of work and distributed among people in a tame what you've actually got is this really rich line by line provenance model of who did what to which line when.
so it allows you to reflect on the pasta work allows you to revert to the previous versions you know wikipedia was has version control if you were citing a specific page of wikipedia at a certain point in time then your situation will have when you access to it but that content of the page may have changed since then so.
least with wikipedia his website you'll be able to or the person following up on your digital citation will be able to find the precise text of the page at the time that you access to which could be important.
i'm. varying control as you to share with others you can synchronize copies between each other and they get protocol kind of manages a lot of that for you can pour copies of the code base from someone else make your own changes to that and push those across to them really easily you don't need to use us peasticks quite so much so there's kind of speed up with working and sharing collaboratively like this.
it doesn't much allows you to cite the price state so there's another infrastructure which has been built by cern already has version control so that you can cite specific variant over not all repositories will allow you to do this i'm. they should be.
and just a quick note on british journal on again there was a recent article but team hitchcock or confronting the digital and what was happening in and sang correct me if i characterizing correctly is what was happening in historical research publishing was that researchers were using digital. resources like british history and line to access texts and party sources from the library and the our cars though interested in and they would then write their academic articles published the but the citations that they were using were referring to the original books that they could have gone to visit in the arc of some sellers but they were using the digital surrogate fall and so. this article is really about the slight disingenuous nature of using a digital objects for your research but citing the original printed material itself so if you're using digital objects if you using digital resources couple that with version control a couple that with documentation to ensure that you saw the object. your literally using not kind of using it for interpreting it as a farrago of the object we need to start right fining additional resources as their primary sources and so on.
and teach yourself get you haven't already. this is a link to the program historian website it's really nice gentle introduction about how he's get how to get the basics offer up and running and the kind of methodology is of committing changes to your code bases to your databases and then reverting in switching back and forth. the program historians quite a lot of other really great research articles lots of technical article.
so on reproducibility all of this kind of data management one o one stuff is really to get you thinking about making what your collecting and what your creating as you do your research as a kind of corpus that can be used for reproducibility so the more open the more radically open about your. methods your mistakes and your research data any documentation the better.
this is where the kind of quantitative methods come into it as well so this is the tro earth digital library which is used to scan the us treasury newspapers a historical study newspapers and you can see the o.c.r. methods that they used to absolutely terrible yet they have improved since but what they are built into the. this was a crowd sourcing mechanisms of the people coming in corrected you but if you didn't know the methods that they used to produce a new just using this is a search engine you wouldn't know what you want missing yet so there's no way to know what will you could look at the original printed material and see which words the should pay but if you're searching for the. ok words and you didn't understand that there are errors in the air see our production methods you wouldn't know to dig a little bit deeper for the sorts of articles that you're looking for.
and qualitative methods as well and documenting these is very difficult but this is an example of one of my colleagues his ph d. theses so he actually had a part of hype was like this which was all of the articles that he was using as part of his process and that's not his filing cabinet but he did go through kind of filing process a manual. organizational process of physical pipers which is the coding process and qualitative research yep that's much more rapid and much easy to do with tools like sotero an engine or where you can just get a deal are for piper and drop it into user territory base and tag and annotated kind of dynamic. a as you go to you. the it. i. you know it. it's like that. yet this man doesn't take my eyes. but. it's really a your life. oh well. so. all. i do. i think as jane mention one of the tools early years they inevitably get bought up by the company and you lose the kind of openness and the costs of the most which is what made them popular to stop with the terror is maintained its independence so far i hope it maintains that the longer it's quite good as it will sink. when i was with this so is your entire article collections year including the files if you download them attached so it's quite a good way to come to synchronize but again this is a synchronization of a sub collection of your research using a separate to some.
i'm the reason we're talking about exposing the guts of all of your research is for critical inquiry really allows other people to interrogate you'll articles at the end of this and his an example from a colleague who did used a abuse parliamentary metadata language to run sentiment analysis. most the speech texts from the uk hats are corpus exactly how and which software used is written in his research article but the implementation of the software how it implements sentiment in our system which lexicons or dictionaries it uses to do that is something that you need to then. make a further inquiry to figure out and then on top of that he's kind of calculated and aggregated the sentiment across the speeches across the years and this is nowhere in his article that gives an idea of where the source code for this might be a over the algorithms that he's used to. to do this statistic a calculation that produced this visualisation so what i'm getting at here is if you wanted to follow up with this and interrogate the methods of an article you need to contact the researcher ask them questions asked for their dot us see if you could get a sample of their algorithms all made. and talk through the methods in person if they're not explicit in the published output so and it's obviously depends on the kind of way that the research articles written as well so this is methodological but it doesn't go into technical methods that goes into a kind of high level schematic with a so i think about that is your writing your article. those will people will come inquiring for you and you kind of get them away few more the bus an explicit about how you've done things the way you've done things. the thing. you. and three. and.
so all of this is really about the scientific method it's about the reproducibility of the research that you're producing you. but this is a distinction between replicating an experiment and reproducing the findings of the experiment if you completely published data and he completely published source code is still an element of getting that up and running for somebody there is a software development in eco system that is you know you can't just pick up any old library and start running it. on your computer there is obviously that kind of have to go through to get scripts running into use open source software so that's not quite replication of the experiment but that's a hurdle that most people have to go through reproducing the findings would be figuring out exactly how a description of what their methods were getting the original source data perhaps good. during the same library that they were using and then reconstructing that for yourself to try to replay the experiment rather than kind of really running the entire source code and scribble this is transparent is kind of gets back to the critical inquiry he is so this article is scribble we could in quiet.
as much as we wanted to to figure out from the researchers have published this how they did it why they did it and what their findings were along the way but if they had published all of that on the blog on a date or pose a tree in different forms then that's kind of motor was transparency so that if the.
all of those objects final discoverable then we kind of heading towards transparency rather than requiring people to be scrutinising everything in a manual way so facilitates the reuse of your research. and the confronting the digital.
article here is the one that i describe previously about citing the digital objects rather than the original source material and china's mention digital object identifies as well so as many of the items that you publish if you can attach it to object identify to them then at least a kind of improving the finder billiton discover ability of your results of a.
ok crowd sourcing and that took quickly about this effectively it outsourcing small tasks it's the pace work for the annotation work at the values labour there's a lot of negative connotations associated with it but it's used properly and your transparent about how you've used to. and why you've used to then it can be really useful to oversee use it for machine learning annotation. i'm so galaxy zoo is a crowd sourcing or resource launched in two thousand and seven the universe as well they've got also projects in here to talk but it.
the universe some of the project comes in of earth and the site. a and you may have been familiar is invest in the galaxies a project that was launched which is that the very first crowdsourcing which is to identify astronomical feet across what kind of the nebula or so on with being represented in the pages i signed up for it when it first started to see how it works and it was fantastic. lastly boring often you collected about five of these pages there was nothing kind of engaging in their soul but they didn't need me to stay on day two hundred if they got a million people during one each and then going away but was fine but that was the first i did and they've moved much more into and humanities research and the created. generation of text for people like us to do research with the handwritten tax recognition is improving all the time that the couple of years ago even it was it was really really paul so the best way to generate ok create content was to get people to transcribe it for you so the images of the documents of the present online. china and then they would ask people to volunteer to learn and transcribe and corrects. they'd triangulate between peoples contributions so you didn't need to be an expert but if three people transcribed the same letter the same way them they would take that as a result but if there was difference they wait until another person and come along to transcribe it and it was very much pitched as crowdsourcing his eye a wonderful academic have a great idea don't have. enough time you people out there can do for me which is where this kind of explosive aspect that most he was talking about comes into it and i think some of the project so you're going to go on to talk about here all much more and this idea of sharing or processes and co you creating data and working with the people he might be doing this to help us. right content and.
so one of those projects is called layers of london project it's a public heritage funded project and his focus was on public engagement and there was a creative agency the designed this were replication.
design of the application was explicitly about allowing school children to contribute content and annotate content and behind the scenes they have an editorial process making sure that what has been contributed is still relevant and quality controlled.
this is just another screen shot of that i mentioned the trove digitize ation of the strain newspapers again they've got what it was a very reliable initially but they've kind of built in these crowdsourcing mechanisms to allow people to annotated what i'm not sure about is the event.
processes for those submissions so anyone could go in there and and a tight and correct the o.c.r. but you have to dig around in their website to see how they approve those changes and whether or not they verified by human and some other examples that kind of got around crowdsourcing were the early english books on.
online either t.c.p. which used for a cane process which is essentially talking it twice during a difference between those two text files and checking for any differences with those differences look back at the original page chance to make the correction that is appropriate and these methods and auntie not we say ninety nine point nine percent accurate. much more costly much more labour intensive and but you kind of get the accuracy out of it other transcribe from process at the university college london launched in two thousand and ten and explicitly tried to build in a game of five mechanism for transcription with the verified approval process.
as for what people contributing and so they got the bent thermometer. since twenty ten they've transcribed twenty two thousand pages and twenty one three twenty one thousand of those ninety six percent have been checked and approved by the editorial team so part of the crowd sourcing is not just using amazon took to get the labels that you want days of verification a validation process. by people who know the content you can verify and vouch for the labels that you're able to pay for mass.
the royal society has recently launched a website the science in the making which is also an attempt to do these collaborative crowd sourced verified content publishing systems and this is a series of peer review mechanisms behind the scenes in administrative user interface is to make sure that what. people are contributing conform not just editorial god line so they mocked up correctly but they're actually not just spamming the system and hoping that it gets published on the site his reputation management involved not just an academic quality control.
and i was on the tanker a mechanical turk is probably useful but again it's the diminishing the value of labour ends this kind of the difficulty in had a credit or a tribute to people who are helping you are and if the incentive is solely financial they going to be kind of trying to ratchet as many labels. i can without much care for the content that they are labeling and if you need to use mechanical turk for the amount of annotations in asia least need to document that this is how you got them. up there's a mother a tool which is not credit sourcing per se but it's the handwriting transcription tool which has used machine learning technologies from polio prefers to handwriting experts to inform the annotations that we used.
for the machine learning models that are used for handwriting recognition so this is a case where you do need experts to do your annotations and you can crowd sourcing if you wanted to but i think telegraph as a kind of difficult to crowdsource.
and the sorts of examples that come up with this is the british media british library released a whole bunch of its images on to twitter and the great tags at the top were i think that yes a human generated tags once this collection of images really. based on to twitter the human generated tags within used with an image classifier to generate the blue tags which is sherlock net tax and someone is labeled this is a will the beast. and the good news. but the machine is come back and called it a dog so using fairly accurate human tags with stuart errors in the machine classification task so it's difficult had to scrutinise the reason that these have come out i think the best but to do that is to be open about the models and the annotation tools that you've used so that people could. attempt to reproduce it is another example the koalas been tagged automatically is a document does have a dog for but.
someone is misspelt koala k. you a lay by sir these are all in this is a simple verification process that you won't get with. while i was on target for example i think you might need to prescribe quite specific guidelines to the sorts of tags that they use but again you need to put effort into making sure that your annotations are correct and that it.
and it. you. the girl. and that is a return to the it to me and i did it is not the third day. we're not thousand people are always right and wrong. it is what hundreds of energy being more affecting the way people know you. the importance of the peer review process.
i think that's it for crowdsourcing so it's it's a useful technique but you need to build in project management methods sure peer review process is and your verification.
on the un mission.
he and we were going to get you to do an excise now he had enlisted various crowdsourcing platforms and we were going to get you to have a go and see what it's like to be on the news this side of how the different interfaces affect the quality of the information they are going to get from people.
but i think given that we were ever running after lunch way we won't do that will provide a list of sites say you can go and have a look out there and there are various ones for transporting text for marking are architectural plans to identify the shape of a building that kind of thing there are lots of different forms and but we won't do. that now was skipper that goes straight onto the next session which is a little bit about public engagement and then we going to hand back to you for the last acts size of the day with an elevator pitch which i would explain a little bit later we've talked a lot today and you have heard through talking with your supervises that one of the.
really important aspects of any of these european project is to show the white a valley for society and to communicate the importance of your work to very specialist public's but also to a general public and people who know nothing about your work a tool so the national coordinating center for public engagement in the. uk described that kind of activity as the myriad of ways in which the activity and benefits of higher education and research can be shared with the public engagement is by definition a two way process to that comes but some of the crowd sourcing things the marty was talking about involving interaction and listening with a go. all of generating mutual benefit say when it works well it's not just about you telling someone what you do and then thinking i've done my public engagement and going away ideally you'll learn from their responses to your work as well and they will have learned something from you that they can take away and it will help them to think differently about something. doing so it's not transmission of information it's the exchange of knowledge and different knowledge from different public's and it's very hard to get rights i've been in knots of public engagement events i'm sure martin has the same way and academic comes the lowest fans of the front gates was effectively and academic lecture and then leaves off key questions. but it's not public engagement and that's not an effective way to ring it however good the speaker is.
and i think it's important to think about the different audience be taught already about who you are you trying to get to because what good of public engagement looks like it's going to vary enormously depending on the groups of people that you're talking to say this is a and a chart again from that from the same body in the uk wish to. cards different kind of audiences and stakeholders for their research that you're doing so they have public sector it one and civil society the general public business and very subsets of that including social enterprise cultural institutions community the local. t. and non-governmental organizations depending on the area working and i would say that probably at least two or three of these groups particularly fit for the project here business and creative industry would be interested in the kind of work that you're doing and you need to think differently about the best way to get them to listen to what you want to tell. and you may all know about the european research is nights i think this shows how well embedded this season these e.u. funding programs as its annual e.u. research is night which is designed to showcase the types of research that are being funded by the european commission.
and it's so i don't know if any of you got plans to be involved this year it's come quite early in your search but it's of the twenty eight september so they may well be something going on in your own institution to get a good idea to go alone in and see what other people are doing what was for you and what you might be able to learn from how they presented it. and again if someone is doing something you think that's really great just copyright it's good practice nobody minds. i in my own institution the university of london we run an annual national festival if the humanities called being human.
so which involves institutions across the country this normally about three or four hundred different institutions taking part and we give it a theme every year so that people. there's something to it took people and get them interested and this year its discoveries and secrets and that's running in november and we've done a number of activities over the years the public engagement around digital research mostly using web archives.
in the very first year of the project which had a digital theme to it what it means to be human in a digital age we presented the web archive and it was pretty knew i was pretty new to know what web archive were almost nobody came along to the event have the first idea what web archive swear and we had a lot of discussions along the. signs of while so you'll cause the whole web well i'm not the whole web but you know quite a bit of that and this is how it works and this is how it's different from the life web and it was really. just the nature of those kinds of programs we had quite a lot of retired people came along he were not that familiar even with the live web but it was amazing how quickly they got interested when you could show them for example this is the website of your village church ten years ago there was something in it for than that they could recognize. i get a handle on it could immediately see the importance and value of this and got a greater understanding of why it was important to preserve digital materials for the future and if i think we found it easier to explain to that group of people than to academics what the value of the station two wars with a lot historians say on our behalf. when you start where his members of the public which you know this is great and can i see the company the easter uk for ten years ago so it's finding that i'm goal that is going to get people to remember what you're talking about it.
we also in the following year in a partnership with our library ran a computer to video game of kaif events using a computer games that has been archived and then emulated from the internet archive. and that with we had so the six or seven games that we set out running it was a bit of a challenge for us because they had to be emulated on the internet archive does not tell you what the new quay starbucks that replace the old operations are so you had to kind of trial and error and go through everything and say ok we control our means that back. hard to punch as somebody says. it's a bit difficult especially if you've never played against the fall which against weeks but to that documentation and also a digital ethnography if you've got someone in he was a real test expert in using one of things they could have said immediately ok of that care display city this action and so on but all that's missing said he had to reconstruct it. so we've got people in to replace these games are really boy and variety of audience is most just the tip all of his most students along they were doing computer design on they were very interested in the third oldest rhetoric we had quite a lot of school children coming in with their parents and the parents could reminisce about playing nice. the games when they were young and i was new to the children as well and again imparting idea about the value of these it's important items of cultural heritage and we also tried to make it look a little bit like an arcade as much as three kids so even had a couple of disco ball scaring to get the lighting right and.
we searched online for that what i would call material culture but the post is and magazines and so on from the time that talked about these and we printed all of those that put him around so that people could see the context for them as well and the the most popular game with one of the older. this one's which is an american educational guy she could be over in the bush and there could the oregon trail and it was designed as the kind of one of a scarf way games where you make a decision and that such york down one written then another want to see how successful you would have been of navigating across to the west of america if you're a pioneer i think i. made it to the and want but my entire family drowned on the way but you know hey these things happen. but you know you might have enough food or you i didn't have enough bullets with your that kind of thing and we had agreed of most of students who spend a good two hours with this really really to mention game working theory to get to the and getting an understanding of what internet culture was like then and how you could use these materials for education so they can. it can work really well and it's great fun as well having people in and getting really excited about the kind of stuff that you work with an interest with for this year's event this is going to be connected up to clear patch or set will have clear patch or branding doing something called our mutual friend the machine our mutual.
friend being a dickens novel and are we running it were not making people come to a university people who haven't gone to university its universities can be very intimidating. setting foot over about dual when you think it's not a place for you is a real challenge so we want to take things out to people i mean it's not exactly mary somebody you can get probably lots of middle class people coming anyway because it's to the bookshop in the middle of london but it's still an attempt to meet and you can hold things in library. trees or community center site kind of thing so we can hold and the ball in the basement of what science fiction up in tottenham court road in central london and marty's been exploring and computer generated poetry from the nineteenth century corpus from projects arm can whine the cells are sorry. in and the we know now visitors to contribute seed words and see what poetry they get back from that and also can have an actual public to going to come alone am going to have what we call a poetry slam which the contest between and the machine generated poetry and the actual poet and see which people prefer and then get them to think. about off the ship what does it mean when a parent has been produced by machine learning to use the all for the person who created the algorithm the algorithm what you do about the data is trying to own so what does all the ship means that clock on of academic aspect that we want people to experiment with. and and this is one of the poems that marty's generated whether your the authors of the elder the author.
and from the poems of robert browning it's sort of works say was are insured social be the rest here on these crowded benches now sullied the throbbing train or take the maintenance free freeze the way and the cloud the cloud know i'm a church and i might charge we're too closely links today to clients with modest kindness. on the crowd point one six when he put it yeah i'm there but it's a bit that a little bit clunky i wouldn't have repeated the cloud to the cloud myself but they get a life when we found it was much much better with poetry at the praise you can immediately tell that is not been produced by human because the gravest. just too complicated and it breaks down really quickly and it's not logical where his perch she doesn't need to be logical. in fact it's its obscure or an ambiguous by nature say we could have come from with bob and and really get people thinking about machinery so. oh yes that's exactly what we're going to do yet absolutely we're going to to give these times that to people in austin who do you think great the s.t. think it was a pilot would he think it was artificial intelligence. but so much that can reveal their room and while the yeah yeah yeah and and one of my colleagues and we're talking about is immediately said well if a writes poetry is interested in reading it and you think all. it's a complicated philosophical question which we probably don't want to get what did i want to read poetry or we sort of force feeding the algorithms with a state if it can rights it does it want to read it. so you know you did the interesting conversations that you might expect twenty d. this kind of work say and say you can i think make really quite challenging. research accessible if you really think about and have got a little bit of resources behind it slow and it's really good fun to do as well as they bearing in mind that idea of making your search accessible to people what we'd like you to spend the rest of this session during his to come up with an elevator pitch for your research.
such i don't know if you've heard about elevator pitch it's quite american business sort of thing. but it's described a short description as an idea product or company that explains the concept in a way that any listener can understand in a short period of time this description typically explains he the thing is for what it does why it's needed and how it will get done so we'd like you to spend. ten or fifteen minutes thinking about your own research and if you're standing in front of an exam a group of sixteen to eighteen year olds how would you describe your research why it's important and why they should be interested in it in and then presented. everybody story about this in in one to two minutes were allowed loan elevator ride every couple of minutes so have a think about the way that you would explain it to agree that not just of non-specialists of people in the last stage a secondary education how would you get your project across to them particularly why it. should matter to them why should be interesting. so with no specialist language it's very. if you kind of ten to fifteen minutes and then asked me to come up and percent but everybody well yeah yeah. for us or carry knives you.


  564 ms - page object


AV-Portal 3.21.3 (19e43a18c8aa08bcbdf3e35b975c18acb737c630)