Open Science and Collaborations in Digital Humanities Part 1

Video in TIB AV-Portal: Open Science and Collaborations in Digital Humanities Part 1

Formal Metadata

Title
Open Science and Collaborations in Digital Humanities Part 1
Title of Series
Part Number
1
Number of Parts
4
Author
License
No Open Access License:
German copyright law applies. This film may be used for your own use but it may not be distributed via the internet or passed on to external parties.
Identifiers
Publisher
Release Date
2019
Language
English
Production Year
2019
Production Place
Dubrovnik, Croatia
Loading...
Goodness of fit Crash (computing) Observational study Digitizing Universe (mathematics) Projective plane Chain Bit Rule of inference Automatic differentiation Row (database)
Frame problem Pattern recognition Context awareness Digital electronics Distribution (mathematics) 40 (number) File format Reflection (mathematics) Computer Area Formal language Element (mathematics) Different (Kate Ryan album) Data structure Maß <Mathematik> Descriptive statistics Area Context awareness Observational study Google Bücher Quantum state File format Digitizing Reflection (mathematics) Digital signal Cartesian coordinate system Formal language Data structure
Point (geometry) Table (information) Graph (mathematics) Multiplication sign Price index Mathematical analysis Web 2.0 Term (mathematics) Fundamental theorem of algebra Link (knot theory) Digitizing Data analysis Bit Digital signal 19 (number) Subject indexing Word Spring (hydrology) Sample (statistics) Programmer (hardware) Personal digital assistant File archiver Compilation album Writing Spectrum (functional analysis)
Web page Operator (mathematics) Price index Plastikkarte Area Concordance (publishing) Term (mathematics) Operator (mathematics) Normed vector space Gamma function Volume Form (programming) Punched card
Axiom of choice Point (geometry) Observational study Digitizing State of matter Source code Range (statistics) Mathematical analysis Augmented reality Field (computer science) Twitter Telecommunication Hypermedia Multimedia Extension (kinesiology) Area Observational study Mapping Digitizing Projective plane Mathematical analysis Bit Digital signal Process (computing) Telecommunication Multimedia Quicksort Library (computing)
Transportation theory (mathematics) Projective plane Source code Core dump Bit Digital signal Resultant Row (database) Physical system Form (programming) Library (computing)
Personal digital assistant Digitizing Source code Projective plane Virtual machine XML Surface of revolution Virtual machine Row (database) Library (computing)
Scaling (geometry) Projective plane Virtual machine Sound effect Mereology Surface of revolution Mixture model Network topology Computer science Website Energy level Game theory Spectrum (functional analysis)
Observational study Process (computing) Software developer Projective plane Videoconferencing Iteration Circle Bit Graphic design Resultant Computer programming Descriptive statistics
Revision control Observational study Vector space Software Diagram Iteration Graphic design Bounded variation Computer
Revision control Multiplication Observational study Process (computing) Bit rate Network topology Projective plane Mathematical analysis Object (grammar) Computer Resultant
Web page Point (geometry) Algorithm Information Digitizing Projective plane Source code Mathematical analysis Function (mathematics) Demoscene Designwissenschaft <Informatik> Mathematics Process (computing) Linearization Website Ranking Amenable group Domain name
Context awareness Source code Execution unit Mathematical analysis Hypothesis Exploratory data analysis Mathematics Type theory Object (grammar) Office suite Source code Context awareness Execution unit Observational study Information Numerical analysis Mathematical analysis Bit Statistics Hypothesis Arithmetic mean Process (computing) Data conversion Object (grammar) Thermal conductivity
Frame problem Context awareness Computer file Multiplication sign Archaeological field survey Database Computer programming Element (mathematics) Type theory Read-only memory Semiconductor memory Information Context awareness Multiplication Observational study Information Archaeological field survey Database Limit (category theory) Computer programming Statistics Measurement Frame problem Order (biology) Right angle Quicksort Resultant
Point (geometry) Group action Statistics Context awareness Observational study Direction (geometry) Multiplication sign Archaeological field survey Mathematical analysis Parameter (computer programming) Disk read-and-write head Focus (optics) Sequence Duality (mathematics) Different (Kate Ryan album) Term (mathematics) Personal digital assistant Reduction of order Diagram Information Traffic reporting Focus (optics) Dependent and independent variables Observational study Theory of relativity Point (geometry) Content (media) Mathematical analysis Analytic set Bit Group action Computer Statistics Measurement System call Message passing Process (computing) Event horizon Personal digital assistant Order (biology) Interpreter (computing) Quicksort Fingerprint Resultant Inductive reasoning
Context awareness Source code Multiplication Observational study File format Multiplication sign Source code Set (mathematics) Database Database Frequency Medical imaging Different (Kate Ryan album) Hypermedia Single-precision floating-point format Website Self-organization Right angle Quicksort Data type Library (computing)
Area Context awareness Observational study Information Observational study Multiplication sign Digitizing Limit of a sequence Set (mathematics) Element (mathematics) Video game Faktorenanalyse Fundamental theorem of algebra Fundamental theorem of algebra
Context awareness Moment (mathematics) Multiplication sign Temporal logic Frequency Latent heat Different (Kate Ryan album) Natural number Term (mathematics) Integrated development environment Information Normal (geometry) Fundamental theorem of algebra Self-organization Context awareness Observational study Theory of relativity Information Temporal logic Digitizing Moment (mathematics) Content (media) Plastikkarte Physicalism Digital signal Term (mathematics) Latent heat Arithmetic mean File archiver Self-organization Normal (geometry) Family Fundamental theorem of algebra Local ring Reading (process) Row (database)
Point (geometry) Digitizing Projective plane Sheaf (mathematics) Content (media) Digital signal Content (media) Limit (category theory) Inclusion map Search engine (computing) Website Convex hull Quicksort Transportation theory (mathematics) Gamma function Row (database)
Web page Source code Context awareness Video projector Digitizing Decision theory Bit Googol Different (Kate Ryan album) File archiver Representation (politics) Energy level Information Object (grammar) Reading (process) Library (computing)
Freeware View (database) Internetworking Web page Range (statistics) File archiver Video game Object (grammar)
Web page Context awareness Data analysis Mereology Metadata Web 2.0 Frequency Latent heat Internetworking Term (mathematics) Googol Visualization (computer graphics) Authorization Area Observational study Optical character recognition Digitizing Projective plane Physical law Mathematical analysis Metadata Data analysis Bit Instance (computer science) Digital library Subject indexing Exterior algebra Googol Error message File archiver Interpreter (computing) Website File viewer Resultant
Web page Metre Point (geometry) Email Context awareness Token ring Source code Set (mathematics) Numbering scheme Counting Mereology Formal language Sequence Revision control Energy level Information Data structure Metropolitan area network Domain name Observational study Web page Token ring Mathematical analysis Content (media) Metadata Physicalism Counting Bit Volume (thermodynamics) Line (geometry) Instance (computer science) Statistics Sequence System call Formal language Degree (graph theory) Number Data acquisition Universe (mathematics) Speech synthesis Right angle Quicksort Volume
Email Presentation of a group Context awareness Greatest element Group action State of matter Multiplication sign View (database) Strut Archaeological field survey Median Counting Formal language Word Mathematics Personal digital assistant Species Information Endliche Modelltheorie Office suite Observational study Structural load Web page Token ring Bit Measurement Process modeling Website Endliche Modelltheorie Right angle Species Data structure Reading (process) Spacetime Web page Token ring Similarity (geometry) Mathematical analysis Average Distance Plot (narrative) Twitter Sequence Frequency Causality Natural number Data structure Form (programming) Mathematical analysis Volume (thermodynamics) Line (geometry) Similarity (geometry) Number Word Visualization (computer graphics) Universe (mathematics) Video game Capability Maturity Model
Source code Digital filter Multiplication Group action Observational study Scaling (geometry) Key (cryptography) Multiplication sign Projective plane Bit Twitter Formal language Word Query language Volume
Asynchronous Transfer Mode Observational study Internetworking Router (computing) Data type
so well. well. but you will see the full you you joe biden could it be. the added. and you had to buy. it took to the project which you you so you don't you do you want to do so. you might wish to build can see you could it be and the good chuckle from hong kong and you decide to know which could say that was it and you have to be. thank you very much everybody and my name's chain when to sign professor to the humanities at the university of london and i'm going to introduce most because i've got the microphone for the record keepers. the knotty stay here is technically to digital humanities also at the university of london now you know who we are but we don't know who you are saying it would be great if we could just go around the room if you could introduce yourself and say where you're from say that we can get a sense of who everybody is and which the. stephen ships are going to get you to talk about your research later on today so just your name and why you from would be great start with daniela because i do know you're so so you know me. or. all. i. oh. so. me. i know you all. yet. is it. for me. all. the rule. are you. it. it's very important. it. up. do i. so on. it hit me. the. i'm so. great well bocanegra but take and we're going to give you a crash course in digital humanities today and say this session is mainly about a quantitative and qualitative methods but we start self just talking a little bit about what digital humanities is because not even people he working digital humanities can really agree about.
but so. forty three some of the differences that move on to quality and quantity of methods look at data context and provenance which is something that's very important the digital humanities or search and and then have a practical x. rays using some of the historical data at they have to trust in the us. just going to hold him and then the subterranean session will dive a little deeper into things like data formats annotation language structures and so on. ok he said he was the stage of the humanities and the best description of it i've been able to find is actually the one on wikipedia but you see from this that it is not short it's a very capacious or definition and includes pretty much whatever you want to include if you're doing to the humanities and has got to do if you do.
the humanities research it's got a digital element to it arguably you are doing digital humanities and i think that's why it's so difficult to define that basically an area of scholarly activity at the intersection of computing or digital technologies and the disciplines if the humanities and it includes the systematic use of digital. resources as well as the reflection on their application and i think that sentence is really important it's not just using digital tools and methods it thinking about what that dollars for humanities research generally how does it change the way the historians were calling greece work for example of.
and this is just to cherie that digitally months is a pretty new concept much they used to be called humanities computing and that goes back a lot further but on base and graham which is run on the uk web archive the two terms crossover only in two thousand and nine so they're still talking about humanity's. eating a decade ago and them suddenly digit humanities to call and as a new way of thinking about doing digital research in the humanities.
just to show you are quite how much people argue about what to do the humanities mean that is a website called what is digital humanity stockholm. cases in which i can with highs on here. anyway it refreshes itself every time you click the u.r.l. and comes up with a completely different definition and there are hundreds of them in their including digitally mathias doesn't actually exist as a disciplined so that's the whole spectrum of people talking about what's to be much she says and encourage you to go and have a look at that and then you can see the. some kind or definitions that people come up with all of that's not very helpful when someone asked me to explain what cities that you do is a researcher. so just a little bit of history exam historians say that's always my starting point is to think about how did we get here the beginnings of humanity's competing a usually described spring the work of the jesuit priest or better to use or he began to create an index of words in the writings thomas aquinas in nineteen forty nine and had a whole.
team of punch card operators doing that work for him and that's why i haven't got a picture of her best a piece or a pair i've got a picture of all the women who did the punch card work and have been role that neglected in the histories of it but actually it turns out that the stanford literature percent professor could just three miles actually got there first and just about.
in terms of thinking about this work so i think the take away the humanities people is that the story of one individual suddenly changing everything is not for you the way to go it's always a team effort and there are more people involved than you might otherwise think.
so some things that digital humanities might include eight and most people went to all of these that anyone working in the field will probably the surrey or four of them and textual analysis is really at the heart of it that's where it started and it's where most people still focus their attention and that that's absolutely. more for this creep as well and thinking about how we analyse text also relevant few new media studies and multimedia that's increasingly becoming something that people are working with. in spatial analysis that's a very big area pretty much any digital humanities project at some point a member of the team would have said can we have a map for this how can we pull this on a map creating digital materials thinking about digitize ation itself not just as a mechanical process that happens in a library but something that in. those academic choices about the text that you include the method to the use and so on a lot about a scholarly communication using digital methods and we'll talk a little bit about that we're looking at engagement later on today and using in teaching using digital tools in teaching is something the humanities are. searches are really interested in doing in exploring how that changes the way people learn and the things he can do in the classroom. and digital ethnography a state not just studying twitter data for example but talking to people about what they think they're doing when they use social media and observing how they do that and what that means os researches woman looking at that that data what you have to take council of an increase. we now have to look extended reality and t.r.a. are and so on and and how researchers can analyze that in any kind of quantity of way in the future we can preserve that future analysis.
just to give you an example of a digital humanities project or two they range from a very large to the very small again it's really sort of capacious definition of what falls on the d.h. and this is a fantastic project called chasing london convicts in britain australia the digital panopticon and it brings together a. whole range of historical sources to look at what happens to you care to come back.
and to look at us.
what happens to people he was sentenced to transportation or hanging or another form of punishment in britain in the and nineteenth century predominant they and their kind of journey through the criminal system were not possible to do without bringing together these different sources because the records of their or. rest of their core proceedings and them what happened to them afterwards held in libraries across the world predominately britain and australia but other places today so you could get bits of the picture but you couldn't see the whole journey and that's what this project has done and visualize the results and found some really exciting new things that we didn't rely.
ice before so this represents what happens to people he was sentenced to death. and it turns out that almost none of them were executed so if you'd stopped with the court records you would have thought wow this terrible executing say many people but predominantly they were being transported what i had a shorter sentence or in some cases were even let off altogether so it's that's really at the heart of digital humanity. he says bring together different sources and analyzing them to come up with new findings that we wouldn't have known about otherwise.
this is another big project which is just started at the british library in the uk could living with machines and the aim of this project is to use all of the digitized news papers that have been created to investigate people's relationships with machines in the industrial revolution say air transport and fact.
trees how with these things described hundred people give any agency to the machines how did they talk about the effect of industrialisation was having on their lives and doing that really large scale and using and the mixture data scientists computer scientists humanities research is so there's a big project. big collaborative projects that tends to characterize th and of the other end of the spectrum you have something like this site which is produced by the museum of london in the uk for the four hundredth anniversary of the great fire of london been large parts of the city and this is aimed at children it's that allows them to.
reconstruct the great fire using mine croft there's a game their teacher resources so this is about teaching and public engagements and using digital methods for that so you got really high level research but you've also got this using digital tools to get people interested in research and telling them something they don't know about already and it's great fun. and brilliant sound effects and you can reconstruct how the fire spread that kind of thing.
so that's just a brief use of you will try to put a little more detail on that but the to pretty much everything falls under the age of he wanted change. apostasy marcy took a little bit more about quality and quantity of methods. and so a lot of those projects are the end result of a development process or research program and a lot of design iterations going to that so agile development is still about iterations and research is all about iterations and the graphic design process is about going around around in circles.
and just want to quickly go through our and ripens video or description of how he does on the logo or and the working with data but i think this is more of a qualitative method that used in graphic design industry so has started sketching on sketchbook lots of little.
little diagrams of the logos that he's conceiving in his mind yet variations on those the new kind of transcribe those into a vector this start to put more definition to them.
you start to go through more versions and more iterations in the graphic design software so what he did on paper is now doing on computer and he starts to get more concrete mall more to find more final with used his own.
keeps. replicating little objects keeps its rating on the design realizes he's going down the wrong path goes back to his original sketches which is back to a different medium that he used in the past to reinforce his next version on the computer so he's going back and forth between different mediums he's created multiple versions of low.
years and he's referenced old versions and brought old versions back into his final design so this is a very fluid very dynamic very undefined processes can figure it out as you go i locked the look of that this is working this isn't working it's a very qualitative method it's difficult to prescribe but it's very interested in its very explicit tree. on the other side research projects tend to be very linear yet you get questions designs you collect the data you perform analysis and then you present your results.
and i've noticed at conferences that have been to a lot of what's presented is exactly that it's a very lenient explanation of this entire process that you've gone through previously but this is not really what you did this is how you present what you did and so your presenting it in a linear fashion and distrust rank is an exam.
people take a look at them they basically taken the page rank algorithm many in flu in informs the algorithm somehow with. i guess with already of new sources to come up with a different metric for trusting and ranking websites and documents but again it's just a five point lead the explanation of the research project which distils a down to something that really wasn't there was a lot more to the research project behind the scenes and so a lot of digital amenities about explaining your math. kids as you go and documenting new methods as you go or not just document in your outputs so i'll just go through quickly the definitions of quantitative and qualitative is does it really kind of understand what they have any sense of what they are already known quantity of is about of.
up to phenomenon it's about the facts if it's about converting the phenomena the observable that you're interested in into data and it's about up offices its you confirming your hypothesis making deductions etc the qualitative process is really about the experience. as a source of truth says a distinction here between facts and truth and will when we're looking for truth were looking for the context were looking for the social values were looking for the way that people perceive and explain the situations around them were not looking at measuring those people. it's about meaning and it's about exploring the meaning and about capturing that somehow and then these two methods you can kind of bring together in what they call mixed methods and his method is his methods of bringing those together in regular ways but really it's yet it had so quantity of satirical in its data driven.
when it's about converting phenomena to measurable objects which is what you might call a unit of analysis and you can define that in many different ways. it starts to do with statistical to techniques once you've got more and more data to analyze it so are you dealing with numerical information and you get into probability c m. what you're doing is collecting data to answer specific questions about your hypothesis so it's very narrow and the method continues to merit soft down and have descriptive an experimental top so descriptive talk his way you would measure the subject wants an experimental talkies you would. the subjects before conduct a bit of an experiment and then see what the changes were afterwards. and so the benefits of that it's answering the what the wind and the were.
statistically significant result so it's a post positivist just mogi. the kind of lowest cost because we've got computing technology daughter made it takes less time but it limits its limits. what it says only the research question that you pose so you are going to get one answer out of this and doesn't answer was i. so the methods involved in capturing dead like this our surveys question is online polls showing second reader which is a lot of what you guys will be working with not just caucuses and files but databases and linguistic hope or and then statistical programming gets built on top of that.
quoted on the other hand is asking about it is this right not is this true. so there's a pragmatic element to it it's about six the experience and the observed as. which is about the subject so it's a subject of us to mr mogi. collecting subjective and some magic information and have to do with the coating of that information in order to interpret it and grouped together and make sense of it and you doing constantly with changing tradition frame so it's quite difficult to capture the same information a second time around. so if you thinking about interviews for example you could interview someone next to me interview them they're going to give you different answers to the same questions that you ask them and so there's there's not really affect to that his opinion. so these are phenomenal logical which is the person experiences that you try to capture ethnographic is how these people fit into this social context where the cultural contexts. and reception is really about you the way people interpret information will recall information repeatedly so if someone had traumatic experience for example and they recalling an experience multiple times over the start to reform that memory and so you need to be aware of the context of memory recall. changing what they're saying as they are which so sort of thing you don't have to worry about with qualitative measurements.
so this is really providing the wine and it's about gathering personal information and that we took the soft in about anonymize i shun and the. the p.r. requirements for those sorts of things i'm the points of the people and it takes a lot more time. carter paid as i mentioned and it doesn't yield stats so measures of imperial significant so it's not probabilistic it's much more expensive to conduct. and the ways he conducted i really with interviews and focus groups case studies which are reports having conducted interviews and focus groups ethnographic research jane mentioned earlier discourse analysis which is kind of looking across corpus of literature and interpreting the themes and the ways of the argumentation that's been written down. this corpus observing people and kind of looking through secondary data would be examples like diaries all written accounts of the pasta and a lot of those a handwritten so already the technical challenge of getting those into some kind of computational text in order to do anything with it so really a lot of calls head of research could just be reading these texts and looking yes. it. it. and the qualitative your it. or are there. yet. ok i guess that would be the difference between a survey in a poll if your polling you'll be asking for more of a one to five measure on somebody where the survey you be asking for descriptive response to the answers you. and mixed messages about combining both of these kind of explanation driven and this is term called abduction was so this induction deduction and abduction abduction is kind of hypothesizing an imaginary result and then building it back woods or reasoning from that it which is kind of in the opposite direction from induction and. reduction in terms of analytical methods. using both of these gives you the benefits of the statistical analysis you get the benefits of the coating informatica interpretation which takes time to do that's a very interesting process to label and relatable and classified in tax on a miser content. the you can converge on these two studies and you could do them really into different ways so you could kind of run qualitative and quantitative method simile tiny asleep kind of bouts what you find from one of the other you could do them in a sequential way which is a kind of most more its most of which way you could do a qualitative study first and then what you find from that it could. inform the questions that you start to explore for your called the quantitative experiment or you could go or the other way around you could start to be a bit more open ended and broon stop asking random questions and what you gather from those you could then for more precise questions that you want to go had a measure with the quantitative a perch so this kind of adds to costs too. because you combining both the automated quantitative measure and the more expensive time consuming qualitative approach but you kind of get to bounce off each approach you'll discover things that you would have isolated when you thought about an era research question with just a pure qualitative a perch. really kind of gives you more exploration to conduct that design process that i should you be full.
and this is a nice little diagram just summarises them quite easily really. for the one in the middle. using mixed methods. and that was useful it possible to china to talk about a contexts.
thanks she writes an this is the debate that i think most humanities research is really like to talk.
we do you create our and data and a lot of and humanities research is to say that the giant and they think they don't have researched later they have text but you know your you arrange it was structured in some way to try to tension today said but as the that might be a transcription of a manuscript. a database of names and places it yet strategists from other material but most of the data that and humanities people were quite is derived from primary sources like newspapers books letters social media sites taken from somewhere else and and often it's from multiple somewhere else says. most historians like me why what with a single source of data you be bringing together a multiple sources of data and trying to and to what across them and they might be held by different libraries archives commercial organisations be from different periods be in different formats some of it might be digital ready some of it is marty said she might have to do. the time he's either from handwritten text or from printed text so getting that dates preparations stage takes quite some time. and and you might guess is that combining digitized and born digital and knowing how they been produced is really important in helping to do that effectively otherwise the flattened out too many of the distinctions between the data types and you end up coming to conclusions that not probably going to be right. and and that means and we're going to talk about open publication a concise later but this is a particular challenge when you don't own your own data and often humanists would not have the rights to be published data sets because they've been collected from different people and they're all sorts of different copyright and and to give an example and. art historians struggle with this is a lot they can publish text but it's far too expensive to publish the images that they've been working with so that's a real challenge in working openly when you're working in digital humanities.
i think the fundamental principle of humanities research is that you can't interpret your data properly if you don't know where it came from who produced it how it was produced and white was produced just getting a data set that you don't know anything about his of limited value and the importance of those. different elements is going to vary depending on the kind of research questions that you want to ask for example you may very well not waste be able to my wife something was produced because lots of such a subjective piece of information particular further back in time you go but if you've got an idea that hit walsall how it was done know where it came from. on them that can really help the start to interpret the material that your working life.
actually on the why this huge amount of digital humanities research into people's motivations. and it's working with wikipedia and the motivations of people to get involved in creating that data is a really interesting area a study just on its time.
and provenance its kind of quite similar to context where does it come from but i'm in an archival contacts from a working with archives digital and otherwise as a very very specific meaning. and this is quite from the society of american archivists provenance is a fundamental principle of archives referring to the individual family or organization that created or receive the items in a collection so that the to produce this why did it come from and where has it ended up and there's a very important and the principle of prop. finance dictates that records of different origins be kept separate to preserve their context they all cards were explicitly to separate shalt the data that we as for searches then want to put back together again and they may have reorganized it according to all keitel principles to reflect its provenance that's not very helpful. for such as you want to interpret it in different ways and long different axes say all of this is kind of going on behind the data before you can even start to work with it.
the un this is a really nice creation i think we should really defined what we mean by context that pertain simple tenuously to physical arrangements social relationships situation on definitions temporal moments and distinct locales they were to happen here the people who are involved how did they. i relate to each other what was the environment like that they are working and what with the tools that they had available to them when did it happen over how long a period of time with that have changed the way they were king is marty said that if you come back to people who recollecting things a time that changes. i knew a lot of time is spent theorizing about this because we don't always know that having an awareness that these of thing she might want to consider is something that's very important to digital humanities are and that context is liable as social scientists describe it to collapse particularly in relation to born digital day. nature and the idea of context collapses where people information and norms from one context seep into another so you can't really tell the way you're reading something is not the way it was intended to be read to produce but you don't know that because you've just got the data without its context so that specific terms of content. text collapse was coined by dana boyd in the alley two thousand but that comes with context and provenance and its absence it runs right three digit humanities are such.
again coming back to base a digital panopticon projects they have a very very large section on the website which describes exactly what records they been using how they were digitized where they've come from when they were pretty east be pretty stern because that's the sort of starting point for a digital humanities researchers going to be.
i don't just want to use the search engine i want to know what's in their house decided what would be included and where it's come from so you can see just for the criminal registers from seventy ninety one to eighty ninety two you hear the origins and contents but also strengths and limitations what you can and can't find here.
and how it was digitized again the method to digitise ation is something they get studied a lot.
and that context is not very often visible actually most of the big digital projects go out of their way to hide it because they want to give you a nice search experience they want you to have this google like such books and you'll get an answer that is meaningful but it may not be representative of what's in there and it may be. highlighting some of the problems with the underlying data say this is the archive of british newspaper said the british library and my immediate problem with they say is that talking at news papers which of physical objects that lots of different things bound together that received read in a particular way and that talking about pages they say me. we got twenty five million pages but that's not what the original newspaper like that's already decision to break it down to a page item level which is changing the way that you think about and use that material just get a little bit better may begin to it and say we've got nearly four thousand issues.
and of the year range and so on for one particular title but it's changing that physical object and obscuring have would have been used and that's something that i as a research you want to be aware of. the digitized nation and the things that you need to know about this is from the internet archive and this looks peaceful marriage to a suffragette sketch of modern life so that our the late nineteenth early twentieth century and that's what somebody looking at his season you can read that perfectly and when you click through to the underlying i see.
car. the name of the authors the title large parts of it go areas in said. you can't find things easily if you do any kind of analysis of have frequently occurring terms that's all going to be role you know that i have found five instances of suffrage in here but and maybe fifty and you just can't find them and when the workings of the digitize ation hidden then you really getting misleading. results from using this kind of data and it's all credit to the internet archive that they do let you download a seal city can really see what the problems are and change the research questions that you're asking as a result. and that brings me on to the headteacher asked which is one of the big projects in the us he said he may know about it a kind of an alternative to google books in some ways it use a single site thought possible to march.
so google digitized a lot of the web google books corpus scanned hundreds of thousands of millions of law books run them through o.c.r. and then made and and graham your which is what most people might be familiar with the wood frequency viewer for the google books. opus but the rich but all of the scandal material and all of the o.c.r. material was also handed off to have the trust as a trust in this digital collection and what they've been doing with it is providing computational infrastructures and a p.r. eyes and extracted would features so that people can conduct a literary analysis and learn how to. use extracted would features from a really large collection of literature and part of me explain this is to come to give you a hands on experience with the context of breaking down books into pages as jane said and then thinking through what that delineate. nation of woods means and try to look for some i guess some interpretations across the school plus why so that the custodian of the google books on its the the big boy asked i think the demented a doesn't contain very much nineteenth century medal. but they do contain nineteenth century literature. they as a said that provide the data analysis infrastructure and they've had the trust research centers produced what they call the extracted features data set which is known consumptive use so what they've done is from all of the o.c.r. text from all of the books collected the meditate on from the books decomposed all of. the o.c.r. sentences on each of the pages and then produced would frequencies for each of the pages of each of the book so there's an index for every page there's some bibliographic metadata for every book and the is the end graham viewer to slice and dice and interrogate this so there's lots of ways of use. doing this and so the extra features they've got popham and are large breeze the a.p.i. access to download the specific extracted features for specific book so for collections of books and have to borrow around on their website a little bit but you can construct a collection say of a genre that you might be interested in get all of the ideas of the.
books and request from them download have extracted features is some books that are in copyright some that are out of copyright so that is kind of working with that the trust to get pastas sorts of data acquisition problems is something that i think you will become more and more familiar with not just with half the trust but with a tiny you're using your own data. i'm and like i said they've got to be graphic metre data so the good things like the title the publication date to the language version room the imprint the right to traditions and page counts and the size as the physical page sizes of the books that was scanned and. well that you can. it. so the data is provided as jaison we're not going to ask you to work directly with jason i just wanted to give you a little bit of an explanation of the structure behind it so you get a bit of an idea of the provenance of the data that you're going to work with at a very high degree level so as you can say that is the idea at the top.
now with the medical. subsection so there's the scheme a version that the using here which is not really going to be that useful for literature analysis the date that these were created the rights to attribute in which is their own cooed the source institution which are mostly american universities so a huge proportion of english language in this quarter but there is a. enough made a few to slice down to individual sub languages and sub topics for instance if you need to kind of get more familiar become more familiar with the domain of the content that you're looking at it and they have point of speed up part of speech tax they have begin and end character counts for each of the lines. so this call it a rich man to data that of describing the o.c. odd texts of these pages page by page volume by volume that this year lots of quantitative analysis and that could be done on top of this what they what they've done with this is the non consumptive use allows them to get past copyright so they're not were releasing the. entice sentences so you can't get interesting context of this but you can get would frequencies out of it and the reason that they have been allowed to release this data set all these data sets is that you can't recreate the original copyright material which kind of sucks but it does allow you to do quite a lot of things. and so for example this is an example of the sequence which would be page thirty three for example that has two hundred seventy three tokens on thirty six lines.
nothing in the page hedda so they've actually mocked up whether or not this had isn't food is or whether this line columns to the side so you can start to build up with this method data a performer of the kinds of pages if you want to do an analysis across to the corpus and then the body text of the page again contains two. hundred seventy three tokens so this kind of deconstruction of the topographical forms of the page so.
and this is can be used for things like distance reading would similarity topic models causal you've got as the the role words in the frequencies a motion analysis and visual structure so this is an example of this five copies in this corpus in the collection of the origin of the species and someone is analyze these have noticed that is. two trends so from nineteen twenty nine on. after that on average about a quarter of an inch taller and one eighth of an inch wide open the books published before so maybe it's a useful measure to know that the page saw as a changing as the printing technologies of all. and the font size is kind of increasing to fill that space up and yet this but it. you do have the words on the pages so you couldn't. without context yes so you are. that's right. and this caucus was built to get around some of the copyright restrictions for releasing the full language the university on. so yes it's useful to know the provenance of the reasons that they've made this available in the restrictions that require them to make it available but already identifying uses that it's eyeball to being want. as someone is done topic modeling a topic muggings a little bit we don't know if you can interpret them. but they have plotted it over the course of the volume as well so you can see the top one topics pertaining to custom house office survey official general is happening perhaps in the first chapter of the volume but then nothing in the rest of the chapter and so these kind of diet chronic presentations of topics of a time can give you a bit of a sense of what the. text itself is about what the topics are changing over time but it's difficult to interpret so this one at the bottom seems pretty persistent entire way through so it's about nature life character mind change states so. in the scarlet letter. so this is the practical that we'd like you to form into maybe groups three groups if that makes sense to kind of load up this website which is the bookworm which is the engram view across this extracted features data set for the half the trust of the or else.
here it can be a little bit slower so just take your time noting it up. and if you can form into the group's what we'd like you to do is find three to five queries that relate to your project will discuss what you would like to focus on as a group that the three five queries use the filter is fast down into sub collections and sub languages. and look for three to five queries with multiple key words that can explore the trend of a would change over time. that makes sense when you get to the world. show for it. but it. if. but there are. but. what are you what that it at scale not really that scale it starts to become difficult but you can do it with individual texts yep. when you have to literally read write and you. i try not to be and. and it makes. that's a two year old tech and. which means you might have to get up and walk around. and the. the agency said it continued to have also a thing.
but. the world. on and. and. i think you can put. but the i think too young and old of tools and nine.
Loading...
Feedback

Timings

  489 ms - page object

Version

AV-Portal 3.21.3 (19e43a18c8aa08bcbdf3e35b975c18acb737c630)
hidden