Open Science and Collaborations in Digital Humanities Part 2
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Alternative Title |
| |
Title of Series | ||
Part Number | 2 | |
Number of Parts | 4 | |
Author | ||
License | No Open Access License: German copyright law applies. This film may be used for your own use but it may not be distributed via the internet or passed on to external parties. | |
Identifiers | 10.5446/46154 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Year | 2019 | |
Production Place | Dubrovnik, Croatia |
Content Metadata
Subject Area | ||
Genre | ||
Keywords |
00:00
Observational studyFile formatMaß <Mathematik>Data structureFormal languageFrame problemAbstractionData storage deviceData structureFormal languageArithmetic meanExecution unitFile formatFrame problemWordAbstractionCartesian coordinate systemSlide ruleHard disk driveStandard deviationComputer animation
00:43
Data storage deviceAbstractionObservational studyComputer hardwareNatural languageSemiconductor memoryCore dumpHard disk driveProcess (computing)AbstractionArray data structureFile formatStandard deviationMathematicsBitMagnetic-core memoryPower (physics)Point (geometry)Data storage device
01:22
Representation (politics)AbstractionObservational studyCodeCodierung <Programmierung>File formatMathematicsLevel (video gaming)Power (physics)AbstractionPoint (geometry)File formatStandard deviationArithmetic meanTheoryCodierung <Programmierung>CodeComputer fileUtility softwareSpacetimeBuildingData structureGraph (mathematics)Relational databaseNumbering schemeMusical ensembleSemantic WebComputer animationLecture/Conference
02:19
Observational studyAbstractionNumbering schemeSet (mathematics)WordFile formatRepresentation (politics)Codierung <Programmierung>WikiData structureGraph (mathematics)File formatSemantic WebOntologyNumbering schemeTexture mappingRepresentation (politics)Uniform resource locatorMathematicsRevision controlStandard deviationCodeComputer animation
03:05
Data transmissionDecimalTable (information)ASCIIObservational studyMultiplicationCodierung <Programmierung>UnicodeScripting languageCodeRevision controlCodeMathematicsTransformation (genetics)Interpreter (computing)Process (computing)Point (geometry)Standard deviationLevel (video gaming)Physical systemTouchscreenTableComputer animation
03:48
Observational studyCodeCodeMathematicsArithmetic meanAxiom of choicePhysical systemComputer fontAbstractionPoint (geometry)Cartesian coordinate systemFile formatGraph coloringDifferent (Kate Ryan album)Representation (politics)Operating systemGame controllerComputer animation
04:29
Observational studyMultiplicationError messageSoftwareString (computer science)CodeLevel (video gaming)Arithmetic meanForm (programming)BitRing (mathematics)Group actionInterpreter (computing)MereologyMultiplicationProjective planeSoftware bugSquare numberWordCartesian coordinate systemFile formatGraph coloringContext awarenessStandard deviationTwitterThumbnailCore dumpEmailCodierung <Programmierung>Computer animation
05:39
Observational studyCodierung <Programmierung>File formatWeb page8 (number)Address spaceMobile WebFile formatData structureMathematicsOrder (biology)Positional notationLibrary (computing)Arithmetic meanProcess (computing)Set (mathematics)Cartesian coordinate systemScripting languageObject (grammar)Serial portComputer animation
06:36
Observational studyView (database)SpheroidInsertion lossMedical imagingPeripheralComputerSurgerySurfaceProxy serverPulse (signal processing)MeasurementPressureHistologyFile formatStandard deviationData structureAttribute grammarData structureMathematicsLibrary (computing)Table (information)Set (mathematics)Scripting languageFile formatState of matterNumbering schemeStandard deviationGraph (mathematics)Game theoryWeb pageLatent heatContext awarenessFile archiverWeb 2.0Source codeComputer animation
07:52
Open setObservational studyStandard deviationData structureTurtle graphicsFile formatGraph (mathematics)InformationTurtle graphicsForm (programming)File formatSerial portComputer animation
08:58
Observational studyFile formatRevision controlStandard deviationData structureMixture modelAttribute grammarLink (knot theory)Field (computer science)OntologyContent (media)Virtual machineMarkup languageSheaf (mathematics)MetadataMeasurementComputer animation
09:53
CompilerInclined planeObservational studyStandard deviationData structureTheory of relativityWave packetLine (geometry)Content (media)Interpreter (computing)Virtual machineModule (mathematics)Scripting languageStandard deviationWritingOntologyType theoryMarkup languageDifferent (Kate Ryan album)IdentifiabilityComputer animation
10:39
Standard deviationData structureMathematical analysisBitInterpreter (computing)QuicksortModule (mathematics)Markup languageEndliche ModelltheorieFerry CorstenOntologyMultiplication signComputer animation
11:18
Cube (algebra)Inclusion mapObservational studyStandard deviationData structureContent (media)Smith chartMathematical analysisNatural languageProduct (business)Level (video gaming)Group actionInterpreter (computing)MereologyRevision controlProcess (computing)CuboidLatent heatMarkup languageConstructor (object-oriented programming)AuthorizationStandard deviationPosition operatorWeb pageOntologyComputer animation
13:05
Inclusion mapObservational studyMaß <Mathematik>TrajectoryPlot (narrative)Thermal fluctuationsProxy serverLinear mapArithmetic progressionEvent horizonLatent heatMeasurementFile formatFrame problemForm (programming)BitExecution unitMeasurementStandard deviationOntologyMathematical analysisGraph (mathematics)Maß <Mathematik>ParsingPlotterDigitizingThermal fluctuationsFormal grammarTrajectoryArithmetic progressionFile formatFrame problemEvent horizonAuthorizationArc (geometry)Transformation (genetics)Grass (card game)Computer animation
14:13
DampingSinguläres IntegralTrajectoryPlot (narrative)Maß <Mathematik>Digital filterFrequencyTime domainPairwise comparisonVolumeLengthMeasurementFile formatFrame problemObservational studyMagneto-optical driveThermal fluctuationsProxy serverLinear mapArithmetic progressionEvent horizonLatent heatDiagramTransformation (genetics)PlotterMultiplicationLengthMultiplication signVolume (thermodynamics)Pairwise comparisonMeasurementServer (computing)Web pageForm (programming)Level (video gaming)Arithmetic meanInterpreter (computing)File formatTheoryComputer animationDiagram
15:07
TrajectoryPlot (narrative)Maß <Mathematik>Thermal fluctuationsProxy serverLinear mapArithmetic progressionEvent horizonLatent heatMeasurementFile formatFrame problemSmoothingObservational studyFourier seriesPairwise comparisonAverageGreen's functionTransformation (genetics)Level (video gaming)Interpreter (computing)Term (mathematics)Revision controlPoint (geometry)Multiplication signDemosceneVolume (thermodynamics)Vector spaceExecution unitWordReading (process)Computer animationDiagram
16:14
Matrix (mathematics)Continuous functionWordVector graphicsObservational studyMaß <Mathematik>MeasurementSource codeHydraulic jumpFrame problemFile formatIndian Remote SensingHill differential equationVector spaceWordBuildingAbstractionMatrix (mathematics)Representation (politics)Observational studyContext awarenessData structureComputer architectureTransformation (genetics)FrequencyProbability distributionBit error rateToken ringCodierung <Programmierung>Endliche ModelltheorieExtreme programmingLevel (video gaming)State of matterSampling (statistics)HorizonFrame problemGraph coloringBit rateComputer animation
18:16
Observational studyHorizonRandom numberTerm (mathematics)Maß <Mathematik>MeasurementFile formatFrame problemFrequencyStandard deviationMereologyGroup actionFrequencyLevel (video gaming)BitSampling (statistics)Term (mathematics)HorizonWordReading (process)Graph coloringObservational studyDiagramTheory of relativityDescriptive statisticsMetric systemObject (grammar)Multiplication signComputer-assisted translation19 (number)Computer animation
19:26
HorizonObservational studyRandom numberTerm (mathematics)MeasurementMaß <Mathematik>File formatFrame problemFrequencyStandard deviationMereologyGroup actionWeb pageGamma functionMathematical analysisData structureFormal languageFrequencyLevel (video gaming)Decision theoryForm (programming)Interpreter (computing)Execution unitTerm (mathematics)WordWeb pageObservational studyGraph coloringMultiplication signTwitterComputer animation
20:49
Observational studyDiagramData structureFormal languageCognitionEndliche ModelltheorieLevel (video gaming)Slide ruleRead-only memoryTerm (mathematics)Codierung <Programmierung>Information retrievalData structureBitEndliche ModelltheorieCycle (graph theory)QuicksortCross-correlationFormal languageConnected spaceSemiconductor memoryForm (programming)Interpreter (computing)Process (computing)Error messageDean numberWordFile formatExpressionSound effectRule of inferenceResultantTerm (mathematics)KognitionswissenschaftPoint (geometry)Surjective functionDiagramComputer animation
23:03
Digital signalBounded variationStability theoryObservational studyComputational physicsDialectMathematical analysisFormal languageInformationDigitizingDialectMereologyProjective plane19 (number)Endliche ModelltheorieMultiplication signComputer animation
23:41
IcosahedronComputational physicsStability theoryFormal languageEndliche ModelltheorieDigital signalBounded variationObservational studyDialectSurjective functionRootAsynchronous Transfer ModeFormal languageAnalytic setMorley's categoricity theoremProcess (computing)WordEndliche ModelltheorieContext awarenessStandard deviationLine (geometry)Computer animation
24:33
Observational studyMathematical morphologySemantics (computer science)Endliche ModelltheorieComputer wormFormal languageWordData dictionaryNetwork topologyArithmetic meanResultantWordScripting languageRepresentation (politics)Sound effectMathematical analysisFormal languageNatural languageSoftwareLibrary (computing)Type theoryEinbettung <Mathematik>MultiplicationData modelMassCASE <Informatik>Process (computing)Electronic mailing listEndliche ModelltheorieStandard deviationInternationalization and localizationComputer animation
26:05
WordNetwork topologyEndliche ModelltheorieFormal languageSemantics (computer science)Product (business)Dimensional analysisTask (computing)Equivalence relationMereologyQuicksortProcess (computing)WordEndliche Modelltheorie.NET FrameworkBlogMachine learningStandard deviationCollaborationismRepository (publishing)Computer animation
27:03
Similarity (geometry)Computer-generated imageryDimensional analysisEndliche ModelltheorieVirtual machineComputer networkObservational studyMedical imagingFormal languageSoftwareSemantics (computer science)Vector spaceDimensional analysisForm (programming)Virtual machineSimilarity (geometry)WordEndliche ModelltheorieGeneric programmingRepresentation (politics)BlogContext awarenessSpacetimeComputer animation
28:12
Observational studyFingerprintSparse matrixEndliche ModelltheorieSoftware-defined radioVideo gameData structureFormal languageInformationMathematicsSemiconductor memorySemantics (computer science)Level (video gaming)Arithmetic meanBitLetterpress printingInterpreter (computing)Pairwise comparisonDataflowQuicksortSystem callArtificial neural networkWordRepetitionEndliche ModelltheorieRepresentation (politics)Multiplication signPRINCE2Vector spaceTemporal logicHierarchyFingerprintSpacetimeComputer animation
29:48
Observational studySystem on a chipData structureNormed vector spaceTranslation (relic)Formal languageComputational linguisticsMathematicsSemantics (computer science)Arithmetic meanBitInterpreter (computing)QuicksortTranslation (relic)WordWeb crawlerLetterpress printingComputer animation
30:39
Data structureObservational studyVolumeWeb pageAlgorithmMathematicsSemantics (computer science)Arithmetic meanQuicksortFrame problemMachine learningMathematical analysisFormal languageSpeech synthesisFrequencyLevel (video gaming)Projective planeTerm (mathematics)Computer fontRoboticsWordRepresentation (politics)Multiplication signTwitterComputer animation
32:02
Observational studyData structureMathematicsFloppy diskInterpreter (computing)Virtual machineComputer fontComputer iconRepresentation (politics)TwitterResultantGaussian eliminationDatabase normalizationMultiplication signComputer animation
32:52
Data structureObservational studyAlgorithmEndliche ModelltheorieComputer networkCellular automatonOrder (biology)Arithmetic meanMetropolitan area networkDifferent (Kate Ryan album)AlgorithmForceFrequencyNear-ringRepresentation (politics)Likelihood functionEvoluteUniform resource locatorEinbettung <Mathematik>Cellular automatonAbstractionWordComputer animation
33:47
Sign (mathematics)Observational studyEinbettung <Mathematik>WordEndliche ModelltheorieSpacetimeWave packetHausdorff dimensionData modelSigma-algebraQuantum entanglementSoftwareDigitizingPresentation of a groupCombinational logicMereologyWordCivil engineeringMultiplication signArithmetic meanComputer animation
34:26
AlgorithmObservational studyEndliche ModelltheorieComputer networkCellular automatonTerm (mathematics)Interpreter (computing)Formal languageMathematicsFrequencyShift operatorArithmetic meanBitGroup actionProjective planeTerm (mathematics)DivisorHypermediaNumbering schemeWordSound effectBit rateDifferent (Kate Ryan album)Context awarenessMultiplication signBounded variation1 (number)MultilaterationComputer animation
37:16
Link (knot theory)Abelian categoryMathematicsArithmetic meanUniverse (mathematics)Projective planePoint (geometry)MappingWordExpected valuePhysicalismCore dumpTerm (mathematics)Different (Kate Ryan album)Mathematical analysisFrequencySpacetimeSource codeXML
38:45
Turing testEndliche ModelltheorieChemical polarityScale (map)MathematicsDescriptive statisticsFrequencyProjective planeTrailDifferent (Kate Ryan album)Context awarenessMultiplication signWeb 2.0File archiverPolarization (waves)WordFile viewerComputer animation
39:29
Convex hullGame theoryArithmetic meanFinite element methodFrequencyLibrary (computing)Arithmetic meanUniverse (mathematics)Radical (chemistry)Term (mathematics)Basis <Mathematik>HypermediaInstance (computer science)Food energyPoint (geometry)Computer iconWord19 (number)Graph (mathematics)WebsiteHoaxWeb 2.0File archiverTwitterRow (database)Existential quantificationFigurate numberComputer animation
41:16
ParsingRootAsynchronous Transfer ModeBlogNumeral (linguistics)Network topologyArithmetic meanThermal fluctuationsNumberRevision controlWordMultiplication signSpeech synthesisBitMereologyGoogolEuler anglesView (database)IdentifiabilityUniform resource locatorMathematical analysisFormal languageDecision theoryAxiom of choiceFormal grammarComplex (psychology)Process (computing)File viewerDifferent (Kate Ryan album)AuthorizationWeightSource codeRippingDiagramProgram flowchart
42:53
Wechselseitige InformationTelephone number mappingSpeech synthesisInterpreter (computing)Complex (psychology)Matching (graph theory)WordGoogolFile viewerMathematical analysisFormal languageDressing (medical)FrequencySoftware testingProgrammer (hardware)Series (mathematics)Formal grammarQuicksortLatent heatScripting languageAuthorizationFitness functionMultiplication signStaff (military)Broadcast programmingFamilyComputer animation
45:28
AbstractionFormal languagePerspective (visual)FrequencyType theoryComplex (psychology)MultiplicationTerm (mathematics)Scaling (geometry)QuicksortWord19 (number)Reading (process)Point cloudClosed setAuthorizationNeuroinformatikMultiplication signMessage passingSpeech synthesisTheoryProcess (computing)Inheritance (object-oriented programming)Web pageLattice (order)Source codeIdentifiability
48:09
Observational studyAnalytic setFile viewerGoogle BücherMathematical analysisSpeech synthesisModal logicFrequencyShift operatorFlow separationArithmetic meanForm (programming)Group actionMereologyPairwise comparisonTable (information)Term (mathematics)Point (geometry)WordGoogolSound effectFile viewerDifferent (Kate Ryan album)Multiplication signTwitterSource codeView (database)Computer animation
Transcript: English(auto-generated)
00:05
Okay, so welcome back, we're going to get into the next session now, yeah, again on qualitative and quantitative methods but more about data formats, annotations and units, language structures and then frames of meaning and there'll be a
00:21
practical at the end again similar to the previous one but just building on that about meaning change in words. So I want to kind of go through layers of abstraction, hopefully it'll make sense once I get towards the end. The ladder of abstraction is really, you're working with data but it has to be stored somewhere and ultimately you write applications that consume it and so the
00:45
ladder of abstraction is really, we don't use hard disks quite so much anymore but this is the technologies and the formats and the standards that have come in the past. Does anybody know what this is on the slide? Yeah, this is a core memory, so this is about 1 kilobyte of memory stored in about the
01:04
1950s, that's how they stored it, put electrical impulses through each of the arrays on the side and it would polarize a magnet that was wrapped around the intersection of the cables and that's how you stored a bit. Yeah, so we're kind of doing that now with natural language processing but just not in physical hardware, well almost in physical hardware and the
01:23
whole point of this is to recognize that obsolete power corrupts obsoletely, so at every layer of this abstraction that I'm going to go through now, you've got standards and formats and technologies changing all the way and the impact of these changes is, well what is the impact of these changes on
01:40
meaning? Who knows? Ted Nelson's my hero. So the next level up is the encoding layer, a lot of you will be familiar with these theories, we kind of started with Morse code which is really dots and dashes and signals and sounds being submitted over single wires, we've built that up quite substantially these days. The next layer up from that is formats, file
02:03
formats, I'm sure you're all familiar with our standard file formats, yeah, but we don't really consider these as meaningful changes, we just use them as a utility, we don't consider what the impact of the format changes are on the meaning of the text that we use. Building on formats, we then get into
02:22
schemas and ontologies and I know you've done some sessions on the semantic web and the knowledge graph and things like this, this is a simple relational database, but the principles are the same, you're using formats and then you're then building more structures, more meaningful structures on top of that and ultimately we end up with applications, this is what we do all of this layers of representation for and when you're working with text
02:44
you're really kind of working at the encoding formats and schemas layers, yeah, is everybody familiar with these technologies, these formats? Yeah, again they always change, they'll always change and the best way to prevent against that change is to recognize that the change is always going to happen,
03:04
standards help as well. Originally it was the US ASCII code chart, I'm not sure if anybody's seen the 1972 version but it was very constrained, yeah, it didn't allow very many non-English characters, that was extended obviously to ASCII, and extended again into Unicode which
03:27
is what we all work with today, I'm not sure if the data that you end up processing is Unicode proper when you work with it or you have to go through transformation processes, they will affect character changes and interpretations, on top of that though it's got a code point in the Unicode
03:43
standards 9731 is the little glyph for the snowman and on top of that we've got font changes with operating systems as well and I'm not really sure what the impact of this is on meaning change, so I think I rendered the same code point on the Mac and took a screenshot over here, this might be the Mac, the Mac
04:01
tends to have more colors on it, I can't remember which operating system that was as well but I had no choice over the iconography or the iconography that was being used for the representation of this code point and for me it still looks like a snowman but if this is a more meaningful glyph, if it's a character for example and it's being presented in a slightly different font format and you don't have control over that at the
04:23
application layer, then I'm not really sure how to measure the meaning change of that, so this is another layer of abstraction for consideration and with some of the Twitter work that we've been doing, we were extracting multibyte emoji and kind of emailing those back and forth amongst the project group and we'd extract it over here, you can see there's two skin toned
04:44
thumbs ups, yep, when I copied this string that I had extracted using Python from UTF-8 strings and I pasted it into my Mac mail application and I emailed it to Jane who opened it up in the Microsoft Word application and replied to me, Microsoft Word interpreted the bit level coding and stripped out
05:06
the multibyte UTF emoji codes and basically stripped out these little squares on the side which are the skin coloration characters, now that's just a bug in a piece of core software in Microsoft's Outlook software that was
05:22
probably written in the 80s to some old text standard but these days you could probably interpret that as a form of racism if it's stripping skin color off your multibyte encodings and so this is where it's kind of a slippery territory between bugs and interpretation of meaning and the social contexts that are implied from those, so going up again from the next level is
05:43
up into the formats and everybody is probably quite familiar with XML yeah this is not the same document but the JavaScript object notation is a serialization of the same data but serialization is not really considered as a meaningful change it's more just a structural change I'm not really sure if
06:05
serializations have an impact on meaning change does anyone have any examples where that has happened no I think the most meaningful change that a serialization change will have is on the skills and the tools that you
06:21
need to learn in order to work with the text to work with the data so it's not so much a meaning change of the content itself but a meaningful change in that you need to learn a whole new set of tools and applications in order to work with the data. TSV tab separated formats again there's lots of libraries
06:41
and scripts that allow you to work with these and this is a simpler data structure it's a tabular data structure you could use commas instead of tabs for example to split things out but again you need to learn a whole new set of tools and techniques and libraries to load these up and iterate through the structures and split them so it's not meaning change but it's a meaningful change in the methods and tools and techniques that you need to
07:03
use to work with this data and RDF well RDF is not really a format it's more of a schema and the way you well standards in schemas help us
07:21
learn a specific tool and technique and software library to work more rapidly with additional data of those formats so we're all familiar with the HTML standard we're able to work with web pages quite rapidly and if you're working with web archives being familiar with the tags available in the HTML is one way of extracting data what you're not going to be quite so
07:44
familiar with is how a specific web page used that tag which is I guess a context to consider RDF again you've probably exposed to RDF and the knowledge graph it's kind of difficult to get your head around for most people
08:03
that aren't computational in their thinking so yeah anyway I guess who works with triples in turtle format no who knows what triples in turtle
08:24
format are yeah okay sorry yeah but if a serialization is representing the same
08:45
information and it's more difficult to utilize then you're kind of siloing off using your data based on its serialization these are also not human
09:00
readable RDF is not human readable this is RDF XML it's very verbose there's a lot of attributes in the XML that link it through to the ontologies that each of the field values are supposed to represent it's not human readable what they've done in digital humanities is create the text encoding initiative which was much more focused on lean markup much more
09:22
focused on the content rather than machine verbosity if that makes sense and so for example the TEI spec has a section for prose tags and you can see the tags used in the XML here are very lean they're only two characters to represent a clause an S tag to represent a sentence so it's it's kind of
09:43
very lean tags used for syntax markup so you can get quite into the text without without being overwhelmed by the metadata there's another kind of sub module in the TEI standard for marking up verses so there's annotations
10:01
here for the type of verse so there's a sonnet or a quatrain or a tercet for different types of prose and an L tag for each line of text so what I'm getting at here is that RDF is used for machines it's very verbose and it's really human readable but if you've writing scripts that will link things together then it makes a lot of sense because you're attaching identifiers to
10:23
ontologies to relate pieces of data yeah TEI was much more about the the kind of the interpretation and the markup of the content as well yeah and
10:41
what happened I guess with TEI over the years is that it became it was an XML based standard but it expanded to include all sorts of modules for all sorts of specialist literary interpretations and it's mostly English based features so simple analytic mechanisms or there's some linking and segmentation tags that can be used and each of these modules within the TEI
11:04
kind of work a little bit like ontologies do in that you borrow tags from one module and the other and you put them together into the markup of the document that's interesting for you at the time it's also been used I guess
11:22
for spatial arrangements as well and some tools have been built on top of on top of it this is for I guess genetic markup of texts where you can annotate annotations yeah so you can see the the boxes over the side here are actually marking up with TEI tags a handwritten correction and then a
11:43
correction to the handwritten correction and they contain the coordinates of the position of these marks on the page which is useful for some kind of analysis mostly it's the benefit of the researcher going through the thinking process of the author who was handwriting these things and
12:01
correcting their document as part of the kind of revisionary process why corrections said certain things why they were marked in certain positions so it's more difficult to see how markup like this could be used computationally but it's very useful for the thinking and the qualitative process to go
12:20
through the interpretation of the document when you've got very specific tags and standards to help constrain the way that you're interpreting and digitizing texts
12:42
the third person that corrected this, so you've got the author, you might have the first annotator and the second annotator and this helps you to pull apart the different contributions and what stage in the process they were made and who was making them which is I guess a deconstruction of the group effort that was involved in
13:02
the production of texts so with multi-modular standards with ontologies that you can borrow bits and pieces from which forms of annotation do you pick and choose and which units do you then pick to mark up and to annotate so I'm going to try to go into a little bit more linguistics now
13:24
we'll talk about measurement units formats and frames and some of the literary techniques that have been used in digital literature analysis Matthew Jockers has recently released a book and this is his plot of a sentiment analysis for the fluctuation in sentiment analysis for the plot trajectory for Joyce's portrait of the artist
13:47
and I think this is simple sentiment analysis, it's just doing token matching, it's not doing any grammar parsing but you can see that a graph like this is not very good, I mean it's measured but it's difficult to interpret
14:01
what he's looking here is called shaizet, a linear progression of a narrative which helps to understand the manner the author presents events to the reader so this is over the course of the story arc and he was speaking I think to somebody from CERN who suggested that he perform a Fourier transformation on that data which transforms it into a plot like this which looks like it's more meaningful
14:27
it gives you a sense that there's some positive things happening at the start there's a disaster in the middle of the story and then everything ends up happy at the end what a Fourier transformation also does is bring everything for multiple volumes into the same kind of narrative time
14:42
so from 0 to 100, so regardless of the length of the volume which the previous diagram shows you this is like 5,000 pages, this is kind of bringing it into a percentile form so that you can then start to compare the comparison of the narratives across multiple volumes so what this has done is transform the measurement and format it for comparison at a higher level of interpretation
15:06
he did this in February 2015 and in April 2015 he was criticized for changing the meaning of the interpretation of the text and so what he's done here is then plot the Fourier transformation in green, the rolling mean in blue
15:24
and the original sentiment points in black behind the scenes and you can see the Fourier transformation peaks here at about 70 but it's actually been that peak is probably influenced by the narrative time around 80, 85 so it's actually shifted narrative time through this Fourier transformation
15:44
and while it might seem at a higher level more useful to compare multiple texts across with the Fourier transformed version of the literary narrative it's actually kind of shifting things back and forth in time when you do that and at a lower level of interpretation when you're reading a volume or reading a text more closely
16:03
perhaps it makes more sense in contextual terms to know that it happened towards the end of the volume rather than say the chapter before so then kind of units, I assume everybody's familiar with word vectors
16:24
yeah, so another way of chunking up texts thinking about the layers of abstraction building on top of this Word2Vec came out I think about 2015 which used skipgrams so building on top of individual word matrices
16:40
it's now kind of taking a word and looking for the contexts, the bigrams next to it starting to build up another vector representation on top of that and then taking that to the next extreme is BERT has everybody had a play with BERT? yeah, BERT is a new architecture bi-directional encoder representation
17:03
it's a transformer architecture that is essentially building across the engram, skipgram, bigram kind of idea in that instead of dropping out word instead of kind of looking for word contexts yeah it drops out words in its sentence and then builds probabilistic models on top of that
17:21
and it does it not just with every other or every second word it starts to build further on top of that and dropping out the next sentence and filling that with a random sentence afterwards and building a probability distribution across whether or not the sentence that follows it should follow it probabilistically or not and then you can kind of feed it in seed text
17:42
and it'll use this probabilistic matrix to start to generate text but again all of this is theoretically based off of word tokenization and word frequencies I'm not sure, I haven't looked into the details of the BERT architecture whether it's using phonemes or morphemes or any other grammatical structures in the way that it's representing text contexts probabilistically
18:05
so I think the idea here is that we're just kind of building abstractions upon abstractions and starting to lose some of the more meaningfully grounded linguistic contexts so getting back to the higher level of word frequencies
18:20
this is another HathiTrust example from Ted Underwood's book Distant Horizons which is about distant readings and here he's plotted the frequency of color terms in a random sample of fiction and he's noticed that after 1800 the frequency of color words is starting to increase and he's had a bit of a hunch about whether or not it represented the decline
18:42
in third-person narration in fiction and it was kind of shifting from third-person narration to a more descriptive way of writing and he looked for another study and found one from Stanford in 2012 from Heuser and LeCac which were looking at the rising frequency of concrete adjectives in 19th century novels
19:04
so colors, names, body parts, not just colors but other related descriptive terms used for physical descriptions and he gathered not just their data but the HathiTrust data and plotted a longer time period and then decomposed it down into genres
19:21
so you can start to use this as a correlation metric for his hunch in the previous diagram that the way that literature was changing after 1800 was starting to become more descriptive and what he found here is in the fiction you can see that the frequency of the color words is rising a lot more than the frequency of the color words in biography
19:41
which suggests that the genre and the form of writing in fiction is changing over time and this is just with very high level word frequency analysis and kind of taking two studies and correlating them together to form an interpretation so what we don't know about this is what the annotation unit was
20:02
how has he plotted these he's obviously not using web pages but he could be going down to the paragraph level in terms of his aggregations I'm pretty sure it's word frequencies but is he using say the HathiTrust data and word frequencies by page I'm not really sure if that would have too much impact on how he's built up
20:21
the frequency analysis across this but another decision that you need to make about kind of analyzing your data is which what is a document what's your definition of a document is it going to be the entire book is it going to be just a single tweet is it going to be say just a chapter of a book the impact of these decisions on the pipeline that you build for
20:43
analysis is kind of difficult to measure in some ways so language structures talk a little bit about kind of internal and external models who's heard of psycholinguistics
21:04
yeah okay who's heard of sociolinguistics yeah okay there's sort of a rough correlation here between internal quantitative and external qualitative yeah it's not direct but that's what I'm trying to draw the connection between
21:22
so psycho language models psycholinguistics is really about the the psychology and the science of the interpretation of language yeah it looks about how long-term and short-term memories are affect our understanding and interpretation of the words that we hear it looks at the way
21:41
that words are coded encoded the phonemes and the morphemes that we hear and whether or not they have priming effects it can also look at the kinds of glitches that the human brain exhibits so there's things I think called spoonerisms has anyone heard of what spoonerism is okay I've got an example so they're kind of slips of the tongue yeah if you say the dear
22:04
old queen or you say the queer old dean you've actually done a segment switch between the d and the q on dean and queen yeah so people will express this without any prescriptive form of language this is just the way that we naturally speak so there's word switching that happens so
22:22
the rules of word formation can be uttered as the words of rule formation so people exhibit these sorts of expressive errors which give a bit of an insight into the way that the brain is actually doing language processing another one is morphine switching so the example sentence is I'd forgotten about that or you might accidentally say I'd forgotten about that
22:45
yeah so these little kind of examples that humans express give a bit of an insight into the linguistic structure and this is the sorts of things that psycholinguistic language modeling is looking to interpret yeah and I guess a lot of this is difficult to grab onto
23:04
socio-language models they're looking about the languages themselves the dialects the regional influences over time of the literature and the verses and the songs that get transmitted between them the ethnography of the language use and how that affects the
23:20
culture of the people that use them so this is an example from a project from the digital humanities conference in 2019 which was just a few months ago on digital folkloristics and they were looking at the kind of northern regions of charms and songs and the way that these these kind of cultures were transmitted the information was transmitted and part of the analysis that
23:43
they did is that they were well the epic songs they realized that the personal pronouns which are often used as stop words and dropped out of standard analytical pipelines for simple kinds of processes actually had a fairly marked impact on the categorization of the epic songs in the
24:01
region and those correlated to different places so what have got ingria south estonia and finished charms were identified by personal pronouns as a feature in the language so dropping stop words out is usually used as a way to expedite or to simplify some of the analytical processing but
24:23
when you're thinking about social contexts they're actually really it's really useful to know if someone's expressing it from the first person or the third person and you can then draw conclusions from that as well so speaking then more just generically about the linguistic models the way you chunk things up can have an impact on the way that your scripts and your results
24:46
are interpreted so stemming for example obviously affects the meaning of words but it's a computational technique to simplify your data representations you then have kind of multilingual
25:01
models to work with which is starting to become quite there's no standards that i'm aware of yet i might be wrong but the standards for reuse of data models across multilingual software tools is starting to kind of accrete but i'm not sure of too many standards yet you guys might be familiar with more i've used spacey which is a php natural language processing library and
25:24
they're very good about exposing through easy downloads their multilingual models but again i think well they've only got one multi-language model otherwise they've got individual language models and i'm not too sure whether or not they're kind of providing all of
25:41
deep linguistic features that you might be used to using in which case downloading the french list here if it's just word embeddings and but you need something else for it you're going to need to produce an additional model for the types of analysis that you want to do in french so there's a massive fracture of the models that you need to learn to use into the software tools that
26:03
you bring together into your natural language processing pipelines so well these are some of the models that are available has anybody used many of these no germanet is a german language
26:21
equivalent of wordnet and wordnet is used quite extensively the linguistic data consortium has quite a lot of data on their repositories that you can surf through and browse around you'd need to check what the provenance of all of that data is before you start to use it i guess that's the challenge of doing linguistic research
26:42
i guess what i'm trying to say here is that there's a lot of slippery standards that they're starting to come into general use but part of your research process should be to evaluate what are the standards available for the sorts of tasks that you're planning to do and try to adopt the standards so that others can reuse your research later on or reuse the
27:03
models that you produce later on i guess back on linguistic production is i found this blog post dimensions of dialogue by joel simon and we're talking now about i guess semantic spaces and meaning alignments and what he's done is used collaborative machine learning he
27:20
calls it collaborative but it was a variant of adversarial machine learning which is two machine models one producing examples one critiquing examples and they both kind of learn from each other in this competitive way he's called it collaborative rather than competitive exactly how he's implemented it i'm not sure but what he's done is use these networks to create emergent
27:43
language isolates which are these small vector you know pictorial representations of what these words mean and he started to see in language isolates that similar concepts are producing similar images so they're grounded in a lexicon an english lexicon but they don't have the semantic or linguistic context they've just got the lexicon and they're starting to bring out even
28:05
with these generic methods similarities in the word form representations that are produced there's another technique another neural network model that's
28:21
well that's called hierarchical temporal memory and it's a company that i haven't written on called new mentor but they've built their hierarchical temporal memory neural network model based on neuroscience of the structures of the neocortex of the human brain
28:40
so they've been looking at the way that the neurons are structured into columns which they call cortical columns and there's seven layers in these cortical columns and they've built this neural model to represent the way that information flows across between these neural columns in the neocortex the human neocortex which is a slightly more advanced
29:01
way of representing neurons yeah what they've ended up doing is they've um well they sold this technology to a company called cortical.io which has produced what it's called semantic fingerprints which is again another representation of the semantic spaces of these words so you can
29:20
see over here that the example they've got where mice is that the fingerprint has a bit of a cluster down on the side here rat does as well but they've both got a cluster at the top called mammals and so if you're doing bit level comparisons you're starting to get a different it's not quite a vector comparison but it's a pictorial bit level representation of the
29:42
kinds of semantic meaning changes and the semantic relationships with words over time so there's lots of semantic meaning changes and words a lot of these i don't know the meaning of myself but i do know that these are the sorts of language interpretations that we just are
30:02
struggling to get a grasp on with computational linguistics so uh homonyms and translations for example and there's an example in germany where it was a bit of a meme where they started to print postcards and sell merchandise but under the idea of i think i spider which is a literal
30:22
text translation so if you i don't speak german but if you translate the words directly it says i believe i spider but actually it's more of a colloquialism that you know the the linguistic word translation was quite funny to native speakers but completely lost on other cultures i've got a few more examples here my english is all under pig means your english is really
30:44
bad so yeah how do you represent those sorts of meaning changes with semantic frames or with cortical learning algorithms it's we're still kind of grasping at these sorts of things there's an example from robert hongrad called smoke and mirrors what they've done with this
31:04
is take a vocabulary of words from a previous research project which were deemed to represent vagueness in english and they essentially did word frequency analysis across corpora of um i think some some literature but some speeches from uh putin to nato and from speeches
31:26
from nato to putin and they plotted the vagueness over time so diachronically they crunched these corpora up and they noticed that over time putin's language was being more and more vague and nato's language was getting more and more concrete but how you interpret
31:42
that i guess is related similarly to the exercises we did earlier which is at that word frequency level yeah so what they've done is do word frequency analysis to see that the lexicon of vague terms is increasing over the speech time but again that's just at the word level there's no kind of syntactic understanding that's going on there then at the iconographic level we're starting
32:06
to see weird things happen online so in the same way i showed you the snowman font representation of meaning change this came up on twitter a few months ago so why is the save button on microsoft excel represented by the picture of a vending machine and so people who don't know
32:20
what floppy disks are are starting to lose what what this icon means as floppy disks and so we've got this social interpretation occurring because our obsolete technology is disappearing from the minds of society and so another tricky thing to grab onto and an even trickier one would be doublespeak so this says i'm currently economically inactive
32:45
due to being offered an early retirement opportunity as a result of my previous employer's human resources redundancy elimination initiative which is basically i know i'm unemployed because the company was firing people and i got fired so those two sentences mean the same thing
33:01
how you would deconstruct and represent those meanings in order to detect the same thing in different texts is a pretty complicated problem i'm just raising problems but what's happening with all of these layers of abstraction and the evolution of these techniques and methods
33:21
and technologies is that we're starting to use combined methods and so we started with m-gram frequencies we go through to co-locations you know the the likelihood of co-occurrences that occurring we've got word embeddings us from 2016 sorry i think i said 2015 earlier and then some of those kind of neocortical grid cell algorithms might be other methods to use for
33:43
an alternate representation to combine and this this example is called disentangling a trinity it's another presentation from the digital humanities conference in utrecht a few months ago and they were using those three methods here to analyze and crunch the dutch newspaper corpus
34:07
to look for network relationships and meaning relationships between the words modernity civilization and europe yeah and so with each of those diachronic segments they've plotted out
34:20
how each of these words related to each parts of the newspapers over time and so that was a combination of these three different methods that they used so yeah i guess the more methods that you can produce and the more open you can be about how you're using those in your
34:40
interpretation the better so and i'll pass them to jane i'm just going to hold it i think
35:02
okay um historians are very very interested in concepts over time it's that that temporal aspect which is is something that that we spend a lot of time thinking about there is even believe it or not a society for the history of concepts which has a very big international conference every year so it's a big deal and so it's over time and the meanings of words
35:23
and concepts can change and there are lots of different factors working alongside each other to affect this so different social and cultural contexts and within and between different languages as some of that marty was just talking about that's the kind of work that group is dealing with these very difficult and multilingual contexts and it's not over the sort
35:45
of extended 200 year period of time that we've been talking about earlier this morning but language changes very very rapidly now and partly because of social media and the exposure to lots of different contexts and cultures there was a wonderful project and funded by the digging
36:01
into data scheme a few years ago which looked at the spread of new words on social media and found that very interesting national and specifically cultural variations into how words spread in north america and where they came from and which ones were successful and which ones weren't and it's particularly difficult to unpick all of this because when a new meaning emerges
36:26
you don't lose the old one they coexist for long periods of time and you may end up with meanings always coexisting or some of them may start to fall out of usage and the the rate at which that happens varies wildly and it may be a generational thing the example we're going to
36:46
use a bit later on is sick which people my age would still never use to mean good but younger people in the uk would use that to mean something that's really good so some groups never get this new meaning and others use that all the time and start to lose
37:04
the old meanings so it's much easier to identify neologisms those new words the first occurrence of a term than it is to trace those shifts in meaning between words over time there is a really good project at the university of glasgow called the historical thesaurus of
37:23
english which has been looking at mapping these changes and identifying the points at which meanings started to shift and they found a very very complicated picture this is for the word sick as i mentioned and it finds 47 different meanings and usages of that term within its
37:44
corpus wildly differing some of them are to do with mental health as you would expect physical health around inferiority and lacking understanding some of those are meanings that wouldn't mean anything to us anymore because they've been lost they're a historical
38:02
meaning but all of these exist alongside each other complicating whatever analysis of a text that you're going to do and incidentally the people who make the most use of this resource are actually historical novelists because they want to write using words that are appropriate
38:22
for the period that they're setting their research in so this is kind of broken out of the university space and has become something that gets used much more widely in popular culture i think it's very interesting that even people writing fiction are starting to get interested in how meanings change and trying to make sure that the dialogue that they're
38:42
using is appropriate for the period that they're working in and another project at the UK web archive and again they started out being interested in tracking changes of concepts over time using that historical thesaurus as a way to try and identify this and you might see
39:05
from the description of the project up there that they abandoned that approach fairly quickly because it was just too complicated to do even with a 15-year period of data so they started to look at polarity rather than getting into those more subtle differences so is it a negative
39:22
or positive concept rather than how has this word changed where is it being used what context is it appearing in some of you looked at this earlier it's the engram viewer that sits over the UK web archive and this is to show really that we had all that kind of complicated mess and
39:40
undifficult to interpret graphs earlier when we were looking at the engrams but this is fairly straightforward and it's to do with tracing the first occurrence of a term in this corpus and a colleague at the British Library Jason Weber came up with Steampunk as a good example that would be that would work well for data from this period it was first coined by the
40:04
science fiction writer and his surname is Jeter I can't remember his first name in 1987 so it's very new work and it appears in 1996 which is the first date that we've got in our corpus here and there are only 16 instances of it in more than 866 000 records so and all of those pretty
40:27
much all of those are on I was KW Jeter that was his name almost all of those are on university websites so there are 16 examples of this new term and they're appearing on university websites and then as you get through to the later period it really starts to take off in 2007
40:43
and peaks towards the end of our period as you'd expect as it starts to become really mainstream and fiction and media and people are using it more commonly so that you can see that something's interesting this started to enter people's vocabularies at this point and it became really popular at this point what you can't say is there are only 16
41:04
occurrences of this in 1995 1996 or you know there's there's more going on there but you can get a sense of a trend but you can't really rely on the figures that are emerging and what's this one can't read oh yes this is that sick that I was showing you earlier with its 47
41:25
different meanings you can see that there are fluctuations in how often it appears but you have absolutely no idea which versions of that word you're looking at here so it that's really not telling you anything at all other than this word appears roughly this number of times
41:42
but you can't read anything meaningful into that linguistically um you can start to dig down a little bit more we didn't talk about this earlier but it is a way of penetrating some of that engram obscuring of what's going on that you can identify parts of speech um by these
42:02
are the different parts of speech that google engram viewer will let you look at so adjectives adverbs pronouns and so on again you don't know how they've decided that this is an adjective or a pronoun but still it gets you some part of the way to getting to that complexity of language and it's worth mentioning here that something most of you are probably used to excluding
42:21
stop words from any kind of textual analysis that you're doing for some humanities research the stop words might be exactly what you want to look at to identify who's the author of a particular text because it may be somebody that that uses a particular basic grammar formulation lot so a choice that you make to strip stop words out because that simplifies the processes
42:43
might actually complicate the work of a researcher who's interested in precisely that aspect of the text so all of those decisions really start to play into this so if you add the parts of speech qualification to google engram viewer um this is for the word tackle which is the example
43:01
they give as being a very straightforward example to disambiguate so you've got tackle as a verb and tackle as a noun but tackle as a verb of course doesn't just have one meaning it can mean tackling somebody in a football match or it can mean tackling a job and i'm sure there are various other interpretations of that as well so even the example that's given is a very simple
43:22
way of getting through some of this linguistic complexity really isn't doing quite that um this idea about using appropriate language for the period that you're interested in studying is something that's been picked up by a lot of researchers this is an american academic called
43:42
ben schmidt who has spent a lot of time studying the scripts for television programs to see how many anachronisms there are in them um how many of you know the the british television program downton abbey yeah it's a sort of early 20th century period drama and he's um gone through
44:03
all of the scripts for the series um covering the years 1912 to 1921 and found 34 phrases that would not have been used at the time that have been used in the scripts uh some of these are wonderful um things that you think sound very period specific staff luncheon which is a very
44:23
kind of formal old-fashioned sounding phrase but that was not in use at the time uh things like dress fittings uh wartime marriage uh want grandchildren realistic prospect those are all things that have been used by the authors that nobody at the time as far as we know from our
44:40
analysis of digitized um contemporary material would ever have used and he goes on to find another 26 phrases that would have been possibly used in the 1910s but only really rarely and they're much much more common now so again they that's disproportionate use for an early
45:01
period and those are things like likely outcome hospital costs off limits overall charge the basics those kinds of things so um i think this is a really interesting way of well potentially humiliating script writers for not having done their research properly um but it's it's a you know an interesting approach to using these methods to tell us something about the
45:24
way people communicated and how we should think about that and represent that and that the final example of research using these sorts of methods is an english literature professor in the uk called martin eve who's done some amazing work on the novel cloud
45:40
atlas by the author david mitchell which was also made into a film he's just published a book on that which he calls close reading with computers which is providing um using these quantitative methods that are often used to look at a really large corpus to look in great detail at a single text which you would also be able to work
46:00
with qualitatively but try and identify what quantitative methods can get you when you're working with a much smaller text and this particular novel has sort of multiple fictional characters contributing different chapters to it and one of those is written in the style of a 19th century biography so that was what he wanted to look at how appropriate is the language
46:24
that the author has used here to reconstruct this apparent 19th century material and it turns out that it's very very good and then the whole chapter there are only three words that would not have been used during the period which is from 1851 to 1910 and they are spillage
46:42
which dates from 1934 latino which is from 1946 and lazy eye which wasn't started to be used until 1960 so the authors only got that wrong three times but i think more interesting from a humanities perspective is that um the author has overused racist and colonial terms
47:03
compared to how often they would have been used in the 19th century so the author has had a sense of literature from that period as using lots of colonial empire racist related terms and has used those and they are appropriate for the age but he's used them far more than
47:23
contemporary writers would have done so there are that's an interesting different way of looking at it's not just about term occurrence but about the type of term and how often you're using them and what that means for studying language as well so um so this research is being used at a huge scale but also to get different insights into much smaller texts and to complement
47:46
qualitative research processes as well and again that moving backwards and forwards between the two is something that we do in digital humanities a lot so um so those are just some examples of how the techniques that marty was talking about and we try to get to that complexity that's
48:03
involved here so i'll pass over to you for the the last bit marty okay so what we want to do is build on what we did earlier using the engram viewer again i know there were a lot of problems with it but this time we want to introduce word senses so if you go to ht.ac.uk it's the
48:27
historical thesaurus yeah and what we'd like you to do is kind of form groups different groups from last time if you can please yeah we want you to kind of mingle with each other and pick
48:41
some words look them up in the historical thesaurus look at the senses of the use of the word over time and then try to find trends to those effects in the hathi trust engram viewer um you can use the engram viewer with the google engram viewer
49:01
with the parts of speech separation as well so that's another point of comparison yeah in fact that probably is a necessary point of comparison does that make sense yeah okay so this morning we looked for trends in the hathi trust engram viewer
49:21
yeah now what we'd like you to do is pick a couple of words look them up in the historical thesaurus look at the senses of the word and how they've changed in the thesaurus over time and then try to find those trends in the word frequencies yeah the example that i gave marty
49:40
this morning was the word hysterical which um at one point was applied very much to women's health and women's mental health problems in particular and then moved more broadly into mainstream psychiatry and then has also has this meaning of something being hysterically funny so and those are very um time bounded shifts in the meaning of that term so there'll be other
50:03
other words that you're aware of or you search for things and see how often they might have changed or had those very large shifts in meaning and see if you can actually identify that from the tools that are made available for this kind of analysis so we'll give you
50:22
about 10 15 minutes rearrange tables yeah mingle with people you haven't mingled with before and we'll come around and talk with you as it goes on yeah
Recommendations
Series of 4 media