Exploring GLAM data (with Jupyter notebooks) - Sept 18 - TIB AV-Portal

Exploring GLAM data (with Jupyter notebooks) - Sept 18

00:00

5

Australian Research Data Commons (ARDC)

Formal Metadata

Title

Exploring GLAM data (with Jupyter notebooks) - Sept 18

Title of Series

Number of Parts

19

Author

0000-0001-7956-4498 (ORCID)

License

CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/42941 (DOI)

Publisher

Australian Research Data Commons (ARDC)

Release Date

Language

Content Metadata

Subject Area

Computer Science

Genre

Webinar/Tutorial

Abstract

Computing Digital Humanities - Tim Sherratt, Assoc Prof Digital Heritage, Uni Canberra The GLAM sector (galleries, libraries, archives and museums) is making more and more collection data available online, but how do we use it? In this talk I'll introduce my GLAM data workbench, a new and growing collection of Jupyter notebooks that offer new pathways for historical research. Tim is an historian and hacker who researches the possibilities and politics of digital cultural collections.

Tech Talk1 / 19

1

23:02

Exploring GLAM data (with Jupyter notebooks) - Sept 18

2

18:58

The Prosecutions Project - Sept 18

3

17:29

Using NetCDF in Jupyter notebooks - Oct 18

4

21:33

Scientific Data in the Cloud - Oct 18

5

16:53

ESIP EnviroSensing Cluster Pt. 2 - Cluster Projects & Highlights - Nov 18

6

08:02

ESIP EnviroSensing Cluster Pt. 1 - Cart before the horse: system QA and data QC practices for sensor networks - Nov 18

7

06:04

ESIP EnviroSensing Cluster Pt. 4 - An Integrated Sensor Data Management System (ISDMS) - Nov 18

8

17:41

ESIP EnviroSensing Cluster Pt. 3 - Cloud-Hosted Real-time Data Services for the Geosciences (CHORDS) - Nov 18

9

11:41

Marine Data enhanced Virtual Laboratory DEVL #1 - June 18

10

11:21

Humanities, Arts and Social Sciences DEVL #2 - July 18

11

13:20

Geoscience DEVL #2 - GeoDEVL - July 18

12

10:50

EcoCloud DEVL #1 - EcoScience Research Data Cloud & Data Enhanced Virtual Laboratory (RDC & DEVL) - June 18

13

13:43

Climate Science Data Enhanced Virtual Laboratory #1 - June 18

14

14:46

Technology of the Characterisation Virtual Laboratory (C-DEVL project) #2 - July 18

15

15:44

[Molecular] Bioscience DEVL + RDC Projects #2 - July 18

16

13:27

Astro DEVL #1: ASVO - MWA Node - June 18

17

31:06

ESIP Information Quality Cluster: Vision, Objectives, Accomplishments and Status - March 2019

18

13:03

Australia National Computational Infrastructure - Implementing a Data Quality Strategy to simplify access to data - March 2019

19

10:28

ESIP Information Quality Cluster - A Brief Overview of Maturity Models for Consistemt Data Quality Ratings - March 2019

Automatic playback

Speech

Text

Image

00:00

SpacetimeLink (knot theory)QuicksortSlide ruleIntegrated development environmentNeuroinformatikComputer animation

00:19

Computer iconCodeCellular automatonPresentation of a groupLagrange-MethodeElectronic visual displayIntegrated development environmentQuicksortNeuroinformatikArtificial lifePresentation of a groupComputer animation

00:47

Computer iconCodeCellular automatonPresentation of a groupLagrange-MethodeRange (statistics)Electronic visual displayLaptopVolumenvisualisierungPlot (narrative)Time zoneDependent and independent variablesSimultaneous localization and mappingSinguläres IntegralOpen setParameter (computer programming)CASE <Informatik>State of matterCodeCellular automatonNumberBitLevel (video gaming)Context awarenessMultiplication signLaptopTotal S.A.DigitizingComputer animation

02:08

Scale (map)Dependent and independent variablesFrame problemDisk read-and-write headTime zoneSimultaneous localization and mappingMathematicsLevel (video gaming)NumberResultantState of matterIntegrated development environmentField (computer science)QuicksortSource codeQuery languageNetbookNeuroinformatikShooting methodCellular automatonWeb browserGame theoryLaptopSpacetimePower (physics)Real numberShift operatorComputer animation

03:38

Presentation of a groupIntegrated development environmentCodeDigital signalComplete metric spaceNeuroinformatikQuicksortCodeIntegrated development environmentContext awarenessMultiplication signLaptopVariety (linguistics)Cartesian coordinate systemFrustrationFocus (optics)Library (computing)Software testingPresentation of a groupSeries (mathematics)Slide ruleOrder (biology)Instance (computer science)Execution unitRevision controlCellular automatonLink (knot theory)MathematicsService (economics)Complete metric spaceTheory of relativityComputer animation

06:01

Dependent and independent variablesTime zoneElectronic visual displayTotal S.A.Abelian categoryMenu (computing)View (database)QuicksortLoginHill differential equationLie groupFrame problemDisk read-and-write headRange (statistics)EmulationRevision controlMultiplication signForm (programming)ResultantQuicksortQuery languageBitCodeNumberType theoryVolume (thermodynamics)DigitizingElectronic mailing listPoint (geometry)FrustrationLaptopTotal S.A.CASE <Informatik>FrequencyGoodness of fitMemory managementTouchscreenCategory of beingMathematicsComputer animation

10:14

Disk read-and-write headDependent and independent variablesError messageOptical disc driveFocus (optics)Series (mathematics)SubsetPrice indexTotal S.A.SummierbarkeitFrequencyCASE <Informatik>Multiplication signTotal S.A.ResultantNumberLevel (video gaming)Parameter (computer programming)CodeMatching (graph theory)Set (mathematics)Query languageTable (information)Cycle (graph theory)Position operatorUniform resource locatorFocus (optics)Computer animation

12:14

RadiusMathematical analysisRow (database)Series (mathematics)Multitier architectureFormal languageFunctional (mathematics)Revision controlLaptopMultiplication signWeb pageOptical character recognitionPhysical systemLocal ringInstallation artCellular automatonRippingThomas KuhnSpreadsheetInterface (computing)QuantileoutputDuality (mathematics)CodeRule of inferenceHill differential equationSquare numberAccess BasicShift operatorAdditionComputer configurationScripting languageView (database)Parameter (computer programming)Message passingExplosionSoftware developerTerm (mathematics)InformationNP-hardQuery languageOnline helpSoftware repositoryEmulationSpacetimeMiniDiscPrice indexTablet computerVariable (mathematics)Subject indexingLoginDefault (computer science)File formatBinary fileCone penetration testLogarithmDirectory serviceEmailConfiguration spaceDirected setLink (knot theory)Hausdorff spaceComputer fileMetadataUniqueness quantificationSheaf (mathematics)FamilyAbelian categoryLogic gateElectronic mailing listLevel (video gaming)TimestampContent (media)Wage labourElectronic visual displayLattice (order)Core dumpSimultaneous localization and mappingAngleInclusion mapPoint cloudOpen setNormed vector spaceGamma functionWordOperator (mathematics)Inversion (music)Cartesian coordinate systemSet (mathematics)Axiom of choiceLaptopPoint (geometry)BitSlide ruleWeb browserQuery languageNumberMoment (mathematics)Point cloudVapor barrierCellular automatonTerm (mathematics)Integrated development environmentMetadataInformationEvolute10 (number)Key (cryptography)CASE <Informatik>Sampling (statistics)Link (knot theory)Web pageSpreadsheetService (economics)ResultantQuicksortComputer fileExistential quantificationWordWeightMultiplication signFrequencyLevel (video gaming)DemosceneCodeSurface of revolutionComputer animation

17:24

Electronic data interchangeNP-hardReverse engineeringWordCountingContent (media)Operator (mathematics)LoginoutputStress (mechanics)TimestampFunction (mathematics)DreizehnFile formatSystem callUniqueness quantificationComputer fontMatrix (mathematics)Price indexSCSILocally compact spaceLine (geometry)Computer fileMathematical analysisSeries (mathematics)Row (database)Correlation and dependenceInformationLaptopDatabaseNumberElectronic visual displayDuality (mathematics)Open setClosed setFile viewerPublic key certificateSerial portWeb pageFAQView (database)Total S.A.DigitizingPlot (narrative)FrequencyGotcha <Informatik>TowerWeb browserInterface (computing)Complete metric spaceDisk read-and-write headRange (statistics)Dependent and independent variablesIdeal (ethics)Frame problemGame theorySource codeSystem programmingReal numberMultiplication sign19 (number)Query languageQuicksortMathematical analysisSource codeDigitizingLaptopComplete metric spaceLetterpress printingForm (programming)Content (media)Physical systemReal numberSelectivity (electronic)ImplementationRepository (publishing)Computer fileSeries (mathematics)BuildingWordZoom lensRow (database)MetadataAsynchronous Transfer ModeComputer programmingRoutingSampling (statistics)Set (mathematics)Range (statistics)Link (knot theory)Figurate numberNeuroinformatikPower (physics)Process (computing)Web browserWater vaporCodeFile archiverFrequencyPattern languageSimilarity (geometry)Interface (computing)Turing-MaschineMereologyComputer animation

22:32

Sign (mathematics)Repository (publishing)Pointer (computer programming)Formal languageSoftware developerLaptopSelf-organizationConvex hullDatabaseCodeBit rateDrill commandsDialectLattice (order)Data typeMountain passQuicksortAdditionLaptopLink (knot theory)Repository (publishing)Process (computing)BitMoment (mathematics)Computer animation

Transcript: English(auto-generated)

00:00

So thanks for the opportunity to come and talk today about something that I've been really excited about the last six months or so. So it's a great chance to sort of play around with some stuff. So my slides today, so there's a link there and if you follow that link, you'll find that you're actually loading up a live computing environment and will actually be

00:24

able to, if you want to play along, you can actually do stuff as we go along today. So, let's get stuck into it. Oops, let's get this on the screen, okay. So now, it is a live computing environment and we are actually going to run some live code during this presentation, so if you want to play along, it's pretty simple, it's

00:46

just a matter of when you get to one of these sort of cells, just click on the little play button so that it runs that little bit of code. So, what I'm using is Jupyter notebooks and you might all be familiar with Jupyter notebooks,

01:08

you've probably seen them before, so this might be a bit old hat to you, but I'll just go through a few things, in fact, I won't actually talk about what a Jupyter notebook is, I'm just going to show you what a Jupyter notebook can do, in the context

01:20

of cultural heritage data. So, let's start, let's just get some data from Trove and I'm assuming you all know what Trove is, if you don't, you're in trouble. So, let's get some data from Trove, I'm just going to do what I said, click on that little play button, set some parameters, goes off, makes a request to the Trove API and brings us back some data.

01:42

In this case, we're getting facets showing us the total number of newspaper articles published in each state in Trove, of Trove's 200 million newspaper articles, digitised newspaper articles. Okay, so we've got some data, what can we do with that? Well, let's make a map.

02:01

So, again, just going to run this cell and what we're going to get is, let's go to the next, so we can actually show it, is a choropleth map. There it is, which just again shows us the number of newspaper results per state.

02:27

So, this is a really simple example of really the power of Jupyter Notebooks with this sort of exploration with cultural heritage sources. Now, I can make those requests, I can get it back, I can do something with it and we can play with it, we can start to analyse it all live within the comfort of your own browser.

02:44

And it's able to do that because we're actually sitting on top of a live computing environment. So, and one of the nice things about this is that we're not just limited to what I've put into this notebook.

03:00

So, if I now just go backwards, I'm just hitting shift space to go backwards through these cells, go back to here, these are all editable. So, I can, instead of just having a blank query which gave us everything, I can search for camels, run that again, just go through this, we're just going to run it again,

03:23

make our map again, and have a different map. So, not only are they live computing environments which we can interact with real data sources, but you can edit them and change them.

03:40

So, you can actually use them as a tool for exploration yourself. So, that's really the theme of what I'm talking about today. Okay, so what's Jupyter? Well, Jupyter is this presentation. This presentation is a Jupyter notebook. It's using a particular sort of plug-in which enables it to be presented as a series of slides, but underneath it's just a Jupyter notebook.

04:01

As you've seen, it's editable. In any of these cells, you can just click on them and change any of the values. It's shareable, obviously. And indeed, this live version is running on a service called MyBinder, which is a service which basically you send a GitHub link to your notebook.

04:22

It spins up the computing environment that you need in order to run that notebook, a Docker instance, and then provides that back to you. So, it's really handy for things like workshops. You can actually create a notebook. People can just log on. They can start playing around with it in that live, customized computing environment.

04:44

Why am I interested in Jupyter in the context of cultural heritage collections? So, I've been playing around with various ways of working with cultural heritage data for a long time, really seriously sort of hacking collections for the past 10 years,

05:00

sharing a lot of code and examples, creating various tools and applications, doing lots of workshops and stuff like that, but I've always been a bit frustrated in our ability to really encourage and support people's own exploration. I mean, you can present tools, and they take you a certain amount of the way, but driving an environment where they're encouraged to actually start poking around inside the code

05:22

and go a bit further and see where it takes them has been a lot more difficult, and that's what really interests me about Jupyter. And in doing that, in the sort of stuff that I do, the sort of examples that I create, I've been focused on two particular issues in relation to Glam collections.

05:40

That's Glam, galleries, libraries, archives and museums. And there, the challenge of abundance and the illusion of completeness. And what I'm going to do today, I'm going to explore those two sort of facets through the context of Jupyter Notebook by trying a variety of experiments and see where they go.

06:03

So, first of all, the challenge of abundance. So, we can say to Trove, Trove, tell me how many newspaper articles you have about influenza. And there's a little bit of code which will do that, and if we just run that bit of code, it tells me that there are 1,614,300 digitized newspaper articles about influenza.

06:29

And that's pretty common if you're typing something into Trove's digitized newspapers. And of course, it's a huge, incredibly rich collection for many types of research. But it's also a bit overwhelming.

06:40

How do we make sense of that volume of material? What does it mean that there is 1.6 million results? So, let's start thinking about how we can drill down a bit to go down through that results thing. So, we could, for example, just do a little quick thing which shows us...

07:19

This is just breaking down by category, so mostly advertising, actually.

07:26

Presumably remedies relating to influenza. A significant number of articles. Let's just keep playing around with the possibilities. So, let's think about how could we look at this as change over time, the number of articles over time.

07:42

And actually, I've got a tool specifically for that. It's a thing called Query Pick. And it's been around for a long time now, so I think I created the first version of this back in 2011. In fact, the first version of this predated the Trove API, and it was just sort of screen-scraping data out of Trove.

08:01

But what it does is simply shows you the number of results matching your query for each year. So, it shows you the whole of your results set. Instead of seeing a list of the first 20 results, you get everything displayed over time, and you can start to explore that, and you can drill down by clicking on a point. And that's good.

08:21

Query Pick is quite well used, and it's actually been cited in a number of published articles and books where people have actually used it in their research. Again, the frustration is that it turns to a certain point, but then it's hard to know where to go. How do you follow that through? How do you continue your explorations?

08:42

So, what we can do, again, using this notebook, is to start to build our own version of Query Pick, a transparent version, which actually exposes its workings to us. This is basically the same sort of code that's sitting behind Query Pick. And in this case, we're going to look at influenza from 1880 through to 1940.

09:08

So, again, I'm just running this code, and it's making a request for each decade in that period. So, that's why it's actually six API requests that it's making, bringing it back again, and showing us that we can then get the assets by year.

09:25

So, we can take that. We can run that one to make our chart,

09:43

and then we can just display our chart here. So, we see the number of articles over time. So, it's just the same as Query Pick, just in that notebook form that we can play around with and edit. Now, okay, that's interesting, but then we might be looking at this sort of thing.

10:00

We might say, okay, but how do we know there just weren't more newspapers published in 1919? How do we interpret that peak in 1919? Well, one thing we could do is just try dividing the number of results by the total number of articles published that year. And it's just a matter of making another API request.

10:21

So, now we're getting the total number of results for each year, and we're just going to divide the number of results from our influenza query by the total number of results for each year. Okay, we've got another chart.

10:41

And we see here it's slightly different. So, the peak in the 1890s is more significant than it was in the early one, but clearly, as we all know, there was a significant influenza epidemic in 1918, and that is a real feature of that chart. So, it's not just a matter of there are more articles.

11:02

There is something real that we're looking at there. So, let's focus in on that period, that 1917 to 1919 period. And so, in this case, what I'm going to do, or what we're going to harvest, is we're going to make a number of requests,

11:25

basically one per month, between cross 1918 and 1919. And we're asking it to show us this time in the facets the titles of the newspapers that were published in these articles. Once we've got those titles, what we're going to do is

11:42

match them up with another data set which has geolocated those titles, so it has positions for where those newspapers were published, and put all our results on a map. And all the code is here, so I'm not hiding anything from it. This is all the code that is munging all that data together.

12:01

It's bringing in the location data, and it's just using Pandas, which is a tool which is heavily used for manipulating tabular data. It's just using that to link the tables together. So, if I go here, I'm just going to make the map. And finally, I'm going to show the map.

12:30

And so, we've got an animated heat map, which is actually taking us through that period, 1918 to 1919.

12:40

So, we've started with a particular question around influenza. We've seen the full-scale results, that 1.6 million results, and then we've started to drill down from that to see what we can find out about those results. And that's all just within this notebook, just those bits of code that I've been showing. There's no magic behind the scenes here.

13:03

So, that takes us to a certain point. And we've been working there, obviously, with a small number of API queries. We're just getting faceted data out, which gives us summaries of the material. But, you know, we get to a point where we're going to want to dig a bit deeper than that. We're actually going to want to pull out

13:21

that data relating to those newspaper articles and start to explore that in depth. And we can do that as well. So, this is a full notebook here, not in its slideshow form.

13:41

So, a number of years ago, again, back sort of 2011 or so, I created a tool to do just that, to just harvest metadata from newspaper articles in Trove out into big datasets, so that you could get, you know, 5,000, 10,000, 20,000, 50,000 newspaper articles,

14:01

which you could then analyze in the sort of tool of your choice. And that's been through a number of revolutions over the year. And at the moment, it's a Python command line tool. It's pretty easy to use, but still it has that barrier in terms of you have to get Python environments set up

14:20

and you have to install the tool. You have to use the command line, which can be a big barrier to people. But once again, what's cool about the notebooks is I can actually just run that command line tool from within a Jupyter notebook, just within, again, the comfort of your browser. So, this is another live notebook. I can actually show you that the Trove harvester

14:42

is sitting there behind it. So, that's the command line tool there. I've set this up with an API key and with a basic query, which in this case is cyclone and rag limited to the decade of the 1910s.

15:01

So, all I'm going to do is just run the harvester. And if the Trove API behaves, it's going to harvest about 300 newspaper articles. So, in this case, obviously I've kept it through a fairly small sample, but using this notebook, I have harvested tens of thousands of newspaper articles.

15:22

Now, we're harvesting the metadata, so we're harvesting the basic publication information from all those articles, but we're also harvesting the full text of those articles. So, we can actually, once the harvest is finished, start to do stuff which explores both that metadata,

15:41

the top-level information, it actually digs down into the text files themselves. And that's just about done, I think. So, you know when something's done when that little asterisk turns into a number. There it goes. So, now if we're running, as in this case, on MyBinder, which, as you know,

16:01

a cloud-hosted service, you want to download the results. So, you can just run this cell. It zips up all the results into a zip file. I can just run this file, this cell. It gives me a nice download link. I can just click on that and it'll download the results. So, I can use this page as a way of harvesting thousands of newspaper articles from Trove,

16:23

downloading the results as a spreadsheet and all those little text files as well. And then we can open up another notebook which gives you some hints for how you might like to start to explore that data. So, in this notebook, we're actually going to just quickly open the last harvest.

16:43

We can do things like show the newspapers that I represent most often within that set. This is just working on the spreadsheet. We can obviously look at them over time within that set. So, we can drill down. We can make a simple word cloud.

17:02

Again, this is just using the titles of the articles. So, these are just hints for exploration. So, the idea of me putting this notebook together is really just to give people some idea

17:21

of how they can start to, again, sort of ask questions about that data. And you can actually go further. So, there's another notebook which actually enables you to work on the text of those individual files. So, there are now 300 and whatever little text files which have the OCR content from those newspaper articles

17:41

and you can start to feed them through text analysis programs to look for patterns and frequencies and there's a little thing here which enables you to do some TF-IDF analysis to look at most significant words within each of those, within those articles. So, all sorts of ways you can start to explore them.

18:01

So, there's another, there's similar sorts of things that I've been putting together. So, there's a harvester for Record Search which is a national archives in Australia and there was a full metadata route for a whole series and I've created some sample data sets using that. So, this is a series in the National Archives

18:21

relating to the White Australia Policy and you can have just sort of filled out this notebook to get a summary of each of these series and you can then just actually download the CSV file which has the data for it in it. So, you get a little summary, you get a little chart so you can see the date range and there's a little link somewhere over

18:42

to download the CSV file. So, again, that's using the notebooks as a form of delivering data. Okay, just quickly, so, why Jupyter? Well, you know, I think it's really cool that you've got everything that you need in the browser and that makes it great for workshops.

19:00

Anybody who's had to do stuff in computer labs knows how difficult it is in university computer labs. It enables you to ask questions of the GLAM data and follow where they go, you know, to start big and then zoom in and of course, you can rinse and repeat. You can start with a notebook and you can go back, you can edit it, you can change it. You know, people often ask me, you know, how do I learn to code?

19:22

And my most common answer is, well, you get somebody else's code, copy and paste it and you sort of fiddle with it until it's broken and then you try and figure out why you've come up with broken and you fix it and you go through that process and that is really facilitated by the notebooks in that you can just edit it and change it and try it without the worry of breaking anything seriously.

19:45

Okay, so the other question that I've looked at is that one about the illusion of completeness. And I'm just going to do another little quick Trove query here and this is showing us all of the newspaper articles over time.

20:09

Now, if we could actually speak, I would ask you what that peak there in 1915 represents and normally when I ask that, people say, it's the war, you know,

20:21

were there more newspaper articles published during the First World War? And the answer is no, that peak represents funding. In the lead up to the centenary of World War I, money was invested in the digitization of World War I era newspapers. So that peak doesn't actually represent anything about the history, it's just about the way

20:43

Trove policies behind the digitization. And that's really important for people when they start working with these sorts of collections to understand that they are constructed, through excellence of selection, through the implementation of policy, through funding, all sorts of ways in which these collections get created.

21:02

And I think it's a basic thing that we should be subjecting these sorts of things, APIs and CSV files and collection data, to the same sort of critical analysis that we would if we had a collection of primary sources in print form. The thing is that it's harder to do, but again, the Jupyter notebooks give us that opportunity to start playing around

21:20

with these sorts of things in a form that we can easily share, that other people can learn from, and we can all start to understand what's going on behind the interface. And I won't go through it now because I'm at the end of my time, but there's a notebook there which I spent some time last week looking at the new API from Te Papa, collection API, that's really great.

21:42

They've got really rich data there, but there's also some unexpected stuff, which you only sort of find out about once you start digging through and making a few requests and going down through the facets. So why Jupyter? So it's not just about working with the content of the data itself. It's the ability to ask questions around the systems

22:01

and the technologies and the policies that construct these things that we're using. And the real fun part is that ability not just to tell, but to show, to give people the opportunity to learn by these sorts of examples and to actually do real work while they're learning. You know, you can go back through this notebook and plug in your own research topics,

22:21

the things which interest you, and you can see where they take you. Okay, so the work that I'm doing is sitting here in this repository in GitHub. So I'm in the process of sort of reorganizing everything at the moment, so it's a bit of a mess. But feel free to jump in and try any of the notebooks.

22:42

Most of them will have links which allow you to open them up in Binder so you can play around, as with this one, live. And feel free to, in the issues on any of the repositories on GitHub, to add in any requests or additions that you might like to see.

23:00

And thanks very much.