RICardo and GeoPolHist: Exploring trade relations between the geopolitical entities of the world from c. 1800 to 1938
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Alternative Title |
| |
Title of Series | ||
Number of Parts | 542 | |
Author | ||
Contributors | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/61912 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 2023422 / 542
2
5
10
14
15
16
22
24
27
29
31
36
43
48
56
63
74
78
83
87
89
95
96
99
104
106
107
117
119
121
122
125
126
128
130
132
134
135
136
141
143
146
148
152
155
157
159
161
165
166
168
170
173
176
180
181
185
191
194
196
197
198
199
206
207
209
210
211
212
216
219
220
227
228
229
231
232
233
236
250
252
256
258
260
263
264
267
271
273
275
276
278
282
286
292
293
298
299
300
302
312
316
321
322
324
339
341
342
343
344
351
352
354
355
356
357
359
369
370
372
373
376
378
379
380
382
383
387
390
394
395
401
405
406
410
411
413
415
416
421
426
430
437
438
440
441
443
444
445
446
448
449
450
451
458
464
468
472
475
476
479
481
493
494
498
499
502
509
513
516
517
520
522
524
525
531
534
535
537
538
541
00:00
Hough transformMultiplication signBitData managementLecture/Conference
01:13
Asynchronous Transfer ModeDigital object identifierRevision controlSource code19 (number)Digital signalClique problemWeb 2.0Control systemMultilaterationFile archiverGame controllerRevision controlCollaborationismDifferent (Kate Ryan album)DataflowDatabaseMereologySet (mathematics)Computer animation
04:32
Phase transitionState of matterElectronic mailing listVisual systemFrequencyFrictionDisintegrationVisualization (computer graphics)Interactive televisionPoint (geometry)Kolmogorov complexityEstimationSeries (mathematics)Traffic reportingPairwise comparisonTotal S.A.Single-precision floating-point formatView (database)SpeciesEmailMaß <Mathematik>Amicable numbersWeb applicationDifferent (Kate Ryan album)Visualization (computer graphics)View (database)File archiverPoint (geometry)ResultantInteractive televisionSummierbarkeitComplex (psychology)Link (knot theory)FrictionError messageEvoluteData integrityMultilaterationDataflowSlide ruleSource codeBitMereologyMathematicsSet (mathematics)Object (grammar)Total S.A.CollaborationismCurveComputer animationProgram flowchart
08:25
Visualization (computer graphics)StatisticsVolumeTotal S.A.Traffic reportingIntrusion detection systemFormal grammarFrictionControl flowRevision controlTrailDisintegrationInclusion mapWechselseitige InformationInheritance (object-oriented programming)Information securityMortality rateObject (grammar)Multiplication signDifferent (Kate Ryan album)Instance (computer science)MereologyOpen setTrailRevision controlComputer fileDataflowFrictionLine (geometry)XMLProgram flowchartComputer animation
10:41
UsabilityMomentumOptical disc driveError messageTraffic reportingField (computer science)Line (geometry)Projective planeValidity (statistics)Relational databaseComputer fileTable (information)Formal grammarInstance (computer science)Open setFrictionComputer animation
12:44
Latent heatFile formatSpreadsheetPoint (geometry)Local GroupDisintegrationTotal S.A.Error messageField (computer science)Twin primeEmailSource codeAngleVolumePhysical lawMaxima and minimaClient (computing)Execution unitTraffic reportingConvex hulloutputForm (programming)Group actionProcess (computing)Translation (relic)Repository (publishing)Row (database)Validity (statistics)Web 2.0Library (computing)Web applicationPhysical systemInterface (computing)Scripting languageExecution unitLine (geometry)Error messageDecision theorySource codeComputer fileRevision controlMechanism designKey (cryptography)Field (computer science)Greatest elementDataflowMessage passingIntegerOnline helpLoginServer (computing)Direction (geometry)BitSpreadsheetLatent heatFile formatMathematical analysisRight angleComputer animation
20:39
Convex hullLaserField (computer science)Error messageExecution unitSource codeText editorSqueeze theoremTraffic reportingSource codeGoodness of fitLine (geometry)Form (programming)Revision controlKey (cryptography)Point (geometry)Web pageMultiplication signProcess (computing)Table (information)Execution unitPhysical systemQuicksortLatent heatCASE <Informatik>
23:33
Source codeError messageTraffic reportingBitForm (programming)CASE <Informatik>Cartesian coordinate systemComputer animationProgram flowchart
24:07
Hill differential equationError messageExecution unitTraffic reportingGraph (mathematics)VolumeSource codeInformationNumberOrder (biology)Data typeFingerprintEstimationCompilation albumStatisticsDemo (music)Source codeDatabaseDifferent (Kate Ryan album)Computer animation
24:36
Traffic reportingGraph (mathematics)Data modeloutputNormed vector spaceString (computer science)Twin prime40 (number)SimulationWage labourGraph coloringGoodness of fitSoftwareFile archiverMereologyLink (knot theory)Type theoryInstance (computer science)Program flowchart
26:25
SpreadsheetFile formatPoint (geometry)Local GroupDisintegrationDecision theoryCollaborationismData conversionLatent heatPoint (geometry)Software developerMultiplication signWeb 2.0Process (computing)Revision controlVisualization (computer graphics)SpreadsheetWeb applicationScaling (geometry)NumberScripting languageQuantificationUser interfaceRepository (publishing)Raw image formatLecture/Conference
30:32
Program flowchart
Transcript: English(auto-generated)
00:11
I'll start a little bit earlier, actually the 2 p.m. I'm wearing this t-shirt because I'm one of the Dev Room Managers here in this room,
00:21
and I'm taking over a talk slot that has been cancelled. We were supposed to hear a talk by Maria Arias de Reina, who couldn't make it today, unfortunately. She was supposed to talk about data flowing in the right way, which is a talk about a tool called CAUTO,
00:44
which implements data workflows with a low-code, no-code approach. This is what it looks like. Of course, I can't talk about this tool because I don't know it. It actually looks pretty cool. So, we are very sorry for Maria not being here,
01:02
and we hope we can host her next time. So, I will speak about a project, a research project, called Ricardo in the digital humanities, which I've done with...
01:21
Oops, sorry. Are we going for it? Yes, I worked with Beatrice De Dangere from Sciences Po, Centre du Soir. She is a historian, and I am Paul Girard. I am from a company called Westware, a small agency specialising to developing data-oriented web applications,
01:42
and we do work a lot with researchers. Today, I'm here to talk about how, actually, a collaboration between a researcher, Beatrice, and a data engineer, myself, can be fostered by using data control systems. By data control systems, I mean making sure we care about the data we are using in the research
02:02
by documentations, version control, and also quality control. So, the research is about history of trade. So, together with Beatrice, we built a database of trade flows between nations,
02:22
well, between geopolitical entities in the world in the 19th century, which means that we have the main data, is we know how much amount of money in different currencies has been exchanged between different geopolitical entities in the 19th century.
02:42
We know important exports, and we know this with a bilateral view, which means that we know the trade from France to the UK, for instance, and the reverse, too, from two different sources, which makes it quite a nightmare to deal with, but still. So, this is basically the main publication we already achieved.
03:04
So, we started by releasing in 2016-17 a web exploration tool, I'll show you, and also a paper about how we built this database. And then, in 2021, we released a new database called Geopol East,
03:23
which is basically a data set that explained, tried to track which geopolitical entity. So, I'm using this weird word, because we have countries, of course, but we also have entities that are part of countries, but we also have trade with entities that were colonies at that time,
03:41
and all kind of weird political statuses. And, because of that, we built this Geopol East database where we tried to track which geopolitical entities were controlled by which other one in time. And, recently, we released a new version of the database, adding 230,000 new trade flows.
04:04
And this releasing of new data, because actually Beatrice discovered new archives, a new archival book about trade, this massive updates needed a tool to make sure we can actually release data which are cleaned and structured the way we want it,
04:23
without having to deal with all those kind of issues manually. I will speak about that a little bit later. So, this is what the main website looks like. It's a web application you can go, where you can explore, basically, the trade of the world in the 19th century.
04:42
So, we have different kind of visualizations. I will not go through all of them, because I don't want to focus too much on this today. Well, if you have questions about this, we can go back to that later. We also have this website, Geopol East, that allows you to actually explore the political evolutions of the links,
05:05
sovereignty links between the different entities. I'll show you a little bit what it's like just afterwards, I think. So, just to be totally honest,
05:20
this slide is actually something I already presented in another conference. So, I wanted to speak about the visual data exploration tools we made, the frictionless data integration. So, this is the main point I want to speak about today, point two. And also, the third point was how we can actually analyze
05:41
heterogeneous data in the long run, like one century of data. My main, I will try to convince you that actually using data integration is a very nice and important tool to foster this long-lasting collaboration we had between Beatrice, historian, and I, data engineer.
06:03
So, about collaboration, I just put a link to a conference I gave a few years ago about this specific subject. So, visual data exploration, so I will really go quickly on this part to focus more on the second part. Our main objective here is to propose basically a tool,
06:24
a set of interactive data visualizations on the web, that all those researchers or basically people exploring this, to change points of view on the data, looking at, for instance, the total trade of the world,
06:41
then focusing on one specific country, then on one specific currency, and to be able to add all the different ways to look at the data in the same tool. We also like to offer visual documentation, like visualization is a very nice and important tool to spot issues,
07:01
or surprises, or errors in the data, and to unfold the complexity of the whole phenomenon. So this is, for instance, the world view, so we are able to retrace the world trade in a century, but as you can see there is more than one curve, so we have different ways we can calculate that actually from the data.
07:23
We can, for instance, take some researchers that really did re-estimations of this total trade by correcting sources and all this kind of stuff. So that's one way to do it. That's the green curve in this visualization. But we also can actually sum all the totals we have in our data.
07:44
This is the yellow one. And we also have the, so here we are summing all the flows we have, the yellow, and the red is more, we are summing only the total that were in the archive books. And it's not the same thing. If you sum what we have, or if you take the sum that's done at the time,
08:01
you don't have the same results. Welcome to the nightmare of dealing with archive data. In this visualization, for instance, we are focusing on one country, Denmark, and then we can actually spot the trade on the long run of this specific country.
08:22
And we can also visualize, so here is Germany on the right, we can also depict actually not only the total trade, but also the different trade, bilateral trade, between Germany and its trade partner for a long time.
08:41
Okay, so this is the main objective. So geopolicists here, for instance, we see like when we talk about Germany sovereign, what are we talking about? And we are talking about a geopolitical entity that had different statuses along time, you can see here. And then you can see on the bottom line which other parts, all other geopolitical entities
09:03
were actually part of Germany through time. Because sometimes we have trade with only Saxony or Waldeck, and we want to know eventually if those entities are part of another one.
09:21
So, frictionless data integration. So we are using data package from frictionless data here from Open Knowledge Foundation. So actually there is a talk from Evgeny from frictionless team later today in the online part of our room.
09:42
We'll talk about the new tool fostering data package, and actually I'm very interested into that. But I will talk about what I've done myself. So about this project, the main thing, the first thing we do is versioning the data. So we put data as CSV's into a version control system
10:03
like a GIT, simply. Here it's on GitHub. And you can see that you can track actually, just the same way we do with code, who changed which data, when and why. So here for instance is Beatrice, who actually corrected the typo in the flow number.
10:20
Adding a comma at the right place, and we have the commentary here. This is very important to keep track of what's going on with the, because we have like hundreds and even thousands of files like that. So it's very important to have that also to know if we have issues, if that happens, where it comes from.
10:42
So data package. Data package is a formalism where it's basically a JSON file where you will actually describe the data you have, adding a documentation. So the first interest of using data package is actually to document your data set to make it easier for other people to actually understand what you want it to do.
11:02
And it's very important for publication at the end, Open Science. So here we have the title of the project. We have the license, the contributors. That's also very important to have. And then we describe resources. Resources can be seen as basically a data table.
11:20
So here for instance, we have RIC entities. And for each resources, which is a CSV here, we describe the fields we have in the table. So we know that the RIC name table is a string, it's unique, and it is required. So it's really like a relational database schema. It's really kind of the same spirit, but in a JSON format.
11:41
The reason why you can do that, as I said, is documenting. The second reason is actually to control your data. So doing right, driving data validation. So if you have a data package described like that, you can then use a Python library, frictionless. It's called frictionless.
12:01
Which will actually check all your data line if each data line you have respects the schema you wrote. And if it's not the case, it will actually provide you with a report with errors. Like for instance here, I have a foreign key error because the modified currency year is not known in the table
12:22
that is supposed to have this data. So it's a very nice way to actually, we get new data, and then we check, OK, where do we stand about what we want to achieve at the end, which is to respect the data package formalism we wrote. So that's very nice. It's very cool for data engineers.
12:43
But as I said, our goal with Beatrice is that we work together and she, because the thing is like, when she enters new data in the system, she has a historian to take decisions on how to interpret the data that were in the archive.
13:03
I count. That's not my job. I don't have the responsibility, I don't have the skills to do it. So we need something to allow her to actually correct the data, update the data that comes in, into the data package format. And she can't use common line tools and Python script
13:20
and that kind of stuff. So we need something. We need a tool here to let actually humanist researchers, in this case, but people in general, to be able to interact with the data flow with something else than actually a two-technical interface. So we built, we developed a specific web application that actually helps Beatrice to integrate new data
13:44
by using the data package as a validation system. So all of this is done in JavaScript. You also have a JavaScript library for data package. So basically this is the steps. So the idea is Beatrice will upload a data spreadsheet,
14:01
so a new data file, a transcription of one new archival book she found. The tool which first checks the spreadsheet format, saying like do we have all the columns we want and everything. If it's correct, then it will go through all the data points, checking all the errors and grouping them
14:23
to make, to propose a curation interface for her to correct all those issues through a form. And we tried to develop something that makes this process, which is tedious and a long process,
14:40
as easy and as fast as we could for her, Beatrice, to actually go through this. At the very end, this tool will actually commit to and push, commit it into a Git repository and push it into the GitHub repository. All of that done into a serverless web application, which means that I didn't have had to introduce
15:05
Git command line to Beatrice neither. The tool does that for her. So this is what it looks like. So it's a React web application. Here we have the Shema validation summary, where we see for the fields,
15:20
so the different fields for which we have errors, the kind of the error we have. And at the end we have the error overview, which says how many rows that has an issue. For instance, here in the source column, we have two different invalid values that impact 169 rows.
15:40
The idea here is to correct all this group of 169 rows with only one edit. So, once we have all those errors, basically the process of workflow using this tool will be to go through the error groups one by one. The web application will actually generate a form
16:03
with inputs to actually help Beatrice to decide. So in this example, we have a value for partner. So partner is a trade partner. So it's a geopolitical entity. Here it's in French, it's Île de célant. So we use English-based vocabularies
16:21
to translate the partner. So we need to decide what is Île de célant. in our vocabulary in the rest of the data. And this is where we have a search input, where actually Beatrice can actually search for Île célant, which is called in our vocabulary Île célant in parentheses. And once he chose that,
16:41
actually the tool will correct this column and put the data at the right place to make sure we will translate Île célant to the Île célant. At the end, once she went through all the process, we have a summary here explaining
17:00
all the corrections she made. So in the first line, for instance, a year was misspelled. All that kind of thing, we change the source name and everything is fixed. So once all the errors have been corrected through the dataform I just showed you,
17:22
then she can move on with the last step, which is actually to publish this new data that is now valid, because we know it's valid because we can control it with the data package into the GitHub repository.
17:41
And this is how basically the React web application will really prepare the data. So I could go into details into what does that mean later, and makes it possible for Beatrice to actually take the right decisions
18:02
to adapt the raw data into the data package we worked with. So I have a little bit more time. So this is the analysis. Maybe I can try to demo a little bit the tool live.
18:25
So the very important thing is, like, it's a serverless web application. So here it's my local host on my laptop, but actually it's hosted on GitHub directly. So what is the... Oh, yeah, MediaLab. Actually, a lot of this work has been done
18:41
by my previous employer. So congrats to them, too, because they contributed to that work, too, a lot. So, ta-da. Okay, this is a tool. It's hosted on github.io,
19:01
because it doesn't need any server. All the login process with GitHub is done through a personal token, which is a very specific long key that you have to produce in your GitHub account for once, and then you use that as a login mechanism. So this is what it looks like. Once I am logged in,
19:22
I can fetch the data from GitHub to make sure I have the last version of the data before adding new things. Then here I can prepare the little file here, which normally should have some errors. So the first thing here you see, like, this green message here on the bottom
19:41
says that actually this CSV file is valid structure-wise, when the errors of the columns are good, which is a good first step. And then this is all the errors I have in my file. This is a nice step,
20:01
because you want to overview what kind of mess you are going to, how trouble you are before starting the process, because if you have very too many, maybe you want to do that later, or asking for help. So once you've seen that, you can start. So this is basically all the things we have to do. So this is the first one.
20:21
I can move to the next error, or go back, even though I haven't corrected it yet. And here I say, like, okay, so the value, commercial A, I don't know, whatever. This character is not actually a unit, because the unit should be an integer.
20:41
Yeah, it's true. So it's better with, oops, sorry. So it's better with a one. And I can confirm this fix. And now we're good. Unit is one. Now I move to the next one. You see here, I am in two or nine. So all the processes are trying to make that as smooth as we can, so as soon as I fix it.
21:00
So here I have, it's written in French, it's Mille-Lafson-Trentuit, but actually we want that to be an integer again. I don't want the later version of the year. So Mille-Lafson-Trentuit as a number, as soon as I confirm the fix, I move to the next page. So that's, we can try to make that process
21:20
as seamless as possible. So here I have a source. So this is a foreign key. So in the source, in the data table, the source is actually, it's a key that is referring, which is referring to the source table. And say, like, so here basically, foreign key source violation. So it means that this sort doesn't exist in our system. So here I have two choices.
21:43
Or, I was supposed to, okay, normally I should, I should, oh yeah, sorry. Trinidad, no. So normally, okay, so whatever. So I can search for it and find it. Or, and in which case it will,
22:00
the edit will be only replacing the key. Or I can create a new item. And here you can see that here I'm creating a new source because sometimes the source doesn't exist yet. So I have to go through all the, you see this form is much, much longer because here I'm creating a new line into the source table. I will not do that because it's too long.
22:20
I will just, give me something, please. And that will make it, okay. And so on and so forth. Again, we have an issue with the, sorry. This example is a little bit up. Okay. Here it's a Trinidad and Tobago. It's a geopolitical line time. I don't know because it's an A.
22:40
Trinidad and Tobago, not A. Up. And we're good. Australia with a lot of E at the end is not correct. It's Australia. Yep.
23:01
Sorry, it's very long. Yeah, whatever. Dollar. Let's say it's a scrap. Don't do that, right? Okay. Ah.
23:21
Ah. Okay, so you see, that's important point is like, so we are based on the data package, in the data package. So we are using foreign keys, specification and so on, but actually we had to add specific forms for our case. So the application here is not generic. You can't just put a new data package
23:41
of yourself with your data and it will not work because we had actually for UX and UI reasons to make specific cases like that where the forms are not exactly as the data package described it. We, it was too complex to make it very generic, but actually with more work that could be achieved
24:01
maybe at some point. And actually the talk from Yevgeny this afternoon will talk a little bit about that kind of stuff. So here we are. I'll stop the demo here. Just to finish, why do we do all that kind of stuff? Because we want to analyze trade
24:21
in the long run. We have lots of trade values as you can see. Lot of trading entities, very too much. And then at the end we try to, this is a visual documentation where we depict the kind of different source we use in the database. And at the end we try to do something like
24:41
that, where actually here we have the trade of the world in 1890. So each node here, circles, are geopolitical entities. And the links between them are the trade of that year. So it's total trade.
25:01
We could choose import or export here. I just summed it up. The important part here is like the color here is based on the type of geopolitical entity we have. So in this orange kind of yellow thing it's what we call the GPH entity. It's entities that
25:20
geopolitical entities we know. Mainly countries but not always. In green those are colonial areas. So it's not a colony. It's not a country which is a colony. It's like French Africa. It's like, we know it's a colony, but we don't know which one exactly. Like here for instance we have European Russia,
25:41
which is a city part of. It's from Russia, but it's the European part of it. And this is what we find in the archive books. So we can't really decide what that means exactly. So we try to analyze this kind of, so we have this various gap between very heterogeneous data, very difficult to interpret, but still try to do a quantification
26:03
and analyze it like this networks on top of this very complex and rich data set. I think I'm good with what I wanted to share with you today. We can move to a question if you have.
26:30
Yes, please. I had a title which was Development of a Specific Web Application to Integrate New Data.
26:40
I have time to tell me, but please tell me if you have time at some point. What was the conversation with your historian like? How did this happen? What did you actually do to plan? Yeah. So the question is like, how Beatrice and I ended up
27:01
deciding to do that. So basically the whole point is very like a collaboration because we worked without that for a very long time and the process was, we had to meet in the same room. I was doing the script, checking the data, editing the data because editing the data in a spreadsheet
27:20
data-wise that doesn't mess up your numbers and everything is not easy actually. And we were working together on that. It was necessary and actually very nice to do because we had to exchange. So she was explaining to me why she was taking this decision or not and she was taking this and I was just putting data. But at some point,
27:41
we ended up with the fact that we had so much more to add that this process couldn't scale basically. So we had to find something else to make sure she could do this process on her own and I would intervene once the data is in the GitHub repository checking myself with quantification
28:01
and scripts and everything again, because you still, you always need to check everything many times and then it makes the whole process much more smoother. Yes?
28:32
Sorry, I don't get it. Can you rephrase it?
28:46
So the question is like would it be beneficial or possible to actually commit the data before checking it and put it in the GitHub? So yes and no. The reason why we don't do that
29:01
is the first one because I need Beatrice to take the decision of documenting the raw data to make it compatible with all the nice visualization I showed you. And she needs to take the decision. She needs to do it. So that's why we put this data into the GitHub after she has done this work
29:22
of data creation. We could actually host the data as a raw file first and then do that later and that kind of stuff. We still need a web interface that lets Beatrice, the historian, take the decision.
29:41
So, no. Any more? Yep. Yes, so this tool I'm using here is actually brand new.
30:00
It's a Gephi but on the web. We are working on this with my company Westware and we are very close to release it. It's basically the same thing as Gephi but a lighter version and a web version. It's already there but you shouldn't go because it's not live yet.