Revamping OpenRefine
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 490 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/46925 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Open setComputerFatou-MengeSoftware testingConvex hullProgrammable read-only memoryFile formatWeb 2.0Transformation (genetics)NeuroinformatikStructural loadQuicksortMessage passingInformationMathematicsGame theoryProjective planeOpen setPhysical systemCASE <Informatik>BitDemo (music)DatabaseoutputConstraint (mathematics)Zoom lensComputer animation
02:07
Online helpExecution unitRow (database)Term (mathematics)Default (computer science)InformationQuicksortInstance (computer science)Uniform resource locatorState of matterMultiplication signFingerprintData storage deviceComputer animation
02:53
Inclusion mapInformationBlock (periodic table)Graphical user interfaceCountingWide area networkTable (information)BitDefault (computer science)Term (mathematics)Error messageDistanceSet (mathematics)Computer animation
03:44
Open setHill differential equationOnline helpSet (mathematics)Operator (mathematics)Hand fanBitQuicksortDatabaseRow (database)Computer animation
04:16
Service (economics)SummierbarkeitMatching (graph theory)DatabaseService (economics)UsabilityInstance (computer science)Quicksort1 (number)Open setAddress spaceIdentifiabilityThumbnailIntrusion detection systemComputer animation
04:47
Router (computing)Service (economics)Open setMaxima and minimaUniform resource locatorDatabaseAddress spaceComputer animation
05:08
MathematicsMaxima and minimaInclusion mapKnowledge baseInstance (computer science)Category of beingPrice indexMatching (graph theory)Multiplication signProcess (computing)UsabilityPhysical lawTable (information)CollineationRow (database)Type theoryComputer animationSource code
06:09
Online helpComputer fileOpen setCellular automatonMultiplicationDrum memoryComputer iconBasis <Mathematik>Cellular automatonBitInstance (computer science)Multiplication signHeuristicDatabaseComputer animation
06:45
Local ringDigital photographyOpen setOnline helpGraphical user interfaceWechselseitige InformationDrum memoryRule of inference1 (number)Link (knot theory)Cellular automatonMatching (graph theory)Range (statistics)Operator (mathematics)Table (information)String (computer science)Search engine (computing)InformationLocal ringInstance (computer science)Data storage deviceBitSet (mathematics)QuicksortRow (database)SubsetComputer animation
08:22
Programmable read-only memoryConvex hullAttribute grammarInformationProcess (computing)Table (information)QuicksortData storage deviceComputer animation
08:48
Modal logicComputer iconInternational Date LineDivision (mathematics)Total S.A.Online helpCompilation albumOpen sourceField (computer science)Different (Kate Ryan album)HypermediaDigitizingBitComputer animation
09:33
GoogolCodeData structureQuicksortProjective planeSoftware engineeringSoftwareOpen sourceProduct (business)GoogolFreewareWordThomas BayesGraph (mathematics)Source codeComputer animation
11:47
World Wide Web ConsortiumInternationalization and localizationTranslation (relic)DatabaseQuicksortMultiplication signLattice (order)MereologyUser interfaceWeb pageLevel (video gaming)Core dumpReverse engineeringGroup actionProjective planeLocal ringEntire functionDemo (music)Proper mapGoogolWorkstation <Musikinstrument>Installation artMatching (graph theory)Game theoryComputer animation
13:41
Local GroupService (economics)Programmable read-only memoryMaxima and minimaCore dumpCodeGoogolStudent's t-testPhysical systemContinuous integrationSoftware frameworkProcess (computing)Stack (abstract data type)WikiElectric currentOpen sourceTask (computing)Extension (kinesiology)EstimationProjective planeGroup actionData managementSelf-organizationMathematicsExtension (kinesiology)Process (computing)Regular graphINTEGRALSemiconductor memorySoftware developerCore dumpArchaeological field surveyStandard deviation1 (number)Patch (Unix)Software maintenanceBitMechanism designFeedbackUsabilityCodeComputing platformQuicksortSoftwareRight anglePermanentMultiplication signArithmetic meanCycle (graph theory)Computer programmingComputer architectureElectronic data processingProduct (business)Moment (mathematics)AlgorithmMatching (graph theory)Planning2 (number)MereologyOpen sourceSoftware frameworkWeb 2.0Open setFuzzy logicBuildingLevel (video gaming)Set (mathematics)GoogolWikiVector potentialTouch typingComputer animation
21:00
Online helpInformation securityMathematicsOperator (mathematics)Electronic mailing listBit ratePhysical systemBlogData structureProjective planeComputer animation
21:47
Open setEmpennageMusical ensemblePrice indexRepetitionMilitary operationDrop (liquid)Physical systemQuicksortSoftware developerWikiData structureMoment (mathematics)Local ringProjective planeLibrary (computing)Formal languageOperator (mathematics)ExpressionJava appletInformationComputer programmingINTEGRALTrailTransformation (genetics)Programmer (hardware)WordCondition numberVideo gameCollineationWorkstation <Musikinstrument>Point (geometry)Process (computing)Different (Kate Ryan album)Set (mathematics)Source codeComputer animation
25:41
Point cloudOpen source
Transcript: English(auto-generated)
00:12
Okay, I think we can start now. So hi everyone, my name is Anthe Nam, and I work on OpenRefine.
00:20
So I'm going to have an original show of hands of who has ever used OpenRefine here before. Wow, pretty good. Okay, but I was expecting people not to be completely familiar with the tool, so I'm
00:44
going to do a quick demo just to make sure you have an idea of what the tool looks like. And then I wanted to tell you a bit about what we're trying to do on this tool, how do we want to improve it, revamp it, and hopefully you'll have some ideas also how
01:01
we can make that better. So let me start with a quick demo. Can you see all right? Yeah, I think it's not too bad. So OpenRefine is what people tend to call an extract, transform, load system. So basically the idea is you have data in some format, in some data store, and you
01:21
want to load it in your system, transform it, message it in a different format, and fix some issues in the data very often, and then push that to another format, to another database that has other constraints about the data. And it's a web-based tool, but it runs locally, so you need to install it, and all
01:43
your data is on your computer. It accepts all sorts of input formats. I'm just going to use a CSV here to show you how it works. So I just put my CSV in the tool. Oh, okay, it's not really meant to be used with that much zoom.
02:03
I might try to zoom out a bit. I can create a project with this data, and this is what it looks like. So this dataset is about filming locations in Paris. So basically every time you want to shoot a film in Paris, you have to ask for permission, and they keep a record of that so you can then get this information, and this is what
02:23
it is about. So you have the title of the movie, the director, and then all sorts of other things about this film. And I'll just show you a few things you can do in the tool to clean up this data to transfer it to another data store.
02:41
So for instance, a very popular feature of the tool is what we call clustering. So if you take this director column here, you can say cluster and edit. And the idea here is that it looks at all the values in this column, and it's going to look for things that could be duplicates.
03:01
And you have various ways to do that. You have a fingerprinting method, which is the default one here, but you can also try to look for things that are near each other in terms of edit distance. So if you play a bit with the settings, you can discover values in your dataset that might mean the same thing, and which could be errors in the dataset.
03:22
So here you can review these. So you can guess, OK, these are probably the same thing, so I want to merge them to this value. And then maybe this one is not actually a true duplicate. Maybe these are two different people, so I want to leave that out and merge that.
03:41
And then you can click here, and it just does the replacement in the entire table. And the idea is all operations you apply are applied uniformly on all rows in your dataset. So it's a bit more principled than Excel, which is very often the tool people come from before using OpenRefine.
04:04
So this is one thing you can do. Another sort of thing that I really like is called reconciliation. So you have this column here of titles, and if you say that you want to reconcile it, you can select another database online that you want to match this column against.
04:24
So you have names, and you want to get unique identifiers for these thumbs. For instance, you could want to take IMDB IDs or all sorts of other identifiers for these thumbs. And so here you can just pick the database you want to match against.
04:41
So I could use some of the ones I have here, or I could add another service. If a database implements the API that OpenRefine expects to do this matching, you can use that. You just need to add the address of the database here. So it's really user-friendly. You just need to know one URL, and then you can do data matching against that database.
05:06
So I'm just going to use Wikidata here. Wikidata is this knowledge base created by the Wikimedia Foundation, and it has data about all sorts of topics, including films. So this is what it looks like.
05:21
You can configure the matching process in many ways. You can select a type that you want to restrict the search to, so you know that these are films, so you only want to match these with entities about films. And then you can also use other columns from the table to refine the matching.
05:41
So for instance, you can say that the director of the movie is actually a very good indicator that you want to keep. So you can say, I want to match that to the director property in Wikidata. And so that's going to do fuzzy matching, not just on names, not just in titles of
06:00
movies, but also including the director name. And I'm not going to do that here because it would take quite a while for these 3,000 rows. So I've already done it before. I'm just going to show you what it looks like. So this is what it looks like afterwards. For each cell, you have candidates, which are entities from the target database that
06:25
they could correspond to. So you can just review these manually, or you can also use some heuristics to make that reviewing a bit more principled and also a bit more time efficient. So if I click on this, for instance, I get to the Wikidata item for this film.
06:56
So these are two interesting features that you can use to do data cleaning.
07:01
And one thing I haven't shown you so far is the facets on the left hand side. So these are a bit like in a search engine when you have summaries of the values in some columns, and you can use these to filter down the rows you can see to a particular subset that you are interested in.
07:22
So by clicking on this matched value here, I selected all the rows where the matching was reasonably confident, and so we already have valid links to Wikidata in these cells. I don't have to review the candidates here. And now with this filter applied on the left hand side, every operation I do on the table
07:45
is only going to be applied to these rows. So that lets you build conditional workflows, which can be quite advanced. You can combine facets together to say, okay, if this value is in this range, and if that
08:01
string is equal to that, then I want to do this operation. And all this is completely visual, it's just with this very simple UI that people can get custom to quite easily. So for instance, one thing I could do here is just take this column and fetch more information from the database and add it to my local dataset.
08:23
So Wikidata stores all sorts of information about these films, and I can fetch that just by clicking on the attributes I like. So for instance, the genre of the film, I can fetch that, or the IMDB ID. So this is just a preview of what you would get in the table.
08:42
I can again press OK, and this is going to do it for the entire table, which would take a while. You can see it's a process that would take a little while. So that's basically a very, very broad overview of what the tool does.
09:01
It's really popular in many communities, so we have a lot of journalists using it, librarians, researchers in general, really, in a lot of different fields. Digital humanities quite a lot. In the Wikimedia movement, and of course many people we don't know,
09:23
because the tool is just open source and you can use it wherever. So let me tell you a bit now about what we do on this tool, the work we're trying to do. So it's a project that started off 10 years ago now.
09:42
It was initially called Freebase Gridworks because it was made by the company who ran Freebase, which was a sort of crowdsourced knowledge graph, sort of structured Wikipedia in a sense. And very quickly the company got bought up by Google, so the tool got renamed into Google Refine.
10:04
And after a while, Google decided to stop running Freebase, so they also stopped supporting the tool. And they therefore converted the tool into a GitHub project. It was already open source, but then they sort of gave the product to the community and renamed it to remove the Google branding.
10:22
And since then, the project ran like that as a GitHub project, without much structure around it. And just last year, we joined finally a fiscal sponsor to actually provide some structure around the project and also to manage funding around it.
10:42
So what I want to stress about this is that the success of the tool is mostly inherited from the first few years of investment in the tool where it was supported by really big software houses which had professional software engineers and really actually clever people to build this tool.
11:02
And not that much happened afterwards, so we still have the challenge of taking this tool, this heritage that we have, and turning it into a viable open source project that can run on its own, really. Because quite a lot of things were done since 2013, of course,
11:24
but comparatively, we're still resting on the success of the initial tool, really. So what can we do to attract contributors? What can we do to make this a sustainable project? It's very standard recipes that apply to, of course, a lot of projects.
11:44
So we tried to reach out to neighboring communities. So because the tool was originally built for Freebase, we migrated it to Wikidata to be able to tap into the community of the entire Wikimedia movement. So this is just a workshop in Amsterdam
12:02
where we trained people to use OpenRefine for that. We also had a grant from the Google News Initiative to improve OpenRefine for data journalism, and that was also very useful. We had meetings with people from newsrooms in the US to understand better what their needs are.
12:21
We improved the localization of the tool quite a lot, also by making it easier for people to contribute translations. And for that, we used a tool called Weblate. It's a web interface to contribute translations, and it's really, really good. It creates this sort of engaged page where you can showcase the translation effort to contributors.
12:41
And in our experience, it really brought a lot of contributors to the project. And people who really feel like they own the project, it's not just they contribute a few translations and don't feel like they are part of the core team. They really get involved quite heavily. So I can recommend that highly enough.
13:01
We also started a W3C community group to standardize the API that underpins the reconciliation feature I showed you. So to have OpenRefine talk to databases to do this matching, we use an API which was designed by the initial designers of the tool.
13:21
And it was not very documented. It was just you basically had to reverse engineer the tool every time you wanted to implement a database for that. So we're trying to bring that to a better level to document that properly. I can just show you a quick demo of what the specs look like now.
13:42
It looks like properly nice W3C specs. And a lot of people who got involved in the group also started contributing to the tool itself, so it's great. And we created a steering committee of high-profile people around the project who know the ecosystem very well
14:01
and can help us find vendors and projects to partner with and other things like that. So that's really, really early stages, but it's already very useful. And this year we want to apply to Google Summer of Code and R3C.
14:20
And the deadlines are just about now. So if you're also thinking about that, it's time to rush a bit for that. So we're not really sure what to expect from that, but hopefully it's going to bring us nice contributors. And quite a lot of things. Also on the technical side, we tried to revamp the architecture of the tool because it's quite old
14:41
and sometimes the age of the tool is felt in the development processes. So we had to migrate the build system, get rid of non-3D dependencies, as we mentioned earlier, that suddenly had to do with the data package integration in the same go.
15:02
And all sorts of other things. We still have a very old web framework that we rely on, which is completely unmaintained and we're probably the only users alive. It's really, really crazy. So we still have to migrate out of that.
15:20
And in 2020, we have pretty exciting plans. So we want to migrate the data processing back-end to Spark. So at the moment, all the data in a project is held in memory. So it makes it really hard to scale the tool to larger data sets.
15:41
And it's a big blocker for a lot of users. So we have regular user surveys to check about the needs of our users. And the main issue people have is scalability. So we want to improve that. And we also thought it would be good to have documentation
16:01
because for now we only have a GitHub Wiki, which is better than nothing, but really, really not up to the standards that it should be. So we're trying to work on that. It's also early stages and we're still not sure what framework we should use for that. So if you have any idea about what sort of documentation platform we can use,
16:23
get in touch with us. We're still not completely sure. And that work is supported by a grant from CCI from the Essential Open Source Software for Science program that has started a few months ago. And they have another funding cycle at the moment. So if you think your product could be eligible,
16:42
there's still a few days to apply to the second funding cycle in this grant program. So do give it a go because it's really worth it. And I have quite a lot of open questions about how can we better take care of this really, really nice heritage we have
17:00
of this really interesting project. So because we're trying to do all these changes to revamp the architecture, we have to break quite a lot of integrations with extensions because OpenRefine has an extension mechanism so you can add new features on your own.
17:21
And revamping the stack means breaking in compatibility with these extensions. How can we do that in a better way? How can we make sure we're not putting too much strain on the developers of these extensions? If you have any idea, let us know. And also a very, very important question I think is
17:43
how do you manage which issues you tackle in the core team and which ones do you leave sort of open as potential hooks to bring new contributors in? Because of course we can do everything but sometimes issues are strategic in the sense that if you don't work on them
18:01
people will come and do it. But sometimes it's quite hard to tell which ones are which. And also if you're familiar with the tool and have any ideas about features that are absolutely blockers or things we should do better, do let us know because we're really trying to get that right.
18:23
And that's it for me. So I think we probably have time for questions now.
18:47
So the question is about adding new fuzzy matching algorithms in the clustering feature that I showed. So recently we made that part of the tool extensible. So from now on extensions can define
19:03
new algorithms like this. So you don't have to patch the tool itself directly. And we're really keen to have more in the tool itself. I mean there's no problem with that. It's just not clear for us which ones are needed. We don't have much feedback about what is lacking.
19:21
So yeah, if you have any particular example in mind I'm keen to hear about that. Can you talk about how to have more contributors? My question is like, do you have in your organization something like a community manager
19:41
or a developer evangelist? Someone whose job would be to help people try to get into the code or turning issues into pull requests? So the question is, do we have someone in charge in the team to do developer evangelism and bring in new contributors?
20:02
Right now not really. Although some people do a lot of work in this regard. We're still a very small team and no one is actually working permanently on the tool so far because we've had some grants but nothing permanent. So I sort of see it a bit as my duty
20:23
as maintainer to make that easier to ease onboarding of new contributors. It'd be great to have someone dedicated to that. The feature that I would like to see
20:42
is I'm constantly, when I'm using OpenRefine, exporting data from the CSV and then checking this into Git to preserve history. I know OpenRefine has its own history so it's also in relationship with the frictionless data and also with the data I've talked about. The integration of all this workflow
21:01
would be very interesting to have. To be able in a single picture just to commit your history of this in Git. Yes, so the question is can we... You can tell it off the changes and it changes because data is precious. You can easily make mistakes here. You've got some security within this stack
21:22
but it's always better to add something. Right, so if I understand correctly the question is how can we integrate the embedded history that we have in OpenRefine with external tools like Git or other pipelines systems. We're also really keen to work on this. Actually some work has started this year already
21:41
to make that easier. Just to make sure people understand. This history which is the basic list of operations you've applied can be represented as a declarative, just a JSON blob which can then be reapplied on other projects which have the same structure. It already gives you some reproducibility
22:01
to your workflows. It's really popular for people who come from Excel who don't have any way to do that in a simple way. Basically you're programming without knowing it because you're just doing things graphically and this gives you this workflow at the end. What we're trying to do is expose the operations in OpenRefine
22:21
as a Java library that you could reuse in other settings. That would be a start because at the moment it's even really hard to just do that. Then we'd like to have integrations with as many other tools as possible. We've been thinking about adding support
22:40
for other expression languages so you could drop to R or drop to another language in the middle of your transformation and that would give you some of that. It's still not quite clear to me what it would look like in practice. Martin, do we see users turning into contributors?
23:11
Yes, I'm one of them really. I just started just needing the tool and got roped into working on it gradually. Yes, I think that's a natural route
23:22
although it's quite hard because it appeals mostly to non-technical users and people who don't feel like they're developers and they don't feel like they could contribute to the tool. But still it happens. It's all about trying to get that message out that you can contribute even if you're not a Java programmer yet and you have many ways to do that.
23:43
Yes, it's happening slowly. I'm curious about why you think that the wiki is not a proper documentation and which advantage do you see in other systems? The question is why are GitHub wikis
24:00
not appropriate documentations? One thing I'd love to have is localization for documentation because we have a very diverse user base and very often English is a hurdle. Also versioning, I'd be able to tell okay, this is documentation of 3.0 and 4.0 is a different documentation.
24:22
And also just the layout is a bit weird and not very easy to deal with. It's not super easy to read. Yeah, things like that. There's a lot of features in documentation systems that you don't have in GitHub wikis. My question is about licensing,
24:42
creating license information and it also applies to frictionless data. So you've got a data set, maybe in some that are created by license, you'll modify them. That would be really, really useful to track that license problems.
25:01
Put them in the new contributors so that the downstream users can then cite it properly. Right, that would be great. So the question is can we add provenance tracking for licensing in the tool? I'm sort of working on this this year by first making it possible to check
25:21
which columns was a particular column derived from. And that should hopefully in the future make that sort of things possible. But it's still very much out there.