We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Data-Transformation on Historical Data Using the RDF Data Cube Vocabulary

00:00

Formal Metadata

Title
Data-Transformation on Historical Data Using the RDF Data Cube Vocabulary
Title of Series
Number of Parts
16
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
This work describes how XML-based TEI documents, containing statistical data, can be normalized, converted and enriched using the RDF Data Cube Vocabulary. In particular we focus on a statistical real world data set, namely the statistics of the German Reich around the year 1880, which are available in the TEI format. The data is embedded in complex structured tables, which are relatively easy to understand for humans but they are not suitable for automated processing and data analysis, without heavy pre-processing, due to their varying structural properties and differing table layouts. Therefore, the complex structured tables must be validated, modified and transformed, until they are suitable for the standardized multi-dimensional data structure - the data cube. This work especially focuses on the transformations necessary to normalize the structure of the tables. Performing validation- and cleaning-steps, resolving row- and column-spans and reordering slices are available transformations among multiple others. By combining existing transformations, compound operators are implemented, which can handle specific and complex problems. The identification of structural similarities or properties can be used to automatically suggest sequences of transformations. A second focus is on the advantages, which come by using the RDF Data Cube Vocabulary. Also, a research prototype was implemented to execute the workflow and convert the statistical data into data cubes.
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Transcript: English(auto-generated)
Thank you for the introduction. My name is Sebastian. Today I'm going to talk about three things. First thing is a data transformation process. Second, about the historical data we are working on.
And third, about the vocabulary we are using in this context. Okay, good. What I'm going to do, first I'm going to motivate this work and give you a little insight why we are doing this in the project. Secondly, I'm going to talk about the vocabulary in more detail
and talk a little bit about the dataset so you can actually see what the data looks like. Then I'm going to define the problem setting, which can be determined by the properties of the dataset. And I'm going to show you our approach, how the workflow was implemented.
And finally, I will conclude with the contributions. Good. So what is motivating our work? First of all, we are working on the statistical and historical data source, which is the statistics of the German Reich. We have access to it in a digitalized format.
Well, this is data about ships traveling in the 1880s, or it's about import and export of goods in the German Reich. We want to access this data, we want to know what is in there, and we are all doing this in the context of the E-XS project.
So we want to access the data, recommend it as cultural long-tail data and do some analytics on it, like visualizations or more sophisticated analytics. But first, before we can do this, we need a process to access the data and actually do data integration,
and therefore we have to do heavy data cleaning, transformation, data fusion. So this is a theoretical approach. To make it more practical, I'll just show you some data of it.
It's statistical, so you see we have lots of numerical values, and also we have labels which give us the meaning about what is this data actually about. So on the left you can see, for example, a geographic hierarchy. On the top you can see, okay, these tables are not too simple,
maybe there could be some problems, how to integrate, how to analyze this data. The target structure of our data looks like that. So you have seen what is the source, and this is where we want to go. So naively you would say, I look at my numerical fact and just look at every label which is referenced by this value,
and reorder it into my nice table, and then I have this normalized version to do the data analysis on it. So for example, here you have your values together with, in the first column you see these are only numerical facts,
and then you have geographical information, you have time information, in a very normalized and approachable way.
But before we go into the actual transformation, let's talk about data cubes and the vocabulary. A data cube is a multidimensional data structure known from data warehousing processes, and it's specifically built to do analytical operations on it.
So maybe you want to do aggregations, you want to combine data from different years, or you want to drill into the data, or have a more aggregated view. And you can always think, for example, you see the sales cubes over here,
your numerical facts are for example your sales in a specific year, and your dimensions for this to define what this numerical value is, you need to know which client did you sell your product to, what was your product, and when did you sell it. So you have a key value relationship between all the dimension values and your product value.
So this is the visualization thing, you can have more than three dimensions, but you can keep this table in your head, this is the representation we are targeting for.
And what is the RDF data cube vocabulary? This is actually an RDF based vocabulary of course, and it just models this data cube in an RDF style. So you can reuse these concepts in the RDF world.
And also this brings us the interlinking, and disambiguation, and lots of other advantages you don't have in classical data warehousing. So you can profit from this concept here. So it defines the vocabulary, it defines what are dimensions, what are measures, you can construct hierarchies,
but for now we can just think, when we talk about the data cube, you can think about this ordered structure table as I've shown you before.
Good. Again an example. Well this one you would say doesn't look too difficult to interpret. So you just take your numerical values, put it in your column, and look what are my dimensions which describe my values.
But it gets more difficult. For example, this is the one you've seen before. You see there are lots of missing values, hierarchies on one side, or very complex and depending dimensions. So for example, on the top you see these dimensions actually form a sentence.
And you can just reorder them, because then you would lose the context of the cells. So they actually highly depend on each other, and it's not just, okay this is this value, I just reorder it, because you don't have this structure anymore.
Luckily, we don't have to work on images of this data, but we have data which is in the TE format, so this is an XML based format, and we also have annotations on it, like we know what is a data, we know what is a label, which helps in the conversion process.
So this looks like a very simple example, but we have actually also problems. So let's first have a look at the data. For example, here you see a row which contains a sum.
It sums all values which are above of this row. This is, at first this doesn't look bad, but in the target structure you can't have this. All the data have to be in the same granularity, and also when you aggregate every of the data,
you can't have the sum already in there because your computation wouldn't be correct anymore. Let's look at another sample. Here are repeat samples. So the data depends on the cell above.
If you do your reordering, this reference is broken. Here we have a composition. These files are in this way that one document corresponds to one page in the book, but the tables are larger than one page, so we have to concatenate lots of these pages,
and in the books there are references, headers are repeated, or chapters are repeated, just to know where you are, but when you integrate the data, you have to resolve all these dependencies which are inside of the table, and there are lots more of this.
There are lots of very small structural properties which have to be resolved before they could break anything in your final target. These are not even every possible mistake or error in this data,
and this is a very simple table. So your algorithm you would need to implement to convert this to a target structure could get very, very complex because you have to consider every of these problems. And also, not only your data can actually be problematic, but also the structure of the tables.
For example, labels could be wrong, or structural dependencies could be wrong, there are row and call spans which have to be resolved, and if in your source data is only a minimal error,
it would make your table crooked, and all your resolving would be false. So you have to do a lot of checking in this case, to not only the data itself, but also you have to check the structure.
So let's define the problem setting. Our data is encapsulated in multiple files, where one file corresponds to one page in the actual book. It is in this structure unusable for sophisticated data analysis
because you can't apply your standard algorithms to it, because you have to do a heavy normalization before you can do this. So you have to resolve this complex structure to have a normalized structure.
Also there is dirty and faulty data, and also you have to check structure and the annotations because they might introduce another error. And also the status is very huge, and you have lots of really small problems,
but they add up to a really high amount, and also you don't know which of them will occur in your table, and how do they relate to each other. So you have to resolve it in a specific order, because fixing one problem could generate two new, or could break the resolvement of another.
So how can we approach this? First of all we have chosen the RDF data cube vocabulary as our target format, because it gives us this normalized view of data, which is able to be the input for data analysis,
and also we can interlink existing resources, and also we are able to merge this data, which comes from the different data sets, and give it into more sophisticated analytics.
The next point is we've developed an incremental workflow. This means that we are not trying to look at a table, and build this huge algorithm, which is then able to fix all my problems, then get to the next data set,
which is about some other data, and do the same thing again. The approach is to generate a set of transformations, which are very fine granular, and one algorithm only fixes one problem in this data set, or in this table. And so you have a huge set of these transformations,
and the next step would be, how can I order these transformations, and how and which sequence have we applied to this data set, to be able to get to my normalized version. And therefore a prototype was implemented,
which can do this, and it also has a not very beautiful, but functional graphical user interface, which allows you to do that. So you can select and configure what transformations should be applied to the input data, and you also get an HTML preview of every step you have done.
So you see, okay, this row probably must be deleted. You select the according transformation, click on the correct row, and it will be deleted. The next step you will see, okay, this row is gone. Now we have another problem you can fix,
and iteratively you can work on until all datas, or hopefully all datas are fixed, and you can export it. So how does the actual workflow look like? First of all, the prototype is capable of loading a complete link group.
A link group are all tables that are part of one table, which might be split over different pages, so we all load them at once, and we pass all the files into an internal data format. So we are not anymore depending on the TE format,
and doing all the transformations on Java objects, and therefore this prototype is also capable of importing other formats, for example HTML tables or something like that, and therefore other importers could be written.
The next step is merging all the single tables into one large one, and then the real work begins. So you have to look at your preview and see, okay, maybe the merging is wrong, so you have to fix some spans or something that the alignment is correct of all this data,
and then you can check all the transformations and apply them. In every step, the HTML visualization is produced to give you feedback, instant feedback. Finally, when you're done, you can export this into your RDF data cube,
and then we are able to reintegrate it into the E-XS workflows and do recommendations or something on it. Also it is possible here to export it to other data formats. If you do not want to work on RDF data or you have specific needs for your data, you can also write another export mechanism here.
Good. So let's look at some transformations. Everything revolves about this normalization process we had in the beginning. You're looking at your facts,
and you're trying to find all your dimensions which are relevant in this case. You have to reach this step, but first you have to restructure your data to be able to perform this step. So this is your temporal target to find a normalized version of your table
which you can then reorder. And mainly you do that by introducing redundancy into the data. So the source data is very precise and very dense and normally relatively good, understandable for humans,
but when it comes to the computer, it's getting different. So you're trying to normalize the data by introducing redundancy in the data, which is later no problem, because in the analytics there are other formats and other schemas which are prepared for that, but for our case, redundancy is the thing to go for.
So most transformations actually deal with pre-normalization steps. So there are, for example, over 30 of them, and most of them do some sanity checks or do data cleaning. For example, they try to fix the structure, they introduce redundant cells
to split up the column spans and row spans, and also, for example, deleting rows. For example, they're deleting headers which could be in there. The second step is the normalization. You can have just a simple one or there could also be multiple cubes in one table
or there could be still a horizontal or vertical partitioning of your data, for example, because of repeating headers, and there are more complex normalization algorithms which can also deal with that. And lastly, you can do post-normalization transformations.
You can think about it as, okay, now I have a normalized table which is very column-based, and here you can add or merge columns. Merging could be, for example, relevant if there are any hierarchies in your data and you want to resolve it by combining it into a single cell.
And also, you can add the headers. For example, when you have lots of products in one of the cells, you need to describe it. For example, you need to say, okay, this is product, and in my cells are the instance of it,
and then also you can find a disambiguation for it. This means you can link it into an existing, for example, DBpedia resource, and to link it to existing data. This then can have additional benefits, like you're able to merge some of the data.
And also you can add metadata like provenance information or something like that. Good. There are also advanced transformations. So I told you there are lots of very simple ones, for example, deleting a row, and we are able to combine such simple transformations
into more complex ones to be able to fix more complex problems. For example, finding errors and then automatically fixing it with such a combined transformation. So for example, every repeat symbol can be,
or every sum column can be deleted, so we are looking for there is a sum, and then automatically applying the already implemented transformation. Also, to assist the user in the graphical user interface, there are transformation suggestions implemented,
like if you find characters which tend to be the cause of a problematic or an error, you can suggest to him, okay, in this cell you have this character, maybe you want to delete the cell,
or you have to do some structural modifications to it. And also, this is a step towards automation. Because the user, this must be a manual process, because the structure is too complex for an automated process,
but scanning the layout and the structure, you can determine if there might be a problem, and then suggest the user, okay, look at this specific point, maybe this is an error, you have to fix it. And then add the contributions. So we've implemented a modular workflow
for a data integration process. We have defined a huge, relatively huge set of granular transformations, which are also combined to more complex ones. And this prototype is usable with this data source,
and also with other data sources. And also the good thing is, you have lots of these links groups are very similar to each other. And what you can do is, you can use an existing chain of these transformations and apply it to another one. For example, the imports, not of the year 1880,
but of the year 1881, but the structure is minimal difference, so you have other structural errors. You can reuse your transformation sets with little modifications. Also, we have lifted and enriched historical statistical data by linking the headers,
this integrated headers into the data cloud, and this is now ready for visualization analysis. And our current data set contains more than 30,000 files, and we've converted more than 10% of it. And in this process, we've created 10 of these conversion chains,
whereby five of them are very similar, because we used similar input tables, and also there are different ones to see if we can reapply the implemented transformations to the new data sets.
Thank you very much.