Enterprise-Wide Metadata Management
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 30 | |
Author | ||
License | CC Attribution 4.0 International: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/53708 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
11
00:29
Computer animation
07:10
Computer animation
13:52
Computer animationMeeting/Interview
Transcript: English(auto-generated)
00:00
Okay, so Hello, everyone from my side. My name is Rebecca Eichler, and I'm a PhD student at the University of Stuttgart. And today I will be presenting the content of my paper on enterprise wide metadata management. So specifically, we wrote more about the current state and challenges in metadata management. And this is all based often
00:26
often industry case. I'm going to start real quick with a motivation why we need enterprise wide metadata management and what it's actually about. So what you see here is a variety of data sources that you would have inside of a company and its
00:45
employees. So today, as you all know, companies are collecting all kinds of data. And they are doing this because they hope to derive some kind of value from the data. Now they could be doing this, for example, through an employee like Bob
01:01
here, who is performing data analysis on a few of the data sources and thereby deriving some kind of insight. Now, the problem today or an issue today is that the employees only know about very few of the data sources that are available to
01:20
them. Now this limits the value that we can actually extract from our data. So this is why today, we are trying to make the data that we have available to all of our employees, so far as this is compliant, obviously. And if we do this, we have more
01:40
people working on more of our data. And thereby, for example, we can generate more insights. Now, to share the data in this way, we need enterprise-wide metadata management. So right now, metadata is collected in the scope of the single systems. For example, in the data lake context here, you could have metadata on
02:04
what data is in the lake, who's allowed to access it, what's the quality, what is the lineage information, and so on. Now, to make the data available, we need to make the metadata available first. So we need to bring this metadata together in some
02:23
kind of centralized approach. And this is what we mean when we talk about enterprise-wide metadata management. So basically, this encompasses the metadata management initiatives of single systems and brings all of it together. So yeah, now we went ahead, and we interviewed a globally active manufacturer. And this
02:45
is a company that is active in various sectors, such as mobility, industrial sector, and so on. They have a global manufacturing network, and there's all kinds of data out of this context. And they are currently striving to become a
03:02
data-driven industry 4.0 company. Now, we interviewed them to find out what metadata management they're conducting, what their goals are, challenges, and so on. And that's what I'm going to be talking about today. So first, I'm going to tell you about goals that companies have today, how they are put into
03:22
practice, and what challenges they are faced with. We also went ahead and we did a literature review and a tool review to see if these challenges do have coverage. And we found a few research gaps, which I will be presenting later. And I'm going to close this talk with a quick summary and conclusion. Okay, so
03:46
the current state in metadata management. The first thing we need to understand is that metadata management is an enabler for data management. So the first thing we need to know is the data management goal. And as I explained earlier, one of
04:05
the very big topics in enterprises today is the sharing of their data, because this promotes the extraction of data value. Now, the company that we interviewed explicitly said, we want to share our data freely and
04:22
efficiently. Now this means really with as many employees as possible, and efficiently with as little effort as possible. And what we also have to consider when we're talking about data sharing, it's always it always has two sides, we have one side actually provisioning data and the other side then accessing and using it. Okay, now, to enable data sharing, we need data
04:49
transparency. And this is our big data management goal. Data transparency is all about the ability to actually find the data that you have to then be able to
05:02
understand it. And when once you have understood your data, you want to be able to access it. So this is what data transparency is about. But data transparency is still kind of vague. So how do you achieve data transparency and the company we interviewed, they went ahead and they defined four sub goals for
05:24
metadata management sub goals to do this. The first one is the creation of a data inventory. So we need to know what data we have in our company. The second one is about creating coherent and shared semantics. Now, if we have
05:43
department A sharing data with department B, department B must be able to understand the terms they use, and so on, they must be able to understand the data to avoid misunderstandings, false insights, and so on. And for that, we need shared semantics. The third sub goal is about creating a common structural
06:05
description of data. So today, it's very difficult to understand data based on only their models and to reuse it. So this is an issue they're approaching with the sub goal three. And the fourth sub goal is about creating a
06:20
common data asset description. So a general description of the data that we have. Okay, now, those were the goals. And now it's interesting to see how can we actually put them into practice? How can we implement these and sub goal one can be implemented through a data catalog. This is a metadata
06:44
management tool, that main its main feature is an inventory and we can go ahead and we can register all of our data sources that we have, or many of the data sources that we have, and we can collect additional metadata on them. Examples of this are Alation, Colibra data catalog, Informatica data
07:05
catalog, and so on. The second goal can be achieved through a business glossary. So this is also a metadata management tool. Often, catalogs also contain such a component. And this is basically
07:21
just a list of terms, abbreviations, synonyms, term relations, this is what we define in a business glossary. Examples here would be like urban data literacy. Now, the third sub goal is with a little bit more complex, or is more complex, and the company that we interviewed, they went ahead and
07:43
did this through semantic modeling. And they defined, they introduced a meta model. So they went ahead and they, to bring, to create this common structure description, they went ahead and they created the meta model with various abstraction levels. So on the first abstraction level, for
08:02
example, they defined a business object. On the second lower level, they would say, okay, we have a machine, and machine is an instance of a business object. And on the next lower level, they would say, okay, a drill bench is an instance of a machine, and they would connect these instances. Now, if we have data models, and in these models, we are using a drill bench or
08:24
something of that sort, or machine, we can interconnect them with these entire definition instances. And by doing this, they bring together all of their data models, they add this layer of understanding to it, and they're creating a kind of knowledge graph. So this is their approach for
08:41
creating a common structural description of the data exists. And the last sub goal can be obtained through a metadata standard like Dublin Core, you can think of this as a list of attributes that you're collecting on your data sets. This is great, because the company
09:02
that we interviewed, they had not actually implemented this part yet, and therefore, we did not go into detail on it. Okay, now, those were the goals and how we can put them into practice. And now I would like to tell you more about the challenges that companies are faced with today. So there are several challenges, but we looked
09:21
at three in particular. The first one is relates to metadata management in data lakes. Now, there is a lot of work on this, there is a lot of literature on metadata management in data lakes, for example, on metadata management systems like constants, gems, goods, and so on. Now, there are still open questions. For
09:46
example, data lakes are known for turning into data swamps, a data swamp is a lake in which the data that is contained is unfit for use. This is usually the case because of missing metadata. Now, it is still unclear what tasks do what tasks
10:04
exactly do I require to prevent the creation of a data swamp. So an example task would be data quality management or collecting lineage information or something of this sort. And there, this is, they talk about it in literature,
10:20
but it varies and it is not clear what do I actually have to do, which is the minimum, what are the minimum necessary tasks for this? Also, for these single tasks, there are open questions like, what metadata do I actually need to collect in this context? So they say you need to do data quality management, but we don't know what metadata we
10:42
need to collect within the data lake for this quality information. Also, what tools, protocols, and standards are there that are good in data lakes? And also, how can I then take the metadata that I collected and then later integrate it into my enterprise wide landscape? So
11:01
these are questions that are open. The same here. How do these tasks actually differ in data lakes? So data lakes are different from our other source systems or source systems that we have. So the question is, how do I actually perform metadata management in this lake? And how does, for
11:20
example, data quality management differ in it? I need to know this to collect the right metadata. The second challenge that we looked at was the selection and composition of metadata management tools. Now, there's also a lot of literature on various tool types. I already
11:40
mentioned data catalogs, business glossaries, but there are more. There are data marketplaces, data hubs, data dictionaries, and so on. Now, there are very little comparisons of these tool types. We have a few blog articles that compare subgroups of these tools. And we also
12:01
have lists of actual tool instances, like our compiler list of tools and vendors. Now, what we don't have is an actual comprehensive overview of the tool types, and also a categorization and differentiation of these tool types. For example, how is a data hub different from a
12:22
data marketplace? Is it a synonym? Is it a subtype? How do they actually relate? How do they work together? This is what we need to know when we're building an enterprise-wide tool landscape. Also, what are the building blocks of these tools? For example, a data marketplace often contains a data catalog, and a data
12:42
catalog often contains a business glossary. So they contain each other, and we need to know what are the building blocks? How do they work together in order to know what tools we need to select? The same topic here, it will be really interesting to know what building blocks do I actually need to conduct a comprehensive metadata
13:04
management. Now, these are also questions that are not addressed sufficiently through literature. The third challenge we looked at is on data marketplaces for the internal
13:21
use. Now, for example, we said earlier, we're using data catalogs and business glossaries in companies today. And this helps us to find the data and understand the data, but it does not actually help us to access or provision data. So this is why we need a data marketplace. Data marketplaces
13:43
are platforms for the exchange of data. But currently, they are mainly used for the exchange of data between companies. Now we're looking at data marketplaces for the internal context. Okay, so it's about data marketplaces in the internal context. And we have
14:00
literature on there is a lot of work on data marketplaces, but usually in the external context, we hardly have anything on how we can use data marketplaces within an enterprise, which is why we are basically missing detailed concepts and solutions with architectural proposals with definitions on the functional scopes, and so on in the
14:24
internal context. Now, just to name an example, how they could differ. Compliance is a huge issue in this context internally, but not externally. If I'm a company, and I'm sharing a specific data set with another company,
14:40
then this is a very, I have checked that this is all compliant. But if I'm trying to share all of the data I have with all of my employees, I need a lot more refined compliance regulations within this marketplace. Yes. Okay, so these were a few of the challenges. And we see that there are issues in the context of metadata management.
15:03
And I'm going to summarize or sum up real quick what we learned today. So I talked about the current data and metadata management goals, which are mainly about data sharing and data transparency. I gave you an insight in to the challenges. So which are mainly data marketplaces. This
15:23
is on the right here, because we need this in addition to what they had to the other sub goals. And the other two challenges are more cross sectional on metadata management for data lakes, and also the selection and composition of metadata management tools. Now, these are blue, because we conducted a literature and tool review,
15:43
and we found that these are not sufficiently covered, and therefore constitute research gaps. This is, this is what I wrote about in the paper. And thank you for your attention.