Data Warehousing and Data Mining Techniques - Introduction (28.10.2010)
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Teil | 1 | |
Anzahl der Teile | 13 | |
Autor | ||
Mitwirkende | ||
Lizenz | CC-Namensnennung - keine kommerzielle Nutzung 3.0 Deutschland: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/325 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache | ||
Produzent | ||
Produktionsjahr | 2010 | |
Produktionsort | Braunschweig |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
4
5
7
10
11
12
13
00:00
PunktEntscheidungstheorieSystemprogrammierungComputerspielDatenbankFormale SpracheInformatikInformationOrdnung <Mathematik>PerspektiveSelbst organisierendes SystemSpieltheorieStabStatistikTechnische OptikFrequenzGebäude <Mathematik>StörungstheorieKonfiguration <Informatik>Gesetz <Physik>Total <Mathematik>AggregatzustandGeradeGreen-FunktionGrundraumMereologieMomentenproblemPhysikalisches SystemPolygonnetzRechenschieberSubstitutionTermWarteschlangeWinkelEntscheidungsunterstützungssystemFlächeninhaltGüte der AnpassungAutomatische HandlungsplanungWasserdampftafelProzess <Informatik>VorhersagbarkeitMetropolitan area networkIntegriertes InformationssystemFächer <Mathematik>PunktKontrollstrukturKreisflächeKartesische KoordinatenHilfesystemRegulator <Mathematik>BeobachtungsstudieUmwandlungsenthalpieSystemzusammenbruchBimodulKundendatenbankVollständiger VerbandData MiningAdressraumQuellcodePlastikkarteClientSchreib-Lese-KopfEndliche ModelltheorieAutorisierungSondierungMinimalgradMultiplikationsoperatorSchlussregelRechter WinkelDienst <Informatik>GamecontrollerSoftwareentwicklerHyperbelverfahrenData-Warehouse-Konzeptp-Blocksinc-FunktionZweiUnternehmensarchitekturComputeranimation
07:19
EntscheidungstheorieSystemprogrammierungInformationStatistikProdukt <Mathematik>FlächeninhaltClientData MiningData-Warehouse-KonzeptSoftwareAbelsche KategoriePerspektiveArchitektur <Informatik>Prozess <Informatik>VorgehensmodellServerDatentypGebäude <Mathematik>Physikalisches SystemAnalysisDatenbankInformationSelbst organisierendes SystemStatistikTransformation <Mathematik>DatenanalyseGebäude <Mathematik>Produkt <Mathematik>Analytische MengeLastMereologieRechenschieberWinkelEntscheidungsunterstützungssystemFlächeninhaltAbfrageServerGewicht <Ausgleichsrechnung>Prozess <Informatik>VorhersagbarkeitPunktOffene MengeData-Warehouse-KonzeptVollständiger VerbandMonster-GruppeClientSichtenkonzeptWhiteboardp-BlockMultiplikationsoperatorStandardabweichungSpezifisches VolumenSoftwareentwicklerApp <Programm>RelativitätstheorieTypentheorieMAPGeradeGüte der AnpassungAutomatische HandlungsplanungCASE <Informatik>Metropolitan area networkRichtungData MiningComputeranimation
14:39
Data-Warehouse-KonzeptPhasenumwandlungDatenbankServerTabelleInformationsmanagerWeg <Topologie>SoftwareAbfrageOffice-PaketDatenmodellRelationentheorieVolumenCASE <Informatik>WürfelInformationProtokoll <Datenverarbeitungssystem>Mechanismus-Design-TheorieTransaktionDateiformatSystemprogrammierungNP-hartes ProblemPaarvergleichKundendatenbankComputerspielPhasenumwandlungVektorpotenzialData-Warehouse-KonzeptAlgorithmusAnalysisCodeDatenbankDatensatzDatenstrukturHardwareImplementierungInformatikInformationKommandospracheMathematikNatürliche ZahlOrdnung <Mathematik>PerspektiveComputerarchitekturRelationale DatenbankRelativitätstheorieSelbst organisierendes SystemSoftwareStatistikTransformation <Mathematik>CodierungProgrammierungWürfelExpertensystemSoftware EngineeringFrequenzTypentheorieEDV-BeratungGebäude <Mathematik>Produkt <Mathematik>MAPEchtzeitsystemIntegralDimensionsanalyseTaskGefrierenGrenzschichtablösungAnalogieschlussAggregatzustandAnalytische MengeArithmetisches MittelBefehlsprozessorBildschirmmaskeBitEinfach zusammenhängender RaumEinfacher RingFunktionalGeradeGruppenoperationIndexberechnungKomplex <Algebra>LastLeistung <Physik>MereologieMultiplikationNormalformOrdinalzahlOrientierung <Mathematik>PaarvergleichPhysikalisches SystemRechenschieberRechenwerkResultanteSpiraleTabelleTermTransaktionVirtuelle MaschineVisualisierungWärmeübergangWiderspruchsfreiheitWinkelZahlenbereichZentralisatorEntscheidungsunterstützungssystemDatenmodellFlächeninhaltStapeldateiSystemaufrufAbfrageGüte der AnpassungReelle ZahlSystemprogrammTeilbarkeitFamilie <Mathematik>PrototypingEinflussgrößeServerInternetworkingNichtlinearer OperatorAusnahmebehandlungHypermediaCASE <Informatik>Prozess <Informatik>NormalvektorOrbit <Mathematik>Zusammenhängender GraphSystemverwaltungFehlermeldungVorhersagbarkeitDatenfeldComputersicherheitZeiger <Informatik>KorrelationsfunktionPunktRepository <Informatik>HorizontaleGeschlecht <Mathematik>KontrollstrukturSpezielle unitäre GruppeVarietät <Mathematik>SchnittmengeEigentliche AbbildungEin-AusgabeInformationsspeicherungProtokoll <Datenverarbeitungssystem>Wort <Informatik>KreisflächeKartesische KoordinatenHilfesystemReverse EngineeringDruckspannungFokalpunktKundendatenbankVollständiger VerbandMiddlewareData MiningDateiformatRahmenproblemMonster-GruppeBefehl <Informatik>VerkehrsinformationAdressraumQuellcodeRelationentheoriePlastikkarteClientEreignishorizontBitrateBootenCharakteristisches PolynomSichtenkonzeptWhiteboardKonditionszahlTrennschärfe <Statistik>Endliche ModelltheorieSchlüsselverwaltungDifferenteSelbstrepräsentationNeuroinformatikObjekt <Kategorie>Atomarität <Informatik>Klassische PhysikLoginMinimalgradDreiecksfreier GraphMultiplikationsoperatorDatenreplikationSchlussregelURLKreisbewegungStandardabweichungTVD-VerfahrenSchreiben <Datenverarbeitung>InformationsqualitätZweiMinkowski-MetrikRechter WinkelRohdatenNetzbetriebssystemPackprogrammMusterspracheGamecontrollerMooresches GesetzMechanismus-Design-TheorieTwitter <Softwareplattform>PortscannerDokumentenserverSoftwareentwicklerEinsBenutzerfreundlichkeitBildgebendes VerfahrenLie-GruppeMathematische LogikRückkopplungSchaltnetzSpieltheorieVerband <Mathematik>ZeichenketteMathematisches ModellBildschirmfensterKategorie <Mathematik>SoftwaretestProgrammiergerätKonfiguration <Informatik>Physikalischer EffektGesetz <Physik>Algorithmische ProgrammierspracheBoolesche AlgebraDatenloggerDifferenzenrechnungDivisionEindeutigkeitGerichteter GraphGrundraumInverser LimesMomentenproblemPolygonnetzPrinzip der gleichmäßigen BeschränktheitStellenringStützpunkt <Mathematik>WarteschlangeZellularer AutomatZentrische StreckungE-MailQuick-SortBildverstehenMatchingSpannweite <Stochastik>Basis <Mathematik>AbstandGewicht <Ausgleichsrechnung>Wurzel <Mathematik>Metropolitan area networkTemperaturstrahlungATMRoutingNummernsystemFormation <Mathematik>Puls <Technik>Parallele SchnittstelleSummierbarkeitBridge <Kommunikationstechnik>Dichte <Physik>Disk-ArrayBildschirmsymbolLesen <Datenverarbeitung>UnrundheitMehrschichten-PerzeptronRichtungUmwandlungsenthalpieEinfügungsdämpfungMailing-ListeSchätzfunktionStabilitätstheorie <Logik>FlächentheorieSoundverarbeitungMultigraphWorkstation <Musikinstrument>Vorzeichen <Mathematik>Motion CapturingNichtunterscheidbarkeitAbgeschlossene MengeWeb SiteSchreib-Lese-KopfDomain <Netzwerk>p-BlockSondierungErweiterte Realität <Informatik>DickeRandwertMessage-PassingProgram SlicingMapping <Computergraphik>Serviceorientierte ArchitekturBenutzerbeteiligungDemoszene <Programmierung>Figurierte ZahlIdentitätsverwaltungLesezeichen <Internet>KoroutineComputeranimation
Transkript: Englisch(automatisch erzeugt)
00:01
Hello everyone. It may come as a surprise, but I hope it doesn't. This lecture will be held entirely in English language. So if that is a problem for somebody, raise your hands or don't say anything at all.
00:31
So the topic is data warehousing and data mining technique and the reason why it's in English. First of all, because it's important for you to know proper English and to be able to grasp concepts even in English.
00:44
Second of all, this topic in all the database curriculum is the easiest. So its concepts are really easy to grasp even in English and most of the technical terms are English anyway. And third and the actual reason is that this lecture is part of an international course of study.
01:08
The Master of Information Technology and Information Systems. That is a master course that is held conjoined with the universities of Hanover, Clausthal and Göttingen.
01:21
And so we are very proud to be part of this curriculum and I hope this helps you to get some of the concepts we have here in English. It's a good exercise for later job applications or stuff anyway.
01:40
First of all, this is a lecture that is very important for computer scientists since it's databases, sure. But it's a lecture that is very important for business computer scientists too. So enterprise information systems, business information systems. That is a course of studies that you definitely know about data warehousing because it's the reality out there in almost every organization.
02:05
Who's from business information systems? One, two, three. One, two, three? Oh, that's not too many. Interesting. Good. So tell your colleagues about it. This is really important to know.
02:22
Anyway, what we will be doing, I will start with some organizational issues and then go directly into the lecture. So the lecture is from the 28th of October until beginning of February. Every time from 3 p.m. to quarter past five.
02:45
So that makes three lecture hours if you count correctly. And the interesting thing is why do we do this in such a large block with a short intermediate break? The answer is because we try to integrate everything. We try to get the exercises within the lecture, within the coursework.
03:07
And have some of the background, some stories about it, some historical remarks how it came to be and what is important and what is not. Disguised as kind of detours so where you can just lie back, relax and listen to the things that are happening.
03:26
This is not really for thinking. This is not really for understanding. But it's rather meant to be background information for you so that you can see some of the issues that are involved. We will also have the exercises like the solution for the exercises integrated into the lecture and discussion about homework.
03:49
There will be homework for I think every two weeks or something? Every week. Every week, oh dear. So there will be homework every week now.
04:02
And finally there will be exams, oral exams. And we used to have as a prerequisite for these exams that you have to get 50% of the homework score, of the total available scores. We cannot do that anymore due to some regulations of the modules and the ministry told us that it's not possible to have two issues in one exam.
04:33
So having 50% of the total score and doing the oral exam would be two parts and this is no longer a viable option.
04:45
So I still very strongly advise you to try to get the 50%. It's not difficult. Most of the lectures are really, you know, like repeating what you have read on the slide. You know, like understanding, thinking about a little, which you will have to do for your exams anyway.
05:01
So it's a good exercise for the exam and everybody who's regularly done the homework will have no problems whatsoever during the exam. Okay? So please, I really want to encourage you, try to top up the lectures with some of the homework at least.
05:21
The credit points for this lecture is four or five, depending on whatever studienortnung you study in. The English people don't have studienortnung so this is a typical German term that is exported in some country.
05:42
Basic point is if you have the old one, then it's four credits. If you have the new one, it's five credits. But it's not automatically five credits. You have to change your course of study. And you have to make sure you do that, then you get five credits.
06:01
Not a problem here. And the next part of the introduction is always why you should be here. So what is the interesting part of this lecture? What is the things you will learn and what is the knowledge you will gain from this lecture? And the interesting thing is that if you have bad business decisions, as everybody can tell you, bad management decisions,
06:31
you will be suboptimal in what your organization performs. And this is to say the least. There have been some outright disasters in industry.
06:44
For example, the crash of Lehman Brothers. That was very bad, strategical decisions. And they were not inventing these decisions, you know, on a wimp and going out on a leave. Let's do something that ruins all the economy today.
07:02
But they thought they were doing the right thing. How did they come to think this? And that is where the data warehousing comes into the game. The data warehousing is one of the major sources of information that management draws the grounds for strategic decisions.
07:24
And that really goes through all areas of organizations, be it companies or also in the private sector. There's a lot of things to do. The interesting part is for the data warehousing, more technical perspective, what is the data that your organization owns.
07:48
And on the other hand, what is called online analytical processing, or OLAP for short. This is how this data is kind of analyzed, how you get the information that we really want to have.
08:04
And the information is not out in the open. I mean, I can use a database and put in some SQL query and everything that's in the database. But most of the information that is really important for strategic decision is hidden somewhere in the data.
08:21
Or it can be extracted by putting some pieces of data together in a very complex manner. And this is what you have to do. You have to see the data from a different angle. You have to look at it in a different way. So what about specific products that I sell? Is it worthwhile selling these products, producing these products, or is it kind of like not taking off?
08:46
What about the clients? Do I have clients that never pay? What about the time? So the Christmas market or something that is very important for most companies. You know, they will have an increase in sales during Christmas time. Or the geographical area.
09:03
Does it really pay off to sell fridges to the Arctic Circle? Probably not. Okay? This is data that is not open, that is not out in the open, but data that is rather somewhere hidden in your organizational data,
09:21
in the sales data, in the customers, in whatever. You know? So what you have to do is you have to get statistics over all this big amount of data. And then putting these statistics in a nice Excel slide
09:40
and bring it to the next board meeting. Obviously the sales in Northern America are not worthwhile. Helps you to get your point across. Because who can argue with statistics? Only somebody who understands statistics.
10:00
And that is not management. So that is basically a point. And the second point is that statistics, having good statistics, is always the basic building block of making good predictions. So you can, to some degree, predict future developments.
10:25
Are the sales going up or down? Is the stock market at a turning point? Stuff like that. You will not always be right. But you get the idea. And that is what this course is basically about. This is what you should know.
10:41
Because it is interesting. But on the other hand, it is not only interesting, but we all love databases and we all love being paid for what we do. And if you go to some of these job sites, Monster or however it is called, you will find a lot of job offers on technology, analysts, data analysts,
11:10
where you see, okay, five to ten years SQL server experience, working with .NET architectures, data warehouse experience, all app analysis services, experience with data marts, staging metadata,
11:23
extraction transformation, loading experience, you know, like that. And that will pay you about 150,000 bucks a year. And you find loads of these. So there is really a market. There is really an article in demand. Data analyst.
11:42
And this is what you can get if you follow this course. And don't come too late for this course. I also want to point out some literature that I like very much and that is kind of standard in this area.
12:02
So for data warehousing and data mining, there are some very good textbooks. And probably the most well-known is the so-called Data Warehousing Bible or the Inmon Bible by William Inmon, Building the Data Warehouse. Very good book that basically deals with all the aspects of data warehousing,
12:24
where you really get to know what you are doing, why you are doing it, what a data warehouse should do or shouldn't do. And there is a second one along the same lines, the Data Warehouse Toolkit, that is more on a technical level by Ralph Kimball. These two are very good books for a general view on data warehousing.
12:46
I also point out one German book, Data Warehouses ist deme, Andreas Bau and Holger Guentze. It may be a little bit superficial, but it's kind of nice. And if you want something that is in German to accompany the course,
13:01
to read about some of the concepts probably in German, that might help you. It's kind of a nice book. We go into more technical directions, specifically for the online analytical processing part. Then I can recommend very warmly the Data Warehouse ETL Toolkit,
13:22
which is kind of like a second volume for the Data Warehouse Toolkit. Again, Ralph Kimball. Which specifically concerns the extraction transformation loading process of the data warehouse, so how it's really built up. Not how it's used, but how it's really built up. So all kinds of how to get the data into the warehouse from productive databases,
13:44
how to clean it, how to normalize it, and all that kind of stuff is in that book. Very, very well written. As for the analytical processing part, there's a very good by Eric Thompson, OLAP Solutions, and by Robert Wamble, Data Warehouses and OLAP.
14:05
Interesting books. I can recommend them very well. And basically that covers most of what we do here in this lecture, concerning the warehouse. Yes, that was too much.
14:46
So what are we going to do today? Today I want to give you a basic introduction of the whole problem. So what is a data warehouse? How do you use a data warehouse and what do you use it for?
15:00
And then we'll go briefly into the life cycle, into the phases that warehouses need to fulfill its full potential. Basically, a data warehouse is a very big data byte. So we all know about databases.
15:22
I mean, we all had relational databases, one. Who had other database courses? Except for the basic. And the others only had relational databases, one?
15:44
Probably, yes, no. Who didn't have databases at all? Yes? So everybody had at least relational databases, one. Well, that will suffice for all purposes. Basically, we can see a data warehouse is a very large database.
16:04
And now, of course, the question occurs, then every large database is a data warehouse? And the answer is no, that's not true. So we have the typical relational system, the Oracle systems or whatever, you know. They are not data warehouses, they are just large.
16:24
And every data warehouse is also large, but it has some characteristics that are totally different from databases and we will go into them in just a minute. If I say large, I mean terabytes.
16:41
I mean really, really large. MySQL is not a fit. 800 gigabyte at least, but most do really have several terabyte. And of course, several terabyte always implies that doing that on one server is a very difficult thing.
17:02
Most of the data is globally distributed for security reasons. What if your computing center burns down? All the data of your organization is lost. Very bad idea. And on the other hand, because the data is needed in different facilities,
17:23
so usually big companies own a couple of facilities and all will need to access the data. You can well imagine that this may cause severe bottlenecks if the whole database runs on a single server. So mostly, data warehouses are distributed.
17:43
Well, then every distributed database is a data warehouse. Now that's not true either. A data warehouse is something really specific. It's a collective data repository. And that means that all the operational data
18:02
that is produced by your company is stored in the data warehouse at any point in time. So you have a complete history of your data in the data warehouse.
18:20
You don't change it. You know, in normal databases, as large as they may be, you update data, you put new data in, you delete data. You don't do that with data warehouses. You keep all the data with a timestamp. Okay? And why do you do that? Why do you need the whole history of your company?
18:45
Well, the interesting part is that if you want to do proper statistics, they will have to cover some time to see trends emerging, to make predictions for the future, to see if the models that you used for getting your predictions
19:02
are true, are valid. Then you, of course, need historical perspective on the data. That is what you're doing. The process, how you get your operational data into your data warehouse, because that's not a single system, that is several systems,
19:22
is called ETL or extract, transform, load. And that means that from the normal databases you have in your company,
19:41
you have some kind of a process that will be investigated during the next couple of lectures, that cleans the data, that normalizes the data, that works a little bit with the data to get more information out of it, and then it is put into the central data warehouse.
20:03
That is the real big database. Those are normal databases. These are the ones that are covered by relational databases. This is one and two. This is the one that we're dealing with in this lecture, in this course.
20:22
And then you can do some analytics on that, covering so-called data marts. We will do that during, I think, the fourth or fifth, something like that, sixth lecture. And then comes the interesting part that will be covered at the end of the lecture,
20:44
like the last four or five lectures, the analytic part. So how do you do OLAP on one hand? And how do you do data mining? What are these algorithms to get out of the hidden information?
21:02
How do I know which products to put together on the same shelf in my supermarket so people will buy more? These are the interesting parts. And we will cover these algorithms and you will know about it.
21:21
Well, if you own a supermarket at one stage, you're able to make sure that everything is in order and that the cheese is very close to the red wine because they're always bought together. This is basically what we're doing. And this is the same way the data flows.
21:40
The data flows from the transactional systems, from the everyday use systems where all the production data, the customer data, the financial data, all the kind of data that is collected by your organization is hosted and is processed. The data flows with help of the ETL process into the data warehouse,
22:06
which is just like a real world warehouse. You have lots of shelves and you just stuff the data in. And then at some point somebody goes through this data warehouse and visualizes some connections, does statistics, makes predictions
22:23
or just creates reports for management. You know, like was it a good year, was it a bad year? All this can be done with data warehouse. Why don't we use databases for that? I said a data warehouse is basically a very large database, which is true.
22:46
Well, if we compare it to a database, I said that the data in a database does not have any historic impact. It's changed and then there may be a change date, but that's about it.
23:01
You don't know about the old values or the history of some data item. You do know in a data warehouse because you never change anything. And that is of course interesting if you do strategic tactical decision that you need to know some of the development to predict what is coming
23:25
or what is good, what is bad. It also implies that for management decision you only have a very small number of transactions because management board meetings take like most of the time of the day
23:42
and they discuss three, four, five things that have to be prepared by analyzing the data warehouse. If you're selling things, products, or have customer contact, that will be three, four, five things you do per second
24:02
because you're selling thousands and thousands of your products. And that is of course a very interesting thing. You have a very small number of transactions, but those transactions may be arbitrarily complex because they need a lot of data to sift through.
24:22
They may need a lot of different angles to look at the data, so maybe grouped by certain countries, grouped by certain time spans or the development over certain time span. So these are usually very long running transactions.
24:41
And they don't have to be online all the time. But you get your analytics and then the system is kind of free to use for other people or for the ETL process or whatever. Whereas a database that is used for operational purposes has to be online 24 by 7
25:01
because you cannot afford not to sell something because your system is down somehow. You need to be available. So if I compare them directly to each other, the online transaction processing that is basically the traditional database
25:23
and the data warehouse on the other hand, then the business focus is that the database is operational. It's really for sales data. You just put that in whenever a unit is sold. And that happens very often.
25:40
The data warehouse is for tactical, strategical decisions. It happens very rarely that you need those. The transactions that you have to do is very large in a normal database, very small in a data warehouse. How many strategic decisions do I do a day?
26:02
How many units of product do I sell a day? Totally different. The transaction time is short in databases. I mean, how long can it take to update some record? And very long in data warehouses.
26:20
How long does it take to group everything by country and who bought what and why and the time he needed and blah, blah, blah. Lots of joins, lots of orderings, lots of aggregation and some other data functions. So this is basically the major differences
26:43
that we see between databases and data warehouses. Again, saving.
27:02
Kind of annoying, but I have no idea if that can be... Ah, we should use an ETL process for that. Okay, come on, this is ridiculous.
27:21
Some definition of data warehousing. So what the expert said, I talked about the Kimbell book and the Inman Bible just a minute ago. So Ralph Kimbell says it's basically a copy of transaction data that is specifically structured for query analysis. That's a very simple definition.
27:40
You have the transaction, the business data that is operational. You just copy that into the data warehouse and you copy it in a very clever way so you can use it for later analysis. So it's normalized and all kind of things. Inman is a little bit more to the point.
28:01
Basically he says that a data warehouse is subject-oriented, integrated, non-volatile, time-variant, a collection of data in support of management decision. What do these four points mean? I mean, subject-oriented means it's organized in a way
28:23
that all the data is about the major focus of your company. So if you're selling computers, then the focus of your data in the data warehouse should be the individual computer that you're selling.
28:42
And all the data that is related to it should be grouped around the central product, the customer, the cost that you have in production, the orders that you have, warranty claims, all the accounting data that you need
29:01
that is grouped around the central entity of your data warehouse. That means subject-oriented. To give you a short example, if we take the customer as a subject in the data warehouse,
29:21
then we could have base customer data from different periods and the activity that all these customers did. So we have the base data like addresses and maybe some demographic data, the age, the gender,
29:41
or if you buy something at some of the big markets like Zatorn or MediaMarkt or something like that, they probably will ask you what your postal code is, your zip code is. Anyone ever experience that, that when paying for something they ask,
30:00
what is your zip code? Did it make you think or did you just hand it over? What are they doing with your zip code? I mean, it's totally uninteresting for the transaction, isn't it? You buy it. It doesn't matter where you come from.
30:20
Unless you had an idea? Exactly, exactly. So they analyze with the zip code where the customers come from,
30:42
so what the basic area that a market serves is. And of course that is very important information for what? I mean, I sell the stuff so I couldn't care less where the customers come from.
31:06
advertisement, yes, right. So if a large number of my customers come from a certain area, then I should probably invest in some ad campaigns in other areas that I want to address, yes.
31:44
That's right. So basically you can find out which markets are performing well and which are kind of underperformers and you can always define areas that are obviously in need of some new markets. Where should I put up the next market?
32:01
Well, just order your customers by the way they are taking and the customer who are the first away and are quite large in number obviously need a new media market. And that's a strategic decision. Okay? A strategic decision that is not obvious from the sales data.
32:26
That is only obvious after you ask people for their zip codes when selling things. And all that information really lands in one big data warehouse that is focused around the customer. And usually there are lots of tables in a data warehouse,
32:44
a hundred tables or whatever and they are all related. They are all concerned about the object of desire, the customer or the product or whatever it may be. Whatever you focus on, that is the center of your data warehouse.
33:01
That means subject oriented. Second, it has to be integrated in Montelsas. So the data warehouse contains all the information from multiple transactional systems that are spread throughout your organization.
33:23
That means of course that all the data that gets into the data warehouse has to be consistent in all of these base systems. Otherwise you will have inconsistencies in your data warehouse. So every inconsistency that you experience in the underlying systems
33:42
will be directly moved into your data warehouse. Which is not a good thing if you are thinking about strategic decisions. Because, well, that's the crap in, crap out principle. If I have bad data, I get bad decisions.
34:02
And of course we want to deal with that somehow. So all the data from the underlying systems has to be made consistent. And that means integrated. You integrate the data from the different operational systems.
34:20
So whatever you have, you can have maybe the gender or some measurements or some conflicting keys or something like that. What if somebody has the data male, female? Other systems just use M and F for male, female. Some use 0 and 1.
34:43
Some use, well, woman, man, whatever it is. How to get that in the data warehouse? Well, you have to think about one common format that fits all. Just decide for one.
35:01
And then during taking the information out of the other systems, just transform it. That's actually the transform in extraction transformation loading. That's one part of the ETL process. Transform your data that is in a consistent representation.
35:25
But of course that's not all. What happens if you have some customer buying a product and the sales department says he is male and the warranty department that has some claims says she is female?
35:44
Same customer. Can happen. None of the underlying sources will notice. How should they? They're their own worlds. But your database, your data warehouse will notice because there are two entries.
36:02
One saying she is female. The other saying he is male. Obviously impossible. So you have to clean your data. You have to decide which one is true of the alternatives. Or best say nothing at all. I mean gender unknown.
36:21
That is basically the integration part. Then we have non-volatile. That refers to the historical dimension of a data warehouse. So the data in the data warehouse is written but not updated. Not deleted. You don't get anything out of a data warehouse. You just put it in.
36:41
It's not like a real world warehouse, you know, like where you trade goods, where you store goods and then sell them off at some point. Here the analogy breaks to some degree. You just stuff data in. And you keep it. At some point you might delete data that is older than ten years
37:02
because nobody is interested anymore or stuff like that. But still, you don't do the regular update for everything that happens. Somebody buys a computer. You note that in the data warehouse somebody returns the computer because he didn't really want it. You note that in the data warehouse too. No, you don't.
37:23
You just keep the old records and you put some new records into it. So you put a new record, it was returned and then it was sold again. So you have three different records for the same machine. In the product database you will have a single record that has been updated for the new customer
37:42
who eventually bought the item. If some changes occur, a new snapshot record is written that is exactly the something was returned or something is no longer valid. But you note that as a new record
38:01
you don't delete the old record or update the old record. That means non-volatile. It's basically stable. It's just growing. And it's definitely time varying. That is, the changes to the data warehouse are tracked and recorded. So you have a history over time.
38:22
You can see how some entity changed. So if your central concern are customers, for example, then you will see all the things a customer bought during his life cycle with your company. And you will see, oh, this is a good customer because he buys every five weeks.
38:43
This is a bad customer because he bought once and never came back and that was three years ago or something like that. So maybe he needs another advertisement because he's interesting still. Maybe the other customer who buys every five weeks
39:00
need a customer card so he can have big savings when shopping with us. So for all these kinds of things, for all these customer processing or the customer relationship management, you need the historic data. What was the development of a customer? Is the customer happy?
39:21
Well, if he buys repeatedly, he's probably happy. If he used to buy repeatedly but then at some point suddenly stopped, he may be unhappy. He may have moved places, you know, like you never know. But still, there's some reason that this had to stop.
39:40
And you can find out about that. Of course, you have different time horizons. So if I look at customers, there will be probably a couple of years that are interesting for me. If I look at products, the sales, I will not cover them in years. I will rather cover them in quarters or even in days when I say, you know,
40:02
what happened today with my company? So you need, you rather need a, as the freshness of data is concerned, a large timeframe, 10 years, five years, 10 years
40:22
that you might consider having the data because predictions for the next year that you do on grounds of three months are worthless. You cannot predict with a very short timeframe. Well, you will probably need five to 10 years.
40:41
In operational system that is somehow different, you know, like, I mean, product data, once a product is solved, it goes to the warranty department and maybe comes back at some point, but basically for the production, it's not interesting anymore. It's not interesting what was produced 30 days ago.
41:03
Okay? Well, so let's come towards a general definition. A data warehouse is a repository of all the organization's electronically stored data. Whatever the organization needs,
41:21
whatever the organization is about is stored in the data warehouse and it's specifically designed to facilitate reporting and analysis. So we need a normal database for the storage capabilities.
41:41
We need a model that somehow groups the data around a certain subject. We need a lot of storage space because the data that we are collecting has a 10 years time horizon. We need ways to extract reports
42:01
and to analyze the data and we need ways to prepare the data in our organization for being analyzed and giving good quality reports. That is basically what data warehousing is all about.
42:22
And of course you usually do that on custom made hardware that is specifically built for this function. You will not have a data warehouse running alongside your transactional systems or a data warehouse running on some computer that somebody works with in some area
42:42
because if the data is destroyed by some unfortunate events that would be a very heavy blow for your organization. And you very often have database management systems typically Oracle or Teradata, Microsoft SQL Server, IBM DB2
43:02
which are basically relational underlying systems for the actual storage part. You retain the data for long periods of time so you have to think about back upping and maybe nearline, offline data because you have large storage devices.
43:23
You have to consolidate the data that you get from a variety of sources and you have to figure out what the data model of your organization is. So what should the central entity be? Is it the product? Is it the customer? Is it the earnings? Is it the sales?
43:43
What are you really interested in? How does the organizational data fit to this subject? That is something that a warehouse architect has to think about first. So let's go to the first detail and see a short use case of some real world data warehouse.
44:07
Yeah, so we've seen what the experts say about it. We've seen what Kimball or Inmon say about the data warehouse but I still didn't find it intuitive enough.
44:20
So I said I will search for some use case to see what people who've worked with them in big companies say about them. So I found a joke actually, I don't know if it's real. It seems some guy from C-base wrote it over the internet about how they sold their first data warehouse.
44:43
So it's about Walmart. In the 90s, Walmart started selling a lot of products. They were pretty big, had a lot of data. So they said okay, what do we do with this data? We need a database, we need to store it somewhere otherwise we are not really efficient. So they called Sybase technology vendor,
45:02
database technology vendor and said look, we want to track our sales. We need something from you. Sell us some relational database systems. The C-base consultant greatly obliged. They bought a huge amount of money for it.
45:21
Sure, with the underlying hardware architecture, as Tilo said, you don't usually use a small computer for it so you really need a cluster for something like this. They bought something from Sun with a lot of computing power. Everything was great for the first few months.
45:47
But then they had a promotion and they said okay, we'll be offering some Colgate toothpaste for a promotion for some small towns for a few days. The management said after the promotion, how did the promotion go?
46:05
How were the sales? Were they good? Were they good comparing with normal sales days? Were they good with the last year's sales during the promotion? So they called the technical guy. He wrote some nice SQL statement involving, of course,
46:22
the product, involving the city, so the city should be small because we sold this toothpaste only in smaller cities. And the time constraint, how were the sales yesterday? How were the sales compared to the last year's promotion?
46:40
How were the sales compared to last summer? And whatever analytical query you can imagine, yeah? What happened when they executed this? This is a pretty big select. It's not a problem if you have, I don't know, 100 rows of data. This is nothing. Imagine they sold a lot of toothpaste.
47:02
You have to do six-way joins, right? This is a lot of data needed to be extracted from the database to be joined together, to be calculated. It seems that this took about 20 minutes for the query to be executed.
47:21
Well, the manager had time. It was not a problem. The only issue was that after five minutes, he started receiving calls from all over the cash registers, all over the Walmart shops. They started saying, okay, I can't sell anything. The cash register freezes. Nothing happens. The database is out of order.
47:41
Nothing functions anymore. So sure, what could he do? He just called Chase. And he said, look, I bought your system. I typed my toothpaste query in, and everything breaks down. I can't sell my toothpaste anymore. I can't sell anything anymore, right?
48:02
And then the C-base guy said, yeah, look, you have a big query. It takes 20 minutes. What we've sold you is a transactional system. This means you have a 20 minute transaction. In this 20 minutes, no other transaction can use the database.
48:21
So the registers can't update the database. They can't, for 20 minutes, operate any sales, right? A sale is an update over the database. This doesn't go. Well, of course, the Walmart technical department was not happy about it.
48:42
They said, I don't care. I want this to work. I want my 20 minutes query, but I also want to do some sales in this period of time, right? Well, the C-base support solution was, with this system, you can't do anything.
49:02
This is pessimistic logging. So you can't execute such queries and expect that during your transaction, another transaction can take in. This will break any ACID rules, for example. For example, the atomicity of a transaction is not respected anymore. So you can't do it.
49:20
You need something else. You need a data warehouse. You can input your query there. You can find out your answer in 20 minutes. And on the other system, on the transactional system, you can do your sales as before. So this is how C-base sold their first data warehouse to do Walmart.
49:46
We can already probably imagine a bit better how things look like. So we have these transactional processing systems. The ones with the cash register, we use them for operational data. We operate a sale, we update the database, cash register.
50:05
You see it every day probably also in real and everywhere. This is not a data warehouse, obviously. A data warehouse is fit for analytical processing, the BigQuery you have seen, for promotions, for a comparative analysis between what is the state today and what was the state last year.
50:25
How much did we sell last year in those cities? This helps for resource planning, for budgeting, for marketing as we've previously seen. So it is decision oriented. It's something else.
50:41
As major properties, we can distinguish between the updates. So the operational databases, the relational databases in transaction processing are mostly fit for updates. This doesn't happen in the field of data warehouses.
51:03
You have the first step of the ETL, the loading processing, but the rest is mostly reads. You do a lot of selects, you do a lot of comparative analysis. In the transactional field, you have a lot of small transactions.
51:23
They take under a second. You do an update. I've done a cell, the scanner reads the product and then updates the database. This does not happen in the data warehouse, have large queries. The size, from megabytes to terabytes, sure, you can have big OLTP databases.
51:44
You can, for example, after each month, store the data in archive. You can't do this in data warehouses. If you have only the last month of your data, you can't do any prediction. You need years of data.
52:01
This leads to petabyte databases. One example, Walmart had in 97 data warehouse of about 50 terabytes. eBay announced they had last year four petabytes.
52:21
This is how it grows. You have raw data in OLTP systems. You have summarized aggregated data in data warehouses. I'm not interested in how much did this specific client at 12 o'clock afternoon bought in I don't know what city.
52:41
I'm interested how much did we sell today in North America. I'm interested in aggregated amounts of data. It's for a different purpose. It's for decision making. Of course, the data must not be the last state of the data.
53:02
It must not be the data that is when the registers close today at six o'clock. I'm not really interested in this kind of precision, but I'm happy if I have the data until last week, but for the last five years. So it might be slight out of date. At least I have enough data to make my predictions on it.
53:26
You all probably know how the data is stored. In a normal relation database systems, you have the third normal form or some other way to avoid redundancy of the data.
53:46
So then you may have your invoice table, you may have your customer table, some status, product, sales, and some other tables which are linked together through different keys.
54:02
In the data warehouse, we don't really care about the normalization. We allow some redundancy. What is interesting for us is that when this fact table, which contains the interesting data for me, my sales, is to be related to the customer.
54:25
So I want a certain customer, I just need to do one join. If I'm going for a time period, doesn't matter if it's a year, a week, or a day,
54:40
so different granularities, I just need one join. Although here, there is a lot of redundancy. So I repeat myself. But storage hardware is cheap, doesn't matter. I just don't want to wait a week.
55:04
Some other basic insights are how do I do this, on which hardware do I do this. Should it be a separate machine, separate installation? Of course, this is the way to go. Of course, I should have different hardware.
55:21
If I want an operational system and I want a data warehouse, they should be separated. You can do it also on the same machine. But then you really need some computing power. You really need to make sure that the machine on which you store both of them has enough power to solve both of the tasks.
55:41
You can say, okay, the C-base guys were stupid. They were at the beginning, they didn't know what they talked about. Maybe pessimistic locking is not the solution. Maybe working with something like optimistic locking would allow us to use the same system. It goes. But it goes for, I don't know, hopefully, a one terabyte database.
56:03
Afterwards, it doesn't function anymore. And this is exactly because they work differently. If we consider the hardware utilization of the operational system, we observe this is the cash register behavior, right? Making sales, continuous making sales, the CPU looks like this.
56:24
It's always fast at its peak. The data warehouse had three big analytical queries today. They were at their peak. If you take these two hardware representations and put them one over the other,
56:42
what do you think would happen here, here, and here? It wouldn't work. They wouldn't be able to make sales. The hardware wouldn't be able to manage both of them. And I think this is kind of the story I wanted to tell you
57:02
about how practical data warehouse is work. So let's go on with our lecture, yes? Sound is up. In the last part of the lecture, we want to show you some applications of data warehouse.
57:25
So what are these typical queries that would need a data warehouse and online analytical processing for answering? And typical queries are, for example, if you focus on some certain unit
57:45
and on some certain time span, how much did they sell, okay? You can do that with all kinds of different intervals, with all kinds of aggregations, so in a month, in a week, in a year, for different units.
58:04
And then what was combined sales for the first quarter? So you aggregate over queries that already use aggregations. And what you see now is that the six-way join that the Walmart CIO had in mind
58:21
is exactly what you need. If you would put that in simple SQL over a normalized SQL data schema, then you would need a lot of joins, a lot of aggregations, a lot of subqueries that may be correlated with the other query, and for every database analyst, for every database administrator
58:42
or somebody who has set up a database and controlled a database at some point, this spells disaster, because those queries are virtually impossible to optimize properly. And then you have these 20-minute monsters that will run forever and block your system.
59:04
So apart from the complex formulation of the query, it really needs complex joins, multiple scans over the whole dataset for performing the different aggregations, and this is very time consuming. So first you have a very big chance of getting the query wrong,
59:24
so it doesn't really say what you want it to say, and once you notice that or once you get the error from the database, you will have lost a sizable amount of time. And then even if you do it right on the first shot,
59:41
it will still need a sizable amount of time anyway. What can a data warehouse do to answer these queries in a more efficient way, more quickly? And the idea is basically that you do not rely on normalization.
01:00:01
but that you rearrange the data and pre-aggregate some data that is always important. If I'm interested in sales for a day, I might be interested in sales for a week, for a quarter, for a month, for a year.
01:00:20
So time is something that is naturally ordered or naturally aggregated. I cannot predict every exact query, but I can point out that some aggregations might be more probable than others. Consider for example geographic locations.
01:00:40
Is it very probable that somebody will ask a query combining the sales of Braunschweig and Kiel and Lübeck? Well, he might be interested in some old Hanse city things or whatever.
01:01:02
Yes, the query exists, but it's probably not a very interesting point of view. On the other hand, having all your markets in Lower Saxony or in Northern Germany intuitively would make more sense because that is a strategic decision that you can really think about.
01:01:24
You know, like, okay, we need more markets in Lower Saxony or we need to put some advertising campaigns in the big cities of somewhere. So we know already some of the very probable aggregations that will happen and what we do in the data warehouse is we pre-aggregate them.
01:01:45
Here for example, the time period. If the basic period is a day, then we can for example see the fiscal week, the fiscal period, the fiscal year. Those are typical periods of time that occur in accounting.
01:02:02
If somebody has to do a tax return, the fiscal year is all that is important. So once collecting the days, I can already, after the year is done, pre-compute the fiscal year. And whoever does a tax return or whatever statement can rely on these pre-aggregated data.
01:02:26
Okay? This is basically the basic idea. This is basically the trick that databases use. And the concept is called a cube. So you don't have the table of the data, but you have different dimensions that like 3D,
01:02:46
you know, like would enlarge the table into a cube where you have different time spans where every field in your database can be pre-aggregated along the dimension of time
01:03:01
or along geographic dimensions or along customer groups or people who live in big cities or people who are male or people who are female or something like that. And this leads to the very typical star schema that we will discuss next lecture in detail.
01:03:25
We really have the time key which is connected to a whole table of the time with different degrees of pre-aggregation. And now putting out a query on one of these time spans may lead to a direct hit.
01:03:43
I want all the data of the fiscal year. No, don't aggregate it. Just take it from the pre-computed database. That's quick. Still, if I need to aggregate Lübeck and Kiel and Braunschweig, I can do that because I have all the data for the day and for the city.
01:04:04
But again, then I have a six-fold join. In a database, it would be deadly to post such queries. On a data warehouse, it's still possible. Good. The data warehouse is basically the repository for analytics that are put up in front of it.
01:04:30
So what happens is that all these tools, data mining tools, visualization tools, work on the data that is provided by the data warehouse.
01:04:40
And the most renowned is online analytical processing. There are a lot of tools for that and every company has its own tool set that works on the very data. The second thing that is very common is called KDD, knowledge discovery in databases, which uses different data mining process or data mining algorithms on the data
01:05:02
to detect hidden connections between some things. So what if you find out that a certain customer group always buys a certain product? You can do targeted advertisements, things like that.
01:05:20
You can make big sales at points where these people are around. And you can set some, well, maybe special advertising campaigns for people who don't buy a certain product. So everything that you don't know about your customers, your sales, your products,
01:05:41
but that you can find out by analyzing what is happening in the stores, is interesting for you, is interesting for strategic decision. Then we have, of course, the data visualization and reporting, which is very interesting for getting ideas into management. You know, like if I can show how something behaves or if I can predict how something behaves,
01:06:02
then my related comments or my proposals for management decisions will of course be much more believable and much more obvious for the management. So the chance of succeeding will be higher.
01:06:21
This is why all these consulting companies, McKinsey or you name them, always have these wonderful colorful slides with all kinds of exographs, you know, like how things behave and how the world works because it makes them credible. And this is something that we can do too.
01:06:42
OLAP, the Online Analytical Processing, is as I said a form of information processing and what it needs is basically timely information, so the information has to be accessible, it has to be accurate. As I said, we need to clean data, we need to normalize data,
01:07:01
we need to transform data sometimes and it is understandable. And, well, timely, yes, for a management decision you can't need a week to get the data, obviously. But still it's not like at the cash register where you put in the customer card
01:07:20
and then it should go, okay, here's the amount, here's your bill and move on. And not, well, let's wait for the server, nice weather today, isn't it? Oh, now it comes. This is not a good customer experience, so you need to be quick, you need to be on the spot.
01:07:42
It means seconds. In OLAP minutes, hours are not unheard of and sometimes it may even take a week if it's a very important and very complex thing to derive, okay. And OLAP comes in several flavors, some of which we will discuss in depth.
01:08:03
There's rollup, there's doolap, there's mollap, there's volap, there's holap. So a lot of things you can do with OLAP and I will not go through all the acronyms. So for example, rollup is OLAP on relational databases, so rollup. And mollap is for multidimensional databases and so on.
01:08:24
So these are different flavors that are somehow concerned about how the data is stored and how the data is reported. We will go through some of them in time. Second thing, KDD, data mining. The basic idea is to find mathematical models, statistical models of the data in question.
01:08:45
And a model always summarizes what is happening. And a model can be used to predict trends, which is a very important thing. So if you have a database like with customers here where you get the zip code for everybody who bought something
01:09:03
and maybe the gender and the income and all kinds of demographic data, you might find that somebody who has children rather buys a minivan than sports car or coupe or something else, which might not come too surprising.
01:09:28
But then there are many things that are very surprising indeed. So especially if you go shopping and you have a retail market like grocery shopping,
01:09:41
what stuff is bought together? Other things that are typically sold together. You will find besides the obvious, you know, yeah, spaghetti and tomatoes, you will find some things that you would never have thought possible. For example, wine is very often sold with diapers.
01:10:07
It's because the people stay at home and rather have a bottle of wine than going out in bars. If you think about it, it makes sense. You wouldn't have thought about it beforehand, but you can dig it out of your data. And that is what data mining does.
01:10:22
So all the important and all the hidden stuff that is inside your data is there if you just go looking for it. And with that you can have a family special, two bottle of wine per one extra diaper pack that you buy or something like that, you know, like everybody will be happy.
01:10:48
As I said for the recent thing, if you can say some, well, select star from customer where total spend is bigger than 100 euro,
01:11:00
you can build some models that, for example, show if somebody buys a minivan and the age is larger than 35, then he has family, he's probably a good job, and the total spending will be rather large. If somebody is a male and comes from Braunschweig, then he might have money
01:11:25
because this is a zip code of the Eastern Ring. And there are all the, I don't know, professors and Volkswagen managers and teachers and whoever lives there, you know, like they might have money
01:11:45
for some reason or the other. Well, not the professors but the Volkswagen managers at least. These are typical rules that you can deduce using your data. And knowing that is a business advantage because your competitions may not know them
01:12:05
and then you can target these customers. You can target the interesting market segments and how to do market segments will also be part of this lecture. The questions you can answer with that is which products or customers are more profitable,
01:12:23
what markets, what outlets have sold most over the last years and the decisions you can take on ground of these reasons is where should you open more shops, where should you close down shops, which customers should be targeted for promotions and so on. So you need some reason to do something.
01:12:43
Should you increase production, should you decrease production? Producing more is good but not if the market is saturated. How do you find out whether the market is saturated? You can do market research or you can look at your own data. What happens? If the sales stay on the same level for some time, the market may be saturated already.
01:13:10
Second thing is who is the user? The data warehouse is online analytical processing so it's kind of like a business cockpit.
01:13:20
But still, it's not the managers that do it. It's too complex for the managers. What you need is analysts. You need decision support analysts that for specific questions that the management may arise. Where the management raises a question of what should we do strategically?
01:13:43
Should we go into the far east market? The decision support analyst goes down to the data, drills down to everything that could be implied by the data or not.
01:14:00
You never know what's in it. You define the information that is needed. You discover the information that is needed by every tool that is applicable. And then you have all the nice charts, you have all the nice visualization that help you to get a well-founded decision for the management.
01:14:24
Then the management may follow you or not. Well, if you're a successful decision and support analyst, your word will count for something. And the problem is really that even if you're a decision support specialist,
01:14:48
you will go through different levels of analysis. You will very probably not pick the right query or not take the right algorithm in your first try.
01:15:02
You will certainly look at the data, find out how to cluster it, find out what makes sense, what makes less sense and then draw some conclusions from that. And with every conclusion you will find out that you really want something different,
01:15:23
something more specific or something altogether different that you've set on the wrong horse at some point. And in the end you will find what you're looking for, hopefully. This is typical of explorative analysis.
01:15:41
You explore the data, you look at some aspects of the data just to see what you're really looking for. So you're building up your mindset which is somehow steered by the data that you find. And the more you dig for something, in the end you will probably find it.
01:16:02
That's basically how you're working. And the typical explorative line is this, now I know what is possible and from that I can deduce what I really want to have. But if I don't see what I want, I don't see what I need.
01:16:25
So that's basically what you do and this is exactly how the data warehouse is actually built. If you have a decision support system, working with the data is totally different from working with the database.
01:16:42
In the database you have the Chris SQL statement and it tells you what you ask for, not more and not less. In the decision support system you detect trends, you detect connections. And you as the support specialist are in control which path to follow, which information to ignore, what is relevant, what is irrelevant.
01:17:13
And this is basically what you do. It's like requirement engineering. Only that you don't know all the requirements at first but factor in new requirements every time you learn something new.
01:17:29
I think if the thing that you learned is something important, factor in a new requirement. If the thing is irrelevant, just leave it out and stick with the old thing, explore somewhere else.
01:17:42
Okay? This is basically who the user is. And for the lifecycle of the data warehouse itself, there's a system development lifecycle that is very renowned. It starts of course with the design. And the design is a typical software engineering requirements analysis.
01:18:02
You talk to the users, you talk to the management. What are they interested in? What are their primary goals? What is the subject the data should be centered on? You see what systems are there, what data sources do you have? So what do you have to factor into your data warehouse?
01:18:23
Then you look at the key performance indicators. So what really is interesting? Is it really the money you earn? Is it your customer base that is the important thing? Is it the products you want to sell, the products you want to market? What are the key performance indicator?
01:18:42
And then you find out how the management works, you know, like you try to find out how are they doing decision, how are they coming to conclusions? And this is a process you have to support. It's not always the correct process, but you know, you can't change humans too much.
01:19:02
If somebody is a good manager, then probably his way of managing is successful. And you can't rush in like McKinsey and say, oh, everything has to change. I mean, there was the great failure of the ISO 9000 norms, you know.
01:19:22
The people came rushing in, auditing, auditing, auditing, and everything has to change. And after the audit, nothing worked anymore because nobody felt comfortable with it, nobody knew what to do. Well, it's more efficient. Yes, it's more efficient, but it doesn't work.
01:19:41
And that is kind of the problem here. Try to map the decision-making process underlying the information needs and then finally design the schema. So you have to talk to a lot of people before you're ready to define the schema. And that is really necessary. So there's no standard way if somebody tells you there's a standard way, just use
01:20:05
this data, this data, this data, and this table, this table, this table, and that's it. Then it's wrong. It doesn't work that way. Okay, the next thing is get to a prototype. And the prototype has to constrain and in some cases reframe the end user requirements.
01:20:26
So, of course, everybody wants a system that has totally wonderful decisions for me that are 100% correct and will grant large, large benefits for everything I do.
01:20:40
That doesn't work either, you know. You just have to constrain it a little to what is possible and what is sensible. And sometimes you can work with people and say, well, I know you want that, but it's not really what you need, let me tell you. And I can change it a little. Then try deploying it.
01:21:01
Rollout process is a very big problem in most companies, especially in big companies, because you have to, again, talk with a lot of people. You have to train people working with that. And you have to be very patient before people start using the thing in the intended way.
01:21:21
And sometimes you have to listen to people because they may have reasons for not using it in the intended way. So that is a good thing. Maybe that is a change that you should work with. And after the deployment phase is over, it's day-to-day operation. And what is needed in day-to-day operation, you have to
01:21:41
control the data warehouse, you have to monitor the extraction transformation loading process, you have to see what happens in your data warehouse and control the whole thing.
01:22:00
Then, basically, you need to enhance what you're doing. As I said, all these different parts of the life cycle are connected to each other, even if you have the initial requirements. The deployment process will tell you what you need.
01:22:24
Can people work with that? Are people comfortable working with that? For every step from the design to the prototype, you have to feed back some information. From the prototype to the deployment, you have to feed back some information and so on.
01:22:41
For every step, you really have to feed back information into the step before until, in the end, you are again at the design phase where you rearrange logical schema or getting new data, getting new data sources or new ways of exploring the data that you have.
01:23:06
The classical software design life cycle and the data warehouse system design life cycle is a little bit different because in the classical domain, the requirements are broken down to all the components of the system.
01:23:28
You know what the system is going to do, which is not true for a data warehouse. In the data warehouse, what you do with the data depends on what you find out from the data.
01:23:45
So what else you need in terms of information sources depends on what you explored before. And it's kind of linear, it's kind of reversed in a way. You don't start with the requirements and then build the program and build the database and run the application on top of it.
01:24:06
But you start with a data warehouse where you put everything together and then you use some algorithms to find out what the data is about and how the data is connected. And from that, you derive certain rules and you derive certain possibilities of using the thing.
01:24:27
And this is basically what the requirements are about, what it should use. And then it comes back to, okay, the requirements are I need new data sources that should be part of the data warehouse. Or I should normalize the data in a different way.
01:24:41
Maybe the subject that I chose was wrong. And back it goes to designing the data warehouse. It's kind of reversed, it's kind of top-down. So the classical life cycle starts with the requirements. Gathering, analysis, design phase, programming, testing, integration, implementation.
01:25:05
That's quite normal. That's what we all know from software engineering. Once you install a data warehouse or once you deploy a data warehouse in any organization, you will probably start differently. You will first implement it.
01:25:23
After having the basic infrastructure, you need to integrate the data because it comes from different sources. You have to test for bias, what is in there. Are you taking data that is incomplete or data that is inconsistent with what you have?
01:25:42
You have to program against the data. You have to find out what algorithms you want to run, what interesting structures in the data you want to discover. Then you design the decision support system on top of that and then you look at the results of the decision support system.
01:26:04
And you try to figure out whether what they predict is good and viable or is unhelpful. Depending on whether it's unhelpful or really already good, you understand what the requirements actually are.
01:26:22
And you go back to implementing the warehouse. Okay? That's a little bit different from what we already know. Some people call the system development life cycle of data warehouses cycle life development system, which is kind of like just the thing reversed, which is kind of a bad joke.
01:26:48
But some computer scientists like it. The development cycle of a data warehouse is usually data driven. Whereas the normal software engineering development life cycle is systems driven.
01:27:06
In software engineering, you design a system that does something. In data warehousing, the data asks you what you need. Kind of what you find in the data, what you explore in the data sets the requirements for what you actually need and what is to be implemented.
01:27:27
Once the data is integrated and tested, you can write programs and the results show what correlations are in the data. What hidden connections are in the data. Because that is a trick, the connections are hidden in the data.
01:27:42
If you would have known them upfront, you could have had requirements. But you don't know them. You have to find them. And once you find them and once you understand them, you have to make adjustments to the design of the whole system. And then the cycle starts all over, which is why it's often called a spiral technology.
01:28:06
So you go around in circles and your data warehouse or your decision support system becomes better and better and better and the decisions become of more quality. But it takes some time. It's an explorative thing.
01:28:21
And finally, you have to operate the data warehouse. You have to do the everyday job, which can be broken down into the monitoring. So you need to see that all the systems are running, that everything is up. Then this magic extraction transformation loading process that will be a major part of the next two weeks.
01:28:47
And finally, the analyzing phase where you really work with the data. And what happens in monitoring is basically that you do a normal surveillance of the data sources.
01:29:01
You find out whether they are productive, whether they are responding, whether they are giving you the data that you want. You find out which data modifications are in the operational systems. How can they be reflected in the data warehouse? And this basically sets the stage for the next steps.
01:29:23
The monitoring techniques can be active mechanisms within the basic systems. So it can be event condition action rules. So if a payment of over 10,000 euro is recorded, transfer it to an economy account or something like that.
01:29:42
Or it can be on every update, do something. And what you do for the replication is basically you take snapshots of the operational data. Which is basically a view, for example, Oracle does that. Or you immediately replicate data.
01:30:02
So IBM, for example, the DB2 products always use direct replications and write the data that is important or the update data, for example, into a different table. And this table is then transported into the data warehouse.
01:30:20
There are also protocol-based mechanisms where you just have logs of what happened to the operational data. And you use these logs for updating the data warehouse. Problem is that this protocol format, this log format, may be ambiguous. So it's sometimes hard to see what really happened.
01:30:43
And, of course, there are some application-managed mechanisms which basically are very hard to implement for legacy systems. So if you have some old system, you need an application that does exactly know what the software does and how to get the data out of there.
01:31:03
But it also can be done. And so if you have to integrate a couple of systems, then data comparisons or time stamping, who did the update first, or is it the same update, or what happened in the update can be done. But that's very handcrafted. Then comes the extraction step in which you take all the data that you need to put into the data
01:31:27
warehouse from the operation system, whether it be from the view or from the logs or where you got it from. And usually this data is quite a lot because if you take every single update and put it into the data warehouse,
01:31:45
then it will put a large stress on the data warehouse and it will put a large stress on the operational system because the operational system has to perform the update or has to perform moving the data. This is of course not what you want, but you want something that overnight maybe
01:32:03
from 12 to 1 where it's kind of like getting very quiet in production anyway, you want to do that work all in a batch. And then the operational system is kind of ready for the next day and that's it.
01:32:21
Or between Christmas and New Year, so typical periods of inactivity. This is what you use for the extraction process where you take the data from the operational system. But of course you have to be sure that you pick certain ways of making sure that everything really gets into the data warehouse at some point.
01:32:44
You can do that either periodically, so for example if you have weather information or stock market information for brokers, then you need to update it every, I don't know, stock market probably every seconds or five seconds,
01:33:04
weather information probably every hour. If it's the marriage status of people, updating it once a year should be enough for most of us. There are very little people that marry twice, thrice or four times a year.
01:33:26
That is what you have to think about. You can also do it actively, so on request, new item comes in, so please update. Or event driven, so if something strange happens, immediately make a snapshot.
01:33:44
For example, big sale or something like that, immediately make a snapshot, put it into the data warehouse. Or the Christmas run, you know like where everybody buys the last Christmas presents, probably shorter update periods are needed during the Christmas time or pre-Christmas sales than for the rest of the year.
01:34:06
Or of course you can do it immediate, so whenever something happens, report it. This is a very rare case. It's really only applicable in the financial sector if you have stock market new.
01:34:22
If something starts to slump, you should sell immediately, not in an hour when it's down on its all-time low. So basically immediate is only for time-critical applications, for real-time applications that you need. On the other hand, it's usually done periodically or after a certain number of transactions that run through.
01:34:49
This of course also depends on the hardware and the software used for the data warehouse and the data source. If you're still under the limit, then you can do it immediately, obviously.
01:35:03
But if your system is stressed, doing it immediately is a bad thing and you should do it periodically every night when the system is under less stress. So it depends on your software and hardware capabilities, what you find in the company. The transformation process is basically about adapting the data and finding out how it relates to data in the other information sources.
01:35:29
Also the data quality is a big concern here. So how to get the data consistent? Can it be that somebody lives at a certain address?
01:35:41
So everybody knows these forms that you find all through the internet where you have to register for something. What kinds of address do you type there if it's not an important system? You type something like hoo-hoo. How should the system know that it's not a real address but I don't want to tell my address to some system that starts sending me some spam mail?
01:36:09
Do you want to put that into the data warehouse? Probably not. You want real addresses. So build some rules how real addresses look like. Real addresses are like street names followed by some number for the house.
01:36:27
And simple rules can find you very easily mistakes in the data or bad quality data. Then you need to integrate the data.
01:36:42
So if you have different data sources they might store things differently. So for example if you have encoding buy and sell and some say just one or two or B and S for buy and sell. You have to kind of carry it into the same format. It doesn't matter what the representation format is of the source of the target system.
01:37:04
But you have to decide for one, you have to transform it. Then how do you deal with keys? Make sure that it's really unique. Make sure that if you have a foreign key that the foreign reference is actually in the reference system.
01:37:22
What about the data types? Is the data of the right type? Stuff like that. So pretty simple stuff that can severely hamper the data quality if you don't do it. Normalization you'd write Michael Schumacher or Michael comma Schumacher or Schumacher Michael or Michael comma Schumacher.
01:37:49
So you have to find some way to store it. It doesn't matter what it is but in the target system it should be the same.
01:38:00
Date handling. Typical American date, typical European date or at least middle European, central European date. Decide for one. It doesn't matter which one but keep it the same over the whole system. Measurements inch to centimeter or something like that happens very often.
01:38:24
Spaceships failed for having the wrong measures. So if something drops a couple of centimeters close to the orbit and something drops a couple of inches that can be fatal.
01:38:41
It's proven to be. Make sure you have that. Calculated values. So something including the value added tax. Or something excluded the value added tax. Basically the same price. Just add the value added tax rate.
01:39:02
Multiply it by the value added tax rate then it's the same. Decide for one. Same aggregation. If you have daily information pre-aggregated into weeks, pre-aggregated into month, pre-aggregated into years. Because you might need that at some point.
01:39:24
And finally cleaning. So consistency checks can be very easy like the delivery date should be after the order date. Nothing can be delivered before it is ordered.
01:39:44
Or if you have some missing values, there's no address field for some customer. You have null values. How do you deal with that? Do you exclude the customer or do you say well let's try to find out what the address is? Maybe some other system has a valid address.
01:40:02
All things that have to do with data cleaning and I can tell you that data cleaning is a very challenging topic. And we will also deal with that in the next couple of lectures. It's not as easy as it sounds. Finally the last step of the ETL process is the loading step. So what you do is basically you take it offline.
01:40:22
You make it during the night or during the weekend when the system is not under stress. And then you batch everything together in a single batch and you let it run. Because most databases have high performance loaders. Whereas the individual update is quite a costly operation.
01:40:47
You basically split between the initial load which is the initialization of the data warehouse and then periodical loads to keep the data warehouse updated. Of course the batches for keeping the data warehouse updated, all the data that amounted for one day for one week,
01:41:05
is much smaller than the initial load. So if the initial load takes a week or a month it doesn't matter. But then the everyday work should be fairly quick. And for this reason because the initial loading can be very big,
01:41:22
there's usually bulk loaders that will take all the information that is in the source systems and put them in some format that can be worked on very quickly. So usually it's something like comma separated values that are just flushed into the system.
01:41:44
And the actual loading of the rest of the data, of the new data, that is just of a, well, aggregated nature. So you need some partitioning, you need some incremental actualization of the data
01:42:00
and the increments are usually fairly small. So that can be done during the night or on the weekend. Well then you have to analyze the data, the data access, what do you need? So how many iPhones were sold in brown spike stores in the last three calendar weeks of 2008?
01:42:24
You can do that in OLTP systems, it takes a lot of time. If I have the information pre-arranged and pre-aggregated for weeks and areas like the brown spike stores and companies probably, only the T-Mobile,
01:42:49
then I can do it very quickly. So the district should be interesting, the aggregation of time should be interesting,
01:43:00
everything that can be pre-aggregated in a sensible, in a semantically meaningful manner should be pre-aggregated because it will save you a lot of time during the online processing phase. And basically it's called online analytical processing, which means now and directly in front of me,
01:43:24
not offline analytical processing, which means I come back tomorrow and want a report. It very often amounts to that, but still it shouldn't. Really pre-aggregate wherever it's possible.
01:43:40
Well, the way you do that is basically with a multi-dimensional data model. So you split a dimension like time into multiple reference frames. You say this is the day, this is the week, this is the month, this is the year. You know, same with geographical reasons. This is the city, this is the state, this is the country, this is the area or continent or whatever.
01:44:07
And then you have certain operations which we will also discuss in detail like roll-up, drill-down, slice-and-dice rotation. You don't have to understand all these terms right now.
01:44:21
We will come to that when we do all the roll-up. Basically it's if you have this multi-dimensional data, you just slice through it to get the right information in the right aggregation step. And of course you have the data mining step where you can have hidden patterns that you need to find.
01:44:43
Basically knowledge discovery in databases. So people who buy wine also buy diapers very often, something like that. And of course the prediction. I sold so many units of some product during the last four years. How many am I going to sell in two years?
01:45:06
Okay, so trend channels, things like that. And it's very useful for answering like how do sales evolve, what happens in the near future.
01:45:20
And the techniques that you use for that is basically clustering, classification, regression, association, rule learning. There are lots of them. We will discuss some basic algorithms that already do quite a lot. So this is basically everything I want to do to introduce data warehousing.
01:45:42
Next lecture we will talk about the architecture of a data warehouse. So what are the couple of basic architectures that you will find in practice? What are storage models? How do you store the data? What layers do you have in the data? What aggregation layers? And of course a little bit about the middleware that you need for making the connection for your operative systems to the data warehouse.
01:46:05
Any questions? Concerning data warehouses? No? Then see you next time!
Empfehlungen
Serie mit 13 Medien