We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Data Modeling (Logical & Physical Models) (11.11.2010)

00:00

Formal Metadata

Title
Data Modeling (Logical & Physical Models) (11.11.2010)
Title of Series
Part Number
3
Number of Parts
13
Author
Contributors
License
CC Attribution - NonCommercial 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Producer
Production Year2011
Production PlaceBraunschweig

Content Metadata

Subject Area
Genre
Abstract
In this course, we examine the aspects regarding building maintaining and operating data warehouses as well as give an insight to the main knowledge discovery techniques. The course deals with basic issues like storage of the data, execution of the analytical queries and data mining procedures. Course will be tought completly in English. The general structure of the course is: Typical dw use case scenarios Basic architecture of dw Data modelling on a conceptual, logical and physical level Multidimensional E/R modelling Cubes, dimensions, measures Query processing, OLAP queries (OLAP vs OLTP), roll-up, drill down, slice, dice, pivot MOLAP, ROLAP, HOLAP SQL99 OLAP operators, MDX Snowflake, star and starflake schemas for relational storage Multimedia physical storage (linearization) DW Indexing as search optimization mean: R-Trees, UB-Trees, Bitmap indexes Other optimization procedures: data partitioning, star join optimization, materialized views ETL Association rule mining, sequence patterns, time series Classification: Decision trees, naive Bayes classifications, SVM Cluster analysis: K-means, hierarchical clustering, aglomerative clustering, outlier analysis
Data structureData storage deviceFingerprintMultitier architectureArchitectureBusiness modelEntity–relationship modelFormal grammarArray data structureHierarchyDimensional analysisCubeLogicGamma functionPurchasingClient (computing)Product (business)Execution unitFamilyLocal GroupMathematical analysisHausdorff dimensionLevel (video gaming)Attribute grammarExistenceEnterprise architectureScheduling (computing)Data modelData storage deviceData structureData warehouseAnalytic setBusiness modelLevel (video gaming)BitDomain nameProcess (computing)Goodness of fitLogic synthesisMathematical analysisInformationTerm (mathematics)Dimensional analysisCartesian coordinate systemClient (computing)Different (Kate Ryan album)Product (business)Table (information)PurchasingLogicHierarchyExecution unitCubePerspective (visual)Flow separationMultiplication signState of matterCategory of beingFocus (optics)MiniDiscForm (programming)2 (number)Scaling (geometry)Right angleSingle-precision floating-point formatWhiteboardElectronic mailing listRhombusCausalityCellular automatonOrder (biology)QuicksortDenial-of-service attackTournament (medieval)DialectLatent heatSet (mathematics)Key (cryptography)Point (geometry)Independence (probability theory)Closed setVotingBasis <Mathematik>Case moddingSound effectResultantAuthorizationGeneral relativityComputer animationXML
Level (video gaming)Attribute grammarExistenceHierarchyHausdorff dimensionDimensional analysisElement (mathematics)Gamma functionSelf-balancing binary search treeVertex (graph theory)FunktionalanalysisProduct (business)Local GroupFamilyAbelian categoryVideoconferencingMaxima and minimaCubeCellular automatonInformationCategory of beingExecution unitGraph coloringMereologyNeuroinformatikRoundness (object)Order (biology)Goodness of fitVideoconferencingSet (mathematics)Product (business)Single-precision floating-point formatPoint (geometry)Different (Kate Ryan album)Right angleBitGroup actionMeasurementType theoryRhombusGeneral relativityPhysical systemPriority queueFamilyNumbering schemeFunktionalanalysisNetwork topologyCASE <Informatik>Element (mathematics)Basis <Mathematik>Latent heatMultiplication signBranch (computer science)CausalityDomain nameAttribute grammarCellular automatonOpticsVideo gameDigitizingDistanceLevel (video gaming)Relational databaseHierarchyCubeInstance (computer science)Dimensional analysisData storage deviceSelf-balancing binary search treeInheritance (object-oriented programming)Topological vector spaceRootSocial classDigital video recorderTimestampAnalytic setComputer animation
Cellular automatonCubeVertex (graph theory)Client (computing)Denial-of-service attackProduct (business)Dimensional analysisCuboidLattice (order)Formal grammarLimit (category theory)InformationLevel (video gaming)Different (Kate Ryan album)Dimensional analysisCellular automatonCubeTwitterStatement (computer science)MultiplicationData miningThree-dimensional spaceMeasurementNeuroinformatikLatent heatType theorySet (mathematics)Regular graphMessage passingClient (computing)DistancePriority queueStaff (military)RhombusProduct (business)Boom (sailing)Multiplication signUtility softwareFamilyOnline helpPoint (geometry)Domain nameUniform resource locatorVideoconferencingGoodness of fitCuboidMatrix (mathematics)Category of beingComputer animation
CuboidLattice (order)CubeMaxima and minimaLogicAxonometric projectionBusiness modelOperations researchLevel (video gaming)VideoconferencingGroup actionMeasurementDatabaseOperator (mathematics)Priority queueCellular automatonCASE <Informatik>Selectivity (electronic)Different (Kate Ryan album)CubeNumberLevel (video gaming)MereologyProduct (business)Dimensional analysisInformationMultiplication signUniform resource locatorFrequencyPoint (geometry)ResultantConnectivity (graph theory)Lattice (order)RhombusVoting3 (number)Universe (mathematics)Identity managementPrice indexMusical ensembleLatent heatRegulator geneStructural loadCountingGoodness of fitSummierbarkeitProjective planeRelational databasePredicate (grammar)CuboidIdentical particlesQuery languageSingle-precision floating-point formatRight angleType theoryComputer animation
CubeLevel (video gaming)Axonometric projectionCuboidFunction (mathematics)Execution unitLattice (order)Operations researchMaß <Mathematik>Local GroupFamilyClient (computing)BuildingSummierbarkeitAverageMilitary operationCubeMeasurementMultiplication signStandard deviationProjective planeRight angleSelectivity (electronic)Group actionCategory of beingInformation securityCompilerOrder (biology)Operator (mathematics)FunktionalanalysisLevel (video gaming)Event horizonBuildingQuicksortOpticsArithmetic meanRoundness (object)CalculationGeneral relativityStructural loadProduct (business)Maxima and minimaImplementationPolygon meshAreaData storage deviceThermal radiationMereologyStatisticsDifferent (Kate Ryan album)Execution unitPriority queueValuation (algebra)DemosceneSkewnessType theorySummierbarkeitCASE <Informatik>Performance appraisalGoodness of fitInformationLattice (order)Table (information)Relational databaseBitComputer animation
BuildingPhase transitionHierarchyDivisorValidity (statistics)Source codeCubeMathematicsPhysical systemMatrix (mathematics)TimestampTime evolutionRelational databaseSystem programmingDatabaseMobile WebData storage deviceVertex (graph theory)Inheritance (object-oriented programming)Network topologyHausdorff dimensionPersonal digital assistantInformationQuery languageOrder (biology)Numbering schemeInformationOrder (biology)TimestampEvoluteComputer configurationMobile WebPoint (geometry)Statement (computer science)Multiplication signMathematicsProcess (computing)Physical systemState of matterMereologyQuery languageTable (information)BuildingTrailDisk read-and-write headVideo gameAuthorizationGame theoryObject (grammar)Hand fanDescriptive statisticsBasis <Mathematik>BitDifferent (Kate Ryan album)Bit rateTerm (mathematics)DemosceneFile formatAsynchronous Transfer ModeComputer hardwareCubeHierarchyProduct (business)Matrix (mathematics)Validity (statistics)Business modeloutputConnected spaceInheritance (object-oriented programming)Mathematical analysisElement (mathematics)General relativityInsertion lossRow (database)Data warehouseGSM-Software-Management AGCategory of beingType theoryData storage deviceDatabaseComputer animation
HierarchyQuery languageHausdorff dimensionSchemaevolutionLimit of a functionComplex analysisProper mapPurchasingInstance (computer science)Process (computing)Limit (category theory)Execution unitLocal GroupBusiness modelIntegrated development environmentDatabaseEnterprise architectureData storage deviceSubject indexingPartition (number theory)Materialization (paranormal)Data structureMatrix (mathematics)Relational databaseImplementationProgramming paradigmTranslation (relic)Software maintenanceTable (information)Attribute grammarDimensional analysisCubeCellular automatonTupleTexture mappingDatabaseCellular automatonNumbering schemeDifferent (Kate Ryan album)Multiplication signBusiness modelSocial classWordGame theoryGeneral relativityMetropolitan area networkDressing (medical)Hand fanPriority queueScaling (geometry)MassDatabase normalizationDimensional analysisWhiteboardClosed setNobelpreis für PhysikPoint (geometry)Process (computing)Table (information)FamilyTransformation (genetics)Translation (relic)Order (biology)LogicRight angleEvolutePay televisionInformationOperator (mathematics)RhombusGroup actionEvent horizonArray data structureCASE <Informatik>Asynchronous Transfer ModeRoundness (object)Electronic visual displayImplementationCartesian coordinate systemAttribute grammarHierarchyCubeData storage deviceProduct (business)Relational databaseData warehouseSoftwareDistribution (mathematics)Insertion lossAddress spaceView (database)Partition (number theory)Materialization (paranormal)Integrated development environmentRepresentation (politics)Semantics (computer science)Query languageMatrix (mathematics)Mathematical analysisEnterprise architectureObject (grammar)Computer animation
Dimensional analysisCubeImplementationHierarchyAttribute grammarHausdorff dimensionCellular automatonTupleMeasurementTexture mappingBusiness modelRelational databaseGamma functionDenial-of-service attackMobile WebLaptopLevel (video gaming)FamilyRhombusCellular automatonCASE <Informatik>Dimensional analysisGroup actionTable (information)InformationMeasurementMultiplication signData storage deviceNeuroinformatikElectronic mailing listNumberPriority queueState of matterPrice indexSoftware testingPolygon meshVolume (thermodynamics)Prime idealLatent heatSet (mathematics)Metropolitan area networkOpticsLengthMereologyQuicksortKey (cryptography)Product (business)CubeAttribute grammarTupleHeegaard splittingFlow separationDifferent (Kate Ryan album)Data warehouseMathematicsRow (database)HierarchyLevel (video gaming)Computer animation
Product (business)Table (information)Hausdorff dimensionStatisticsLocal GroupPhysical lawMIDIIcosahedronGamma functionUniform resource locatorTable (information)Dimensional analysisIntrusion detection systemMultiplication signData storage deviceInformationKey (cryptography)Right angleQuery languageDomain nameLevel (video gaming)Product (business)Normal (geometry)Maxima and minimaMeasurementDifferent (Kate Ryan album)Point (geometry)Order (biology)Single-precision floating-point formatRhombusQuicksortBitTape driveImplementationState of matterOffice suiteSemiconductor memoryNeuroinformatikPlotterGreatest elementProgram flowchart
Hausdorff dimensionQuery languageTable (information)Sparse matrixAttribute grammarProduct (business)Dimensional analysisData storage deviceDatabase normalizationAbelian categoryGamma functionMusical ensembleData storage deviceQuery languageProduct (business)Category of beingNumberTable (information)VideoconferencingGroup actionBusiness modelDimensional analysisKey (cryptography)Digital video recorderAttribute grammarMultiplication signState of matterGoodness of fitNormal (geometry)MeasurementPhysical systemLattice (order)Electronic mailing listTerm (mathematics)Database normalizationOnline helpConsistencyMathematicsMereologyDatabaseRow (database)RhombusOperator (mathematics)Dependent and independent variablesPrime idealCentralizer and normalizerTape driveGame controllerPoint (geometry)BitNumbering schemeWebsiteCAN busTouchscreenMetropolitan area networkLeakComputer animationDiagramProgram flowchart
Computer configurationProduct (business)Local GroupPredicate (grammar)MIDIComputer-generated imageryLatin squareLattice (order)Product (business)Data storage deviceState of matterMultiplication signAreaOrder (biology)Table (information)Uniform resource locatorDatabase normalizationInformationMathematical optimizationTask (computing)Dimensional analysisResultantGeneral relativityCategory of beingRow (database)1 (number)DialectMereologyComputer fileVideo gameRhombusPoint (geometry)Particle systemEvent horizonGroup actionStability theoryAlgorithmComputer animation
Dimensional analysisRenormalizationTable (information)Hausdorff dimensionRelational databaseAbelian categoryProduct (business)Local GroupQuery languageData structureAbfrageverarbeitungOverhead (computing)Personal digital assistantEntire functionDatabase normalizationInformationMobile WebData storage deviceDecision theoryVolumeInsertion lossProcess (computing)Software bugModal logicSpacetimeFrequencyElement (mathematics)NumberLevel (video gaming)Materialization (paranormal)Dependent and independent variablesForceGame theoryNumbering schemeTable (information)Rule of inferenceRhombusDimensional analysisParallel portRight angleDisk read-and-write headMaterialization (paranormal)Multiplication signProduct (business)Level (video gaming)Different (Kate Ryan album)Point (geometry)Type theoryNetwork topologyNumberDecision theoryModal logicVolume (thermodynamics)Data structureElement (mathematics)FrequencyDependent and independent variablesGroup actionTerm (mathematics)Descriptive statisticsInformationCategory of beingMoment (mathematics)Tape driveAdditionMaxima and minimaHeuristicSet (mathematics)Calculus of variationsGrass (card game)Mixture modelCondition numberCodeAugmented realityStability theoryStress (mechanics)Goodness of fitMereologyMatching (graph theory)Perfect groupData warehouseRow (database)Performance appraisalBusiness modelCellular automatonData storage deviceQuery languageOrder (biology)Insertion lossLattice (order)Mathematical optimizationSystem administratorDatabaseCausalityStructural loadControl flowSingle-precision floating-point formatTupleEntire functionStaff (military)Normal (geometry)HierarchyComputer animation
Dimensional analysisTable (information)Event horizonVideo trackingRepresentation (politics)Transformation (genetics)Relational databaseBusiness modelDirected setMultiplicationAsynchronous Transfer ModeData storage deviceArray data structureData structureCubeElementary arithmeticCellular automatonMeasurementLinearizationData typeDifferent (Kate Ryan album)Table (information)SubsetSound effectRhombusRule of inferenceNumbering schemeMultiplication signFlow separationMeasurementDimensional analysisData storage deviceTwo-dimensional spaceInterface (computing)Data typeDatabaseLattice (order)Row (database)Data structureRelational databaseArray data structureTheory of relativityQuery languageLevel (video gaming)Right angleMereologyMatrix (mathematics)CubeProduct (business)Cellular automatonData warehouseElementary arithmetic2 (number)SpacetimeStatement (computer science)Cartesian coordinate systemGeneral relativityTupleLinearizationMaterialization (paranormal)Set (mathematics)Event horizonTransformation (genetics)TrailHierarchyPlastikkarteKey (cryptography)Matching (graph theory)Scaling (geometry)Social classAddress spaceType theoryQuicksortPriority queueGroup actionWebsitePerspective (visual)FrequencyNeuroinformatikTraverse (surveying)Point (geometry)AreaPiRepresentation (politics)Dependent and independent variablesInfinite conjugacy class propertyComputer animation
CubeCellular automatonLinearizationInformationQuery languageBlock (periodic table)Product (business)Web pageDimensional analysisData storage deviceArray data structureCache (computing)Physical systemPopulation densityLinear mapOperations researchLevel (video gaming)Subject indexingMathematical optimizationCodierung <Programmierung>Data compressionPartition (number theory)Sparse matrixData structureRead-only memoryBusiness modelRelational databaseHybrid computerPointer (computer programming)Operating systemFile systemRelational databaseData storage deviceMereologyCache (computing)Normal (geometry)Cellular automatonRow (database)Array data structureGoodness of fitCubeOperator (mathematics)LinearizationHidden Markov modelDimensional analysisInformationSparse matrixBlock (periodic table)Product (business)MeasurementBusiness modelTouch typingPoint (geometry)Well-formed formulaDifferent (Kate Ryan album)Level (video gaming)Data structureNeuroinformatikOrder (biology)Dressing (medical)Query languageNetwork topologyHydraulic jumpMultiplication signPoisson-KlammerBitType theoryRaster graphicsDatabaseCombinatory logicTraverse (surveying)Physical systemKey (cryptography)Strategy gamePhysical lawGroup actionDisk read-and-write headQuicksortPopulation densityPolygon meshPlotterCASE <Informatik>Priority queueReverse engineeringCorrespondence (mathematics)Dependent and independent variablesAreaDigital electronicsFrequencyNobelpreis für PhysikGeneral relativityInformation securityContent (media)Moment (mathematics)Computer animation
Business modelRelational databaseHybrid computerPhysical systemExtension (kinesiology)Set (mathematics)Mathematical analysisClient (computing)Source codeView (database)Server (computing)Query languageData structureInterface (computing)OSI modelInformation retrievalMathematical optimizationCalculationCubeCuboidAdditionOverhead (computing)OracleoutputInformationService (economics)Product (business)Strategy gameSubject indexingInformationData warehouseProgramming paradigmServer (computing)Projective planeOperator (mathematics)Computer fileLevel (video gaming)Multiplication signMathematical analysisTheory of relativityQuery languageLinearizationEnterprise architectureNeuroinformatikAdaptive behaviorInterface (computing)EvoluteGoodness of fitInformation technology consultingCubeOpen setFrequencySoftwareRaw image formatDatabaseCalculationClient (computing)Standard deviationoutputNetwork topologyDimensional analysisPresentation of a groupInformation retrievalPoint (geometry)Raster graphicsTransformation (genetics)HypercubeCache (computing)Data storage deviceMathematical optimizationPhysical systemProduct (business)CASE <Informatik>AdditionElectronic mailing listMereologyRelational databaseLattice (order)Constructor (object-oriented programming)Electronic visual displayCentralizer and normalizerSoftware repositoryDistribution (mathematics)FunktionalanalysisGeneral relativityPhysical lawFile formatAreaWhiteboardRow (database)Boss CorporationDistanceResultantDisk read-and-write headElement (mathematics)Event horizonData structureQuicksortCompilerGroup actionAuthorizationBitComputer animation
Product (business)Business modelRelational databaseService (economics)Mathematical analysisStrategy gameSource codeSubject indexingView (database)Channel capacityAnalytic setData structureSoftware maintenanceIndependent set (graph theory)Presentation of a groupVertical directionPartition (number theory)Different (Kate Ryan album)Server (computing)Query languageCubeHausdorff dimensionSparse matrixPopulation densityCuboidAxiom of choiceVolumeArray data structureData storage deviceTable (information)Level (video gaming)HierarchyDimensional analysisImplementationAbfrageverarbeitungOverhead (computing)Personal digital assistantDatabase normalizationCache (computing)Order (biology)Vertex (graph theory)Bit rateState of matterData storage deviceDatabaseInformationLevel (video gaming)Musical ensembleSoftware maintenanceDifferent (Kate Ryan album)Physical systemDimensional analysisInformation systemsField (computer science)CuboidGroup actionMultiplication signCentralizer and normalizerTable (information)Goodness of fitQuicksortStrategy gamePoint (geometry)Business modelQuery languageView (database)Complex analysisData managementOverhead (computing)MereologyMaterialization (paranormal)Revision controlBeat (acoustics)Physical lawDisk read-and-write headCausalityAreaWhiteboardCASE <Informatik>Data structureOrder (biology)NumberGeneral relativityArray data structureCombinatory logicPlanningThermal radiationTelecommunicationVolume (thermodynamics)HierarchyReal-time operating systemProjective planeFunktionalanalysisHydraulic jumpProgramming paradigmBitPopulation densityProduct (business)Presentation of a groupServer (computing)CubeTheory of relativityNobelpreis für PhysikCache (computing)LinearizationSparse matrixProcess (computing)Hybrid computerInternet forumHeuristicPartition (number theory)DivisorOperating systemData warehouseDatabase normalizationComputer animation
Subject indexingNetwork topologyHash functionMathematical optimizationMultiplication signNetwork topologyHash functionRaster graphicsLevel (video gaming)Population densityStrategy gameSinc functionComputer animation
Transcript: English(auto-generated)
It's my pleasure to welcome all to the lecture Data Warehousing and Online Analytical Processing. And today we will go into the modeling domain. So last week we were talking about, well, data warehouses at large and how the storage structure and the architecture of such warehouses can be done.
We were also talking about a little bit how to model these entities and the relationships especially with respect to, well, aggregation levels. So something could be on a daily schedule, on a weekly schedule, on a yearly schedule or something like that.
And how to denote that in normal entity relationship modeling or in UML. This week we want to continue with the data modeling. So we want to go a little bit deeper into the issues that are kind of interesting.
Nice try. Good. We want to go a little bit deeper into the modeling issues that we find. And the basic idea is there are two kinds of modeling that are needed. There's logical modeling. So how does the logical model or how can the logical model be distinguished from normal logical models
as we have them in relational databases? And the second one is about the physical modeling, how to build the actual tables. How to build the actual tables and how to store the data on the disk that accesses can be done very fast and very quickly.
This is what we want to do today. So what is a logical model? A logical model deals with the structure of the data, the relationships of the data. And as we have said, in the center of all databases, all data warehouses,
is the subject that you as a business leader of tomorrow are interested in, as an analyst to say the least. And you define these kinds of subjects.
What is it? Is it the customer? Is it the sales? Is it what you gain, revenues or whatever? What you are interested in. And then you create the facts and the dimensions for each subject that we want to know about. If I talk about sales, what would be appropriate facts for that?
The price of things or how many items did I sell? What products did I sell? These are all facts. And then we have the dimensions for all these facts. So how many things did I sell?
When did I sell them? Time dimension. Where did I sell them? Geographical dimension. What product categories did I sell? Product dimension. So these are the dimensions that go with it. And of course we can have all the dimensions in different granularities.
Again, this depends on the focus of your warehouse and what is interesting. The logical structure of the multidimensional model as we showed last lecture already, then works like follows. Basically, you have the subjects that you are interested in.
Sales or the prices or inventory or something like that. And for these you have facts. So you have the amount. You have the unit price. You have your stock, what is in stock and stuff like that.
And all these facts of the subject tables apply to some dimensions. So for example, for the sales it's interesting. To whom something was sold.
Where something was sold. What was sold. And of course when was it sold. For each of these questions you have a dimension.
The time dimension. The client dimension. The article dimension. And for each of these dimensions you have different aggregation levels. So to whom did you sell something? Well, to a certain client which is basically single person.
The client lives in some city. So I sold something to New York or to customers in New York. The city might have a district in the state of New York.
Region, country, different levels of aggregation. From the individual customer. From Mr. Muller. Up to all the customers in Germany. Or all the customers in Europe. Depending on what I want to know.
Same goes for where was something sold. From the individual store, here middle in Braunschweig or something. To all the stores in Germany. All the stores in Europe. What was sold? From the individual product that I have.
I sold chair or whatever. To I sold furniture. I sold, I don't know, whatever it is. House products or something like that. Different levels of aggregation that you can determine. And there's no way to get that automatically somehow.
I mean if you have some dimensions like time the granularity will be quite general. I mean it's kind of that time will be somehow distinguished in minutes, hours, days, month, year.
It's quite obvious. You might need something different sometimes. So for example some people will rather have the year. Other people might have fiscal years. That are not from first January to the 31st of December. But rather from March till the end of February.
Because it's just a fiscal year and tax return issues. You can have that on a different scale if you want to. And the application also has to determine what the right scale is. For example if I have a stock company.
So I'm selling stocks or I'm dealing with stocks. Then second would be a good time unit. Because seconds in the market, in the stock market can mean a lot. Year is probably interesting for investors and hedge funds and stuff like that.
But still you would go from second to year. What if I have a shopping mall or something? Is the second something was called really interesting? Something was sold really interesting?
Or very probably not. You would say well probably hourly if at all. Rather on a daily basis probably. At the end of the day you do the great turnover and see what you sold. Make the inventory. So that depends on your application and there's no right and wrong.
There's only I'm interested in this subject. Therefore I want to see these facts. And the aggregation dimensions are like that. Depending on the analyst. This is logical design. The dimensions are entities that are chosen for a specific purpose in the analysis.
So what I want to know is for example an article in terms of the sales. What was sold. In terms of individual purchases.
So who bought what together. Or in terms of the price. Same dimension. Same kind of articles. So it can be used to define more than one cube. And usually in the dimension I have a hierarchy.
Might be a bigger hierarchy with lots of levels. Might be just single level or two levels or three levels. Doesn't depend on the actual or only depends on the actual application that you're dealing with. The dimension hierarchies are always organized in classification level.
And you have different dependencies in the classification level. So for example several days make a month. Several months make a quarter. Several quarters make a year. Several days may also make a week. But then a week may not make a month.
So for example a week could spend over two months. It's difficult to put them into some perspective. What do you do with weeks that start in February and end in March? You don't know. So you just have an alternative way.
You have one hierarchy over here. And you have the other hierarchy over here. Same dimension. Same kind of information but different form of aggregation. So that can happen. And the idea is already one idea that we know from relational databases.
That is a functional dependency. An attribute is functionally dependent on some other attribute. If for all the values in the domain of the attribute
there exists exactly one, and that is the interesting point, from the other domain that corresponds to it. So for every single day I can find exactly one month of which the day is part.
The first of May. Obviously there's only May. It can never be in April or in November because it's the first of May. Also if I have a May in a certain year the year is clear if I have this May day.
For every date that I have here in the date it's one specific date. First of May, I don't know, old date. This is one instance of the day.
It becomes immediately clear what month it is, what quarter it is, what year it is. It's somehow encoded in the date. Because you have the date, you have the month and you have the year here.
And this is a functional dependency. It's immediately clear from knowing the date what year it was. There's just a single solution. Good? And then you have the classification schemas where you say a classification schema is a semi-ordered set of classification levels.
Why semi-ordered? Nobody?
Well, because you can have alternatives. I can have a day, a month, a year. I can have a day, a week, a year. Well, probably not even a year because a week can stretch several years.
So, tough luck. Might well be. Well, that's the basic point of semi-order. Not everything is kind of in a hierarchy, but there can also be different branches in the hierarchy.
And the smallest element, dk0, is the smallest classification level, the smallest granularity that we have. If we record data on a daily basis,
there's no way that we can find out the exact time of the day that something was taken. For this we would need the hour or time stamp. But we have to decide for one. At some point we have to say, this is a granularity I want to start with and this is a granularity I want to build up on it.
And if you have a fully ordered set of classification level, this is called a path. So, for example here, from day to month to quarter to year, that is a path. From day to week is a different path.
But they are kind of ordered along this classification level. And we can have two classification levels here, but still we have this single element. This is the smallest element that is common to all paths.
The branching, in this case, happens in the smallest element, could of course also happen further up the tree. So at some point it can just branch and we have different paths through the classification hierarchy. Well, so if we have such a path in the hierarchy,
then the classification hierarchy, to define the classification hierarchy, we can say that the classification hierarchy is a balanced tree where the nodes are all the domains of all different levels
unified with the all, which means everything aggregated. Okay? And the edges in the path are the functional dependencies that we have.
Once I have a path, I can always say what is the month, what is the year, and stuff like that. Good? Everybody knows about dimensions? It's not too difficult, is it? Good. Let's make an example. We have a classification hierarchy from the product dimension, for example.
We have the article, the product group, the product family, and the product category. What are the dimensions that we can have and what is the classification hierarchy? Obviously, it starts with the article and this is the lowest part, the branch, the leaves of the category
because that means there's no smaller granularity. There's not part of an article of some product, there's not spare parts or something like that, it's just not modeled in the system.
The smallest we can have is an article, so the TS-56 or the TR-34. These are just articles. And what type of article are they? To which product group do they belong? Well, they are video recorders.
And besides video recorders, my shop can also sell camcorders and digital cameras and whatnot. Everything can live in this thing. So this second category in my dimension
reflects the parent nodes of leaf nodes. Good. A video recorder can be from the video department so the product family goes here and the video also has all the camcorders
and has all the audio equipment and all the TVs and stuff like that. Then we have the product category. The product category for video, audio, TV is electronics. And then we have what we just said,
which we unify it or we union it with all, which means that we put all the categories together in one single aggregation class. It doesn't matter whether it's electronics or closes or whatever, it's all the articles that we sell, all aggregated together.
Okay? We always want that. This is the root node of our classification hierarchy. And now I can look at the hierarchy at every depth that I want to. If I want to have information by category,
I can do that. If I want to have information by product, by article, I can do that. In the data warehouse, we will pre-aggregate all those values and the analytics can just take these values that are pre-aggregated, that are pre-computed, okay?
Quite an easy concept that is very sensible. Now, the second thing that we need to know about is a cube. What is a cube? A cube is basically like the Rubik's cube. You always knew this Rubik's cube that you can turn
and that has different cells in it of different colors. It's kind of similar here because every part of this Rubik's cube is a single cell and these different cells contain one or more measures
with respect to the dimension that make up a cube. So a cube schema consists of a granularity and a set of measures. For each granularity that we have, we have a cell.
So each article gets a cell. Each product category gets a cell. Each product family gets a cell. Each cell for the article gets the price as a measure.
Each product family gets the aggregated prices as a measure. Each product category gets the aggregated prices as a measure, okay? So the cube is the set of all the cube cells
where all the domains of the different granularities, of the different dimensions are crossed with the domains of the measures. So for every product type, I have the price.
For every product family, still have the price and so on. Clear what a cube is? Good. Let's make an example. The coordinates of the cell are the classification nodes corresponding to the cell.
So let's say we are interested in sales, okay? And we say that the measure for sales is the turnover, okay? How much did we sell? The turnover. This is the measure. Good.
So by what dimensions is this measure, the turnover, specified? We can have the turnover per client. This is our who. We can have the turnover per store, our where.
We can have the turnover per article, what? And the turnover per day, when? Okay? This is what we record. Per article, per day, per store, per client.
And since we have the granularities here as the lowest granularity, our cube will also contain the aggregation steps of everything that we took. Okay?
So the dimensions that we have is the supplier, the city, the quarter and the product, for example. And I go there and says, well, here is my product dimension. For every computer, for every audio, for every video device, for every telephone, I build a cell.
The time dimension, yeah, so this was the when dimension, time. Could be quarters. This was the product dimension, our what product. And this is the where dimension, location.
And then we have the supplier dimension. I can't have four dimensional cubes on the slide, so I will just do many of them. This is kind of this dimension here, supplier one, supplier two, supplier three.
And for each cross product that we have, a specific product in a specific quarter at a specific location for a specific supplier, we get a single cell showing us the measure that we are interested in, the turnover.
So I know, what does this measure mean?
What is the basic statement that this cell makes? Yeah? Berlin, yes?
Exactly, it tells us that a specific supplier, supplier one, because that's a cube, sells over 900 computers in the fourth quarter in Berlin. And this is how we fill the cells with information.
That's easy. Well, this was three dimensions and basically the fourth was put together since I had several cubes of that type next to each other. In way of storing that, that must be vectors.
And multi-dimensional cube is just a big matrix that can have multi-dimensions. And we can put any ends or any kind of dimensions that we want to it, which is basically meant a cuboid.
And again, we can have different aggregation levels over that. The highest aggregation level that we can have is everything. We want to aggregate everything. We want the total turnover of our company.
Very often called the apex cuboid. All the measures in our cube are summed up and make the apex.
And then we can go down and always aggregate only one dimension. So, for example, we can aggregate all the cities together. We can aggregate all the items together.
Or we can aggregate all the years together. And the other two are basically the information that we have. We can have the cities and the items, the cities and the years, the items and the years, or we can have the city, the item and the year. So basically how does this cube look like?
How many dimensions does it have?
This cube down here. What does it look like? Well, it's obviously exactly one of these cubes.
Why is that? Because it has an item, it has a year, and it has a city.
All three dimensions are present. And then I fill it up with the cells. My three-dimensional drawing is not too impressive, I grant you that. So basically what comes out is one of these cubes
like we've just seen with all the different cells in it. So now let's aggregate one of the dimensions, of the three dimensions.
So we are only interested in city and item. And we aggregate over all the years. How does this cube look like?
Do you know what it is? It's just a grid, it's a spreadsheet, exactly. Because you have the item, and you have the city, and that's it.
And the values, the turnover values, for all the years are aggregated. So from this cube I can just learn how many items of what type I sold in what city.
Over. Over all the years. Okay, same happens here. How many items aggregated, so all kinds of items, did I sell in some city per year?
And finally, how many items did I sell per year, specific products, independently of where I sold them? Okay, I always aggregate one dimension. Good. Then let's move to the next level.
How does this cube look like? Yes? It's a list, exactly. Because we just have the city, the location. So, Braunschweig and Hanover and, I don't know, Göttingen.
And for every point we just say how many, how big was the turnover, okay? Over all the years, over all the products in this part of the city.
And finally, how does the apex look like?
It's a single number, exactly. This is the number of how many products I sold worldwide over all the years. Just a single number. Different aggregation levels. Depending on what I want to know, I need the different aggregation levels.
So, for example, if my query deals with North Africa or the US, aggregating all the locations is not a good idea. In that case, I would be in this part of the cuboids.
I can use any of them depending on what else do I want to know, okay? But I need the location information to cut out everything that is North Africa or everything that is the US. Because in this part of the cube, I just cannot distinguish between the US and Europe and North Africa and anything.
The information is aggregated, okay? This is the interesting point in working with that. So, the full data cube that I have is a lattice of cuboids of different dimensionality.
Starting with the full product, so for every dimension that I have, I have the dimensionality counts.
If I have three dimensions, it's a three-dimensional cube. If I have 15 dimensions, it's a 15-dimensional cube. And ending at a single dimension, everything is aggregated into a single dimension. The count, that's a measure, that's a turnover, okay?
So it has different aggregation levels, that's right. But the aggregation levels depend on the dimensions that reflect the turnover, okay? So from the individual values here, where I basically see all the different items that are sold,
everything by product, by city, by year, up to I see what I sold independently of product, year, location, everything, okay?
Good. The problem is that the more items you have, it gets very complicated. Because this was three dimensions, let's do four dimensions. Well, then have the time, the item, the location, the supplier, which basically allows for four different aggregations.
Because I can aggregate either one of these, okay? I don't care for the time, I will take item, location, supplier, okay?
I don't care for the product, I take time, location, supplier. I don't care for the location, I take time, item, supplier. I don't care for the supplier, time, item, location. Four possibilities. Now, take two out of each. Time, item, location and time, item, supplier will both result in time and item.
This is a cuboid that I can derive from both of them. On one hand, I aggregate all the suppliers. In the other case, I aggregate all the locations, okay?
Two ways to get to the same result, because it's the same two-dimensional cube. Good? And so on and so on, and then I get the time and the item and the location, which will yield in the time, and so on and so on.
Clear? I see this lattice is growing. It's growing amazingly fast. Now we have to store, obviously. So we see these cubes are very practical for query processing, because it contains the pre-aggregated results for different queries.
And depending on what the query is, whether the query has a time component, I need one of the time cubes.
If the query has a location component, I need one of the location cubes. If the query has a time and a location component, I need one of the cubes having time and location in it. Aha! So now there we go.
And on this multidimensional model, we can have different basic operations. And the classical operations that we know from databases, from relational database systems, is the selection on one hand and the projection in the other case.
But we have some more. We have the cube join, the sum and the aggregation. We'll come to that in a minute. Let's start with the multidimensional selection. It's basically exactly specified like in the case of databases. The selection of a cube that has some dimensions and some measures by some predicate is defined as the selection.
All the cells from the cube where the predicate applies.
And the variables have to be classification level, which functionally depend on the classification level of the granularity. I choose the classification level and the different measures that I get for each cell. In each cell, there are several measures and these appear in different classification levels.
So I could say something like, I want the sales. Sales has different measures. For example, the price and the location where it was sold and blah, blah, blah, blah, blah.
These are the measures of sales. This is what I'm interested in, in the query. Now I select something and by selecting, I fix one dimension to a certain value.
So the product group, which defines my aggregation level in the dimension products, is for example set to video. Which means for the evaluation of this query, for the evaluation of the selection, what cubes can I take?
I can take all the cubes that contain the article dimension, the item dimension, because otherwise I couldn't talk about product groups.
Yes? And I need the measures for all the other things, but they can be aggregated. If I'm only interested in the product group, I will take obviously one of the item cubes here.
Here I can select the product group and everything else is aggregated. So this would be the right cube to do our selection on. Understood? Good.
Okay, the projection works similarly. The projection basically projects some of the measures that are interested. So if I have for example the sales, it might have two measures, turnover and sold items. However, I might not be interested in both of them.
So I project one. Or I might be interested in both of them. So I project two. Okay? So what I do is I make a projection, for example on turnover and sold items.
So I want to see both items, both measures of the sales cube. Okay? And then I can define what I want to see, for example the price. Along the price dimension, please show me the turnover and sold items measures from the sales subject.
It's exactly the same like we had in relational databases. So if I have different cubes, joint operations can be performed.
For example, I could be interested in sales, but I could also be interested in prices. Because I want to know what revenue I'm creating. And revenue is sold units times the price. I need both for calculating the revenues.
Okay? So what I do, I do a join. For example, if I want to know the turnover that I have and I don't have it as a measure now, I could just calculate it unit price by unit sold.
And for this I take two cubes, sales cube and the price cube. One contains the unit sold, the other contains the unit price. And they can only be joined if there is the same granularity. Because if one calculates in, I don't know, product groups,
so I have only prices for product groups. And the other one wants to say, well, I have prices for different products. I sold different products. Then I just cannot aggregate them because I don't have the price for the individual product,
but just for the product group. So to perform the join, I need them to be on the same level of granularity. I could aggregate the one with the smaller granularity, so for example all the products into product groups and then join them.
That would work again. So we need the same granularity and then the join basically is I make the cube of the new granularity, the granularity that is both of them, and just combine the measures.
So both measures are part of the new cube in the specific granularity. This is a cube join. Everybody understood what a cube join is? Basically I'm building one new cube from the previous two cubes that contains both measures.
No more, no less. Good? When the granularities are different, we have to aggregate the one with the smaller granularity to the needed granularity.
So for example if you have sales times inventory and we say that the sales are aggregated by day, article, store, and client, but the sales are only considered by month, article, store, clients, then obviously we cannot put them together
because what is the individual item that we can join? What is the information that we can join? In one cube we have all the days, in the other cube we have all the month. So what we basically need to do is
Yes? Exactly. We have to look at the sales table and aggregate all the days into month, and then we can join them. This is the way to go. So basically we cut out this edge and build a new one here.
This is the basic idea behind that. Not too difficult. So and then we have the aggregations. The aggregation is the most important operation for all the OLAP stuff that we're doing,
and we have a whole bunch of them. So in SQL we know the sum, the average, the count, the minimum, the maximum. These are the typical aggregation functions that are built into SQL. There are a lot more, there are some statistical functions,
there's something like the mean, the mode, standard deviations and stuff like that. There's already existing in today's SQL dialect, and this is what is also used in OLAP basically. So most of the time these basic functions do the trick.
A little bit of arithmetic operation, so times and divided by and stuff like that, yes. But basically this does the trick. So what we could for example do, we could do, I take the sales cube, and I want the sum of all the product groups, all the cities and all the month.
I would just sum it up for every month, for every city, for every product group, independently of the items. No, not independently of the items, independently of the supplier in this case.
This is basically this type of aggregation, it's basically how I build the higher order cubes. And if I leave one out, I will go another level higher in the lattice and move to the sum of all sales with respect to the city and the month,
not looking at the product groups, not looking at the suppliers aggregated. That's basically what I do. Well, one of the idea of course is that you have to compare the granularities to see which are they really the same or which one is finer.
And we say that granularity is finer than some other granularity. If for all the granularities in the coarser cube, there is a granularity in the finer cube that can be uniquely assigned.
So if I have a month granularity in one cube and a day granularity in another cube,
I have a functional dependency from every day to a month. So for every entry in the month cube, it is clear what every day in the day cube belongs to. So the day cube is finer than the coarser month cube.
Pretty obvious, but I mean this data warehousing, most of the stuff here is obvious. So that's kind of nice to see. And with that we go into the detour and look at some of the classification schemas,
cube schemas and stuff like that in practical terms. Good, so we've seen how the logical modeling looks like. We've seen that we have these classification schemas, we have cube schemas, classification hierarchies. All this information starts with building the data warehouse.
So someone does an analysis and says, okay, the classification schema should look like this. We should build such cubes because this is told by our business analyst. These are the processes which are involved in our business. The problem is these classification schemas, these hierarchies, they evolve with time.
Requirements change, data changes also, so data warehouses also need to change. Let's see a practical example of how this can happen.
So I will take our previous example, Saturn. They sell a lot of electronics and they decided to build a data warehouse around 2003. So the first classification hierarchy, they had looked something like this for the mobile phones.
I will simplify it and just take mobile phones to get a better feeling of what happens here. We have mobile phones and they classified it as GSM capable or 3G capable. Then the corresponding products under this classification.
This was fine until 2008. The classification was enough for what products they had in their database and data warehouse. In 2008, the 3G devices became affordable.
The hardware became cheap enough for the producers to decide, okay, we'll all go to 3G. It's the new way. So O2 decided, you know what, our old GSM handy device should also be 3G capable. So then our old schema is not reflecting the reality anymore.
We don't have a connection for the new data between GSM and the O2 with modified hardware which is now capable of 3G. So the data has evolved with it. We need to evolve also our classification hierarchy.
Further, of course, later the new iPhone came on the market. The 4G iPhone. It's a new category of products which Saturn had to have in their data warehouse. For this, they also had to modify their classification hierarchy.
Of course, all these modifications have to keep track of how the data has evolved. If I wouldn't be able to keep track of what happens with my data, I wouldn't be able to know what was happening three years ago.
At the same time, I need to store the new modifications of new products coming to market. So I need a versioning system which allows evolving data to be also stored. We observe that this is very, very dependent on the time, right?
Products evolve with time, data warehouses grow old too, they evolve with time. So why not use timestamps to mark something like this? This was one of the most used solutions. Just use timestamps in a validity matrix to know how our classification hierarchy evolves.
This is what we are doing also with this example. From the beginning of the data warehouse, we know that mobile phones are classified as GSM until some point in time which has not been reached.
The same happens for the products, so this transitively goes further. However, for this product from O2, we know that they have produced GSM XDA's only until the seventh month of 2008.
So we put here an end timestamp for our validity schema. We say that after this date, this product belongs to category 3G. Pretty simple to do. The same can be done also with growing our classification hierarchy with new elements.
So here I've just inserted a new element which is valid only from above this date, the April of 2010, with the corresponding objects. Whatever it may come will be inserted here.
As I've said, something like this can be well stored in a validity matrix. Such a matrix can be represented as shown here with rows as parent nodes and columns as child nodes. Under this assumption, we will have the mobile phone node which has GSM and 3G and 4G as children with their validity dates.
We will then have the further description of the hierarchy with the GSM sub-node. Again, with the products, with the XDA, with the limited validity between 2003 and 2008.
And of course, new nodes with the corresponding children. I think it's pretty intuitive now. Of course, it may be that after a while, GSM devices will never be produced again.
And I won't have them in my deposits anymore. I won't offer them for sale anymore. I could just delete the node, right, in the classification scheme. Well, not quite. If I delete them, it leads to information loss.
I cannot do queries anymore which regards those kinds of devices. Although, as we have said, this information remains in the data warehouse. You don't delete information from the data warehouse, you just input information in there, right? So, deleting is surely not an option.
The solution to such problems is, again, modifying the classification hierarchy. As I've said, one can simulate this behavior and still allow queries like as is now compared to as was yesterday.
Or I can also do some analysis like as is versus as is. My schema has evolved. I have a new schema today. I want to analyze data from the last year. I can still say, okay, I don't care about this relation here in the past.
I just want to use the last schema because it's faster. I want to analyze the dates with the most actual schema, do a comparative analysis, and I don't care about this evolution. This is fully allowed by the validity matrix.
At the same time, I win in flexibility. If I say I'm interested in this evolving product, I'm interested in seeing how I've sold with regard to this evolution, then I can perform as is versus as was and say for the products after a certain date,
I will take the schema with the modification according to the validity matrix. For the products before a certain date, I will take the old schema. Pretty easy. The same thing I can do also by considering an old schema where I'm interested only in the past
and seeing how I was doing in 2004 or at the beginning. Different possibilities, as I've said, such a validity matrix offers high flexibility. Another common type of querying is like versus like,
where I'm actually interested only in those objects in the classification hierarchy, which are not evolving, the objects which are pretty stable. In this case are Nokia and Blackberry Bold. I just want to see how am I doing on these not evolving products, this is a possibility to do.
Of course, not only the classification hierarchy evolves, another possibility is for the schema to evolve. We remember that we had not only this old and then the 3G and GSM and 4G and so on, but we also have different classes which give our granularity.
We have our product group, we have product family up to the old class. Of course, this can also be modified. I want to be able to allow schema evolving also.
I want to be able to do possibilities of schema versioning also. Let's take an example so that this becomes clear. We have here different cubes from sales to purchases and price, and one dimension, the article dimension.
During time, I observed that actually product family is not queried at all by my business analyst. They are actually not interested in it. But when I build my data warehouse, if the schema would look like this with product family in it,
I would still need to calculate and aggregate everything for the product family. You remember, the latest was pretty big, so I would perform a lot of unnecessary operations. So I say, okay, the needs have changed. The business analysts don't want product family anymore, so I need to modify the schema. What can I do?
Should I just delete it? If I just delete it, of course, I lose information. Because nobody that will want to query this pre-aggregated data, which I have already in the data warehouse from the past, they won't be able to query it anymore. So what can I do?
I need to adapt the schema. The solution is common from the software engineering, so I'm able to do a schema evolution again with these timestamps. And the main advantage for something like this is that,
if I allow schema evolution, I have only one schema. And I'm able to do fast querying. I'm at the same time able to do fast loading processes, but I lose in flexibility because I have only one schema, which is the evolution.
I can't query into the past. Of course, the other solution is similar to S4N where I use versioning. I don't allow any data loss. I just say, okay, until some time ago, I wanted to have product family.
And my data warehouse has a lot of data which is categorized and aggregated with product family inside. But beginning with this year, I'm not interested in product family anymore.
So I will spare time, and at the same time, still allow queries into the past. Right? So the main advantage in schema versioning is that I'm allowing cooperative analysis with the past. I'm allowing for people to ask, how was it with the product family last year?
I can't compare the product family with today, but I still compare the products because the product family is not there anymore today. In order to do this, I would have to do additional calculations, which could be done on the fly. So then the flexibility comes, of course, at a trade-off, more running time, right?
And this is about everything I wanted to say about the logical model. We can now further go to the physical model. Exactly.
So now that everybody has understood the logical model, with different kinds of cubes that offer different dimensions and aggregate all the other dimensions, how do we get that down to a physical model? How do we store the actual data? Because obviously we don't have cube stores that somehow store the different cells.
And the first idea that comes to mind when storing data is a database. We all love databases, so why don't we just store the data in the database? And this was exactly what the whole point was about.
Where is my, oops, yeah. This was basically what the whole point was about. So we have to set up a database environment somehow and store the data. And then we can go on to physical performance tuning, like indexing the data,
partitioning of the data, distributing over different servers, materializing the views for aggregations and so on. So this is what we want to do next lecture. So next lecture we'll go into the ideas of indexing, partitioning, materialization and stuff. Today we want to see what different possibilities are there to store our actually cube data.
And this is basically the storage architecture. I want to discuss some of them and just show some popular notions on what is good and what is bad. There are basically two kinds of implementation that one can think of.
The first one is obviously relational because we have relational database and we love relational databases. And there are different kind of schemas that are prepared for that. The snowflake schema and the star schema basically, we'll go into that just in a minute. Or we could just stick to the multidimensional idea of the cube and just store it as a big, big, big matrix.
So simple, put FART for the matrix for a big array in the storage and address every cell. Two possibilities, both have advantages and disadvantages.
So let's start with the relational model. If we remember what the aim of the relational model of the relational databases was, it was basically that all the data is stored in a non-redundant way,
is safe and complete and you can ask queries about it and so on and so on and so on. So what we want to do is we want no loss of semantic knowledge, so all the classification hierarchies and stuff like that.
We want a translation from the multidimensional queries to SQL because SQL is how you query relational databases. So all the things that we had about queries last time, there should be a transformation or automatic transformation in SQL
and these SQL queries have to be optimized somehow so that they can be evaluated quite quickly. Then, so this is an efficiency point basically and of course we also have this extraction in a transformation loading process.
So when new data arrives, it has to be put into the tables. It has to be split somehow. Every cell goes somewhere and it has to be clear where it goes and that in a quick fashion. So if you want to go from the multidimensional model to the relational model,
you have to think what representations are there for the cubes, for the dimension, for the classification hierarchies and for the individual attributes. Because what we have on one hand is the complex cube.
What we have on the other hand in the relational database is tables as we know them from relational databases one. How does it work?
What can we identify? What can we put together? A basic idea leaving out the classification hierarchies for a minute is, well, I have a cube that has certain dimensions and for each specific set of dimensions, we have values for the facts,
the aggregated values for the facts given the dimensions. So why don't we see the whole cube as one single table and in this table, we have the dimensions as columns, as attributes
and we have the facts as attributes. And since the dimensions identify each cell, this is the primary key of the table and in the facts attributes,
they are just the values for the facts for this specific tuple of dimensions. So basically what I do is I build a big table, the cube table that has a number of dimensions,
D1, D2, D3 and here are the different tuples that are just different arrangements of the dimensions. For example, the Berlin stores in the year 2000
for the product group electronics or something like that. This is one of these tuples. They identify a cell, thus they make the primary key.
And for every such tuple, we have the measures, the facts assigned. So why don't we just put them right here, fact one, fact two and so on, fact n
and we'll just get them into the tuple. Then each, oops, different color, each such tuple in the table is basically a cell in my cube. Okay, and each attribute here reflects the dimension
and each attribute here reflects the measure and this is exactly how it's done. This is called a fact table, okay?
Ding, ding, ding, ding, ding. Most important thing. So it's quite easy, I mean, we could have thought of that. Still, I just said leave the classification hierarchies out for a minute. What do we do with the classification hierarchies?
It's kind of difficult, no? Because now for the different days, for the different parts here, if we consider, oopsie, the dimensions again for a moment. So we have D1 and D2, oops, okay.
Then for all the different dimensions, some of these things can be grouped into different classification hierarchies. But of course this is not part of the table. What do we do? Any ideas? No ideas?
How can we state, for example, if we have the day as a dimension,
time as a dimension and we have the day granularity in this column that several of these days belong to a certain month?
Yes, you could split the table, but what would you split? Which dimension? You would need exponentially many tables for doing all the different splits. Bad idea. Better idea?
Well, basically, consider this. The items here in the table are cells, okay?
So every row in the table is a single cell. It has a certain day, it has a certain location, it has a certain whatever, and it has a certain turnover, certain price or whatever it is, okay? If I consider the day and I want to see the month of this day,
does the month change for different tuples with the same day? Not really, does it? So the dimensions are kind of fixed.
The same day will always be in the same month and the same months will always be in the same year, okay? So what can we do? Yes?
Very well, and this is exactly how it's done. So if you would have been around 20 years earlier, you would have invented the data warehousing. This is exactly, I mean, it's a very simple idea. What you do is basically you take the whole cube, okay, with all the cells and every cell can be uniquely identified
by the product dimension, by the geographic dimension, by the time dimension, and has a measure in it. The measure goes here, the different dimension information, what is this article?
Store, this is the time. No, this is the time and this one is the article, goes here. This is our cube table, but only for the day,
only for the store, only for the article, the lowest granularity level. So what we now do to introduce also the classification level is we put in a foreign key relationship
to a fixed set of tables describing the dimension. So I put in tables that describe which day belong to which month and which month belong to which year or which product belongs to which product group and which product group belongs to which product family
and build a foreign key relationship, a one to end relationship, so for every day the month is fixed but a month can contain many days, a one to end relationship, okay, for each cell in the fact table
to the cells in the upper table. The foreign key idea is the basic trick here because all the different tables, all the dimension tables that you have
stay the same always. They do not change. But the facts in the fact table, they may change and they are connected to some certain, well, to some certain dimension via this foreign key. This is what is called a snowflake and why is it called a snowflake?
Because all the facts are in the center. This is where the fact table is, okay? And then the foreign key relationships that I introduce go into dimension tables. The dimension tables may even have
different aggregation levels, okay? They are all dimension tables but this fact table contains the day. This dimension table tells me which day belongs to which month. This dimension table tells me which month belongs to which year
and since they are normalized I don't store too much, okay? How does it look like? In the middle we have the fact table. Lowest granularity IDs.
This is a specific product sold on a specific day in a specific store and this is the fact sales value and this is the fact revenue value, okay? So on this day in this store of this product
I sold five items for a revenue of $50 or something like that. This is what the table says. And now for all these dimensions, so these are the dimensions, these are the facts, the measures. For all these dimensions,
for each of these dimensions I have foreign key relationship to the dimension tables that describe the granularity levels of this dimension. So for example, if I have the day here, there is a 1 to n relationship
to the dimension of time and the dimension of time says well, a day belongs to some week, a day belongs to some month, a month belongs to some quarter,
a quarter belongs to some year, and so on. Okay? These are normalized relationships so it's a minimum amount of information that I can store. I have to say at some point that the 1st of May until the 31st of May are days of May.
But I don't have to say it for every single day, May day that is in the fact table. If I say it once and for all and just at this foreign key relationship and say, yeah, this is the day of May, then everything is in order. And I will do the same for all the other things.
So for example, product goes in the product domain and the location goes into the location domain. Okay? With different aggregation levels. Yes, the aggregated facts are stored in the fact table.
The information in the dimension table is only interesting for the aggregation that you need. So if I want to have something aggregated not by day but by month, I would have to find out for all the days
in the fact table that have a foreign key relationship to the month that I'm interested in. These I will take out of the fact table. I would select them from the fact table, aggregate them and can answer my query. Okay?
This answer is not stored anywhere. I have minimum information in this thing. I have to compute it on query time. Okay? And how do I compute it? Well, I join the sales fact table with the day table. Okay?
And I join the day table with the months table and then I get the information. Mm-hmm.
That's right. It's a fully normalized schema that involves a lot of joins that I need. The more dimensions I have, the more joins I will have. The higher the granularity in each dimension,
the more joins I will have. Huh? So it grows, definitely. Yes. That's right.
We will come to that in a minute. Good. So what are the advantages of the snowflake schema?
The size of the dimension tables, as Silvio just pointed out, is heavily reduced because they are normalized. I don't have to state for every day that occurs in my fact table to which month it belongs. I have a dimension table
that says once and for all, the 15th of November 1975 belonged to November. Or the video recorder belongs to the product group of video. Okay? Once and for all.
So every time this occurs in the fact table, I just have a foreign key to the dimensionality table. The only thing growing here is the fact table and the queries will run fast.
Model of the joins. And that is if the dimension is very sparse. So most measures corresponding to the dimension have no data. And the dimension has a long list of attributes which may be queries. Then this normalization pays off. If it's not sparse
or if the dimensions are very short in terms of their attributes, it will not pay off because it's huge joins then. Okay? Many joins. Which will make the system slower. This kind of the idea
of snowflake schemas. So the disadvantage is that the fact table is really responsible for 90% of the storage requirement. It grows. Because it contains all the facts with the foreign keys. And the dimensions table do not grow because
they are once and for all. And if I normalize the dimensions now, I really take multiple joins. They're not big anyway. What happens? What can happen? This normalization doesn't help me. Yes, I do have
less problems with consistency. I can do changes easier or faster. But since they are very small tables anyway, what is the point? I can control the redundancy efficiently. I can update several rows
instead of one. It doesn't really matter. So the normalization of the dimension table because it leads to a large number of tables. And this large number of tables needs large number of joins. We all know from databases that joins are always the important part
that may result in performance deterioration. So if you have a query that connects the product category with year and country, what would you have to do? Well, let's look at it.
I want the year, which is time dimension, and I want the country, which is the store dimension. What do I have to do? The year is over here. The country is over here. One join,
two joins, three joins, four joins, already now five tables involved. One join, two joins,
oh, so we can go over weak. Sorry. Two joins, three joins. Okay. One, two, three, four, five, six, seven joins are needed.
One, two, three, four, five, six, seven, eight tables involved. Oh, do we? The product category, oh yes, that's right. So even worse, is here.
One more join, one more join, one more join. Very good. Ten joins. Okay. And three more tables makes eleven,
all eleven tables are involved. That does not perform. Worst performance ever. What can we do?
Yes? Yeah. So if you pay attention, you see that all these tables around here in the dimension area are normalized. I don't store redundant information.
Every day is just stored once, every month is just stored once, every year is just stored once, every country is just stored once, every region is just stored once, but with a foreign key relation to express the one to n relationship.
What if we put them together in one single table where for every store we note the state and the region
and the country and this is my new table location with the topples here.
Okay. This is just a single big table. This is not four tables that need to be joined but a single big table. It's denormalized, yes, because if two stores are in the same city, they obviously are in the same state,
they obviously are in the same region, they obviously are in the same country. But still it's redundant information if the dimension tables are small. That was one of our ideas.
Also this combined table, this materialized table would not be too big. Question?
The join itself costs time because I have hmm? Yeah?
Yes. Exactly.
The join is this one here and this one here and this one here. Okay. Why is that? Because the sales table is amazingly big. It's 50 billion records. If you want to join it with the store, the product,
or the day, you have to join 50 billion times how many days are there, how many products are there, how many stores are there. You get the immediate, intermediate result which is probably not 50 billion anymore
but maybe 5 billion because I have 10 product categories. And now I go with this 5 billion table from the product to the product room. Yeah? Do the join. Again, very time consuming task. If you have a very clever
join order optimization algorithm that goes, I do this join first and then go over to this join but I do this join first and then go over to this join
and the final join that I do is this one. That would be okay. That would not be a problem. But the idea was to go one step further and to say, well, if I do these small joins anyway and very often why don't I just materialize
the result of this join in a single table? And I only have to do this join. Okay? This is the basic idea and this is exactly what leads us to the star schema. I use a denormalized schema for all the dimensions. I have the fact table in the middle
and the dimension tables around them like a star.
And the basic idea is, well, these are the tables containing all the information, product ID, product group, product category, time ID, day, week, month, quarter, year,
store, state, region, country and so on. Okay? All denormalized information. And the only joins that I now have to do are these joins. For all different queries on all different aggregation levels.
And of course there's a trade-off because if these tables grow very big and I'm not very interested in the product category or the year or the country then I carry all that information that makes my dimension tables bigger
in every query. Bad idea. But if the dimension tables stay small then I can afford that and save on the joins and the join order optimization. Good idea. And this is basically
the discussion star schema versus snowflake schema. So the advantages of the star schema are that it improves query performance for often used data. You have less tables, you have simple structure. And with regard to every dimension you have
very efficient query processing because you just need a single join for every dimension. The disadvantages are that if the dimensions carry much information or if the dimensions are very deep so they have higher levels of granularity the resulting table
will be very big. And the resulting table will be part of the whole purpose on that dimension. You will not say well in the star schema I just have this single dimension table
and in the snowflake I may have different granularity tables for each dimension. What if I'm only interested in the first one? Very often, 99% of queries just have the first granularity level.
Perfect for the snowflake schema isn't it? Or still this has many, probably many items. Doing the production on that costs time
for every query and you don't need it. This table over here is very small. So doing it in 99% of the query is very efficient. This is kind of there's no right or wrong. It's not you always take and you feel good.
Or you always take the snowflake schema and you're fine. Just doesn't work that way. Look on your query load look on how your dimensions are built and look on how big the materialized dimension tables are in the end. This is basically what is done. So in the snowflake
the structure of the classification are expressed in table schemas and the fact and dimension tables are normalized. This is important. In the SCAR schema the entire classification hierarchy is expressed in just a single table and the normalization
in the dimensions is broken because they are denormalized. Okay? That's basically how it works. And if you have big dimensions a star schema is a very bad idea. If you have many queries
on very distant granularity levels very highly aggregated granularity levels a snowflake schema is a bad idea. If you have big dimensions and many queries on high dimensions
on high levels of aggregation then it depends. At some point the star schema outweighs the snowflake schema or vice versa. And you as an administrator you as an analyst and builder of the data warehouse
have to consider what is better or what is worse. That is basically what I'm saying here. Okay? So for example if we have a staff schema and a snowflake schema then it might be that okay? And we have the product group
to product category and we have the product category to the final description. Okay? Then we can join the product groups and the product categories and we have all the information there is. And we just have seven tuples in there.
Okay? Or we could put it to a star schema so we have the product ID we have the product group we have the product category and we can easily see that in terms of tuples we are better off.
We only have four tuples. But if you count the cells then we have the product the product description the brand, the product group the product group description the product category and the product category description.
Very long rows in the table. Whereas here we have the minimum amount of information that is possible. Storage versus evaluation speed. Okay? And we can see here
so for example the information that mobile phones belong to electronics we have that three times. One stupid update and you will immediately have an inconsistency. Okay? Your decision.
You are the administrators. You built the data warehouse. Well, problem is really at what point when you are an administrator should you change? And there's no right or wrong as I said there's just a set of heuristics that you can go.
So when a typical query takes causal granularity it's good to have it pre-materialized. When the volume of data in the dimension table is relatively low it's good to have it materialized. When modifications of the classifications are rare compared to insertion of new facts
you can materialize it. So you should move from snowflake to star schema. These heuristics work in both ways so if you are stuck with a star schema and your dimensions explode because you're taking on all kinds of products
your Amazon or something like that you started off with books and now you're saying everything from groceries to furniture to cars or whatever they do your dimension grows and your dimension table grows. Bad idea, go back to snowflake.
Or the volume of the data in the dimension table is high go back to starflake star, snowflake, sorry. Actually the definition about what you take or the decision
snowflake or star schema depends on the necessity if you want fast query processing if you have to control your storage and most of the time people rely on some mixture because some dimensions may be very sparse very good idea to normalize them some dimensions may be very thick
good idea to keep them in star schema and this is what is called a starflake schema so some of the dimensions are normalized some others are not normalized are denormalized and both are part of the data warehouse. Okay, short break
and then we go to the multidimensional models okay, we will reconvene between 5 to 5 let's take it from here so basically as I said
starflake schema the decision for every dimension that you make is depending on the frequency the amount of dimension elements the number of classification levels so how big the table will be and the materialization of the aggregates just like you do it in normal databases and in practice of course you have several measures
several fact tables that are assigned with the dimensions you might have something on the sales the receipt, the revenue whatever, you know like so whatever subject you are interested in you will have a different fact table for that or you might have a different fact table for that having different not different dimensions
it may be the same dimensions but having different measures different facts in them so each of this fact table is connected to a subset of the dimensions possible and this is what is called a galaxy so different stars are kind of put together
to some galaxy where different fact tables may or may not connect to the same dimension table okay then there are different schemas in data warehouses that are not used very often the most popular or the most
often used relational fact tables are the star and the snowflake schema there are also fact constellations which have pre-calculated aggregates very hard on the materialization but very quick for for clearing and the so-called factless fact tables where you don't have any measures
where you don't have any facts but just how the possible relations between the dimensions a subject can be you can use that for event tracking for example very often done so what we have seen is that relational model as such might have some disadvantages
because you do have obvious transformation steps how to get from a cube to a star or snowflake schema and you can also automatically build all the queries with the foreign key relationships
and once you have the query with the aggregation level it's clear how it is somehow transformed or translated into the right join hierarchy and that can be done automatically by the application that's not a problem
but still you have to translate it into the representation and do you want to see the SQL queries that come out of that especially if you have a snowflake schema what if we need 10 joins that's such an SQL statement so for the user this is definitely not the way to go
it has to be kept by the interface it has to be worked on by the interface so if I abstract it by an interface anyway while using a relational database I could also store the data multidimensionally not use a database at all but just use arrays
take the matrix of the cube as a basic storage scheme and then use arrays in the addressable storage space for storing the values so the elementary data structures
are again the cubes and the dimensions each cube has a set of dimensions and a set of measures but now all the values for the dimensions and measures are not put together in a tuple in some relation but in an array
directly in storage and all are kind of ordered by some criteria and are physically linearized ok so for example if you have a two dimensional cube the first dimension has five values the second dimension has four values
then you have five times four twenty cube cells ok how do you store them well you start at one cell you store the five cells in a row and then you hop to the next row again store the five cells in a row
and stuff like that so you have to decide for one of the dimensions which you use for storing the data so for example if you have the linear storage scheme that I've shown here
then you store all the time values right after each other so first all five values for January then all five values for February then all five values for
oh that's the week of February I see ok so this could be March and this could be April yeah so every month is stored linearized behind each other and the different products
that you see here are basically just adjacent to each other so what you do is basically what you can do you can point at any of the rows and then just go to the row until you get to the product that you want and if you put an index
on top of it that can directly go to all the different rows you will have the index on D2 which is basically the time dimension you have to traverse this index
to get to the right to the correct row and then the values of D1 since you cannot expect them to be ordered you basically have to you can use an index there or you have to just scan through all the products that are there
so if you search linearly in a in a cube what happens? basically if you have a query like no, the March no, it's actually the days of February
I see, I see, I see it's not the weeks it's the days of February then that's okay exactly, it should be February so if you have a query like jackets sold in February you go to the 1st of February
and then run through all the different things okay and if you are interested in a special day in February so for example the 14th if you want to access
this very cell you have to look through the different array parts until you get to the 14th cell okay so this is one day in February it's part of the linear array so it's no longer two dimensional
but it's one dimensional basically in storage it just looks like one, two, three, four five, six, seven and so on okay this is the first week of February
until here second week of February and so on is it clear what the example meant?
so basically every row in that table corresponds to a certain time dimension a value in the time dimension every column in that table
refers to a certain product and I can either store all the time items after each other or I can store certain products after each other I have to decide for either one doesn't really matter so a generalization is
if I have a cube with some dimensions n dimensions and m measures of certain data types then the index of a cube cell with coordinates x1 to xn can be linearized as
I take the x1 value plus the x2 value minus one times the dimensionality the cardinality of the first dimension because I have to hop over a row of the first dimension
okay plus the dimensionality of the first row I have to hop over that so often as I have to hop over the second dimension okay and then comes the new for the third and so on
so the steps are getting larger that's the basic idea and this is the formula that comes out so you can compute the cell where every vector is stored uniquely depending on where you store it okay
clear good so the order of the dimensions in the cube definition has a certain influence on how fast we are because what happens now if we talk about the jackets in February
and April the jackets in March doesn't matter actually the product is sold in January so what what if we want to access the products in January
yes the computation is very easy because we hop we calculate the first item that there is and then we just go to the adjacent
and that's it okay we traversed five brackets what if we want to know about all the pens there are
hmm yes hmm so it's a little bit difficult because we hop on the first then we traverse all those that we can't use get to the second
traverse all those we can't use get to the third traverse all those we can't use and get to the last so we traverse sixteen records for that bad idea if we would have ordered it by product
it would have been a good idea hmm the type of query that we expect that comes most often should be the key point for choosing the kind of linearization okay and hmm the solution what can you do
well basically what you can do is you use caching if you get some information in the cache then you can kind of rely on that information already so this is basically done by the operating system and a multi-dimensional
database basically has to see that the blocks it stores and it wants to access and that future queries may work on stays in the cache a normal database system does that a relational database system does that it has its own database cache
which is part of the of the normal cache of the normal query cache and as soon as somebody wants to touch any of these records the database system blocks no you cannot touch these records this is the database cache okay if you do that on a file system with a multi-dimensional database just linear storage
and the the the operating system might cause damage to your caching strategy still caching can do quite a lot to solve the problem hmm alright so you have a measure for every cell
basically a lot of measures for every cell and the array storage is amazingly efficient because everything is just stored once right behind each other on the other hand if you have complex operations you also have large cubes that you have to jump in you have to traverse a lot of things what could we do
well we could do again an idea of denormalization we could first use a level of indexes that point us to the right place in the cube and then use the actual array information where the measures
are stored okay so the first level contains all the indexes the second the data cells that are stored in the block and what we can do is basically we use indexes trees bitmaps and stuff like that we will go into that a little bit deeper next time next week and compression
and physical partitioning all these tricks on the first level and that points us to the data cells on the second level so the computations where we have to jump are done in the smaller first level on the indexes the actual data access is done on the second level that is something that you could do
if you have a sparse cube the thing is kind of difficult because all the cells of the cube in a linearization has to be stored even if there was no measure or if there was a null value why is that? well because you have to calculate you have to compute the jumps
and if you have no idea whether a cell is simply not stored or there were cells you cannot predict how many jumps you would have to make okay so you have to keep the structure in check and that is basically one of the problems
with linear storage because if you have sparse cubes which sometimes happen you will write a lot of null values and you have to traverse a lot of null values not a very good idea so what you can do is you can build
again use an index on top of that and just make bigger jumps for null values that follow each other where you say oh I don't have any data for the February so I will just make a jump pointer here and jump over the whole thing it's just the next block so the
Upper layer again holds the possible combination of the sparse dimension and the lower layer may hold the dense, deader dimensions. So, if we see it like that, then we can use, for example, for a marketing campaign, we are interested in the customer dimension,
but not all customers, certain customers, just a couple of them. And basically, we always have the time, the geography and the product, which is the lower level, for which we have the full cubes,
and we put a layer on top, for each cube, we put on top the aggregated information for the customer, the aggregated measure for the marketing campaign. See how it feels. So, this is basically the customer and the marketing campaign,
just two small measures, very sparse, and the densely populated time product geography. And they are connected, again, by a foreign key. Okay? Can be done. And this brings us to our last detour today, where we can see how this is done in practice.
Okay, so we've spoken about the physical modeling. We've started with a logical model, which spoke about cubes. This corresponds to the multidimensional paradigm, which can be then translated either relational through star or snowflake schema,
or through arrays to the lower layer of the physical modeling. Well, based on this kind of storage possibilities, Inmon speaks in his book about different possibilities,
which have actually become standard today of doing all-up. The first one, if we consider how things evolved in history, was the desktop all-up. It was actually just the beginning, so they've considered taking a piece of a data mart and doing all the computation on the desktop.
Not a very great idea. The next solution, which actually evolved from the multidimensional paradigm was the MOLAP, which actually says that on a physical level, the data is stored in arrays,
as we have previously seen with linearization and so on, and pre-aggregated information. At this point in time, there were already very powerful companies selling relational technologies. IBM, for example, or Oracle were already on the market. And they said, you know what? This new multidimensional story,
we can also have a piece of this market. Why not consider that we adapt the multidimensional paradigm on relational structure, and we have our databases, they are proven technology, we have so many clients, why not just invent roll-up?
The technology is already there, we just do the mapping between the paradigm and what we have. And with time, both of these technologies, so roll-up and MOLAP have evolved in what today is called hybrid roll-up,
and what is in use in most of the big data warehouses. So, as I've said, MOLAP was the first solution. It was based on downloading the Hypercube from the central data repository. And the computation was done on the client.
Well, of course, this is pretty cheap, and you are not really able to perform great analysis because you are limited with what you have on your desktop. So, you can't say you are doing comparative analysis with higher granularity, however, on great periods of times.
Well, the next idea was the multidimensional physical layer. From the architecture point of view, the data is stored multidimensionally. You have a multidimensional database server.
You have an interface which is actually quite intuitive. The idea is that the paradigm is then through linearization and directly translated into the storage. So, no complicated queries or so are needed between the server and the data.
And then you have the client level, where the operations have to be transparent from what happens in the server. So, everything here is just clicks from the business analyst who says, I want to rotate this cube. I want to see this dimension further. I want to dig into the data. This is transmitted to the server.
The MOLAB translates this, as I've said, really intuitive through linearization, gets the data, computes everything and shows this through the display interface to the user. All the computation, everything that is pre-aggregated is performed during the construction.
So, therefore, during the loading step. It's not only the first loading step, where you, of course, aggregate a lot of data, but also during your regular loading step. Each week, you may want to update your data warehouse. This is when it happens. The data is aggregated.
The advantage is, of course, that due to this aggregated step, you have fast data retrieval. You can query over the year and you don't need the joints. You just pull the data. If we remember from the first part of the logical modeling,
our main cube would look something like this. The rest being pre-aggregated data. This upper layer is aggregated data on two dimensions. This one here on other two dimensions.
This data here is a list of aggregated data on one dimension. And finally, we have the old aggregation. So, for all three dimensions in this example, the apex, as we have described it in the cuboid lattice.
Everything is pre-generated. This can become quite complicated, as I've said. So, you should imagine that every step on the way has to be pre-calculated at loading time. A lot of additional information has to be stored.
A lot of time has to be spent. So, this is clearly a disadvantage for the loading time. The cube can be derived from a large amount of data. It only contains, however, summary level of information.
An example of this kind of evolution due to the aggregation is here an input file of 200 mega, which I've actually seen for this open source project. It generates 5 giga of data only through this calculation storage.
So, imagine if 200 mega would expand to 5 giga, how it would look like if I would want to have an input of 50 giga for a week of raw data, how it would expand. I think there were only four dimensions, just to get the feeling.
Another impediment and disadvantage is that this software is often proprietary. Of course, the big relational distributors Oracle, IBM, Microsoft have all bought the first products which came on the market.
So, Congos was bought by IBM, SBase by Oracle. Microsoft has developed their own system. And actually, Palo is the only system, open source, multi-dimensional data warehouse solution, which I know right now. They are just living from consulting.
So, they are giving their software for free, but you won't be able to use it until you buy a specialist who will tell you how to use it. Good. The next step was the relational OLAP. Here, things are pretty clear. Star or Snowflakes schema for storing the information
in a multi-dimensional paradigm. A roll-up server which translates what the user inputs in operation through SQL information and gets the data from the server.
Of course, because the presentation level is a multi-dimensional presentation layer, this transformation here is not trivial anymore. It has to be adapted. This adaptation comes, of course, at a cost.
It has to be done through SQL queries or either standard like MDX to express the needs of the user. So, we use special designs, but we still suffer of performance issues.
For these things, we will further, in the next lectures, pay special attention to indexes and optimization possibilities. This transformation, as I said, costs a lot of time, so we can use bitmap indexes or air trees or whatever other multi-dimensional indexes we have available.
Of course, the great advantage, it's proven technology. We've tested it throughout relational databases. We know it functions, as Tilo said. It does its own caching. It's pretty solid technology, robust. Very big advantage. Another important fact is that it allows
the storage of large quantities of data, which actually, in the case of multi-dimensional databases, it's not so multi-dimensional. So, mollap producers recommend their technology up to 50 or 100 giga.
While roll-up technology can easily go up to 5, 6 tera, people already speak about technology supporting petabyte. So, the disadvantage is here, clearly, the missing functionality in the query language.
And such products are offered again by the big vendors, Microsoft, Siebel Analytics, both by Oracle. Mondrian is an interesting solution for an open source project. We will maybe test it in our further lectures. Okay, so based on the roll-up needs,
we can differentiate between mollap and roll-up. Both offer a multi-dimensional view. Of course, mollap offers great query performance because it has a lot of pre-aggregated data. Roll-up has to do this through materialized views and other indexing mechanisms, which cost.
Mollap offers, again, also analytical flexibility because the presentation there is actually the way the data is stored. However, when it comes to the real-time data access, it's not possible in mollap. I don't have access to it. I would have access if I would implement a roll-up server
which stores the relational data together with the pre-aggregated data and just pass by the aggregated data into the real-time data. So, roll-up can do this, mollap can't. Again, when it comes to large volumes of data, a big minus for mollap.
100 giga is not nearly enough for a data warehouse. From the management of the information systems point of view benefits, a mollap system is easy to develop. It's not as complex from the technology point of view as the roll-up.
However, the structure is easily maintained in roll-up. Here, a lot of proprietary technology. There are systems where actually the linearization is not there anymore as we know it. So, they've made this jump strategy a bit different to optimize it
for some typical queries and so on. So, it's not intuitive anymore. When it comes to the maintenance of the aggregated data, it's easy to maintain the data in mollap. It's hard to perform this in roll-up because it has to be done explicitly with materialized views.
Well, the difficulty in mollap is how to perform the aggregation during the loading process. That is the main problem. If we, however, consider these two technologies combined, of course, we could speak about the hybrid version of it, where the detailed information can be stored in relational databases,
again, respecting the multidimensional paradigma, and we could store the aggregated data in multidimensional databases, be it a proprietary system or not. Another solution would be to perform a hybrid roll-up
by making use of different partitioning possibilities, either vertical or horizontal. We could, for example, use aggregations stored in mollap for fast query performance during vertical partitioning
and still allow detailed queries to the detailed data through the roll-up in vertical partitioning, or we could perform the horizontal partitioning, where we choose different dimensions. One classical dimension is time to store pre-aggregated with mollap.
Again, this comes from heuristics. And to store older data in roll-up. Well, actually, this is something which Inmon has recommended, but I've seen on the forums that many people say, Well, it depends. If you actually need higher aggregation levels for the older data
and you're going to compare them three years ago by year level and for the recent data with a smaller aggregation, then you actually need to do this exactly the opposite. But still, it's a combination of both mollap and roll-up on the time dimension. It depends on your needs.
Another solution would be to divide your usage between mollap and roll-up, between your dense data and sparse data, divide it in sub-regions and say, OK, where I have dense data, I'm going to do roll-up, where I have sparse, I'm going to allow some more aggregation and store it in mollap.
Everything depends on your queries and how the data is built. As a short conclusion, I just want to remind you that roll-up is built on the star and snowflake schemas with combinations of it. It's great for detailed, so smaller granularity data
and large data volumes, terabytes, if not even higher. While for mollap, the cube structures, they are stored in array and it's good for small volumes of data. So actually, if you would like, you could store your small data mart
as multidimensional, for example. But the best solution is to use a hybrid roll-up based on your heuristics. Good. As a short summary for our lecture, what I would like you to remember
is that the logical model involves cubes and dimensions. Dimensions are expressed through hierarchies and classifications levels. Regarding the physical level, the physical modeling, it can be either implemented relationally through the star schema.
This is great for query performance, for data which is often used. It offers less tables.
This leads to smaller numbers of joints. Again, simple structure, great, but it's a trade-off. If I have big dimensions, the star schema is not the solution. It can lead in these cases to high overhead due to my big dimensions, big number of redundant data.
Another possibility is the snowflake schema where I have normalized my dimensions. Again, this leads to a large number of tables. The dimensions have them to be joined during querying.
This is not great in the case that I'm going to take my data from the fact table with me. The physical modeling can also be performed directly multidimensional, expressed through array storage. What is important here is to know how to perform the linearization.
So actually, you transform the multidimensional paradigm into a single line array. The most interesting problems are the order of dimensions. How do I choose? What's my first dimension?
Is it time? Am I doing queries over time? Is it products? Are those my primary query purpose? Well, the solution is to use caching and here I need to be careful that the operation system does not mess up with the cache of the multidimensional database system.
Another problem is dense cubes and sparse cubes and for this reason, one can use two-level storage with indexes that point to the corresponding data allowing for jumps in the case of sparse cubes, for example. And not to forget is that based on this physical modeling level,
one can speak about multidimensional OLAP, relational OLAP, but what actually is done in practice is using both of them into a hybrid model. Next time, we'll be speaking about optimizing strategies,
about indexes, namely about the bitmap indexes, air trees and hash indexes. Are there any questions about today's lecture?
Then, I would like to thank you for your attention and we'll see you next time.