We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Build the DW, ETL (09.12.2010)

00:00

Formal Metadata

Title
Build the DW, ETL (09.12.2010)
Title of Series
Part Number
7
Number of Parts
13
Author
Contributors
License
CC Attribution - NonCommercial 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Producer
Production Year2011
Production PlaceBraunschweig

Content Metadata

Subject Area
Genre
Abstract
In this course, we examine the aspects regarding building maintaining and operating data warehouses as well as give an insight to the main knowledge discovery techniques. The course deals with basic issues like storage of the data, execution of the analytical queries and data mining procedures. Course will be tought completly in English. The general structure of the course is: Typical dw use case scenarios Basic architecture of dw Data modelling on a conceptual, logical and physical level Multidimensional E/R modelling Cubes, dimensions, measures Query processing, OLAP queries (OLAP vs OLTP), roll-up, drill down, slice, dice, pivot MOLAP, ROLAP, HOLAP SQL99 OLAP operators, MDX Snowflake, star and starflake schemas for relational storage Multimedia physical storage (linearization) DW Indexing as search optimization mean: R-Trees, UB-Trees, Bitmap indexes Other optimization procedures: data partitioning, star join optimization, materialized views ETL Association rule mining, sequence patterns, time series Classification: Decision trees, naive Bayes classifications, SVM Cluster analysis: K-means, hierarchical clustering, aglomerative clustering, outlier analysis
Query languageHierarchyDrill commandsPivot elementComputer multitaskingEuler anglesCubeOperator (mathematics)Similarity (geometry)Operations researchBuildingTask (computing)Standard deviationSoftware developerTime domainExpert systemProcess modelingComplex analysisSelf-organizationData miningData managementProjective planeNumberPoint (geometry)Process (computing)Operator (mathematics)Multiplication signCategory of beingStrategy gameWeightPrice indexCASE <Informatik>Order (biology)Internet forumSubsetTorusSoftware developerDatabasePhysical systemScripting languageOffice suiteData storage deviceLevel (video gaming)QuicksortAlgorithmStandard deviationEndliche ModelltheorieOpen setPrologProduct (business)Staff (military)Domain nameScaling (geometry)InformationGame theoryNetwork topologySpeech synthesisMobile WebSoftware maintenancePosition operatorMereologyData warehouseClassical physicsMathematical analysisData structureQuery languageNormal (geometry)Theory of relativityDrill commandsPivot elementHierarchyGraphical user interfaceRepository (publishing)MetadataStructural loadType theoryTransformation (genetics)Task (computing)Analytic setExpert systemMoving averageCubeComputer animation
BuildingComplex analysisSelf-organizationTask (computing)Standard deviationData managementPhysical systemSoftware developerTime domainExpert systemProcess modelingDatabaseMathematical analysisRepository (publishing)Axiom of choiceInternet service providerCalculus of variationsData miningProduct (business)Focus (optics)Function (mathematics)SoftwareData storage deviceProcess (computing)FaktorenanalyseRAIDSystem programmingReplication (computing)BackupRead-only memoryBand matrixHeat transferComputer networkComputer hardwarePhase transitionProof theoryControl flowPrototypeTransformation (genetics)Structural loadTraffic reportingIterationOperations researchSoftware maintenanceHuman migrationFile formatData conversionType theoryPoint (geometry)Multiplication signPhysical systemCASE <Informatik>Event horizonBasis <Mathematik>Strategy gameProcess (computing)State of matterGame controllerExtension (kinesiology)Operator (mathematics)Level (video gaming)Power (physics)Key (cryptography)Data miningNeuroinformatikBus (computing)Dynamical systemHand fanInternet forumSheaf (mathematics)Different (Kate Ryan album)PrototypeSemiconductor memoryBoss CorporationDivisorPropagatorBit rateInformationStandard deviationIntegrated development environmentConnectivity (graph theory)Projective planeWebsiteFunktionalanalysisBitSoftware developerExpert systemDomain nameNetbookCatastrophismClient (computing)Set (mathematics)AreaRoundness (object)Musical ensembleDatabaseAxiom of choiceData managementGeneral relativityDecision theoryWhiteboardRepository (publishing)Classical physicsSoftwareGoodness of fitTransport Layer SecurityRight anglePOKEVolume (thermodynamics)MetadataData warehouseMathematical analysisData storage deviceNormal (geometry)Internet service providerBackupComputer hardwareDesign by contractTraffic reportingStructural loadProof theoryHeat transferServer (computing)Software testingAnalytic setPhase transitionComputer animation
Function (mathematics)DatabaseHuman migrationFile formatData conversionSystem administratorSource codeEvent horizonPhysical systemComputer hardwareData recoveryRead-only memoryTable (information)Process (computing)TelecommunicationCommunications protocolSystem programmingAreaRule of inferenceoutputFile formatDifferent (Kate Ryan album)Limit (category theory)WindowMultiplication signChemical equationOperator (mathematics)AnalogyIntegrated development environmentWrapper (data mining)Type theoryPhysical systemINTEGRALCASE <Informatik>Transformation (genetics)Product (business)Basis <Mathematik>System administratorProcess (computing)Source codeData storage deviceMathematical analysisCurvatureSemiconductor memoryData warehouseComputer fileAreaProjective planeLevel (video gaming)Boss CorporationDesign by contractExecution unitState of matterData managementDatabaseNumberMereologyPlanningFigurate numberFunktionalanalysisExtreme programmingPoint at infinityNumbering schemeScripting languageResultantAxiom of choiceFault-tolerant systemFormal languageGeneral relativityComputer programmingSoftware testingOperating systemQuery languageMiniDiscPersonal digital assistantNormal (geometry)Population densityAdditionBeat (acoustics)Table (information)InformationData structureComputer animation
Computer fileSet (mathematics)Table (information)AreaData structureScripting languageJava appletHuman migrationQuery languageFunction (mathematics)Physical systemData recoverySource codeProcess (computing)Order (biology)Digital filterString (computer science)SequenceDatabaseMechanism designFile formatoutputDocument Type DefinitionMetadataType theoryLengthRelational databaseData integrityRepository (publishing)Interface (computing)Access SQLOpen setOperations researchDirectory serviceSystem administratorVolumeData storage deviceTheory of relativityTable (information)AreaBitLevel (video gaming)Normal (geometry)InformationData structureRelational databaseCore dumpOperating systemCASE <Informatik>General relativityMetadataComputer fileData qualityCurvatureHard disk driveMultiplication signResultantData storage deviceSemiconductor memoryScripting languageTransformation (genetics)Process (computing)Function (mathematics)outputFile formatFunktionalanalysisAxiom of choiceRow (database)Block (periodic table)Physical systemData managementDatabaseOperator (mathematics)Query languageClassical physicsFormal languageComputer hardwareVolumeStructural loadPrototypeConstraint (mathematics)Traffic reportingAverageType theoryDifferent (Kate Ryan album)State of matterCalculus of variationsSeries (mathematics)Data typeSystem callLogicAsynchronous Transfer ModeMereologyCausalityInternet forumMathematicsNichtlineares GleichungssystemBasis <Mathematik>Domain nameKey (cryptography)Priority queueStrategy gamePoint (geometry)Exception handlingIntegrated development environmentClient (computing)Sheaf (mathematics)MultiplicationPower (physics)Uniformer RaumBoss CorporationComputer programmingService (economics)Value-added networkNumberNegative numberOpen setInsertion lossSpecial unitary groupQuicksortRule of inferenceProfil (magazine)Hazard (2005 film)ExplosionDimensional analysisDevice driverComputer animation
Transformation (genetics)Source codeDatabaseComputer programPhase transitionSystem programmingProcess (computing)Table (information)Finitary relationType theoryBuildingLogicBasis <Mathematik>Software bugGaussian eliminationAxiom of choiceDatabase transactionDatabaseTable (information)Process (computing)Integrated development environmentSet (mathematics)AdditionComputer fileSource codeInformationoutputPointer (computer programming)Software bug2 (number)Attribute grammarType theoryMathematicsData typeCubeState of matterOperating systemLevel (video gaming)Relational databaseOperator (mathematics)Domain nameService (economics)CASE <Informatik>Client (computing)Mathematical analysisFlagTimestampMereologyRow (database)Multiplication signLogarithmGaussian eliminationLoginData warehouseClassical physicsElectric generatorStrategy gameKey (cryptography)Arithmetic meanComputer programmingFile formatAreaRepository (publishing)Stability theoryPower (physics)Decision theoryInformation technology consultingComplete metric spaceTouchscreenElectronic visual displayGoodness of fitStructural loadEvent horizonOrder (biology)Point (geometry)Rule of inferenceProgrammer (hardware)Cartesian coordinate systemDimensional analysisNeuroinformatikDifferent (Kate Ryan album)Pole (complex analysis)Musical ensembleTask (computing)Context awarenessEndliche ModelltheorieDispersion (chemistry)INTEGRALTransformation (genetics)Business objectData integrityData dictionaryResultantVideo game consoleNegative numberCombinational logicGeneral relativityConsistencyData managementDatabase transactionBuffer solutionPhysical lawQuicksortWater vaporFormal grammarLogicSemiconductor memoryPhysical systemWhiteboardNumbering schemeSoftware testingWordNumberDescriptive statisticsGroup actionDressing (medical)Computer animation
DatabaseAxiom of choiceTable (information)Database transactionSource codeEntire functionProcess (computing)Pairwise comparisonGaussian eliminationPersonal digital assistantRule of inferenceDisintegrationArtificial intelligencePattern recognitionAlgorithmTransformation (genetics)Object (grammar)Address spaceCharacteristic polynomialConsistencyComplete metric spaceRow (database)Combinational logicProcess (computing)Gaussian eliminationOrder (biology)Table (information)Transformation (genetics)State of matterDatabaseMultiplication signPhysical systemIntegrated development environmentError messageHypothesisData storage deviceINTEGRALPattern languageInformationOntologyInformation technology consultingTouchscreenFlagComputer programmingPointer (computer programming)Data integrityUnsupervised learningField (computer science)ConsistencyType theorySource codeDifferent (Kate Ryan album)Address spaceCharacteristic polynomialArithmetic meanWordData warehouseRevision controlData managementoutputDatabase transactionDecision theorySet (mathematics)Complete metric spaceMathematicsComputer animation
Repository (publishing)Rule of inferenceSource codeDomain nameObject (grammar)Error messageEvent horizonGateway (telecommunications)TouchscreenHausdorff dimensionContext awarenessInformationBasis <Mathematik>TimestampFormal verificationControl flowAtomic numberMaß <Mathematik>Matching (graph theory)Address spaceCodeCore dumpCategory of beingFloating pointLengthSet (mathematics)Pattern languageType theoryFocus (optics)Key (cryptography)HierarchyLocal GroupField (computer science)Data structureAttribute grammarCodeAddress spaceDimensional analysisKey (cryptography)Client (computing)Row (database)Error messageField (computer science)TouchscreenForm (programming)Table (information)Finite-state machineEvent horizonStandard deviationGeneral relativityPositional notationProcess (computing)Data storage deviceInformationRule of inferenceExtension (kinesiology)Classical physicsContext awarenessExpected valueElectronic mailing listEmailSoftware testingNumberDescriptive statisticsResultantCASE <Informatik>Domain nameFilter <Stochastik>MetadataFlagDatabaseExecution unitData warehouseRange (statistics)State of matterPointer (computer programming)Complete metric spaceSuite (music)WordTask (computing)Group actionDifferent (Kate Ryan album)outputAtomic numberMathematical analysisCore dumpData structureFormal verificationData qualityHeegaard splittingType theoryMatching (graph theory)Lattice (order)FLOPSPoint (geometry)DistanceStability theoryMathematicsGoodness of fitArithmetic meanSequelMetropolitan area networkProgrammer (hardware)MereologyFlow separationControl flowFamilyGame theoryLimit (category theory)Order (biology)Term (mathematics)Discounts and allowancesStructural loadCAN busObservational studyLevel (video gaming)Operator (mathematics)Computer animation
Rule of inferenceGenderError messageDuality (mathematics)Transformation (genetics)DataflowProcess (computing)WaveComa BerenicesMenu (computing)Information managementMaxima and minimaGamma functionOpen sourceSource codeOpen setObject (grammar)LogicOracleData profilingSoftware testingRegulärer Ausdruck <Textverarbeitung>outputInformation securityAddress spaceRegular graphNumerical digitView (database)CurvatureResultantTouchscreenLevel (video gaming)Electric generatorError messageOpen sourceDatabaseFilter <Stochastik>Computer fileProof theoryGoodness of fitProcess (computing)Right angleAnalytic continuationCASE <Informatik>EmailInformationElectronic mailing listTable (information)Event horizonWavePattern recognitionData warehouseSoftware testingOpen setSource codeSoftwareCodeAddress spaceOrder (biology)Type theoryPrincipal idealDimensional analysisAnalytic setMetadataNatural numberRule of inferenceClient (computing)Hacker (term)Regulärer Ausdruck <Textverarbeitung>Data qualityLimit (category theory)Term (mathematics)StatisticsObject (grammar)Profil (magazine)Message passingCategory of beingInterface (computing)Validity (statistics)Product (business)Multiplication signMereologyRow (database)NumberMusical ensembleGame theorySampling (statistics)PeopleSoft, Inc.Data miningVideo gameQuicksortArithmetic meanCodePoint (geometry)Numbering schemeDivision (mathematics)Power (physics)Stability theoryDemosceneCountingNetwork topologyFigurate numberMetreSpeech synthesisFilm editingSound effectData storage deviceData dictionaryField (computer science)Operator (mathematics)RhombusIntegrated development environmentWhiteboardVotingINTEGRALBasis <Mathematik>Different (Kate Ryan album)ForestoutputTransformation (genetics)Focus (optics)Perspective (visual)Pairwise comparisonExpressionTask (computing)Metropolitan area networkGoogolSystem callSet (mathematics)Boss CorporationWindowCausalityIdentifiabilityRevision controlString (computer science)AreaStructural loadKey (cryptography)Prime idealConstraint (mathematics)MeasurementDisk read-and-write headDistribution (mathematics)BitFile formatMenu (computing)Computer animation
Phase transitionTime domainData storage deviceCore dumpSoftware bugCountingLocal GroupRange (statistics)Division (mathematics)Random numberGamma functionMomentumData profilingDistribution (mathematics)Pattern languagePareto distributionPoisson processPerspective (visual)Database normalizationDisintegrationPairwise comparisonMathematical analysisProcess (computing)Task (computing)Medical imagingRow (database)EmailMetadataMatching (graph theory)FlagField (computer science)Standard deviationData warehouseData storage deviceOpen sourceOrder (biology)Data profilingProcess (computing)Source codeINTEGRALTransformation (genetics)SoftwareDistribution (mathematics)Product (business)NumberDifferent (Kate Ryan album)Selectivity (electronic)Software bugRange (statistics)DivisorCASE <Informatik>Video gameSampling (statistics)MereologyProfil (magazine)Data dictionaryMultiplication signSigma-algebraConstraint (mathematics)DatabaseString (computer science)RandomizationBasis <Mathematik>Client (computing)Statistics2 (number)Mathematical analysisIntegrated development environmentRandom number generationComputer fileKey (cryptography)Perspective (visual)Computer animation
DisintegrationPairwise comparisonMathematical analysisProcess (computing)Strategy gameConstructor (object-oriented programming)Modul <Datentyp>SequenceBinary fileTable (information)Attribute grammarData structureData integrityCondition numberDefault (computer science)Type theoryBoundary value problemData warehouseData typeAttribute grammarSource codeTable (information)InformationCellular automatonRevision controlLevel (video gaming)Translation (relic)Term (mathematics)Type theoryAddress spaceIntegrability conditions for differential systemsDefault (computer science)LengthClient (computing)String (computer science)Data dictionaryINTEGRALStrategy gameArtificial neural networkProcess (computing)Semantics (computer science)Pairwise comparisonBasis <Mathematik>Physical systemAxiom of choiceModul <Datentyp>Different (Kate Ryan album)CASE <Informatik>NeuroinformatikElectronic mailing listNumberOperator (mathematics)Multiplication signProduct (business)Order (biology)Data structureMathematical analysisOperating systemCausalityGroup actionTimestampDressing (medical)PlanningData storage deviceSpacetimeNumbering schemePower (physics)DatabaseDialectMetropolitan area networkComputer simulationStructural loadStokes' theoremComplete metric spaceStability theoryPoint (geometry)Diagram
DisintegrationAttribute grammarType theoryComputerProcess (computing)Error messageConstructor (object-oriented programming)MereologyData structureMatching (graph theory)Instance (computer science)Correlation and dependenceTupleDistribution (mathematics)Similarity (geometry)Element (mathematics)Gauge theoryTexture mappingSource codeCompilerCorrespondence (mathematics)DecimalHierarchyProduct (business)Local GroupHeuristicQuery languageSinguläres IntegralService (economics)Process modelingTransformation (genetics)Graphical user interfaceLibrary (computing)Function (mathematics)Software developerText editorCodeLevel (video gaming)General relativityQuery languageGroup actionTask (computing)Set (mathematics)Product (business)Correspondence (mathematics)WebsiteMatching (graph theory)Table (information)Source codeGraphical user interfaceDistanceComplex analysisElement (mathematics)Operating systemKey (cryptography)Open sourceTimestampDescriptive statisticsCategory of beingStandard deviationSoftware developerLibrary (computing)FunktionalanalysisTransformation (genetics)INTEGRALCASE <Informatik>Attribute grammarType theoryDifferent (Kate Ryan album)Cartesian coordinate systemMultiplication signData structureDecision theoryAlgorithmSemantics (computer science)Process (computing)Error messageDomain nameInformationState of matterConnectivity (graph theory)ResultantMereologyNumbering schemeMusical ensemblePiPhysical systemLattice (order)TelecommunicationRow (database)Distribution (mathematics)Internet forumInstance (computer science)Greedy algorithmPlanningStructural loadClosed setService (economics)Electronic program guideOrder (biology)QuicksortOperator (mathematics)Metropolitan area networkFile formatEstimatorInformation technology consultingPoint (geometry)CausalityDemosceneComputer animation
DisintegrationService (economics)Process modelingLibrary (computing)Function (mathematics)Transformation (genetics)Graphical user interfaceSoftware developerCodeText editorSinguläres IntegralLatent heatBinary fileLogicSuite (music)Open sourceSource codeForceObject (grammar)Texture mappingMarkup languageJava appletTransformation (genetics)Latent heatFile formatSource codeBinary codeLink (knot theory)Open sourceText editorINTEGRALSoftwareResultantStructural loadData conversionPhysical systemData managementCartesian coordinate systemDatabaseExterior algebraField (computer science)Information technology consultingBoss CorporationMultiplication signNeuroinformatikLeakWhiteboardMetropolitan area networkParticle systemCausalityTable (information)Dimensional analysisCodeView (database)Level (video gaming)Semiconductor memoryBasis <Mathematik>WindowMaterialization (paranormal)Numbering schemeStrategy gameOperator (mathematics)State of matterCombinational logicImpulse responseComputer simulationMereologyBuildingVideo gameProjective planeThumbnailComputer animation
Type theoryProcess (computing)VolumeTable (information)System administratorPhysical systemData integrityFile formatDimensional analysisBootingTransformation (genetics)Service (economics)WritingMiniDiscStructural loadParallel portUtility softwareStapeldateiHausdorff dimensionGamma functionWeightSummierbarkeitPhysicsComplete metric spaceError messageFlow separationSubject indexingDrop (liquid)Content (media)InformationData warehouseSystem programmingSource codeData warehouseStructural loadMereologySubject indexingProcess (computing)RectangleView (database)Table (information)Group actionControl flowDimensional analysisConsistencyInsertion lossMultiplication signData storage deviceMathematical optimizationPhysicalismKey (cryptography)GeometryNetwork topologyLevel (video gaming)Maxima and minimaMultimediaRow (database)Electronic mailing listSpacetimeMoment (mathematics)FunktionalanalysisSystem administratorPhysical systemData managementDatabaseService (economics)Task (computing)Transformation (genetics)Classical physicsParallel portSpeech synthesisBasis <Mathematik>WindowBootingUtility softwareStapeldateiStrategy gameData structureDecision theorySoftware developerAreaDifferent (Kate Ryan album)MetadataHeuristicNumbering schemeConstraint (mathematics)InformationArithmetic meanPoint (geometry)Shooting methodCausalitySemantics (computer science)QuicksortComplex analysisOperating systemResultantINTEGRALField (computer science)WhiteboardBoss CorporationTouchscreenLimit (category theory)Latent heatDescriptive statisticsInformation securityPrototypeCodeContext awarenessOperator (mathematics)Video gameRhombusCartesian coordinate systemMIDISound effectSoftware testingScheduling (computing)State of matterOnline helpClosed setBitType theorySocial classBuildingGoodness of fitComputer fileLogicComputer animation
Content (media)Data warehouseInformationSystem programmingSource codeType theoryPhysical systemProcess (computing)Repository (publishing)FrequencyCodeTransformation (genetics)Digital signal processingBlogScheduling (computing)Data transmissionTexture mappingScripting languageDimensional analysisHausdorff dimensionData managementMIDILevel (video gaming)CubeInternational Date LineWide area networkInterior (topology)Product (business)Computer hardwareSoftwareTask (computing)MetadataDisintegrationData structureData storage devicePhase transitionSinguläres IntegralStructural loadSource codeMetadataDatabaseAreaTable (information)Structural loadTransformation (genetics)Product (business)Process (computing)Repository (publishing)PrototypePhase transitionProjective planeData structureOperating systemData storage deviceLevel (video gaming)Data warehouseData managementMathematical analysisOperator (mathematics)LogicMereologyInformationError messageTouchscreenPhysical systemDimensional analysisComputer fileScheduling (computing)CodeType theoryResultantTask (computing)Set (mathematics)Beta functionStrategy gameSemantics (computer science)HierarchyCubeAssociative propertyState of matterMultiplication signClassical physicsData miningComputer animation
Data miningAssociative propertyRule of inferenceBusiness IntelligenceData warehouseClassical physicsReal numberData miningClient (computing)Multiplication signInformationRule of inferenceAssociative propertyMathematical analysisComputer animation
Transcript: English(auto-generated)
Okay, so we should probably begin, unfortunately Professor Balke can't come today or at least for the first part of the lecture can't come, maybe he will join us at a later point during our lecture, until then I will try to replace him.
So let's start with the lecture, let's first shortly remember what we've discussed about last week, so we have introduced OLAP operations, we've started with the classical operations
one has to perform when doing analytical processing like roll up, which came into flavors, hierarchical and dimensional, dimensional can be again one type of hierarchical rollup when you go over the top of your granularity, then you have drill downs and the main point
here is that in order to perform drill down you actually need the data, you can't drill at a lower granularity than the data that you have, you can simulate such things, so if you don't have the data in your pre-aggregated multidimensional model, then you can simulate
it with just taking it from hopefully your relational model in real time, but this takes time. Then you have the classical slice, dice and pivot operations, in SQL you can simulate
them with where clauses as we've discussed last week. Between this graphical user interface which allows us to perform this OLAP operations and the data, we again have query languages, in the case of data warehousing we have spoken about SQL 99, so the evolved part of SQL, this is already able to perform
the roll up, drill down, cube operators and we've also spoken about MDX, query language which has been first introduced by Microsoft and due to their dominant position on the
market it has also became the de facto standard in multidimensional query languages. Of course this is used for the case where the data is physically stored as multidimensional
through arrays, the mollab case, but it has also been ported in relational online analytical processing systems through mapping on SQL 99. Good, now that we've spoken about the query languages, we've spoken about how the data
is stored either on a logical level or on a physical level, we can go on and speak about how the data warehouse is actually built. We will begin today with how the project of building a data warehouse looks like, we
will discuss in more detail the current process of building the data warehouse which we've spoken throughout the course, the extract transform load operations and we will conclude
the lecture with speaking about the metadata repository which improves the quality of the data in our data warehouse. Okay, so what's the deal with building a data warehouse, how does such a project differ from a normal software development project?
Well, it's a pretty complex project, it involves a large number of activities and it has a pretty different structure than the classical development projects.
It has a project manager, it does project management stuff as in any project, so it plans the resources, it manages the budget, the timelines and so on. It has an architect in the case of the data warehouses, it has to be aware of the methods
which should be allowed on the data warehouse, on the concepts which he wants the data warehouse or the data warehouse needs to store and of course he must be aware of how the data warehouse should be modeled.
So conceptual modeling is a task of the data warehouse architect. Here we are speaking about multidimensional entity relationships and multidimensional UML, we've spoken about in I think lecture five or so. Then we will speak about the data warehouse miner, this is the person who is aware of
mining algorithms, who is aware of strategies of how to extract data from the data warehouse. This falls into the category of non-standard analysis, how one can perform such analysis
but not through the GUI, through data mining algorithms and of course about the concepts which need to be mined. Both the architect and the miner need a domain export. We've said that the data warehouse has to store what the company needs, so has to be
subject focused, this is why we need the domain export for. He will tell us okay you need the time granularity to also consider the tax fiscal year. Such informations are provided by the domain expert so he is our domain knowledge source.
Then you have the classical developer, he will actually perform the maintenance and management of the metadata, so will implement the storage of the metadata in the metadata repository, will perform the tasks for the extract, transform and load process.
And then we have the user which sits in front of our analytical guys and performs the operations we've spoken about last week.
So we already recognize some roles which are not classical from normal project management and project development. Of course we have to speak also about choosing the right software.
There are many possibilities, many vendors, technology providers but the first choice we have to agree upon is how should our physical data be stored. Should it be multidimensionally stored, should it relationally be stored?
This is a decision which has to be taken across all the roads we have discussed about but primarily with the person who has the domain knowledge and who will tell us okay our company needs data warehouse which goes up to 50 giga because this is the amount of data we actually store on a regular basis or we need something bigger therefore
we should go to relational. Another software choice is what tools should we use. ETL is performed through different tools from different vendors. They are usually independent of the vendors for data warehousing solutions as well as
from the vendors which offer database management systems for the operational environment. So they are a different animal. This has to be established at the beginning. Another choice, the analysis tools with what am I going to perform the data mining
over the data warehouse. What is going to ensure my reporting functionality which I also need for the end user. Another piece of software is the repository. It's not regularly used in all the data warehouses.
However, it is an important component when I'm dealing with a corporation which spreads around many countries and I immediately need a very powerful metadata management to follow the data through my company and see where did it start from, where did it go to.
For something like this, I need the repositories. You can imagine them as simple as volumes. When it comes to the hardware choice, of course, the solution will be distributed.
You are going to speak about storage area networks, something which is more actual, more performance capable for today's needs. Or probably you would need something cheaper or, for example, network area storage where the network layer imposes some legs, you have the computers as nodes,
but you still have a high capacity of storage. Processing power. We've said that there is always a trade-off between storage and processing. Much of the information is pre-aggregated. However, during the ETL process,
this pre-aggregation step costs us a lot of processing power. This is why multi-CPU systems are the way to go. Failure tolerance. The data is very important. If I'm going to lose the data, I can't do any analysis anymore.
So data replication, RAID, mirroring, backup strategies. Most of the data warehouse vendors have a contract partner, something like Lamelight Networks, which replicate the data throughout the globe to avoid any catastrophes or something like this. The best example is the World Trade Center, what happened there.
Just imagine that the bank would have had all their clients just in the World Trade Center servers. The bank would have been ruined after the disaster. This doesn't happen due to the fact that failure tolerance must be considered from the beginning.
Of course, other factors like the cost, the memory bandwidth, the transfer rates, the latency have to be taken into consideration. How does the project timeline look like? It's a prototype-like project, so we start with a proof of concept.
We establish the technical infrastructure, the hardware, the software. We perform the first extract, transform, load process as a prototype. And we perform our first analysis and
reporting as told by our domain expert and prototype user. So we just select two, three users, which will be typical for the users of the data warehouse in our company, and say, what do you need? What do you expect from this data warehouse? What are the analyses you would like to perform?
This happens in the proof of concept. We build a prototype. Then there's a controlled release, much like in software development. You have a beta release and you choose some beta testers. This is exactly what happens here. You build small data marts and say, okay, let's see how it would go for
the marketing department. How would it function if the marketing department would try to perform analytics on the data warehouse? And of course, when the project reaches maturity, the third phase is the general availability where the whole company can have access to the data
warehouse and the data warehouse grows so that it can respect the requirements of the whole company. But the central process of the data warehouse project is the extract, transform, load process, and this is exactly what we will discuss today.
What is the extract, transform, load process? It's actually three combined functions which are used to pull data out of the operational data system, to get it to the staging area,
to transform it, make it conform throughout our company. It's like I'm getting data from America, Europe, Asia. I'm putting it together. I try to make it consistent, to increase its quality,
to merge and find duplicates, and then to load it in the data warehouse. So it then copes for migrating the data and converting it to the proper format. As any process in the data warehouse,
we can also ask ourselves when should we perform ETL? Well, we can perform it periodically and this is actually the most used technique. It depends then from the company strategy, one can perform it every night, every week, every few weeks.
There can be refresh policies set by administrators or based on user needs or traffic. For example, holidays are the best example for data warehousing refresh. The time for refreshing the data warehouse during the winter holidays,
for example, for sales or marketing, is decreased. One can have a regular time of a week, but when winter comes in the November and December sales months, it is decreased to a daily basis. Of course, there are also event-based solutions where the boss says,
we have done a new contract with another company, which is going to give us these products. I want on each thousand sales from products coming from this company, refresh in my data warehouse so I can perform new analysis.
Of course, there is also the case where I can have a data warehouse for analyzing the stock market for something like this, which is pretty rarely for data warehouse technologies. We speak about real-time extract transform load,
where every update has to be reflected in the data warehouse. But this actually is a very big problem, and it's not that easy to solve for a large amount of data. Okay, so ETL, I've said that ETL is used to integrate heterogeneous systems.
The company's subsidiary from United States may use DB2, Europe may use Oracle. So I need to integrate this from different types of sources which use different types of technologies. So ETL then has to be able to cope with this heterogeneity.
Other challenges are that it has to get the data from the source to the target as fast as possible. Let's imagine the case where I want to perform this update, this ETL process on a nightly basis.
Then it has to be performed from three o'clock in the night until six o'clock in the morning, or seven o'clock. If it takes longer than this, then it's useless. It has to cope with recovery from failure. If I have this limited window of time and the ETL process fails,
I'm not very happy if it starts all over again. So it has to be able to continue from where it left. The balance which is vital here is between keeping the data in the memory
and writing it on physical disks or files. When failing, if I have the data in memory that is lost, I have to recover everything, it takes more time. If I've written it in a file, then I can continue from where I've lost track.
We've spoken at the beginning about the staging area. Staging area is the main area where the data is stored after extracted from the operational data system. Of course, staging area is not the place for the end users.
The data suffers here a process of transformation. So no reporting, modification, or access from other than people inside the project is allowed in the staging area. Maybe you can remember the analogy Tilo made with the restaurant where the guests are not allowed in the kitchen.
This is exactly what happens with the staging area. Okay, so how does it look like? We have our data in the operational environment.
Let's say Oracle is Europe. Let's say financial system is in US. We can consider this kind of data, for example, for Volkswagen, we can imagine that the human resource systems Oracle is in Fallsburg, we can imagine that the financial systems with their sales
is in North America, and so on. We, however, need to perform the extraction from these heterogeneous systems, and therefore we build wrappers for each of these technologies. Wrappers extract the data through queries, for
example, store it in our staging area. This can be anything from flat files to normal databases. It can also be, of course, a proprietary database management system.
Then this data has to be processed. It undergoes a process of transformation through data cleaning and integration. And then it will be loaded into the data warehouse.
So the data starts here, and then it's here. This is the big picture of the ETL, got it? Not that difficult.
Let's speak a bit about the structures in the staging area. What kind of information do we store in the staging area and how is it actually stored? We have normal files, text files, for example, XML files, and relational tables.
You may ask, why flat files? Why should we store data in flat files? Well, actually, the advantage of using flat files is that the operating system is faster when accessing flat files than compared to the database management system when accessing data.
Due to the additional layer, there's a logical layer in the database management system, this increases the access time. Another big advantage is that operations like sorting, merging, deleting, and replacing through this access,
faster access time is then faster outside the database management system. So then processing flat files through something like Perl, Python, or other scripts results in faster performance when done on flat files.
Of course, the disadvantages are clear. There is no updating concept when working with flat files. So one should, again, ask himself, what am I going to store in the flat file? Do I need updating? Do I need lookups?
There's no indexing in flat files. If I'm going to perform a lot of searches, then forget the flat files. But nobody said that I can't use the flat files for something else where I have an advantage on it. So again, someone needs to look over the data and say, what should I use for this kind of data?
Should it be a flat file? Should it be something else? Again, a very important advantage when using flat files is that you all know when you are out of electricity, what does the operating system do?
It loses everything which it had in the memory. Thankfully, it usually stores data, temporary data, in a dump file. Dump files are flat files. If you work with flat files, you are better protected against such failures.
As previously stated, if sorting is your goal, then flat files are, again, a very good choice since the flat files can be sorted with faster performance than compared with relational databases.
Filtering using grab functionality is another classical example for flat files or replacing text strings, again, due to the faster access through the operation system. The other possibility was using XML data.
It's a common format for both the input of the output and the output of the ETL process. We will later on see that it is also a very important way of solving the transformation problem when integrating schemas.
Different schemas have to be integrated and therefore we need a common language of these different schemas which pass through XML. It is generally not used for persistent staging because it does not benefit from
either of the two advantages of the databases and of the flat files. So it's a temporary solution. The classical mechanisms we're using XML data are the XML schema, the XPAT and the XQuery search languages and
the transformation, the XSLT language. And of course, we also have in the staging area the relational databases. This is the best solution when I don't have any proprietary ETL tools.
So I'm going to do everything by hand. And trust me, it can be done also by hand. It just means a lot of know-how. You have to know what you are doing and of course, a lot of work. But when doing this by hand, then relational databases are the best way to go,
although it's not the most efficient one. The most notable advantages when working with relational databases are the fact that the metadata are clear. You have the column names, you have the schema, you have the types, you have the lengths, you have the relations between the tables.
So all this is given. Of course, normalized staging is also an advantage because you have the relations between different tables, so we can impose different constraints when working on improving the data quality.
Then there's the advantage of the structured query language, which you don't have in the case of flat files. You have it in the case of XML, but as I've said, the XML is just a temporary solution. However, the main disadvantage, which I've already mentioned,
is that if I have a time constraint, as in the case of the data warehouses, which need to be updated in a short time window, then I have to use flat files. I have to go on sorting the information in text files just through the operating system.
I can't use the advantages which relational databases provide us with. About the design of the staging area. How does everything begin with? Someone has to analyze the data, have a feeling about how the tables look like,
and build a report based on the volumetric worksheet of each table. This actually means that each table has to be estimated based on its future usage in the staging area. For example, in the initial load process,
when I'm going to build the first prototype of the data warehouse, I estimate this amount of rows. Actually, this can be already calculated for the initial load. It can't be calculated for continuous load, but an average can be built. So this type of volumetric worksheet can therefore give me
a feeling about what I'm going to store in my staging area, and what kind of file should that be. Should that be a relational database? Should that be a flat file or an XML? For example, new information, important information are the row length.
Here, I can think about the block size, what kind of hard drives I want to buy. So it's related already to the hardware infrastructure. How many rows I am expecting in this table? How many bytes?
All this information put together tells me what kind of files I'm going to choose for my staging area. Okay, let's go to the next process, the way ETL functions, and we'll start with the data extraction.
As I've said, the data extraction actually ports the data from the operational environment into the staging area. This can be either performed through a set of scripts, which perform exporting functionality. It's not commonly expected that the operational environment
offers something like export. It's actually only expected when you buy the data warehouse, as well as the operation system from the same technology vendor. And then you can say, okay, they are working together anyway. I'm going to expect that this one exports something in a format
which the further process can read. The other solution through external programs is to perform the extraction through additional tools from the operational source. If I am going to perform the extraction through export,
then the most probable way is to go to avoid a proprietary format and go through a text file. If I am going to perform extraction, then the most notable solution is to use an intermediary database.
I then can use, of course, the advantages of the relational databases. The steps of performing extraction are quite trivial. Of course, when we are speaking about building a data warehouse, we have the initial extraction, the big extraction at the beginning.
For this, one has to perform a logical map. One has to know what should I extract from where. Again, domain knowledge comes into place. The person who knows about the business processes of the company will tell me, okay, for us as a, I don't know, service vendor,
the clients are important, whatever subject-oriented processes the company has, and through such information, one can decide what are the tables and databases important for performing extraction.
The second type of extraction is the ongoing extraction. The updates, as we call them in regular databases, where we can just input new data in the data warehouse, we can notify the data warehouse about change data, or if specifically noted, we can notify the data warehouse
also about deleted data. We've previously, in the previous lectures, spoken about the fact that data warehouses should not lose data. We are not allowed to delete information in the data warehouse. This is why deleted data has a special state when performing extraction.
Actually, for example, when I've typed in something wrong, the most used process of deleting information from the data warehouse is just to negate that information.
For example, I've sold, I don't know, ten monitors of some kind. In fact, I've sold seven, then I add three more, and I come from ten again to seven, just to negate the operation.
Okay, so the logical map connects the operational system to the final data. But building the logical map implies discovering the data sources. I was used to it, Fred. So discovering the source of my data.
For this, an analysis has to perform, a discovery process has to take place where I'm collecting the documents from the operational system, and I identify the databases, the tables, the relations, the cardinality, the keys, and the data types which I'm interested in.
In this step of building the logical map, I also have to pay attention to anomalies, because something like inputting null values in the data warehouse can destroy the entire process.
We've spoken about aggregated values. When building aggregated values with an operator, like for example, the cube operator in the SQL 99, it will generate null values. Those are actually aggregated null values, and those have to be differentiated from the null values which come from the operating system.
A classical example is when North America doesn't have a field, and Europe has corresponding attributes when merging the data, the data coming from North America will have null in that attribute. What should we understand through that null value?
This kind of anomalies has to be treated early in the data extraction process, and the most dangerous case is when the null value comes in a foreign key. So the join would be a problem. A classical solution for this is to perform the outer join
on the tables which I need to join, and if the null value comes in just a normal attribute, then I have to consider again asking the domain specialist on how would the company like to treat those cases. A classical solution is to just give a special meaning to one flag.
If the data is numeric, just say, okay, we don't have minus one as values in that attribute field, so we'll just input minus one, and during our entire process, we'll flag it as it was a null. Something like this.
We've spoken about the initial extraction, the second part of it is the ongoing extraction, so data needs to be maintained in the data warehouse. Of course, this happens on a regular basis, but the main question here is, how do we detect what we actually need to update in the data warehouse?
It would be stupid just to delete the whole information and perform again initial update. The amount of information to repeat this would be very big, and as I've said, I also won't gain any information because I lose everything which reflected the state of the data throughout time.
So the only possibility then to perform this is to perform ongoing extraction by detecting the changes. How can I detect the changes in data warehousing? There are three main possibilities. One is by using a special column called audit columns.
The other one is by using the log functionality of classical databases. Two strategies are the scraping of the logs and the sniffing of the logs. And the third possibility is a process of elimination. We'll discuss each of them right now.
So the idea of using the audit columns is just by adding timestamps to the records. After the initial extraction, I have the timestamp of the extraction. After modifying the data in the operational environment,
the timestamp changes with the last operation performed on the data. When I then want to perform this update, I just check timestamps and I see when was this record in the data warehouse loaded, then what's the last timestamp when an operation has been performed on this data?
Well, today. It's clearly that the data has been changed due to different timestamps, so it has to be updated also in the data warehouse by introducing the new information. I've said the other possibility would be the one using logs.
A classical solution is log scraping. The idea is to use the redo log of the operation on the database and at a certain time point, for example during night, to find the transactions which affect the tables on which ETL is interested in.
ETL has a set of tables, again related to the business process of the company, where he's extracting continuously the data from. When these tables appear in the redo log with certain transactions,
I just take those transactions and input them in my data warehouse as modifications over the data. The problem when using such an approach is when the log gets full. The approach for something like this is to empty the log automatically
by the database management system and in such cases, then our log scraper is not aware of what happened in between. Between the last scraping and after it was emptied. It can detect that the log was emptied, but it can't know what happened in between.
So a small amount of data can be lost. The other possibility would be log sniffing. This is a pooling solution where all the databases and tables which are interesting for our ETL process are pooled and logged
at a smaller time granularity, again through a sniffer, which then watches the transformations performed on these tables and reflects the modifications in the data warehouse. Actually, in order to solve the problem of emptying the log,
mixed solution between these two has to be used. For example, when the log has a low degree of occupancy, then I can comply with larger times of sniffing or scraping.
When getting full and reaching its capacity, I'm going to be afraid that it will be emptied. So the time interval between scraping the log will be lowered, will be smaller so that I can decrease the amount of data which I'm going to lose if I'm going to lose any data.
And the bulletproof solution for detecting changes without losing anything is the process of elimination. Of course, this costs a lot of time. Because the idea here is to compare versions of the extraction. This process preserves the previous or the last extraction copy
and compares it with the state of the interesting tables in the operational environment. This comparison, as you can imagine, is an all by all. So I compare everything from the operational environment with the last state of the extraction.
So this will take a lot of time. The great advantage, impossible to miss any data. Another interesting advantage, it can detect the deleted rows because I can see what was present in the last extraction and what is not there anymore in the operational environment.
So I can detect also deleted rows. So as I've said, if the records get deleted or overwritten in the operational environment through such a method, I can detect it.
If, however, the errors have been already loaded into the data warehouse, the corrections that are made are not to delete the record itself, but to add new records which correct the error through cancellation. So I add, for example, negative records which cancel the bad record
and then I add the correct value because I don't want to lose any information on the step, on the process. Okay, we've spoken about the data extraction, but what about the data transformation?
The data transformation basically uses rules, or lookup tables, or combination with other data like data dictionaries, or just semantical related data like ontologies.
The major steps in data cleaning, in data transformation, are data cleaning and data integration. Both of them involve manual work, so none of these two sub-processes are perfectly automatic. There are some solutions which come from the artificial intelligence,
some things like pattern recognition, clustering, supervised, unsupervised learning methods fit here quite good in order to support data cleaning and schema integration. However, as I've said, this is today, even today, not a complete, automatical process.
Actually, there are a lot of English word, consulting, consultants, consultant companies which live just from these two sub-processes.
They say, we'll create you filters, patterns and whatever you need so you can get clean data in the data warehouse. We can perform the data integration for you by writing new screens that raise flags when there's a problem. Of course, at the end of this, there is the human being
with two displays, with two records and two buttons, one red and one green, and it says, okay, are these two records similar? Are they the same? Are they different? If the flag raised by the consultant's program has identified these two rows as duplicates,
then a human has to confirm or to reject this hypothesis and press on the green or on the red button. And don't laugh, this is how it's done. So data cleaning is quite a lot of manual work. We'll speak, of course, just about the semi-automatic
and automatic approaches that can be implemented. Okay, so we found out that the data can be dirty. How should the clean data look like? Well, the characteristics clean data should have are correctness.
One example of correct data is a store. We know the store is located in Braunschweig. However, in the address after street, it is reported that the store is in Paris. Clearly, the two sources for this information are,
one of them is wrong or there are different types of rows which have been joined together. So correctness is a vital characteristic when speaking about data quality. Another one is unambiguity.
So I have to make sure and be sure that the values of the data have just one meaning and cannot be misinterpreted. Consistency. A big problem is, again, when we are speaking about fields
which can be reduced to flags like male, female. In the database from North America, you will probably find M for male, F for female. In Europe, you will maybe find the complete word and in Asia, you will probably find zero and one. You have to get this information together
because actually what this information tells me is the same thing. So the value which I'm going to store in order to load into my data warehouse has to be consistent. Another characteristic is the completeness. We've just spoken about the null values.
We should not allow null values when loading into the data warehouse. We should reduce them either through a business decision where we say, okay, this value should be replaced with something or any other mean. The data cleaning engine produces actually three deliverables
which will be later stored in the metadata repository. The first deliverable is the profiling result. We've spoken about it at the beginning. What this is is just a description of the schema definition, the business object, the domains I'm interested in
when storing the data warehouse, the data sources. So it becomes, again, technical. Table definition rules for the data, value rules. Again, for example, how many characters should the name have, the fact that it's a character
and so on. Everything is important to be stored in the metadata and everything, every piece of information like this is discovered during this profiling process. The second deliverable is the error event table.
This is actually also already organized through a classical data warehousing schema, respectively the star schema, where each error is locked. We have a screen dimension, error events,
date dimensions and table dimensions. The screen dimension reflects the filters which have been used and for which the data has failed the test. What kind of tests have been failed?
Everything is then logged in the event fact. So this is our fact table and this then helps at identifying where I should put the flags for someone to look over my data. The last deliverable of the data cleaning process
is the audit dimension, which actually describes how the quality of my data actually is. The context for performing this audit dimension is usually the fact table. So when I'm going to load my facts, I actually need to know how does my fact comply
according to the screens I have defined to monitor the quality of my data. For example, the completeness when it comes to text or when it comes to integer. The number of screens which it has failed, screens can be something from checking
if a name corresponds to a dictionary. I'm going to define a dictionary with typical names from United States and I'm going to check my client database against that dictionary and if a row fails this test,
this is going to be registered in the audit dimension. The core of the cleaning engine actually performs the following tasks. It breaks the data into atomic units. This is especially useful when speaking about the address.
A classical example is a field where a client has to input in an online form his address. Sometimes he is asked to break that information in a separate field, sometimes not. Well, in the data cleaning engine, this has to recognize something like street,
the name of the street, the number, the city, the zip code and the country. Everything is then split into different attributes and stored as such. The standardization I've already spoken about when I've given an example about data cleaning problems.
So we have to reach an agreement about how we should reflect this kind of notations in our data warehouse. Are we going to store male, female, M, F, 0 or 1? Has to be decided during data cleaning.
Verification, it's a semantical step. Nice example is, for example, one has a zip code and the question is, does it correspond to the city which was given by the client? Is it correct? Or maybe, are the rows misinterpreted?
Maybe I have put together two rows which actually don't belong together. I can find out through verification. Verification is also a sub-step of matching. When bringing rows together, we can find duplicates data through matching. We can identify up to what an extent
two records are actually the same record with missing just a few letters. The classical types of enforcements performed on the columns in the database
are things like, for example, if the incoming data contains or not the values I'm expecting. If it fits the type, for example. Another such enforcement
is about the null values I've spoken about. I'm not allowed to load any null values because this will lead to further problems when aggregated. Numeric values. If I know that the life expectancy of humans should certainly not go above 150 years,
then probably I should take care of customers which have been born before 1800 something. This is a clear case of column enforcement over the date of birth.
Again, for the case of name columns, character columns, where the address is just one letter. Clear case of an error, of a problem in my data. And the list goes on.
For example, I can define myself patterns. This is well known from the case of emails or telephone numbers or so on. If the range I'm going to define is too long, then I can also define negative or wrong value lists which hopefully are shorter and say, okay, this screen suits me for when data is from this list.
When it's from this list, just raise a flag and say, I've checked the data quality and this should go into audit with a problem. Spellchecker can also be used. The classical spellcheckers you use in Word can also be used on textual data.
Other checks are the structure enforcement For example, how are the relationships of the columns to each other? The very important task here is the primary and the foreign keys to detect how the databases
will relate to one another. And how I can split then aggregated groups of fields into different groups of fields. For example, through foreign keys or something else
if I'm going to build mini dimensions. We've spoken last week about demography mini dimension. Something like this is going to split the address because I'm not going to need the street. I'm going to need only the zip code or the city maybe
if I'm doing a marketing analysis on where my clients come from. So I'm going to split this field into more attributes and then I'm going to split the client dimension into a mini dimension and the main dimension and connect them to the fact table through foreign key relations.
Other quality checks are related to the data and value rule enforcement. For example, a classical situation is where I have a customer database with only companies there and then I have the same customer which is simultaneously
registered as a limited. I think in German this is Gambia, right? Limited company Gambia. Yeah, and the corporation. This status cannot be. I can define it as a data holding information about corporations
and then I see that the value is actually limited. This is a contradiction in terms so this kind of data and this kind of problems can be found through data and value rule enforcement. Of course, such filters can also provide for statistic filters.
For example, classical cases are when names have to be checked. One can find in its database a mail with the name Sue or Mary or whatever it might be correct but it would be better that a human watches over it
because Sue is actually a name which is not classical for male humans. It's a strange name. It's an anomaly if you want. Something like this again can be checked through data and value. We'll speak also about anomaly in a short while.
Let's see an overall sketch of how data cleaning is actually performed. So we have the stage data and we have the deliverables of the data cleaning process. And we start with after profiling the data quality to define the filters which we need to proof our data, the so-called screens.
These screens are then executed on the data which we have staged in our flat files database or XML or whatever. Quality measures result from executing the screens over the stage data
and error events on the fact tables are generated. These error events go into the audit dimension. Again, generate metadata which has to be stored. It will be stored in the audit dimension and the ETL process will continue provided
that there are no errors from the error fact event when checking the list of fatal errors. If there are errors which are considered fatal by our business process analyst,
then everything will stop. Otherwise, the process continues. Of course, for the process to run faster, there are also ways of running not just screen by screen but defining waves which work on the same database or table
as an aggregated way to use the same source with all the screens at once. So I have the data, I have 10 screens. Working on that data, I'm going to run those screens as a wave. Is it clear how the process works? Say no. Is it clear?
Everybody sleeping? Yeah, good. A right question which should or could come right now is, is the data really that dirty? Do we really need those mechanisms? Is it necessary to speak so much about the data cleaning?
Well, someone came visiting our institute last week from a company from Canada. And they said they actually develop only products for cleaning the data. They have had data warehousing solutions. They don't care about them.
anymore because the market is already occupied by big companies. But a very good way of supporting their company is just by developing just data cleaning tools. It's a way of living and it also proves that the data in big companies is actually dirty.
And so we come to our first detour, but we should probably make a short break first. And yeah, we'll continue at ten past four, yeah? Okay, so let's continue.
For the detour, I've prepared a small example of garbage in, garbage out. And I was going to download something from my SQL, so I've tested how good their interface is for inputting new clients.
So of course, I just inputted what any user would give in order to get his software. I wanted to write something fast, so I did. I knew that they would probably use some kind of pattern recognition email address filter, so I just wrote an email or something looking like an email.
Of course, the password, some names don't look like names, right? And whatever came into mind. Of course, this was valid. I had the first problem that they require a confirmation for the email, but as you all know, there are these ten minute emails that one can use.
So just Google for ten minute emails, build yourself an email which is valid or only ten minutes, receive the email from my SQL, click on it, and then you have an account with the user, you use it, you download the software, and then you're gone. What happens on the other side, however?
My SQL has this information as a client with very important data to mine. Like for example, someone with this name from this postal code from Cambodia, who's a hacker.
And has a company with 5,000 employees, has interest in their software. Is it important for data mining? Is it of high quality to use? Actually, it's not. It's crap, and after the principal crap in, crap out, they can't really get great results by allowing something like this to go into their analytics.
We'll look at some approaches of identifying such data through the data cleaning process. There are, of course, open source solutions and non-open source solutions.
Most of them are of semi-automatic nature. From the open source, I would like to mention the data cleaner, the object data cleaner. It's quite easy to use and does very good work. And the talent open profiler. Of course, from the known open source tools,
all the big vendors have bought themselves some small companies, which have started by performing data quality check. I've performed here some screenshots of the data cleaner, just to show you how it can function for simple tasks.
Like for example, for the email address, you can define yourself some special regular expressions. You can, of course, test the regular expressions you have developed with some inputs and see what passes, what doesn't pass.
You can also use some already defined regular expressions from their library, and so on. One good example is the categories they can cover.
So a lot of types of data which they have defined as true validity checks, true regular expressions. Example here for the date and time. A very important task when performing cleaning, I've already mentioned, is the anomaly detection.
So data which doesn't really fit in the image. One example I've already used was a record which says that the mail is named Mary. It's kind of strange, so probably a human should look over it and
see through metadata where the data came from. Maybe it resulted through a match of two records, where actually a mistake has been done through interpreting a male-female with flags like zero and one. And this resulted in a male with this strange name.
So anomaly detection is quite an interesting and important field when speaking about the quality of the data. This with the name, well, it's not actually the best example. The most interesting examples for something like this, for
using anomaly detections, is for large sales numbers for distributions of data and such cases. For example, if I'm going to investigate anomalies in my sales numbers for cars in different cities, I can do a select.
I can report my count, and I can already see that something might be wrong with Bremen or Berlin because compared with all the other cities, the deviation is quite high. It can't be that I've sold here too, when in Hanover I've sold so much. So maybe it's true, maybe I've just opened a store there and
I didn't sell anything, but I should look for explanation. I should see what is wrong with those numbers, if there's something wrong with it. Of course, you can't spend all your life performing anomaly checks. And as I've said, data warehouses are usually pretty big.
So you have in fact tables, hundreds of millions of records. What can you do? The easiest solution is to perform data sampling. Just take a part of the data, you don't need to take everything. Classical solutions, divide the data in 1,000 pieces and
choose one record from each of these 1,000 buckets. Another solution is to add a random number as a column. So add a new column to all of these rows and take the first 1,000 numbers. Since they are random,
I'm more sure that I've picked fairly from each of these records. Why am I so interested in picking fairly and uniformly? Why shouldn't I just, for example, choose a range of dates? Give me just the dates for the last month and I'll search for anomalies there.
The biggest problem is that anomalies happen usually temporarily, so on a time basis. This is why I have to do random peaks in order to hopefully detect something through data sampling. Of course, the sampling ratio has also to be chosen wisely.
I'm not going to detect any anomaly from 100 million records. I'm going to choose two, right? So thank you very much. What can I see? Another solution is to perform data profiling. Profiling of data is very, very used when dealing with strings.
Names is the solution for something like this. And I've brought here an example where I can check the names from my database against a dictionary.
A dictionary which was generated by someone, again, manual work. And I just check the names and I can see, for example, that I don't know, 47 records from my database have names which do not match with anything from the dictionary. And then I can look through this 47 and see what's wrong.
Why aren't those names there? Are they maybe, for example, Spanish names or, and I've checked again, US names? Are they valid names or are they just junk? But this reduces with a sizable amount the records I'm going to need to check manually.
This is also the purpose. Data profiling goes, however, even further and tries to work with distributions. The simplest solution is to pay a closer look at values which don't fit to a certain distribution.
I'm going probably to look at data which is bigger or smaller than the average plus minus three times the standard deviation, you know, the sigma rule. Everything which goes into such values is very exclusivist.
It is either an anomaly or it is write data which, however, doesn't quite fit in build and should be investigated nearly. When talking about the keys, it should have a flat distribution.
I should make sure that the keys have a flat distribution. Otherwise, I may have had problems when creating foreign keys, for example, or the primary keys constraints if the data comes from flat files. Another classical distribution is the Cephean distribution when it comes to sold products.
Usually, the companies sell much more cheap products when compared with expensive products. If you are going to compare the sales of cars with sales of pencils, the number is pretty clear. You have a Cephean distribution when you have such order of products.
And when you are a car dealer and you sell both Volkswagens and Porsches, and you suddenly see that you basically sell three Volkswagens a month and ten Porsches, then either you have rich clients and you live in Beverly Hills,
or something is wrong with your data warehouse, right? So distribution is a nice mirror to look into to see what happens with your data. Of course, there are much more distributions. You can investigate and fit over your data and
see which fits actually to your data, Pareto, Poisson, S distribution, and so on. There are also statistical software solutions. The most used is the SPSS. This is not an open source solution. I'm usually mentioning throughout our detours what is open source and
what's not maybe if you want to try at home. So you have the possibility to, for example, can be used freely for something like this. Okay, we've spoken then about the data cleaning part of the data transformation.
We will start with the second part of data transformation, respectively the integration. The integration comes from the fact that in the operational environment we have different tables, different sources. But the target source is a data warehouse which has a different schema.
It has either a star or a snowflake schema or a derived schema from these two, for example. And in such cases, I then need to integrate the data I have from the operational environment. During the process, I need to resolve all the differences
between the perspectives of the data I have in the operational environment. I also have to be careful not to keep any redundancy.
Let's take a closer look at the integration of schemas. So this process involves actually four steps. The pre-integration analysis, where an analysis is performed in order to see what are the schemas to be integrated.
The schemas to be integrated are then compared, are made conform with each other so that I can use data from two sources which actually is the same data. For example, the sales of products from America with the sales of products from Europe are sales for me in the data warehouse.
However, the schemas might be different. For example, the price might be called cost in one of the tables as the attribute name. Such things need to be understood, resolved, seen as conflicts, seen as the same schema, made conform, merged,
restructured in a unifying schema, and then can be loaded in the data warehouse. Of course, the integration process is also a continuous process which needs refinement
because when a modification to a schema for one of my sources is performed, then this merging process has to adapt so that I can get the data from the new schema style.
And for this reason, this has to be performed on a continuous basis. Okay, so we've said that we start with the pre-integration analysis. Through this analysis, we identify schemas with conflicts.
So two schemas which hold the same type of data, which should be brought together. We identify conflicts, like for example, cost versus price, through comparisons of schemas. We build a list of conflicts, and then we try to resolve them.
We'll see this will be also manual work. We then try to conform the schemas in a unifying one. We modify the schemas, integrate them, and then merge and restructure the data into the end integrated schema.
So this is how the whole process looks like. The pre-integration analysis has to choose a certain strategy. For example, it is often the case that the number
Of tables in my sources is large, that I need to perform a pre-modularization step at first. So for example, I have to perform an analysis and see what belongs together, what should I look closer. So I am performing a modularization when the choice is too large for me,
and then I'm going to take a closer look. Of course, you can't integrate everything. The operational data system comprises a lot of tables usually,
and you are not going to integrate everything into the data warehouse. It's also not desirable to do something like this. As I've said, the ETL process is also limited in time. But there are different strategies according to how you can modularize and
how many tables you have to integrate. And one can differentiate between two strategies, the binary approach and the n-ary approach. With the binary approach, one starts with two tables, integrates them sequentially, adds another one,
performs the next integration step. So it will be something like this, the second step, the third step, and so on. Until I have all my cells from all my sources. Another solution is balanced integration where
I choose to split the process to two or more. Integrating sub-processes, I perform an integration here, another integration here, then another one here, and so on. And the big bang integration, it can be difficult,
but if you have a few tables to integrate and a few sources, then big bang integration is also a valid approach. And then heterogeneous solution where you can have it all. You can have also the binary approach, you can have also some smaller big bangs
with three or more, and so on. This all depends on how the operational system looks like. The conflicts we were speaking about can happen at different levels. One of them is the table-to-table conflict.
Let's consider a table called sales, and let's consider the German table called FERCOIFER. How does the data warehouse know that actually the information stored in these two tables should be brought together?
Well, probably by using a translator. It's valid, and it can be used. So something like identifying such conflicts can be performed by using dictionaries, translators, synonyms, homonyms are such approaches. WordNet, for example, is a source for
coming from two terms to the same meaning. Another type of table-to-table conflict is the structure conflict. One structure is more comprehensive. One has more attributes than the other, but actually the data is almost the same.
Without considering, for example, an attribute which maybe can be derived from somewhere else. I can imagine a client table where I have the name and the address. However, the address is just a string field, and then another client table where the address is split again in attributes.
This is such a conflict. Integrity condition conflicts, again, data types or data lengths can lead to such table-to-table conflicts. Attribute conflicts, again, by naming,
can be solved again through dictionaries and so on. By default value in one table, the default value is zero. In the other table, the default value is a space, just a space. A conflict which needs to be solved. And again, integrity conditions can be met here. And the derived type of conflict between not only table
structure conflicts and attributes conflicts. All this needs to be solved by the integration process. The goal of this process is to achieve a compatible schema Where the data can be stored together, unified, and
then loaded into the data warehouse. However, in order to achieve this kind of a schema from different types of sources, manual integration is usually performed. What do you do with semantical conflicts? As I've said, for example, cost and price.
What if both of them are in sales and it's meant to be the same thing? Syntactically, they are very different. Semantically, they are related, if not the same thing. A computer can't reach these two concepts that easy.
Maybe using derived technologies like artificial intelligence or latent semantics analysis, one could reach through additional information. This conclusion and say, okay, they are semantically related. So I can bridge this data together.
Other types of things which need to be done manually are renaming of entities and attributes in the case that this is not a small syntactical difference like missing a letter or two or so.
Cardinalities and functionalities, as well as different types, fall also in the same category. If I have a date type or a time stamp for a date, maybe the application which uses the table in the operational system for
North America uses time stamps, while in Europe, the DBMS uses the date type. This has to be brought together. What I want to state here is that the problem
when performing schema integration is the semantic component of integrating the data. This is why this results in manual work. Matching part of the schemas can't be done completely automatically because bridging this information can't be reached with 100% certainty
only through algorithms. Of course, there are semi-automatic approaches. There are, as I've mentioned, solutions which bring you to decisions like, for example, I as an algorithm would propose that these two
entities are semantically similar because through latent semantics or as a distance in a domain knowledge ontology, I've seen that the distance is quite small. However, a human has to be there and decide, yeah, it's the same thing.
No, it's not. So this is why doing this automatically is error prone if not even doing it with humans can also result in errors through misunderstanding. The matching process can either be semi-automatized through labels.
I'm going to look at the name of the table. I'm going to look at the name of the attributes and try, for example, to use an ontology. And I would see that in my ontology, the distance between the price and the cost is quite small.
They are used in the same field, and not only, but they are quite close to each other as meaning. I could solve something like this then by matching them semantically. I could go even further and perform instance-based matching,
and look at the data and see how the data is distributed. And say, okay, it seems that the data between these two sources has a very similar distribution, if not the same. This might lead to the understanding that they are the same thing. Another solution is the structure-based matching,
where I go from the structure side. And I say, what kind of data attributes does this structure define? How do my attributes look like in the other? What are the elements which I'm going to store? What are the depths of my hierarchies?
The relations, the foreign key relations, and so on. And if they look like the same, then I can go from this reasoning and say, okay, I probably should perform a schema matching here, because they are the same thing. One example of query driven schema mapping is here,
when we have a source and the target, at high level they match. We need to perform low level mapping. And the idea is that this data, this data from the low level,
let's consider it as sales. Sales, sales, North America, Europe.
I want to bring them together. Only here, the table is different than the table stored here. I have an example here, yeah, of how the schema matching is actually performed. For example, here I have the product ID, here I have the ID.
The product here is the same, group ID here, so foreign key. Here I have the description of the group and the category. They are both the same thing, in two different areas, stored in similar schemas. Through query, I want to modify the data from here,
so that I can query it like this, yeah? This is what I actually need to perform here. I want to abstract from the schema itself, because I can detect that they match and
perform a mapping relation of data from here, one, to a schema, which I've proposed in the second part, yeah? Well, the task is not that easy.
So as I've said, I have a high level mapping, and I want to generate a query or a set of queries that transforms my data from the left side, and integrates it and makes it conform to the target schema from the right side. The problems which arise with something like this,
is to generate the correct query. How do I generate the query or the queries, and how am I sure that these queries perform the correct mappings? And to make it more complicated, I have to guarantee that
the transformed data corresponds to the target schema, yeah? So these are problems which we need to answer. Of course, solutions can come from the industry. And they say, we'll make schema integration for you. Such an example is the Aqualogic from BIA Technologies.
They've been both, I think, for two years now from Oracle. And if you look on their previous website through data archives, you can see they were promising mappings and transformations can be designed
in an easy-to-use graphical user interface, 200 functions in a library. For complex mappings and transformations, architects and developers, this is the key part, architects and developers can bypass the GUI. So if things get more complicated than just syntactically matching,
you need consultants. BIA consultants, they are the architects and the developers that bypass the GUI, yeah? And they use XQuery, which is actually a standard open source standard for XML.
To define or edit the services, yeah? So if you take a closer look at what actually BIA technology does, is they say, what is solved syntactically, I can do it semi-automatically. What not, you need to buy our consultants, they'll do it for you without the guy, the graphical user interface.
This is how they make money with it. What do they actually support? What do their tool do? They have a data transformation tool, transform binary data into XML and XML to binary data. So the transformation is actually performed in XML.
They get their data from your source into XML, perform some modifications there, some transformations there, and then back into the binary data. So actually, they just build two converters for XML to XML.
So transform the data from application specific format to XML. Transform XML specific application to another general schema which they can read. And then, after the transformation has been done manually,
transform it back to binary, so just converters. Of course, I didn't have the software to make an example with it. But I've used something which is similar, which will come up in the detour, so in the next slides. There's the alternatives to using a consultant is to do it yourself.
Use open source projects, data mappers, the XML engine, XML editors, XPath and XQuery. There are also specialized tools for it.
Clover ETL, Altova Map Force are just some examples. And I brought an example from the Altova Map Force, which works as they say under the same idea as the BI integrator.
And the idea here is to use XSLT between the temporary results through XML. They offer a nice editor here, which loads the XML. I then say how my XML should look like in the end format.
I, through connecting, fields which have to be, can be connected, are the same thing and can be mapped. So it's just a matter of I am going to click here, and here, and here, and here. And through my clicks, the Altova generates a transformation,
an Excel file, XSLT file, which transforms from the source. They've loaded the XML source file to a destination XML after I've done my manual work in the editor. And then use a converter to binary so
that it can be loaded in a database management system. So actually the secret is that most of the work has to be done manually here in the editor. They just port you from one XML to another, but the links will be performed by you.
Okay, let's go to the last part of our ETL process, respectively the loading. The loading can also be broke down into parts. The initial load and the continuous load. Of course, over time, because I need to update my data warehouse.
The biggest issue of the loading, especially of the initial load, is the huge volume of data. We're speaking of large amounts of data which I need to load in a small window time.
A big problem here are the indexes and the aggregated tables. Loading has to consider that we maybe have naturalized views or that maybe we are using more up. Everything has to be performed at this level.
If aggregation has to be recalculated and we have spoken about strategies on updating materialized views two lectures ago,
the indexes are a problem which loading has to deal with at this phase. We'll speak about it. Another issue loading has to deal with is recovering. If I start the loading process and then something fails,
I don't want to restart it. Because restarting it would mean what have I already loaded in the data warehouse. Hopefully, I won't introduce duplicates, so I have to compare and so on. So, recovering gracefully is a very important task of the loading tools.
For the initial load, one just has to establish what the dimension tables are, create and assign the keys between the dimension tables and the fact tables.
Assure that when a new dimension is loaded, that it is conformed for all the records which are loaded also in the fact. And of course, I have to be careful to write the dimension in the proper format, if I have a physical multimedia storage or
a relational multidimensional storage, then I have to take care of the snow or star schema. For the fact table, I've said that they're usually pretty big.
The solution is to use some bulk load utility. The secret between bulk loading is that it doesn't perform the classical inserts row by row. When performing bulk loading, most of the databases support bulk loading and
all of the data warehouse support them, they disregard any constraints, any checks that normally database management systems do, because they take time. So when one administrator says, I'm using the bulk loading functionality on this big chunk of data,
the management system knows, okay, I won't do any checks, any constraints, anything, I will assume that you know what you're doing, that the data is correct, so that you can finish faster. That's the difference between this and classical inserts. Of course, the best solution for
performing something like this is performing it distributed in parallel. Classical tools for loading. You can search on Google data transformation services. You will come up with a list of tools which are suitable for something like this. There are batch copy is a classical well-known,
pretty old utility for something like this. And the SQL loader, which are the classical contenders for loaders for initial loading. When it comes to the loading over time, of course,
this has to be scheduled, when should loading be done. And here, we have always a tradeoff between how our data, how complete our data is, and how fresh the data is.
So again, you are not going to see a data warehouse which performs yearly based loading, because then the decisions you are going to take on a monthly basis are affected.
But you are also not going to see data warehouses, which reach hourly or at hour granularity loading. Of course, loading is a most sensitive step, because this is the transition between the staging area and our data warehouse.
So I have to make sure that I'm not creating the applications, that I'm not loading things twice. For example, if the loading process breaks. I have to be sure that I'm not exaggerating through multiple loading the inconsistencies.
These are things that the data warehouse developer has to take care of when proceeding with the loading step. There are also some best practices when performing this ongoing loading regarding facts.
For example, I'm going to separate the updates from inserts, so new data from data which has already been there, and I'm going to update it. I've said something about the indexes, which actually need special treatment.
The solution is to drop the indexes which are not required for the updates, load the updates, and then drop all the indexes for the inserts. Because otherwise, when I perform the inserts, I also need to update the index structure. And the index structures usually will be a lot of them,
will cost a lot of time. I don't want to do this on the fact table, which has millions of records. So I drop them, and then I construct the index one more time. I have already said in the indexes lecture that reconstructing indexes is also a good idea for building the optimal geometry for
air trees if you remember the minimum bounding rectangles and so on to avoid the dead space, there are some heuristics. When rebuilding the indexes, you build, of course, build with the optimal geometry and the list of dead space.
This is the moment you are doing it. After you've performed the ETL, you have obtained the data in the data warehouse. But the last few years, a lot of the data warehousing solutions companies speak more and
more about the metadata. What should one do with the metadata when speaking about the data warehouse? Well, metadata is actually an important part when considering the quality of the data.
Because it tells us what we have actually stored in our data warehouse, and where does it come from. More specifically, the repository which holds metadata, data stores the intermediate products we've spoken about, for example, when performing
extraction, transformation, and loading. We've always said we are performing an analysis to see how the operational system databases look like from table names to attributes, data types, length, and so on.
All this is stored in the metadata together with the audit errors which have been registered throughout the screens and so on. So everything is here. As distinct types, we have the metadata which come from the source system,
the ODS, the metadata which comes from the data staging area, and the database management system, actually the warehouse metadata.
From the source specification, we are talking about the source logical schemas. Yeah, we are talking about also ownership descriptions, who owned the data where we have now ported in our data warehouse. This is data about the data stored in the data warehouse.
How often was it updated? What kind of access methods have been used? Was it, for example, stored directly in the operating system? Was it accessed through the database management system? Was it a relational table, and so on?
Other metadata coming from the source system are how was the data processed? The job was scheduled. How was the data extracted? So what kind of extraction code did it need to be extracted from the operational data system?
From the data staging area, we have data like, for example, the scheduling and the type of files we have used. The intermediate results we have obtained, yeah. We have the dimension tables, of course, with the Surrogat keys,
which we have assigned between the dimension tables and the fact tables. We have the transformation, which the data has undergone. From the beginning of its lifetime in the ODS, until it reached the data warehouse, the aggregation, it suffered.
Yeah, the mappings which we have performed, the loading information, the aggregation definition, the audit, the errors. All this is stored in the metadata.
And I've brought here an example of how the metadata for a simple cube looks like. It's stored in XML, it's just a star schema. But the corresponding metadata, which has resulted through the prototype of a data warehouse, is stored also in the metadata repository.
And of course, it has a name, it has a cube, it has dimensions with hierarchies and levels of granularity. The table where this is stored, the same thing, and it has the fact reference, yeah, so it's pretty intuitive.
This is what you have in the metadata. You might ask yourself, why should I store all this? Think of it like when you add a new data mark for a department.
And that department says, okay, I need this and this and this. And then you have to relate on the table names to know actually what is in the data warehouse. Well, this is not enough. You have to look where the data came from that is now in those tables or in those structures. And you can reach this through the metadata, yeah?
As a short summary, what I would like you to remember from today's lecture is that the data warehouse project is usually a complex task.
It has from 500 up to 1,000 of sub-tasks. It involves choosing the hardware, the software, and establishing a timeline. The timeline usually comprises three phases from building the very specific prototype with just one data mark in mind, then trying it on a set of
beta testers, and then make it available for the whole company. The most important process when building the data warehouse is they extract transform load process, which involves
establishing the storage structure in the staging area. The strategies which I'm going to use for extraction. How should I decide for ongoing extraction? What has changed? For example, sniffing or scraping of logs, yeah?
What are the problems of scraping and sniffing? For example, the issue that the log will be probably full and emptied by the database management system at some point. We have spoken about the transformation sub-process, which involves assuring for a better data quality.
Of course, the major problem here is the semantics. The semantics is problematic also when performing integration. It results also in manual work, which is assisted by semi-automatic approaches.
We've spoken about the last sub-process of the ETL, the loading. The main issues, how it deals with indexes, for example. The solution for the fact table of loading everything as a bulk.
And we've spoken about the metadata as a solution to describe the information which we have in the data warehouse. Worth mentioning is that it comprises all the intermediate results of the ETL process and helps understanding how to use the data warehouse
eventually for building new data marts and supporting new data marts. Next lecture, we will start with business intelligence. We will intensify our discussion about data mining,
about data warehousing and all up by going through data mining. And how we can achieve information by analyzing the data.
We'll speak about the basic principles of data mining. And we'll start with a concrete example, the association rule mining. A classical example for data warehousing of analysis, which actually popularly is called market basket analysis. So what have people bought together?
Like wine and cheese or stuff like that, yeah. Actually, for example, Real is doing something like this on their data warehouse just to see how to stack their racks. Where to put the wine, where to put the cheese. They should be together, because when the client comes, it buys cheese and
says, hm, it would go with wine. Hm, and he buys more. Stuff like this, market basket analysis, association rule mining. That's it for today, we'll see you next time.