We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Indexes (18.11.2010)

00:00

Formale Metadaten

Titel
Indexes (18.11.2010)
Serientitel
Teil
4
Anzahl der Teile
13
Autor
Mitwirkende
Lizenz
CC-Namensnennung - keine kommerzielle Nutzung 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache
Produzent
Produktionsjahr2011
ProduktionsortBraunschweig

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
In this course, we examine the aspects regarding building maintaining and operating data warehouses as well as give an insight to the main knowledge discovery techniques. The course deals with basic issues like storage of the data, execution of the analytical queries and data mining procedures. Course will be tought completly in English. The general structure of the course is: Typical dw use case scenarios Basic architecture of dw Data modelling on a conceptual, logical and physical level Multidimensional E/R modelling Cubes, dimensions, measures Query processing, OLAP queries (OLAP vs OLTP), roll-up, drill down, slice, dice, pivot MOLAP, ROLAP, HOLAP SQL99 OLAP operators, MDX Snowflake, star and starflake schemas for relational storage Multimedia physical storage (linearization) DW Indexing as search optimization mean: R-Trees, UB-Trees, Bitmap indexes Other optimization procedures: data partitioning, star join optimization, materialized views ETL Association rule mining, sequence patterns, time series Classification: Decision trees, naive Bayes classifications, SVM Cluster analysis: K-means, hierarchical clustering, aglomerative clustering, outlier analysis
DimensionsanalyseHierarchische StrukturWürfelDatenmodellMathematische LogikMAPRelationale DatenbankArray <Informatik>LinearisierungInformationsspeicherungSchwach besetzte MatrixDichte <Physik>Automatische IndexierungTopologieZahlzeichenAbfragePortscannerPlastikkarteTrennschärfe <Statistik>Produkt <Mathematik>Globale OptimierungMultiplikationOrdnungsreduktionWeb-SeiteMultiplikationsoperatorKreisbewegungWeb SiteGruppenoperationHeuristikHilfesystemMAPMobiles InternetDimensionsanalyseEntscheidungstheorieSpieltheorieTabelleData MiningPunktURLInformationBitFlächeninhaltFormation <Mathematik>Minkowski-MetrikNebenbedingungSCSIProzess <Informatik>AbfrageInformationsspeicherungBitmap-GraphikTopologieProdukt <Mathematik>FestplatteLesen <Datenverarbeitung>Data-Warehouse-KonzeptRelativitätstheoriep-BlockArray <Informatik>Ordnung <Mathematik>Endliche ModelltheorieGlobale OptimierungLinearisierungHardwareZweiInterface <Schaltung>NP-hartes ProblemComputeranimation
Automatische IndexierungProdukt <Mathematik>Globale OptimierungMultiplikationPortscannerWeb-SeiteOrdnungsreduktionTopologieDatenstrukturLeistung <Physik>EinfügungsdämpfungZeiger <Informatik>MultimediaDatenbankSystemprogrammierungSchlüsselverwaltungWort <Informatik>EinsFestplatteVersionsverwaltungp-BlockDatensatzKategorie <Mathematik>RechenschieberZeitrichtungTopologieRelationale DatenbankDimensionsanalyseTeilbarkeitNatürliche ZahlLogarithmusHeuristikCachingDatenbankMultiplikationsoperatorInformationsspeicherungSpannweite <Stochastik>Physikalisches SystemZeiger <Informatik>Mini-DiscCASE <Informatik>DatenstrukturAbfrageLoginLesen <Datenverarbeitung>Inverser LimesEinfügungsdämpfungInformationLastGüte der AnpassungEinheitliche ZeichenschnittstelleGebäude <Mathematik>VideokonferenzAllgemeine RelativitätstheorieMaßerweiterungPrioritätswarteschlangeIdeal <Mathematik>Rechter WinkelSchwebungEndliche ModelltheoriePunktWechselsprungWeb logGradientDivisionComputeranimation
Funktion <Mathematik>TopologieMultimediaDatenbankDatenstrukturSchnittmengeElementargeometriePunktSummengleichungMinkowski-MetrikObjekt <Kategorie>InformationsspeicherungKnotenmengeEinheitssphäreWürfelGleichmäßige KonvergenzDimensionsanalyseMaßerweiterungMultiplikationKlassische PhysikInformationSystemprogrammierungPrototypingVersionsverwaltungWeb-SeiteVerzeichnisdienstLokales MinimumRechteckGasströmungDimensionsanalyseRechteckMinkowski-MetrikLokales MinimumObjekt <Kategorie>Wurzel <Mathematik>Data-Warehouse-KonzeptQuaderVererbungshierarchiePunktTopologieElementargeometrieMultiplikationsoperatorHypercubeWeb-SeiteZellularer AutomatFestplatteProdukt <Mathematik>Hierarchische StrukturIndexberechnungGleichverteilungVerzweigendes ProgrammEinflussgrößeResultanteHilfesystemCASE <Informatik>InformationZeiger <Informatik>WürfelShape <Informatik>Physikalisches SystemGeradePortscannerMultimediaSoftwaretestDatenstrukturNatürliche ZahlReelle ZahlDistributionenraumRandomisierungVerzeichnisdienstInformationsspeicherungGüte der AnpassungTeilbarkeitTVD-VerfahrenVersionsverwaltungGruppenoperationArithmetisches MittelBootenElektronische PublikationFormation <Mathematik>MultiplikationVollständiger VerbandSchlussregelFlächeninhaltAggregatzustandBildschirmfenstersinc-FunktionDreiEinfügungsdämpfungRechter WinkelDatensichtgerätKlasse <Mathematik>WhiteboardBitComputeranimation
Wurzel <Mathematik>TopologieMultimediaElementargeometrieDatenstrukturAusgeglichener BaumObjekt <Kategorie>KnotenmengeRechteckParametersystemZeiger <Informatik>TupelHeegaard-ZerlegungDatenverwaltungOperations ResearchDimensionsanalyseTopologieCharakteristisches PolynomAbfrageRechteckGüte der Anpassungp-BlockInformationZeiger <Informatik>MultiplikationsoperatorMAPKoordinatenLokales MinimumWurzel <Mathematik>EindringerkennungMinkowski-MetrikNichtlinearer OperatorProdukt <Mathematik>Inverser LimesPunktKonditionszahlDatensatzTupelHeegaard-ZerlegungKanalkapazitätElement <Gruppentheorie>Dimension 3Dimension 2Ordnung <Mathematik>FestplatteObjekt <Kategorie>Schnitt <Mathematik>IdentitätsverwaltungBitAutomatische HandlungsplanungSummengleichungGarbentheorieDichte <Physik>NavigierenCodeCASE <Informatik>SpieltheorieSprachsyntheseFächer <Mathematik>Metropolitan area networkSpannweite <Stochastik>MedianwertArithmetisches MittelComputerspielElementargeometrieGenerator <Informatik>GruppenoperationMomentenproblemDatenbankAggregatzustandComputeranimation
Objekt <Kategorie>MultimediaAlgorithmusRechteckKnotenmengeWeb-SeiteVolumenTLSVererbungshierarchieDivisionWurzel <Mathematik>Minkowski-MetrikTopologieAbfrageHeuristikInformationsspeicherungMinkowski-MetrikHeegaard-ZerlegungGüte der AnpassungZeiger <Informatik>p-BlockInformationVerzweigendes ProgrammData-Warehouse-KonzeptVersionsverwaltungSchnittmengeWurzel <Mathematik>ZweiDivisionRechteckEinfügungsdämpfungSoftwaretestDifferenzierbare MannigfaltigkeitKoordinatenFestplatteDimensionsanalyseAlgorithmusObjekt <Kategorie>Produkt <Mathematik>MultiplikationsoperatorAnalytische MengeInklusion <Mathematik>Ordnung <Mathematik>BitCASE <Informatik>WellenlehreGruppenoperationMetropolitan area networkART-NetzTopologieSimulationAblaufverfolgungDatenstrukturGesetz <Physik>Prozess <Informatik>MAPFlächeninhaltPhysikalischer EffektProfil <Aerodynamik>PunktWort <Informatik>Spannweite <Stochastik>SprachsyntheseLie-GruppeComputerspielLeistung <Physik>DifferenteQuick-SortSpieltheorieGarbentheorieWärmeausdehnungArithmetisches MittelComputeranimation
VolumenInnerer PunktKnotenmengeHeuristikMultimediaObjekt <Kategorie>PolygonzugHeegaard-ZerlegungRechteckMinkowski-MetrikFlächeninhaltLineare AbbildungPunktHausdorff-DimensionLokales MinimumRelationentheorieLokales MinimumRechteckDimensionsanalyseMinkowski-MetrikLinearisierungElement <Gruppentheorie>Inverser LimesGüte der AnpassungPunktHeegaard-ZerlegungLesen <Datenverarbeitung>AuswahlaxiomKonditionszahlQuadratische GleichungSchaltnetzAbfrageCASE <Informatik>MultiplikationsoperatorHierarchische StrukturWurzel <Mathematik>AbstandKategorie <Mathematik>FlächeninhaltPufferüberlaufSpezifisches VolumenPaarvergleichInformationOrdnung <Mathematik>Objekt <Kategorie>RichtungNichtlinearer OperatorHalbleiterspeicherSoftwareGlättungDifferenteBitMultiplikationMereologieTelekommunikationAbstimmung <Frequenz>AggregatzustandSoftwaretestSchätzfunktionMomentenproblemGeradeWhiteboardArithmetisches MittelQuaderHeuristikTopologieGruppenoperationStandardabweichungMini-DiscComputeranimation
Lokales MinimumMultimediaDatenbankObjekt <Kategorie>VolumenProzess <Informatik>Lineare AbbildungDichte <Physik>TopologieWurzel <Mathematik>RechteckMinkowski-MetrikPunktp-BlockSkalierbarkeitÄhnlichkeitsgeometrieVersionsverwaltungInformationOrdnung <Mathematik>RechteckAbstandMinkowski-MetrikPufferüberlaufWurzel <Mathematik>GraphfärbungKonditionszahlTopologieSchaltnetzProdukt <Mathematik>Lokales MinimumMultiplikationsoperatorCASE <Informatik>MultiplikationMAPVererbungshierarchiePunktMereologieDifferenteEvoluteDatenstrukturDatensatzNichtlinearer OperatorHeuristikObjekt <Kategorie>EinfügungsdämpfungEntscheidungstheorieHeegaard-ZerlegungNormalvektorQuick-SortGruppenoperationART-NetzGesetz <Physik>Allgemeine RelativitätstheorieWeg <Topologie>TelekommunikationFamilie <Mathematik>SchwebungSchlüsselverwaltungPhysikalischer EffektProgrammiergerätNatürliche ZahlElektronische PublikationSichtenkonzeptBenutzerbeteiligungResultanteAbstimmung <Frequenz>Computeranimation
AbfrageKnotenmengeMultimediaDatenbankAlgorithmusFolge <Mathematik>DimensionsanalyseObjekt <Kategorie>VersionsverwaltungTopologiePunktSchlüsselverwaltungSpannweite <Stochastik>PartitionsfunktionProzess <Informatik>Minkowski-MetrikStatistikAdressraumSurjektivitätWeb-SeiteInformationsspeicherungDialektMereologieTVD-VerfahrenKategorie <Mathematik>Data-Warehouse-KonzeptUB-BaumElement <Gruppentheorie>DimensionsanalyseProdukt <Mathematik>LinearisierungDatenstrukturMinkowski-MetrikInformationsspeicherungPhysikalismusAnalysisAlgorithmusFolge <Mathematik>Objekt <Kategorie>KugelElementargeometrieZeiger <Informatik>Kartesische KoordinatenTopologieAbfragep-BlockMinimalgradHeegaard-ZerlegungSchlüsselverwaltungKonditionszahlOrdnung <Mathematik>Mixed RealitySpannweite <Stochastik>FlächeninhaltBitKontrollstrukturDifferenteFestplatteSequentielle SucheNichtlinearer OperatorGesetz <Physik>AggregatzustandNominalskaliertes MerkmalCASE <Informatik>Array <Informatik>TeilbarkeitGenerator <Informatik>Extreme programmingCodierungNatürliche SpracheMultiplikationsoperatorPrinzip der gleichmäßigen BeschränktheitSpezifisches VolumenGrundraumSchwebungSpieltheorieFreewarePunktSoftwaretestSingularität <Mathematik>ComputersimulationBenutzerfreundlichkeitWeb SiteRechenwerkWhiteboardWeb-SeiteVersionsverwaltungGebäude <Mathematik>Schnitt <Mathematik>Rechter WinkelComputeranimation
Minkowski-MetrikAdressraumWeb-SeiteSurjektivitätInformationsspeicherungTopologieInformationTupelSelbstrepräsentationSpannungsmessung <Mechanik>SchlüsselverwaltungAbfrageSpannweite <Stochastik>Hausdorff-DimensionBeschreibungskomplexitätAlgorithmusRechteckPrädikat <Logik>Mini-DiscPunktStrom <Mathematik>Operations ResearchTopologieMereologieMultiplikationsoperatorMetropolitan area networkProjektive EbeneMAPGüte der AnpassungPunktDruckverlaufTransaktionVektorpotenzialBitAusgeglichener BaumSchnittmengeMySpaceOffene MengeVerschiebungsoperatorBewertungstheorieKategorie <Mathematik>Virtuelle MaschineBasis <Mathematik>SpieltheorieShape <Informatik>InzidenzalgebraWechselsprungInformationLesen <Datenverarbeitung>DimensionsanalyseMinkowski-MetrikObjekt <Kategorie>VideokonferenzRechenwerkSpannweite <Stochastik>Ordnung <Mathematik>ZahlenbereichSchreib-Lese-KopfSchnitt <Mathematik>Nichtlinearer OperatorPhysikalisches SystemLeistung <Physik>Geschlecht <Mathematik>Produkt <Mathematik>DialektAusdruck <Logik>p-BlockSoftwaretestElement <Gruppentheorie>BinärcodeTransformation <Mathematik>KardinalzahlRechteckAlgorithmusLokales MinimumSchlüsselverwaltungKurvenanpassungPrädikat <Logik>FestplatteObjektorientierte ProgrammierspracheKoordinatenInterleavingAbfrageInformationsspeicherungGrenzwertberechnungMinimalgradTVD-VerfahrenLogarithmusComputeranimation
AbfrageRechteckMini-DiscStrom <Mathematik>AlgorithmusReelle ZahlAttributierte GrammatikRelation <Informatik>Automatische IndexierungArray <Informatik>Vektor <Datentyp>AdressraumZahlenbereichOrakel <Informatik>Codierung <Programmierung>DickeDatenkompressionAllgemeine RelativitätstheoriePhysikalisches SystemNormalvektorBitmap-GraphikDatensatzElement <Gruppentheorie>TabelleAttributierte GrammatikNichtlinearer OperatorDimension 2DatenkompressionArray <Informatik>DickeData-Warehouse-KonzeptAnalytische FortsetzungZeichenketteEinsZahlenbereichBitKurvenanpassungTeilbarkeitOrtsoperatorMailing-ListeCodierung <Programmierung>DatenbankFormale GrammatikMAPGüte der AnpassungÄhnlichkeitsgeometrieVerschiebungsoperatorTopologieAdressraumTypentheorieMereologieAlgorithmusMultiplikationsoperatorLeistung <Physik>Reelle ZahlInformationSummierbarkeitTransaktionTemperaturstrahlungDeterminanteEinfache GenauigkeitMultiplikationMetropolitan area networkResultanteSoundverarbeitungSchreib-Lese-KopfMixed RealityAutomatische HandlungsplanungQuellcodeLastOrdnung <Mathematik>Gesetz <Physik>URLFlächeninhaltFamilie <Mathematik>Verband <Mathematik>Computeranimation
Reelle ZahlArray <Informatik>Attributierte GrammatikAutomatische IndexierungMathematikOperations ResearchTrennschärfe <Statistik>Vektor <Datentyp>Spannweite <Stochastik>AbfrageHardwareMultiplikationVektorrechnungCodierung <Programmierung>Physikalisches SystemBasis <Mathematik>InformationsspeicherungNormalvektorGleichheitszeichenArray <Informatik>Nichtlinearer OperatorBitmap-GraphikCASE <Informatik>BitBasis <Mathematik>Spannweite <Stochastik>Güte der AnpassungTrennschärfe <Statistik>AssemblerOrtsoperatorLokales MinimumLesen <Datenverarbeitung>Attributierte GrammatikHardwareSchlüsselverwaltungEinsDatensatzTVD-VerfahrenMultiplikationsoperatorDiagonale <Geometrie>ClientZahlenbereichDimensionsanalyseGrenzwertberechnungStützpunkt <Mathematik>Geschlecht <Mathematik>FunktionalAbfrageAllgemeine RelativitätstheorieReelle ZahlInformationsspeicherungRechenbuchAusdruck <Logik>Mailing-ListePunktCodierung <Programmierung>Vollständiger VerbandUnrundheitAlgebraisches ModellMetropolitan area networkMAPFächer <Mathematik>AggregatzustandFamilie <Mathematik>BenutzerschnittstellenverwaltungssystemRankingGarbentheorieGruppenoperationInzidenzalgebraAbstandFünfOrdnung <Mathematik>ComputerspielProdukt <Mathematik>StichprobenumfangLastPhysikalisches SystemTotal <Mathematik>CodeMixed RealityMathematikFormation <Mathematik>Gesetz <Physik>Computeranimation
Automatische IndexierungVektor <Datentyp>NormalvektorCodierung <Programmierung>GleichheitszeichenAbfrageSoftware Development KitArray <Informatik>TabelleBruchrechnungKlasse <Mathematik>Exogene VariableHardwareBefehlsprozessorDimensionsanalyseHeegaard-ZerlegungElementargeometrieOperations ResearchAiry-FunktionSpannweite <Stochastik>MultiplikationInformationsspeicherungVektorrechnungOrdnungsreduktionHausdorff-DimensionKurvenanpassungAttributierte GrammatikTopologieWechselsprungGeschlecht <Mathematik>TopologiePunktGreen-ITEinfache GenauigkeitSprachsyntheseHilfesystemSpieltheorieWeb SiteCASE <Informatik>MereologiePhysikalisches SystemRelationentheorieTrennschärfe <Statistik>ZählenHauptidealSpezielle unitäre GruppeBenutzerbeteiligungStrategisches SpielSchlüsselverwaltungBootenMAPGeradeSchießverfahrenMailing-ListeGlobale OptimierungEreignishorizontZusammenhängender GraphCodeDimensionsanalyseArithmetisches MittelFamilie <Mathematik>HalbleiterspeicherMultiplikationsoperatorTabelleKlasse <Mathematik>MultifunktionBenutzerschnittstellenverwaltungssystemWhiteboardGrenzschichtablösungFormation <Mathematik>TermDialektObjekt <Kategorie>Metropolitan area networkNichtlinearer OperatorBasis <Mathematik>ElementargeometrieEinhängung <Mathematik>MultiplikationMechanismus-Design-TheorieBitmap-GraphikMixed RealityFitnessfunktionHardwareSchnittmengeLesen <Datenverarbeitung>Attributierte GrammatikResponse-ZeitSpannweite <Stochastik>BitArray <Informatik>BildschirmmaskePufferüberlaufEinfügungsdämpfungBefehlsprozessorAbfrageRechteckWurzel <Mathematik>InformationsspeicherungHeegaard-ZerlegungComputeranimation
PartitionsfunktionSichtenkonzeptGlobale OptimierungCASE <Informatik>TermSpeicherabzugData-Warehouse-KonzeptGlobale OptimierungGüte der AnpassungComputeranimation
Transkript: Englisch(automatisch erzeugt)
Hello everyone, I will welcome you today to the data warehousing data mining lecture. Professor Balke could not come today, so we will do this lecture together. So as a bit of summary, we've spoken last week about the last part of modeling.
We went through the logical models, we spoke about the cubes, the dimensions, the hierarchies, how one can use the classification levels to model the data warehouse. We've spoke then about the storage of the multidimensional paradigma,
either relational through the star or the snowflake schema, or directly multidimensional by performing array storage. We have spoken about linearization,
then about the problems which arise through linearization. Maybe you remember that the order of the dimensions is decisive. If I'm going to linearize by considering the time dimension, and then I do a query based on the product, then I will have to jump between blocks.
And this means a lot of reads. Reads are expensive, so a very important thing to consider is then, how do I store my data when I perform linearization? These decisions are usually heuristics based, so
I have to look at what my users ask, what are their queries, how do they look like, and what makes sense to how should the ordering of the dimensions be performed. This week, we will continue with the physical level.
We will speak about some optimization methods, namely the indexes. Okay, so we'll start with the tree based indexes, and then we'll continue with something which is usually done in data warehouses, respectively the bitmap indexes.
Good, why should we use indexes? The question is, or can be asked, why not? And I've brought an example just to see how important they are. Usual scenario in data warehouses is that you have tables which are pretty big.
A 100 giga table is not something unusual. Our hard drives can perform today somewhere to 50 up to 70 megabytes per second in read, if you speak about 7,200 rotation
per minute hard drive, as you probably do at home, do have at home. If you speak about SCSI interfaces or something more powerful with 10,000 rotations, then maybe you can reach real 100 megabits per second up to 300. I don't know, let's assume you can.
If you can reach these speeds, and you are in the situation that you need to do a full table scan, you still need 17 minutes just to go through the data. This is a lot of time, no processing, no nothing, you just read the data. 17 minutes to read the data can't be affordable
in a scenario like data warehouses. And if you consider that you then have constraints, like you discuss about the dimensions, the product dimension, and you restrict it on the Bosch 500 something. And then on the region with the granularity of country and time by month,
then you've already reduced the amount of interesting data to some 1.4 kilobytes. So what you're actually interested in when you have 30 locations, 10,000 products in 24 months, is reading 1.4 kilobytes.
And for this, you have to wait 17 minutes to read the whole 100 giga. It's kind of stupid, right? So why read this amount of data just to see what happens in the 1.4? For this reason, people have, this never happened to me.
It's okay when it happens to Tilo, he speaks further, but this never happened to me, go away. Uh-huh, this went away, it's unbelievable.
We're hoping that our nice hardware will save everything, and then we can go further with the lecture.
So now that you've had also a preview of what I'm going to say, what's about the indexes? How do they come up into the story?
Well, they help me concentrate my search where I'm actually interested in. For example, let's consider that my 100 giga data is concentrated here in this space. What I'm interested in is actually in this area here,
where I'm speaking about this product between these months. Now, for this information, I have to read all this. Doesn't make sense.
If I use primary indexes, and of course, I can use primary indexes just on one criteria, let's say time. I can concentrate my search in this interval. It's a primary index, the data is stored on the hard drive,
sorted just like this, so I spare myself reading these parts. If I'm going further, and I declare also some secondary indexes, like for example, I index also by product, then I can,
again, limit my area of search. Well, I still do some reading which I actually wouldn't need to.
The perfect situation would be to cut this out and to be able to read only what I'm interested in. Of course, you can achieve this when you have good heuristics in your logs and see, okay, people are interested in querying about products,
dimensions with these granularities, in time dimensions with these granularities. And I don't know, in geographical dimension with these granularities. Then I introduce one primary index. I introduce then some secondary indexes, and everything's great. I know that in data warehouses, there are not so many updates.
There are only loadings, so I have to construct my indexes on loading. So I don't have to care about how many indexes I actually insert. This is a nice solution. So we already see that indexes are an interesting way to go
in speeding up queries in databases. Who of you has heard about B-trees, about trees in general in indexes? Hm, it's not good. The rest of you should read the slides in the relational databases lecture,
which are on our website. They explain the basics of indexes. They are not the purpose of this lecture. Here we will speak about index structure, which are fit for data warehouses. They are a version of the basic indexes, so
they are constructed on things like B-trees. But you should first read those ones too before the exam or before even the next lecture, so you can better understand what happens with indexes. You should start with B-trees.
They are the basics of indexes in databases. And I will just say a few words about them so you can follow me further. B-trees offer the possibility of storing sorted data and optimizing search times, but it also considers that in OLTP,
so in normal transactional databases, people need insertion and deletion. It's not like in data warehouses. In real, people are going through the counters, they are buying stuff, so the database needs to be actualized, yeah?
This happens with insertion. If I have an index, then I need in OLTP for it to be able to react to insertion and deletion. This is the case of B-trees, they can do this. And the basic structure is a tree with nodes. Each node has a key value, like for
example, in the case of primary indexes, the ID of a record. Pointers, a left pointer and a right pointer, showing two child nodes, which have the property that they are ordered.
So every ID in the child node, referred to the left pointer, is smaller than the key value in the father. Every ID in the child node, referred through the right pointer, this one here, is bigger than the father, yeah?
With this kind of information, I can do fast querying, fast searching. Such a B-tree looks something like this, okay?
So I have a root, this is my root, I have IDs, two, six, seven. I have left pointers, one smaller than two. Everything which is in the child node referred through the right arrow
is bigger than two, but smaller than six. The same goes for seven, everything which is bigger is referred to the right arrow. Everything which is smaller is referred to the left pointer. And then you have pointers to the data on a disk.
So for example, I also have this nice pointer here to know where I can find my record. And this helps because what does the system do? It reads blocks on the secondary drive. And the assumption which stores a lot of seek time
on the hard drive is exactly this. That when I'm performing queries, when I'm searching, I'm doing a range query, and I'm going to read two. The next thing I'm going to read is three. Because if I've done my index on the time dimension, and one, two, and three are days, and I'm going to say what are the sales for
the first month, then I don't want to jump from one to five. Read this block once, then read this block for five, then come back to two. Read this block again from the hard drive. Because reading from the hard drive costs time.
So if I'm going to read optimally, then I need to read one and have two and three in the cache. Keep it there, use it, and then load the next block. This is the idea with B-trees, and this allows for search in logarithmic cost since it is a self-balancing tree.
It's a tree by height. We have n nodes, so the cost is logarithmic by nature. Every information about B-trees you need, you can find in the relational databases lecture so you can understand better what happens with it and how it balances itself in insertion and deletion.
This is a very important factor. So please read the slides. Good, in data warehouses, we have one problem. B-trees are great for one dimension. We have more dimensions. I have just named three of them, time, geography, and products.
What do we do if we have ten dimensions? So we need some kind of other indexes which are aware to consider multidimensional data. This is the challenge which data warehouses pose to indexes. The basic idea in multidimensional indexes and
multidimensional trees is that they have to consider that products are represented in space as clusters by certain dimensions. For example, we have here three clusters.
They can, here are, here are actually be organized in two other parent clusters and again in a big root cluster. Therefore, we need to define our cell geometry who is able to describe these clusters.
Before we didn't need a geometry, we had just one dimension, we had points, we had a block, that was our geometry. It was a line. In space, it's not like this. In space, if you have two dimensions, you have a rectangle, for example, three dimensions, a cube, hypercube, and so on.
Good, what results from here is a hierarchical structure. It's something like a root here and children in the shape of a tree. The criteria for producing such indexes,
I've mentioned already one, it's, for example, clustering, which makes sense, I'm going to cluster my objects in space based on their dimensions and I'm going to say, okay, if I take time for the first dimension, I'm going to expect that this product here
is near a product which was sold the next day if I take sales as measure. And their categories, they are both mobile phones. I'm going to expect that these two are clustered together by the product dimension and by the time dimension in the same cluster.
Because they are similar. Then we can differentiate between indexes which allow overlapping of clusters because these clusters have a gray area between them. Or this joint, they say I don't care, okay, one point may belong to the other cluster, I don't want to be that exact.
I will take that this belongs here, it goes into noise, I don't care. Another criteria is if the tree is balanced or unbalanced. Of course, if I'm going for performance in search,
I want the tree to be balanced. I don't want to have a branch which is fully degenerated and I have a tree like this, a root here, then a node here, a node here, a node here, a node here, because this remembers to full scan. I'm starting from here. I don't have anything in the right side, so I have to jump here.
I don't save anything, I'm not pruning, it's not good, right? Doesn't help. So I have to consider also if the indexes should be balanced or unbalanced. This results in cost by insertion, we'll see. Some other interesting factors are where do I store my data?
We've seen that in B-trees, we have the possibility to store our objects. Also in the nodes, we have these pointers directly on the hard drive. So when I read the root node, if I found my information there, I don't need to go to the leaf nodes to find my data.
I just go through the pointer and find the data on the hard drive. The other possibility would be, like in the case of B plus trees, that the tree structure is only directional and I find my data only in the leaves. Some are better for some scenarios, some are better for other scenarios.
It depends on what kind of system you need. And then indexes differ also by, as I've said, geometry. You can have spheres, you can have cubes in multidimensions, hyperspheres, hypercubes. The first version of multidimensional trees we are going to speak about,
and it's of vital importance for data warehouses. Other Airtrees, I suppose you three have already heard the Airtrees in multimedia databases, right? Who has heard about the Airtrees?
Besides you both? Okay, okay, we'll speak about them today. So, the idea about the Airtrees is that they are able to go up to about ten dimensions, which is enough for small data warehouses,
to represent time, geography, product, and whatever they would need. And index on these dimensions. Of course, there are variations which are a bit better. They go up to 20 dimensions. What I want to mention is, you might say, well,
why should we bother in indexing only ten dimensions? What should I do if I have a data warehouse a hundred dimension? Well, actually these indexes perform pretty good. They have been evaluated, however, on random data,
which is uniformly distributed. And when you have such cases, you can't really cluster too much, because it's randomly generated, uniform distribution. And such a distribution is actually unfair. Airtrees cannot benefit from the real nature of data where you have in sales,
as I've said, products which are actually similar. So therefore, they are pretty near to each other in the space. So if you have here the ten and twenty dimensions, this shouldn't scare you. This is just a possibility to compare it with the performance of other indexes. But actually, if you would perform the tests on real data and
not random generated data, you can easily reach also 50 dimensions. Good. The index structure, so what it basically allows is insert,
updates, deletes. They are possible because I need to be able to load my data. I'm not going to recreate the index each week when I'm going to load new data in the data warehouse. And the basic structure is formed through the data pages.
Those are the leaf nodes. They point to where the data is located on the secondary storage. The directory pages, those are internal nodes. They help for navigational purposes. I want to orient myself to finding these leaf nodes.
I'm going to use directory pages. I will refer to them internal nodes. I think it's more intuitive. And the geometry, which I previously spoke about. In the case of L-trees, they use rectangles or minimum bounding rectangles.
Minimum bounding boxes is another name for it in three dimensional space and hypercubes in higher dimensional spaces. For today, I'll speak about bounding rectangles. We'll show our examples in two dimensions so that it's easier for you to follow it. Good, let's take an example of how such an L-tree can look like.
So I have the root node. The root node consists of the whole two dimensional, I should move this. Yeah, the whole two dimensional space, fuck, space I have.
Let's consider that I have here the dimension of time and here products.
Then, as I've said, everything is hierarchically structured. The data in space would look something like this. These are my points in space based on their coordinates. On the time and product.
Well, of course, there are more of them everywhere in the space. I did not represent all of them. Now, from the root navigating to the left, I have the first child, which is comprised of other children.
All are represented through their geometry, since they are internal nodes. They have a minimum bounding rectangle. What this means is that this rectangle is not bigger than its minimum should be
to comprise the old points it needs to cover. This means that if I have three points to cover, R5 should not look something like this. Because I lose precision, I lose space. I don't have anything here. This is dead space.
And if I'm going to do some query which involves that space there, I would need to look into R5. This thing I don't need to do if I use the minimum bounding rectangle. I know for a certainty that my information is not there if my search will not intersect with the minimum bounding rectangle of R5.
Okay, and the leaf nodes have pointers directly to the data on the hard drive which are our records. So practically it is easy to navigate through such a tree. I'm going to go into the root.
I'm going to compare my search with the root and say, okay, where am I doing search? Am I doing on the time dimension on data on the last year? If my search is somewhere here, I can already say, sorry, The data is not stored anywhere.
I'm just reading the root. I'm not going to read all the data from here, from downstairs. And I say, I don't have the data. I'm just doing an access. This is cheap. This is great. This is what we want to obtain with R trees. Good.
The characteristics of R trees is that they are working on clustering. This is why they have not that great performance on randomly distributed data. They allow overlapping. We will discuss about it, because this leads to some inefficiency problems while searching. They are balanced in height.
This allows for great search performance. And the objects are stored in the leaves. So I need to go through the leaves to reach my object. If somewhere on the way, I already don't find internal nodes which intersect with my search, I can already say, sorry, pal,
the data is not in the database, so what you're searching is not here. This is great for early stops. And the geometry, as I said, is the minimum bounding rectangle. Further conditions on the R tree make it more efficient, namely that it, well, of course, the root has to have at least two children.
If it would have only one children, then that would be the new root. It wouldn't make sense to have a degenerated tree. The fact that internal nodes have to have a minimum and the maximum limit is of great help. Because otherwise, I would put nodes which have just one element.
So they are underpopulated. This means that the tree will be bigger, because then I'll have a lot of nodes with just one object inside or one pointer. And then I would have to read a lot of these blocks to reach, again, my data.
So this underlimit is important for not growing the tree unnecessarily. Of course, we have also an upper limit. The upper limit is established by the size of our blocks on the hard drive. If I'm going to be optimal, I want that when I perform reading,
I'm not going to read more than one block. I want to read one block per node. This is why I have an upper limit. As a heuristic and recommendation, the lower limit is the half of the upper limit, so I can assure the occupancy level.
Each entry in this R-tree is a pair of this form. So I have an interval and the child pointer. The interval defines the minimum bounding rectangle. So if I have three dimensions, I will have the interval for
the first dimension with its start coordinates, the interval for the first dimension with its end coordinates. This is an interval.
Then I'll have the interval for the second dimension with its start up to the interval with its second dimension with its end. And then the last pair on the third dimension. This is my eye information. If I have two axes, what actually the eye is,
are these two points that describe me, a minimum bounding rectangle. And the rest is the pointer to the child.
Good, this was the internal node. If we speak about the leaf nodes, the leaf nodes have tuple IDs through which I can identify my records.
What is important to mention is that, as I've said, the tree is self-balanced. So the leaves are on the same level. I don't have any degeneration as I've previously mentioned. So if I would have leaves, then all the leaves are, for
example, on the third level. This is very important for search performance. The basic operations you can do with L-Trees are, as expected, the search, which is most important. But you can also do inserts, updates, deletes. And then there's a new operation, the splitting, which we actually
need in order to fulfill our minimum and maximum limits. I've previously said that a node allows for a minimum of m, small m nodes, has a minimum capacity, and the big M nodes' elements for a maximum capacity.
This leads to splitting when I'm going to insert a node, which then leads to increasing over the m capacity. I need to cut the node in two. We'll discuss about this. Good, let's talk about the search,
which as I've said is the most important operation. As expected, it starts from the root and it performs recursively towards the leaves. At first, one path is selected. It is investigated for intersection with the range query.
If intersection has been found, then we go further into the child of that node. If not, we can prune that path and go on another randomly selected path. Let's see an example so that it's better to understand.
So I have my search query. My search query looks something like this. What does it actually mean? It means that I'm interested in mobile phones. Let's say this is the product dimension which regards mobile phones,
which were sold between April and June. I can describe this in my two dimensional space through S. So I start with S with its coordinates in the two dimensional space.
And I compare against the coordinates of the root. Coordinates of the root are here. I compare for intersection. And of course, I can see that S fully is included in the root.
So I say okay, I have information about the data. I just now have to look at the three children the root has to see where is actually to be located. And I compare it with R1. R1 is described by these coordinates, this interval.
I compare S against these coordinates. And it's clearly that it's analytical geometry, so to say. So with just one formula, I can decide if S is or is not contained in R1. So I can say, okay, S is not there. It doesn't make sense to look into R4, R6, R5.
I can spare myself the effort. So I'm not going to read these three blocks. Or if they are bigger, maybe they represent, as in the data warehouses case, hundreds of blocks or thousands of blocks. So I'm sparing myself a lot.
The same is happening to R3. My search query is not in R3. So I'm going to prune this path also. When I do the test with R2, I can see that S is included in R2. So I have to follow all its children to find out
in which of the children is actually S included. What are my products which I'm interested in? I then have to compare with R7, R8, and R9. I observe it's not in R7. I observe it's not in R9, but it is in R8.
I go to the leaf nodes, and so I can reach to my data saved on the hard drive and return to the user. Look, these are the objects you have asked. If I would have done a full search, I would have had to compare with all 12 nodes.
As I said, this is a happy case. I'm presenting here a small example. But what I'm actually pruning here is a lot, because on the same hierarchy, I'm not going to see three nodes in reality, but thousands of nodes. And if I choose two out of thousands, which are hierarchically organized in other thousands of nodes, it's a big deal.
So in order to read and to reach the objects I'm interested in, I have to read the root. I have to compare against R1. I have to compare against R2. I have to compare against R3.
Now, you would say probably, if I've compared against R2, and I've seen that S is included in R2, why do I compare with R3? Why do you think? Isn't this enough to compare with just R2?
And they say, okay, I don't care about R3. I've just found out that it's in R2. So why do I still need to ask R3 what its coordinates are and if I'm interested in R3 or not? What do you think? Yes, yeah.
What would I do if my search query looks like this? Then I need to go into R2. But for this object here, I need to go also into R3. I said that R3s allow overlapping. This is exactly why I need to check more than just a confirmed inclusion.
And I need to go further. Well, this, of course, increases the search time. It's a disadvantage. We'll speak about it. Because of this, there's no guarantee for good performance. It has the advantages of B3s.
It, however, is not great. Because due to overlapping, I have to make sure that I traverse all paths. I go into R1. I can pull something. I can say, okay, I can cut R1. I can go further to R2. But I can't exclude R3.
I can't say I don't care about R3. This is a big disadvantage in searching. Good. For you to look at home, I've also described a bit the algorithm. So you consider a search rectangle. The search rectangle is compared for inclusion or intersection with the internal nodes,
which help for navigation. And at the end, we reach the leaf nodes where the data is. And we return the leaf nodes which are intersected with our search and return the correct sets. I think it's pretty intuitive.
Everybody has understood what happens with the search? Okay, let's come to a bit more difficult problem. We want to insert. This means that we are doing loading. We have a new object, some new product, which was sold yesterday.
I have to do the weekly loading in the data warehouse. So I need to update my index. I'm going to insert the new information in the index. And for this, I follow the rule of B-trees, which says that only leaf nodes are allowed to store pointers to the data.
So I search myself a leaf where I will insert my data. The heuristics behind choosing the leaf is that I need to find the leaf which grows the least to include my new object.
I don't want to create unused space. I don't want to include dead space in my geometrical structure. This is the first problem. So heuristics on choosing the leaf. The second is to be careful not to go above m.
If I go above m, I need to perform splitting. It's overflow, so I need to cut the node in two. And this performs, this can lead to cascading splits for the roots. We'll see.
As I've said, there can be a case of overflow, which leads to dividing or splitting the nodes. The goal of splitting is again described by heuristics. I have to, my goal is to minimize the overlap.
As we've seen, if I minimize the overlap, I optimize the search. I'm not going into more branches, so I limit myself to going through one branch. Minimizing overlap, great idea. The second heuristic, I must stick to my minimum expansion,
also with the splits, also with the overflow, and not include more space than what I exactly need. Otherwise, I might find intersection where it actually isn't and read blocks on the hard drive, which I shouldn't.
Good. Of course, also the root can be touched by division through this cascading node split. And in this case, of course, also the root can be split into different nodes. Let's see an example of how insertion can actually go.
We have the previous example, we want to insert a new product, which was sold at a certain day. There are different possibilities. I can say I'm going to expand R7 as a minimum bounding rectangle to include my new node,
or I'm going to expand R9, produce some overlapping, and include the new node in R9. Of course, this is, you can't do right.
You should minimize the expanding space, but you also should avoid as much as possible the overlapping. In such cases, you just have to perform heuristics, decide on how your clusters look like, and say, okay, I have two choices. Either I take into consideration more storage space,
and I allow overlapping and use some advanced version of L-trees to keep the query performance great, or I say, okay, I produce some dead space here,
because the queries that I have on these dimensions are anyway pretty big, so this doesn't bother me. It depends on the data. Good. So as I've said, the possibilities are to increase the node
with the smallest volume increase, the one which needs the smallest volume increase. This talks for keeping the minimum bounding rectangles as small as possible, and keeping dead space as small as possible.
This is a very nice possibility, which is actually also used in Praxis. Okay, what do we do if we have overflow? Let's consider that we have the superior limit of each node set to three,
and R7 already contains three elements. I'm going to make R7 bigger, because for some reason it is more of advantage for me,
but now I'll have a fourth node, a fourth child, which goes beyond my limit. This is the case of overflow, where I need to split R7. Splitting R7 is again a heuristic problem,
because I can perform good splits or bad splits. Again, I have for each of the splits to use my previous heuristics,
keep area coverage area small, overlapping again as small as possible. Let's consider this case, where I have a root. This is my root, and I've just inserted this node here.
Again, my M is three. Now I have four nodes. How do I split? What do I do? Well, there are a thousand possibilities. I can say that I do a split like this, and now the M condition is respected,
because I have two nodes here, two nodes here, two children here. It's great. Only that, if I have a query which looks something like this,
what happens? I have to read the root, and I see intersection with the root. Then I have to read R1, I see intersection with R1.
Then I have to read R2, I don't see intersection with R2, so I can prune these nodes here. However, because I've expanded R1 above, so I've taken the wrong choice to expand the minimum bounding rectangle to a big area and did not consider the heuristics,
I now observe that I'm actually in the dead space. There is nothing here. All the interesting information would have been here. So I've just performed reading operations, which actually don't make sense. However, I would have allowed overlapping,
but I would have considered just expanding to the smallest area. Then I have my R1 here, R2 here, I read the root, I do one compare, then I compare with R1 and R2,
but I go in neither of them, because I observe that actually it doesn't intersect neither with R1 or R2. It's dead space. This one I don't consider, so this one would have been a better split.
Good. Deciding on how to choose your splits is actually not that simple. There are a lot of possibilities and the classical approach, so you can't check all the splits. It's impossible.
You might do it for three objects, you might do it for ten objects, but if you have to do it in space for a lot of dimensions, then you have to calculate volumes, you have to perform all the possible splits, all the combinations. Bad idea. There are two heuristical approaches, one that has a quadratic cost and one that has a linear cost.
And this is exactly what we will discuss further. Okay. So the idea by the quadratic cost overflow problem is that I don't consider the hierarchy anymore.
I have to reassign my maximum bounding rectangles by choosing the objects as originally and by choosing starting points. Since it's quadratic, probably you also have considered
that the possibility would be to choose a node, compare it with all of them and so on. This is not the best possibility. It costs a lot of time. Again, considering volume increase and calculating this for all the combinations, not a solution.
After choosing two starting points, random starting points, and calculating just for these two, the smallest increase in volume, I can assign new objects to these two starting points. So I can say, okay, I've chosen these two.
I see they are far from each other and then I reassign the points in space. I say, okay, I want to insert this point here. This is closer to this one as opposed to this one. So my new node will be this one. Then I want to insert, I don't know, maybe this one and so on.
Good. The other possibility would be to solve the overflow problem with linear cost. Of course, this is not as good as the one with quadratic cost. The other one is precise. But however, this is a good underestimation of that with much cheaper calculation times.
The idea here is that I need to establish first what are my starting rectangles again. But I'm not going to test all of the best rectangles to start with as in the quadratic cost. So I'm going to go precisely to the rectangles which have this property.
The rectangle which has the highest minimum and the rectangle which has the smallest maximum. I then determine the coordinates of these two, establish the distance between these two,
and set them as my starting point. Of course, this has to be done in a multi-dimensional space. I've done it in two dimensions and so that it is clear. Let's take this example.
The highest minimum, I've said. Let's go on the x and epsilon direction. Direction on the x direction, I have the highest minimum. Well, a has the lowest minimum.
It's somewhere here. b has its minimum somewhere here. c here. d somewhere here. e has the highest minimum on x. It's this coordinate here. Again, on x, I then calculate the smallest maximum.
Intuitively, this is the one from a. b is somewhere here. c is here. e is the highest. d is here. So then I take this coordinate and I calculate the distance between these two. The distance is five. The whole distance is 14.
In order to compare it with the distance on the epsilon x, I just need to normalize it. Otherwise, I wouldn't be able to compare apples with oranges. The same has to be performed on the epsilon x. So then again, the highest minimum is in this case by d.
And the lowest maximum is in this case by c. These points here, they are at a difference of 8 out of 13.
Again, perform normalization and choose as the best split the one. I have to choose this one. The one with the greater distance. In this case, c and d on the epsilon x.
They are my new parent nodes. And the rest I try to insert. For example, b, I will have to calculate the distance from b to d and c and to establish which of c and d would grow bigger if it would have needed to include b.
So probably c would have to grow something like this to include b. So you can see all this space here is my cost in order to cover b also.
The same thing I need to calculate for d probably would be something like this. Bad idea to draw the same color. Again, the cost I don't have any colors.
The cost would be this one. And then I need to compare the two costs and say, okay, this is my decision. I'm going to expand, for example, c because it's not that expensive as expanding d. These are heuristics which can solve this overflow problem just by choosing the start
points optimally or at least as good as possible. Again, the idea of having the best distance between the two starting points. And then, the pausing the minimal effort to include the rest of the nodes. This means I'm going to expand as few as possible in order to include something new
in the started structure. Good, so we've discussed about the insertion. What about the deletion? By deletion, we have exactly the other problem.
Again, we have to delete in the leaf node. But we have to be careful to consider the case where we produce underpopulated nodes. In this case, we do condensing. So we unite two nodes.
And this is actually a combination of what we've done previous. So we actually delete the two nodes and try to reinsert the object with the heuristics described above. So I'm trying again not to do too much overlapping. I'm trying not to produce too much dead space by expanding too much.
Not too much when I could have found another one to expand a bit less. If the root remains with one child, I've said that air trees don't allow a root with one child, the new child becomes the root.
It's only normal to delete the old root because it doesn't bring any more information. Good, let's perform also delete. We had in air nine two objects in this minimum bounding rectangle.
Our m is two. We delete this one here. So air nine will remain with just one object, which doesn't fulfill this condition.
So what do I do? I've just said I delete the whole air nine and I keep the one which didn't got deleted by the user and say, okay, I need to insert this again.
So I transform the deletion in an insertion operation. And then I say, where should this not go? Either in air seven, either or in air eight. I perform heuristics and decide to expand, for example, air eight. So deletion is a clear case of transforming it into an insertion.
Good, if a record is updated, again, I can make use of the previously introduced operations. I can perform a delete of the node which is to be updated and then reinserted. It's the best to keep the optimal structure of the air tree.
Otherwise, each time I would perform an update, I would have to condense to minimize the space of a minimum bounding rectangle and then decide which to grow. Actually, it's easier to delete and then reinsert.
What's the cost? Well, the cost as I've troweled the air tree stated, depends on the overlap and on the dead space. Let's say that I'm searching for these products right here.
I'm going to compare through the root. I decide that it overlaps with A, since these are the coordinates of A. I decide it overlaps with B, since these are the coordinates with B.
And of course, it doesn't overlap with C. As I said, I compare this with the coordinates of C, so I can safely prune this. I have overlapping between A and B. So if I would choose randomly and choose B as the path to follow,
I would come here and I would observe that actually, look, in B, everything is dead. Actually, G is what I would have been interested in. So in B, I can't go in any other child, because H, I, K on the yacht are not intersecting with my search space.
So then I have to go back to A. And for this search, due to the overlap between A and B on this level in the hierarchy, I read one, two, three, four, and then I read eight more,
which I actually shouldn't do if I would know that A and B don't overlap. So why are R3s inefficient? They allow overlapping between neighboring minimum bounding rectangles.
The natural evolution to such problem is not to allow overlapping. And this is exactly what R plus trees do. They say, I don't care what the nodes, how the nodes look like. I don't want any overlapping.
MBAs, MBRs on the same level should not overlap. Of course, this leads to storing multiple, the same data multiple times in the leaves. We'll see an example of that. However, it is always a.
trade-off, I store the identical information more times in the leaf, but I win performance while searching. And this is exactly the difference between AirPlus3s and AirTrees. On the same example that I've previously given,
if I wouldn't allow overlapping between A and B, then I would need another child here, let's call it P. For example, A, P and B, as well as C. None of them overlap, but G is found in both P,
because part of it is child of P and part of it is child of A, and in A. So because overlaps are not permitted, G has to be copied in both A as son of A and son of P,
resulting in higher storage space. So what does this actually bring? Great search performance. It actually leads to 50% improvement in search performance compared to the AirTrees.
The drawback is that when I perform splits, the nodes can be underoccupied through many splits, because I'm not able to optimize on the splits. I'm not able to allow overlapping anymore. I have to keep them separated. So I might do unoptimal splits in order to fulfill this condition.
Therefore, low occupancy in the nodes. I have to introduce P to split B and A, for example. This leads to higher degeneration degree. What that means is that if I have my data warehouse, and I have an AirPlus index on it,
once two months, I need to take one night off to take the warehouse offline and say, this night I will reconstruct the AirPlus. I will delete the old index because it's fully degenerated, its performance is not that good anymore. I will delete the whole index and reconstruct it.
This is actually also done in practical applications. Of course, there are more variations of AirTrees. AirStar, Xtrees, for example, use also other geometries.
There are S-trees which use spheres and so on. The advantage by AirStarTrees is that the algorithm has some improvement I don't know about. And that the Xtrees, for example, this is quite interesting,
Xtrees say, well, AirPlus trees have this problem of degenerating and maybe when they degenerate too far, I can't allow myself to build the index again. I have a special situation, I have, I don't know. They are the holidays and we have a lot of sales. I want to do some analysis to know what I should actually provide for the holidays.
So I use Xtrees which say, for this degenerated stuff, I just pack it in a node and I'll do sequential search anyway. So I'll divide the data which has to be searched sequentially from the rest which still has a structure until I can rebuild my trees.
So I practically treat objects a bit differently. I say this stuff has to be reconstructed. I'll treat it as sequential. The rest keeps its structure. Well, with this kind of improvements, both X and AirStarTrees go up to 20 dimensions in the randomly generated data.
And probably before going further, we should take a bit of break. How do you say, about 10 minutes? 20 past 4? Great.
OK, so let's continue. We've spoken about the AirTrees and before beginning with the AirTrees, I've said that there are also B-Trees and variations of that like B-Plus trees, B-Star trees and so on. And they are not so good for data warehouses
because they are considering only one dimension. So they are great for OLTP but not great for data warehousing. Well, there is a possibility one could use B-Trees and this is due to the fact that in the literature,
people have considered introducing a method to transform multidimensional data in one dimension. This is called the Z-Curve. And by combining Z-Curve with the old B-Trees, one can obtain universal B-Trees which are able also to index multidimensional data.
You should just regard this Z-Curve as what happens in the linearization by multidimensional physical storage, array-based storage. We were performing their linearization. This is what happens with the Z-Curve. It just is a bit smarter because it considers that data
which is clustered together in space is similar and should be stored near each other in the same block. This can be described in our multidimensional space
as regions like this. For example, this is a block which I store on the hard drive. And the Z-Curve which goes through my data
looks exactly as expected as a Z. And so on. What do I observe? Is that when going through my data stored in a certain block,
data which is clustered together and therefore similar and of interest to me to read just once and go through it all, this can be directly found out from this Z-Curve by starting here and going continuously until this point here.
So the Z-Curve does not traverse areas which are not of interest to us. The Z-Curve does not allow mixing of data when considering multidimensional aware clustering.
This Z-Curve can there have keys like 1, 2, 3, 4 and so on. And these keys will be used in the B-trees and their variations to store the data as one-dimensional.
And then again we have the nodes as in the B-trees with pointer to the left, pointer to the right, value here, well if I put one here it's pretty dumb because then I can't use the left pointer. So then I have here 1 and 2 and 3 and so on.
And then I have the navigational part here and the data part here. Usual B-tree with the advantage that I transform the data from multidimensional to one dimension through the Z-curve. As I've said, the second element of the UB-trees
is the Z-regions which give actually the efficiency when processing range queries on more than one dimension. These are products which if you want
are sold in consecutive months. Products, okay let's do the product here, which belong to the same category like cell phones.
Right? So I want to have them near each other. You would probably find it strange that 17 on a Z-curve looks so far away.
Actually it's not if you consider the coordinates on the two axes. You can just see, okay, they are near to each other on the product So they should be clustered from the dimension of the product
in a near area, even if the time, for example, is different. This one was sold in this month, this one was sold in this month. So okay, they should be one further from the next, but they are close on the product dimension.
This kind of properties can Z-curves fulfill. Okay, what are the Z-regions? These nice colored regions that I've intuitively drawn here are our regions. They are covered by an interval on the Z-curve.
Again, the interval is from here up to here. And this interval, of course, can be defined by two thresholds, A and B.
So a Z-region can be then described on the Z-curve as having two coordinates, a start and an end. The secret behind the UB trees is that each region maps exactly
onto one block of secondary storage. So that when I read products which are near to each other, similar to each other, I just read a block. This cost me by most of the hard drives 8 milliseconds. Yeah, seek time plus read time.
If I'm going to do it as before, then I need to read this part, this block here for the first part, so I'll do one block read here, two blocks read here, three blocks read here and so on.
I would have done with UB trees just one block read. Good. Where does the advantage come from? How do I get to the coordinates? The secret is to use numeric bases, transform everything into binary.
As for example, if I want to get to this element here, 51, I just make its projection on the X and epsilon axis, the two dimensions that I have, I know is the fifth element on the epsilon
and fourth on the X dimension. I then translate this into binary and this is a property of the Z shape in the Z curve. I can calculate my Z value just by interleaving the two dimensions
and then I get one from here, then one from here, zero from here, then zero from here, one from here, then zero from here. Actually, if you calculate this into the decimal, this should be 50.
Well, I've started with the index from one, so if it would have started with zero, this one would have been exactly 50. Otherwise, as usual, index plus one. This is the property of the Z value and this is what happens also here.
For example, the value of 16 in binary after the interleaving gives me exactly three and three. Well, it's actually 15, 16 minus one in binary, then the interleaving is three and three.
And when I do bit shifting to 16 in binary, the next value is 17. Do you understand what the advantage is here? What do I gain by bit interleaving?
I construct my Z curve just by doing shifts, shifts of numbers through their coordinates. So I get the coordinates of points of objects in my space on these dimensions. I project them on one dimension, project them on the other, transform the coordinates in binary, interleave them and find out where they are situated on the Z curve.
This is great, I just do shifts. You can calculate, as I've said, also through the interleaving and you see exactly how you can move through this space.
Well, so what are the advantages of the Z value? We can reduce dimensionality, we can use the values we get from the Z value as keys in bit trees and by using a variation of the bit trees, the bit star trees, I can use again filling degrees like the small m and the big M in the air trees
with 50% resulting in a guaranteed node access to locate my key in logarithmic time. This nice formula here, I will suggest you to read the relational databases II lecture.
This explains how you can calculate the maximum node access in balanced trees. It's logarithmic, it depends of course of how many nodes you have and how big your node is, how much information you can store on one block on the hard drive.
So, if we have these properties for the Z values, then we can greatly use for range queries which are exactly what we are interested in data warehouses. So, how does it look like?
We have again our range query, this one here, given by two coordinates, as usual. The algorithm actually calculates the coordinates transformed in Z region.
It's simple, it gets the x and epsilon projection of the start of the rectangle and the same for this one and says, aha, okay, here is four, here is three,
I calculate it in binary, I perform interleaving, and then I say okay, then on the Z curve, this starts exactly here at 15. Where do I stop on the Z curve? I stop at whatever that was, probably 58 as it looks like, exactly here.
So then, just based on the Z curve, I already know where to start looking on the hard drive.
And I say, I'm going to read on the hard drive, starting from the Z region here because I'm interested in 15. So, I will read the whole green block, I will get this information for return,
I will get this information for return. The next element, which is here, I'll have to jump on the Z region. We'll come to jump in a second.
So, first I identify 15 and 16 as fulfilling the predicate on the Z curve by knowing the start points and the Z region. And then I have to somehow calculate that the next jump point is 27 without reading any more blocks on the hard drive.
There is a secret here with the jump point, this has to be fast to compute. If I lose time computing the jump point, this isn't efficient anymore. So, I have to be careful that the jump point has to be calculated fast from the Z region and that it doesn't outperform reading everything from the hard drive.
Good. So, we'll speak about the jump point, but the algorithm goes further and you have to repeat these two steps with seeking on the Z region, testing the predicate of your search, testing if it's true or false, only by using the Z region and the Z curve, not reading on the hard drive,
and only when the predicate is true, reading and accessing the hard drive itself. As I've said, calculating the jump point is the critical part. If it takes too long, I don't have any more advantage over reading everything from the secondary hard drive.
How do I calculate the jump point? It's easy, again with coordinates and bit operations. I perform the interleaving and I see where is the new start point which fulfills my predicate on the Z region. Oops. I thought I had an example. I don't, but this is the whole idea.
I just need to perform the interleaving, go further on the Z curve, don't read any region which doesn't fulfill the predicate. I go, I move myself on the Z curve with bit shifting and this way I also do the Z and that's it.
It's pretty easy. What I would probably insist in the homework for you to better understand these UB trees is to see how you can actually move on the Z curve and how you can actually calculate it through interleaving and the interleaving and shifting.
So bit operations on such a two-dimensional space. Good. And so we reach our second part of our lecture, the bitmap indexes. I've mentioned in the introduction that bitmap indexes
are actually only seen in data warehouses. They are not multi-dimensional indexes, but they are very helpful in indexing a certain type of data, like for example gender, if it's male or female.
If I have attributes like a set with two, three types of values, bitmap indexes are great for indexing something like this. However, this is not used in OLTP, in normal transactional systems because updating and inserting costs is not a great idea to do it very often.
So this is why they are mostly used in data warehouses only. So let's see how they look like. Let's consider a simple relation, a sales relation with an ID, a shop and the sum of sales. So we have the shop Saturn with a buy sum of 150 Reales and so on,
then again Saturn, Reales three times and Picon Klopenburg, I think is this. I can index it with bitmap indexes as follows.
I do an index on the number and on the value, so on the shop name and on the number. And then I have the value Picon Klopenburg is the third in my row of six records.
So the cardinality of my records is six out of which this record is the third. So I just put a one there. Easy, right? I do the same with Reale. Reale is the second, the fourth and the last.
And then I do the same for Saturn which is the first and the fifth. Everybody understood until here? Pretty easy. Good. Let's describe this a bit formally.
So what is actually a bitmap index? It's a collection of bit vectors. What we've previously seen, what we've previously seen here are called bit vectors. Bit vector one, two and three. So each value is then described by a bit vector.
The number of bit vectors describes the number of distinct values I have. I had three bit vectors. This means I had three distinct shops, Reales, Saturn and PC. Had I have had one more little or something like this, I would have had four bit vectors.
So I have a distinct number of values. Then I have the cardinality of the relation. This is fully coordinated to the length of each bit vector. The number of records I have in my relation, this is the length of my vector because I need to put one where this is true for my value and zero where this is false.
So I need to have the list of the whole table. Again, the bit vector has a value of one in the position where the record has the IID in the table
and a value of zero otherwise. So if Saturn was the third in the bit vector for Saturn, I would have had on the third position a one. Pretty intuitive. What this actually does is it allows a mapping between the record numbers and the record addresses.
A very important requirement is these numbers are not to be modified. Even if I delete something, I still have to keep these numbers fixed. Otherwise, I have to destroy my whole index and rebuild it again. So a similar operation in databases is logical deletion.
When you perform logical deletion, you delete something from the database just by marking it as dead but you are not going to restart the primary index. You just continue with it. Good.
So let's assume that the cardinality of a table is n. Attribute A, one attribute, a shop, has n distinct values in the table. Then the size of the bitmap index for attribute A is this product here.
So then we can calculate how big our bitmap indexes will be. The significant number of zeros happens when I have a lot of values which are distinct.
So if I have from six records five shops, I have a lot of zeros. Right? It's only logical. But if I have two shops in six records, I'll have a lot of ones because I only have two distinct values. They repeat themselves. This is exactly what is used when you have sets, gender, female, male.
They repeat a lot. You have record one, is he a male or is he a female? Female. Record two, male. Record three, again female. So then m is very small. I have a lot of ones. This is great for encoding the information. This is where the power comes from. I have opportunity to compress with run length or whatever algorithm allows me to compress binary data
based on long strings of zeros or ones, continuous zeros or ones. There is Huffman encoding. There is a lot of possibilities. Actually, Oracle uses their own compression algorithm.
So bitmap indexes are implemented in Oracle DB2 and all the other solutions for data warehousing. What do we do if we want to modify something? As I said, the most important factor is that record numbers are not to be changed.
Otherwise, I have to destroy my whole index. Let's, for example, assume that I want to delete this from my record. I'll just mark it as deleted, logically deleted. I can also delete it from my database, but this one will remain six. It won't be five. It won't be renamed.
So this is bad. Because if I would have to rename it, then I would have to reshift everything from here onto the fifth column. Do you imagine what this means for this number of records? Unnecessary, a lot of unnecessary reads, so it doesn't make any sense.
We just keep the five and say, okay, Saturn doesn't have in its bit vector one anymore on five, but a zero. This is the only secret when it comes to handling modification in bitmap indexes.
Now I want to insert something. If I want to insert, it's pretty intuitive. Insert something in the database. This is how it looked before my bitmap index. This is how it looks after. What happens is I add a new value because it's distinct from what I already have.
I say it's on the seventh position in my table. What I still, however, need to do is insert a new column in all my bit vectors to reflect the new cardinality. I've said that each bit vector is as long or contains as much elements as the cardinality of the table.
As my cardinality is seven, I now have here seven elements in my bit index. You can already see this can be costly, so this is why bitmap indexes are not used in OLTP. This doesn't happen in data warehouses, but when I'm doing loading, for example, but not here, not on a daily basis.
Otherwise, I would have to perform all day long modifications of the bit indexes when I perform sales. Makes no sense. Good. What happens if I insert something which has, if I update something which has already been there?
For example, I modified the name of whatever that was, I think that was Real, to Reve. When I do this, I just have to modify for position two Real to zero. Say it doesn't appear anymore there.
Pretty intuitive. Insert Reve as a new one. The cardinality is not changed. It is kept to six because the cardinality doesn't actually modify. How do we perform selection?
The great advantage here is in the case of the Z-region and Z-curve is that I can do it with bit operations. Our hardware is anyway optimized for bit operations. Assembler does really fast bit operations.
So I'm going to use AND and OR on bits to do selections. For example, if I want to find out how much I've spent in these two shops, I take from the bitmap index the bit vector of Saturn and the bit vector of PC and say, OK, I'm interested in the OR relation because I want to sum what I've spent there with what I've spent there.
And then I say, I've spent something in Saturn, not the case in PC. I'm still interested in this record. I'm not interested here. I'm interested here and here. And this is how I say I want to return the first one.
I want to return the third one. And I want to return the fifth record. Pretty intuitive, right? So one, two, three, four, five, six. I just return the ones with one.
Easy. Has everybody understood how to perform selection? This is great when the selectivity is high. So when I have a lot of ones. When I'm going to do this on a primary key and I have something, I have here just an end, for example.
Even bad. And I have one, zero, zero, zero, zero. This is how a primary key looks like. It's bad. Doesn't make sense. So I have to make sure that I choose the dimensions on which I index with the bitmap indexes with high selectivity. It only makes sense to use it there.
Like, for example, of gender. Male, female. Good. So the greatest advantage is that the operations are directly supported by hardware. So they are cheap. The disadvantage is if I have a lot of values which don't repeat themselves.
Like, for example, on the time dimension. I'm interested in the birthday of some client and I want to index it with the bitmap indexes. And I have 365 vectors for each day with one probably somewhere on the diagonal or something like this.
This is bad. However, time dimension is sometimes exactly what I want to index with bitmap indexes. And if I want to do this, there is a variation of it, which is called multi-component bitmap indexes.
Another problem is if I want to do large ranges of the data. I want not to read each of the vectors. I don't want to read for SATUR, for REAL and for REVE and go sequentially as I wouldn't have any index at all.
So if I want to solve a range query, I can do it by range encoded bitmap indexes. So we have two variations and we'll talk about them right now. The multi-component bitmaps, as I said, are great for solving time dimension indexing. Again, this uses the principle of transforming the number in a different base.
Pretty easy. I have a month attribute. It varies between 0 and 11, January up to December. And this attribute can be transformed with a simple function like this.
This is called the 3 to 4 bases encoding, where I say, okay, how do I obtain the maximum value of 12? How can I range between 0 and 11 or 1 and 12? I can range by multiplying 4 with the maximum value of epsilon, which ranges from 0 to 3, and then adding something.
So the first value would be epsilon of 0 and a z of 1, or a z of 0. z of 0 epsilon 0 would be January. Because I'm multiplying 4 with 0 and then I add 0. January. Easy. If I'm going for February, z is 1.
If I'm going for March, z is 2. If I'm going for April, it's the fourth month, right? So I just set epsilon to 1 and z to 0. Easy. So then I can encode everything from such long bit vectors for the fifth month to such vectors,
where I say, well, I actually record my formula here, my nice formula. And I say, well, I'm interested in a 4 multiplied with this value for epsilon and this value for z.
These values are chosen from the definitions of z and epsilon. So what does a from 1 and 1 mean? It means that I'm going to use an epsilon of, this was 2, right?
Now of 1. And I'm going, yeah, because this one varies from 0 to 2. So the first value would be 0, 1, 2, and that's it. And for the z, I will again, small mistake, again 1.
I use it for the formula, and I come up with my month. Is it clear how to calculate something like this? Please ask if you're not sure or have doubts or questions.
Everybody with me? Questions? Nobody? Okay.
The advantage of using such multi-component bitmap indexes is that if I have to store 100 days, I'm not going to store 100 bitmap indexes, but I'm going to transform it in a base of 10 by 10. So I'm going to store 20. So I have already saved a lot, a lot of search time, right?
I'm saving, I have 20 vectors instead of 100 to check. Great improvement. The read access for a point question needs to read operations.
Because I need to make sure which day it actually is. I'll go into this in the case of range encoded bitmap indexes. Because if I want to do a range query, the advantage of bitmap indexes is not there anymore.
I want, for example, to list persons which have been born between March and August. So what do I naturally do? I look at the persons and look for the ones who have one between March and August.
So I read the first person and so on. The other idea, of course, would be to use bitmap indexes, read this and say, well, nobody here. Read this and say, uh-huh, this one has to be returned. Read this, nobody here. This has to be returned.
Nobody here, nobody here. I've read one, two, three, four, five, six indexes for a range query. It's not great. So someone had another idea. Why not do something else? Why not set all the bitmap vectors to one if they are higher or equal to the given value?
For example, if this person, person one, has been born in June, from June on, I should not write zero anymore but one and fill it with one. And you'd say, what does this bring to me?
How does it help? It's easy. Look, I do the same query on a transformed bitmap index for range queries. It's optimized for range queries. So I have exactly this with one from the starting one up to the end. For example, this person here has been born in January.
So everything is full with one, yeah? Then, I interpret my query of persons born between March and August as persons which did not exist in February but existed in August.
What does this bring to me? I read February and say, I'm interested in persons who are not born. This one, this one, this one not because he was already born. This one and this one.
And then I say, I'm also interested in persons which were already born in August. This means that this one was not born in February but was born in August. So I should return it because it was born between March and August.
Do you understand? Pretty simple principle. The same for here. Was it born in August? I don't care that he was born exactly in April and I don't have to read it. I just see that in August he was already born.
So I return him. I look at this one and say, uh-oh, he was born before March because in February he was already there, yeah? So I don't have to check anything else here. I check here and see, okay, August I have to return this one too.
And although this one wasn't born in February, he was also not born in August. He was born later on in December. So he'll have a birthday soon. So I don't return this one. What's the catch? I've just read the month of February and the month of August as bitmap indexes.
So I just read two of them. It's great, right? I spared myself instead of six reads, four of them and read two of them. Of course, if I want to do just one limit,
if I want to perform a query like persons born after March, I'm reading just the February corresponding vector. I'm reading just a one and say persons which were not born in February. So I just read one of them.
Easy. If I want point queries, so to say I want all the persons born in March and in March only, then I need to read. This is three here. I need to read the February month
and I need to make sure that he was born in March but not later on. So again, two reads.
The advantages, the big advantage here is that it's great for indexing dimensions, like the time dimension, as I said. Fully indexing dimensions with B-trees, which are great in the case of OLTP, would lead to several times larger amount of data.
So it might lead to bigger dimensions not fitting into the memory, the index I'm in right now. So if I'm going to index a dimension, I would be very glad if my index fits in the memory, so I don't need to do the same swapping and caching and reading part of the index, going through it, reading the next and so on.
B-trees don't fit in the memory because they create large amount of data. They do much more than bitmap indexes because they are also fit for indexing and deleting and so on. This is something which I don't care in the case of data warehouses, so therefore B-trees are great, they fit in memory.
I can do whatever I need for the dimensions then with the bitmap indexes. The best thing in bitmap indexes is that they reduce response time for large classes of ad-hoc queries on this kind of tables and they bring very high performance gains even for low CPUs,
not that great CPU hardware. So you can even consider performing this for a data mart locally, bitmap index, great idea. Now that we've reached the end of our lecture,
let me tell you what is actually important from what we've discussed today. B-trees, not fit for multi-dimensional data, not a great idea. Air-trees are the solution for something like this. They use minimum bounding rectangle as a geometry.
Of course, there are also other possibilities. The operations which it allows is selection, insert, insert can lead to overflow problem and therefore node splitting
and deletion which can lead to compacting, the need of compacting nodes or even of deleting the root when it remains with just one single child. It's inefficient because it allows overlapping for neighboring minimum bounding rectangles. Therefore, the solution for data which has a lot of overlapping
would be to use some kind of tree which increases the performance by not allowing overlapping but leading to more storage need. You trade off. You say, okay, I don't care, I have some redundancy.
I store these in both parts, I store this object in both parts but for this I gain query performance. Another solution which brings B-trees again over the map is using a Z-curve and Z-regions to transform multi-dimensional data
in one dimension. The great fact here is that I can calculate on this one dimension described by the Z-curve where I should jump when intersecting with the selection
and I can do this greatly with bit operations, optimal for calculating on our hardware. And the last topic of our lecture for today was bitmap indexes. Bitmap indexes is great for attributes in the form of sets like this
where the values repeat very often. So they are not distinct like, for example, in the case of a primary key. It wouldn't make sense to index with bitmap indexes on a primary key. The operations are efficient.
Again, they are bitmap operations and they are directly supported by hardware. And when having problems of too much storage in use, for example, for the case of time dimension, I can transform my bitmap index into a higher layer with another basis,
with multi-components. Or if I perform a lot of queries, or range queries as it is the case and actually all the systems have this, I use range encoded bitmap indexes by replacing the zeros after a certain period,
after a certain event on the time axis with one. And then I make sure that instead of the needed reads, I can do just two of them. Good. Next lecture, we'll discuss about optimization.
We'll speak about partitioning, data partitioning, join-order optimization in the case of the star and snowflake, Shima, and about materialized views, which are actually the core of the data warehouse in terms of optimization. Are there any questions about what we've discussed today?
Then I would like to thank you for your attention and we will see us next week.