We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Revealing the SQL Server Magic

00:00

Formale Metadaten

Titel
Revealing the SQL Server Magic
Serientitel
Anzahl der Teile
110
Autor
Lizenz
CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
Think SQL Server is magical? You're right! However, there's some sense to the magic, and that's what I'll show you in this extremely deep dive session.
23
63
77
SoftwareentwicklerServerE-MailSampler <Musikinstrument>Web logTwitter <Softwareplattform>ParserSQL ServerElektronische PublikationQuellcodeAbfrageLesen <Datenverarbeitung>SpielkonsoleDatenbankHochdruckTabelleEinfache GenauigkeitBenutzerfreundlichkeitFächer <Mathematik>GruppenoperationBitPunktProdukt <Mathematik>DatensatzMessage-PassingStandardabweichungDifferenteParserDatenfeldLoopElektronische PublikationAutomatische IndexierungDatenbankOpen SourceMAPTabelleAbfrageTypentheorieDynamisches SystemMini-DiscSyntaktische AnalyseCodeDatenverwaltungMetadatenProjektive EbeneGraphische BenutzeroberflächeDemo <Programm>ServerVerschlingungWeb-SeiteBaumechanikZahlenbereichKünstliches LebenStabilitätstheorie <Logik>PhysikalismusEreignishorizontQuellcodePrädikat <Logik>Selbst organisierendes SystemEinfach zusammenhängender RaumOrdnung <Mathematik>RelationentheorieSystemprogrammComputeranimation
SoftwareentwicklerInformationsspeicherungServerWeb-SeitePunktIndexberechnungTotal <Mathematik>VariableZahlenbereichDatensatzDateiformatInformationVersionsverwaltungDickeTabelleDefaultE-MailSyntaktische AnalyseWürfelReverse EngineeringZahlenbereichElektronische PublikationDatensatzGanze ZahlDickeOrdnung <Mathematik>Bitmap-GraphikGarbentheorieWeb-SeiteDateiformatBitZeiger <Informatik>VariableE-MailVarianzQuellcodeInformationVersionsverwaltungDatentypDatenstrukturNummernsystemDatenbankTypentheorieCASE <Informatik>Automatische IndexierungAusnahmebehandlungFreewareQuick-SortServerSchaltnetzMini-DiscDefaultDifferenteMereologieSichtenkonzeptMultiplikationsoperatorMetadatenDatenfeldEinfache GenauigkeitTabelleHalbleiterspeicherFolge <Mathematik>InformationsspeicherungSchnelltasteAlgorithmische ProgrammierspracheStichprobenumfangMinkowski-MetrikRechter WinkelGebäude <Mathematik>sinc-FunktionSymboltabelleGruppenoperationMAPDatenloggerBitrateMapping <Computergraphik>MomentenproblemGewicht <Ausgleichsrechnung>SchlussregelSechseckAggregatzustandInstantiierungZeichenketteGrundsätze ordnungsmäßiger DatenverarbeitungPhysikalisches SystemPunktComputerspielAbstandBinärcodeVerschlingungMeterComputeranimation
SpieltheorieGammafunktionRechenwerkElektronische PublikationAbfrageBildschirmfensterSichtenkonzeptPartitionsfunktionMIDI <Musikelektronik>TrägheitsmomentSoftwareentwicklerMetadatenIndexberechnungBetriebsmittelverwaltungWeb-SeiteDatenbankDatensatzEinfügungsdämpfungMini-DiscRechter WinkelEindringerkennungTabelleE-MailVierzigCASE <Informatik>Schreib-Lese-KopfServerQuick-SortFreewareWeb-DesignerEinfache GenauigkeitComputeranimation
BetriebsmittelverwaltungSpeicherabzugData Encryption StandardAbfrageSoftwareentwicklerIndexberechnungMetadatenBetafunktionTUNIS <Programm>SichtenkonzeptSpezialrechnerMagnettrommelspeicherStellenringInklusion <Mathematik>HochdruckWeb-SeiteSoftwaretestDatenbankE-MailTabelleStichprobeDefaultTropfenExt-FunktorLokales MinimumSchreib-Lese-KopfDatensatzEinfach zusammenhängender RaumRechter WinkelTypentheorieEinsElektronische PublikationWeb-SeiteDatenfeldVerband <Mathematik>DatenbankMobiles InternetTabelleMessage-PassingSechseckOffene MengeE-MailVersionsverwaltungServerMehrrechnersystemDateiformatComputeranimation
ZeichenketteObjekt <Kategorie>DatentypWeb-SeiteIndexberechnungVersionsverwaltungE-MailZeiger <Informatik>FreewareSoftwareentwicklerTypentheorieDateiformatAutomatische IndexierungQuellcodeE-MailAusnahmebehandlungDatenfeldPunktSelbst organisierendes SystemDifferenteMereologieMultiplikationsoperatorSchnittmengeSchießverfahrenAbschattungSpezifisches VolumenElektronische PublikationTypentheorieComputerspielCASE <Informatik>DatenbankWeb-SeiteDatenloggerServerComputeranimation
Normierter RaumTypentheorieVariableDezimalzahlSoftwareentwicklerDickeDemo <Programm>SpezialrechnerCASE <Informatik>MereologieDateiformatVariableDickeVarianzDezimalzahlTypentheorieZahlenbereichObjekt <Kategorie>BitDatensatzElektronische PublikationBinärcodeZentrische StreckungBildgebendes VerfahrenDatentypZeichenketteAutomatische IndexierungWeb logGruppenoperationUmwandlungsenthalpieTupelMetadatenEinsMini-DiscServerBeobachtungsstudieDivergente ReiheMenütechnikZellularer AutomatFreewareSchnitt <Mathematik>MultiplikationsoperatorAggregatzustandLokales MinimumInformationsspeicherungPunktComputeranimation
SichtenkonzeptSoftwareentwicklerMIDI <Musikelektronik>TrägheitsmomentMotiv <Mathematik>Attributierte GrammatikBetriebsmittelverwaltungMAPDickePlastikkarteSoftwaretestWeb-SeiteHochdruckHill-DifferentialgleichungE-MailInformationsmanagementWechselseitige InformationDateiformatWeb-SeiteDatenbankDatensatzDickeZeiger <Informatik>TabelleArithmetisches MittelBildschirmmaskeOverhead <Kommunikationstechnik>SchwebungMereologiePunktZahlenbereichBitEinfügungsdämpfungVierzigDreizehnComputeranimation
Notepad-ComputerZeiger <Informatik>Vektor <Datentyp>Schwach besetzte MatrixSoftwareentwicklerWeb-SeiteInformationsspeicherungZeitstempelVerschlingungDickeWurzel <Mathematik>MAPInformationsmanagementDatensatzDatentypPrognoseverfahrenDemo <Programm>DatenbankTabelleSoftwaretestTropfenMIDI <Musikelektronik>FehlermeldungMessage-PassingAbfrageHochdruckBetriebsmittelverwaltungZeiger <Informatik>Elektronische PublikationWeb-SeiteDatenstrukturWurzel <Mathematik>DatensatzEinfügungsdämpfungBitFolge <Mathematik>DickePunktDatenbankMultifunktionZeitstempelServerMinkowski-MetrikMini-DiscTwitter <Softwareplattform>MereologieMultiplikationsoperatorRechter WinkelImplementierungDatenfeldDateiformatMAPMixed RealityZahlenbereichCASE <Informatik>PufferüberlaufFestplatteNabel <Mathematik>AppletObjekt <Kategorie>BitrateLokales MinimumWeb logGeradeAggregatzustandMatchingInformationsspeicherungGewicht <Ausgleichsrechnung>Wort <Informatik>DatenflussVerkehrsinformationGüte der AnpassungVierzigFluidRoutingEinfach zusammenhängender RaumTabelleFitnessfunktionDivergente ReiheComputeranimation
Lokales MinimumWurzel <Mathematik>E-MailWeb-SeiteSoftwareentwicklerDatenstrukturDatentypDickeMAPTopologieVerschlingungDatensatzInverser LimesKomplex <Algebra>Zeiger <Informatik>TypentheorieDivergente ReihePufferüberlaufPunktOverhead <Kommunikationstechnik>RoutingDifferenteInformationsspeicherungDatensatzKlassische PhysikZeiger <Informatik>Schmelze <Betrieb>Lokales MinimumSoftwarewartungNichtlinearer OperatorTypentheorieDatenstrukturVollständigkeitBitZeitstempelDatenfeldDickeMini-DiscKomplex <Algebra>MultiplikationIndexberechnungCASE <Informatik>MultiplikationsoperatorDatenloggerTotal <Mathematik>TabelleVerschlingungWeb-SeiteElektronische PublikationAutomatische IndexierungMAPTopologieServerWurzel <Mathematik>Geradep-BlockBitrateFreewareArithmetisches MittelMereologiePhasenumwandlungSoftwaretestWeb logTVD-VerfahrenAggregatzustandEinsRelativitätstheorieGesetz <Physik>Güte der AnpassungSoundverarbeitungComputeranimation
MereologieTopologieSoftwareentwicklerMAPDatensatzFarbverwaltungssystemInformationsspeicherungExtreme programmingWeb-SeiteLokales MinimumSpezialrechnerDifferentialDatenstrukturE-MailTypentheorieSystem FRoutingPunktMultiplikationsoperatorTopologieMereologieOffene MengeCASE <Informatik>MAPInformationsspeicherungOverhead <Kommunikationstechnik>MathematikSymboltabelleWeb SiteVierzigWort <Informatik>VerkehrsinformationMinkowski-MetrikEinfügungsdämpfungKonstruktor <Informatik>ServerDifferenteSchreib-Lese-KopfObjekt <Kategorie>AggregatzustandEinsLokales MinimumWeb logZeiger <Informatik>DatensatzMusterspracheDatenbankSoftwaretestWeb-SeiteProjektive EbeneZählenKlassische PhysikWurzel <Mathematik>SichtenkonzeptCodeSechseckElektronische PublikationFitnessfunktionZeichenketteComputeranimation
Lokales MinimumSpezialrechnerSoftwareentwicklerIndexberechnungBetriebsmittelverwaltungWeb-SeiteServerSpeicherverwaltungPartitionsfunktionObjekt <Kategorie>Maß <Mathematik>PortscannerTopologieIdentitätsverwaltungMixed RealityGleichmäßige KonvergenzProxy ServerNichtunterscheidbarkeitDatenstrukturKette <Mathematik>RechenwerkE-MailInformationsspeicherungFolge <Mathematik>Minkowski-MetrikFreewareBootenElektronische PublikationWeg <Topologie>sinc-FunktionWeb-SeiteEinfache GenauigkeitRechenwerkTopologieSpieltheoriePunktSchätzfunktionSpeicherverwaltungTabelleMinkowski-MetrikMAPDifferenteInformationsspeicherungMomentenproblemDatensatzOrdnung <Mathematik>Objekt <Kategorie>Uniformer RaumAutomatische IndexierungPufferüberlaufEntropie <Informationstheorie>Mailing-ListeKlassische PhysikVerschlingungDatenstrukturCASE <Informatik>GruppenoperationSchreib-Lese-KopfMaßerweiterungFolge <Mathematik>Technische OptikBitAggregatzustandAbstandDivergente ReiheZahlenbereichURLWeb SiteBitmap-GraphikSpannweite <Stochastik>WellenpaketMultiplikationsoperatorHeegaard-ZerlegungEinsBetriebsmittelverwaltungWurzel <Mathematik>Mathematische LogikMixed RealityLokales MinimumPartitionsfunktionElektronische PublikationNichtlinearer OperatorSicherungskopieBootenUmwandlungsenthalpieZeiger <Informatik>TypentheorieServerComputeranimation
SoftwareentwicklerQuellcodeSyntaktische AnalyseSichtenkonzeptPhysikalisches SystemTabelleKette <Mathematik>IdentitätsverwaltungWurzel <Mathematik>Automatische IndexierungWeb-SeiteSoftwaretestIndexberechnungDefaultBetriebsmittelverwaltungDateisystemAbfrageElektronische PublikationBildschirmfensterPartitionsfunktionZahlenbereichHochdruckRechenwerkTouchscreenNummernsystemWeb-SeitePortscannerPartitionsfunktionObjekt <Kategorie>BetriebsmittelverwaltungPhysikalisches SystemRechenwerkZeiger <Informatik>TabelleSichtenkonzeptDynamisches SystemPunktServerDatenverwaltungBeobachtungsstudieDatensatzIndexberechnungTermURLDatenbankDatenstrukturTypentheorieComputeranimation
PartitionsfunktionSoftwareentwicklerSichtenkonzeptInformationsspeicherungTabelleAbfrageDatenbankObjekt <Kategorie>E-MailAggregatzustandGeradeMAPTetraederElektronische PublikationBildschirmfensterHochdruckMessage-PassingMaßstabDickeDatentypPartitionsfunktionMini-DiscServerZeiger <Informatik>Web-SeiteDynamisches SystemBetriebsmittelverwaltungDatenverwaltungObjekt <Kategorie>RechenwerkBitrateLesen <Datenverarbeitung>TabelleSichtenkonzeptHilfesystemQuellcodeMetadatenPunktDatenbankMAPSystemverwaltungExistenzsatzVersionsverwaltungBeobachtungsstudieInformationsspeicherungAggregatzustandTechnische OptikCodeResultanteEinfach zusammenhängender RaumNummernsystemComputeranimation
Funktion <Mathematik>TabelleBasis <Mathematik>Digital-Analog-UmsetzerSoftwareentwicklerMenütechnikDemo <Programm>SpeicherabzugSichtenkonzeptDatenbankWeb-SeiteHochdruckTropfenTouchscreenBildschirmfensterTUNIS <Programm>TabelleSystemverwaltungNummernsystemDatenbankObjekt <Kategorie>DatensatzSchnittmengePhysikalisches SystemWeb-SeiteZeiger <Informatik>PartitionsfunktionRechenwerkProgrammverifikationComputeranimation
Demo <Programm>PartitionsfunktionBootenWeb-SeitePunktKonstanteSoftwareentwicklerKonvexe HülleCASE <Informatik>MaßstabRückkopplungSpezialrechnerVideokonferenzPhysikalisches SystemAbfrageServerGoogolDatenmissbrauchBinärdatenGleitkommarechnungTypentheorieParserKontextbezogenes SystemDatentypSoftwaretestWeb-SeiteBootenPunktAutomatische IndexierungComputerspielDezimalzahlDatentypExistenzsatzDifferenteKomponententestPartitionsfunktionTabelleLastTypentheorieRechenwerkObjekt <Kategorie>BetriebsmittelverwaltungTermKonfigurationsraumBitPhysikalisches SystemCodeSechseckDatenfeldResultanteURLVerdeckungsrechnungOffene MengeLokales MinimumInformationsspeicherungVerschlingungDatensatzParserFigurierte ZahlQuellcodeFunktionalWeb logSichtenkonzeptTransportproblemPhysikalischer EffektMultiplikationsoperatorZellularer AutomatSymboltabelleHecke-OperatorEinsProzess <Informatik>SpieltheorieNummernsystemMereologieSoftwaretestGrundsätze ordnungsmäßiger DatenverarbeitungOrdnung <Mathematik>CASE <Informatik>Computeranimation
SoftwareentwicklerDifferentePasswortTypentheorieDatentypAblaufverfolgungQuellcodeDateiformatDatenstrukturMetadatenAutomatische IndexierungWeb-SeiteCASE <Informatik>TabelleMinimumMathematikMultiplikationsoperatorDatenbankFolge <Mathematik>BildschirmfensterPunktLesen <Datenverarbeitung>FehlermeldungSQL ServerSchwach besetzte MatrixServerProjektive EbeneRechter WinkelCoxeter-GruppeCOMBitrateE-MailDruckspannungOrtsoperatorInformationsspeicherungProzess <Informatik>EreignishorizontSmartphoneChiffrierungInformationComputeranimation
SoftwareentwicklerSichtenkonzeptRechenschieberTwitter <Softwareplattform>Web logTwitter <Softwareplattform>Computeranimation
Transkript: Englisch(automatisch erzeugt)
I'm kind of excited to see what you will get from this. So once you start dripping down and just falling over, that's the point that I'm going to pick it up and do it a bit harder. So I usually give the warning at the beginning that this session is not meant to cheat. It's meant to inspire you. You're not meant to learn all of what I'm saying here today. You're just meant to pick up a bit of it
and get an idea of what's happening. So don't fret if there's something you don't understand. I'm going to be up here afterwards. Just enjoy the show, and you've been warned. So no complaints afterwards. This is where I usually start spending 10 minutes about who I am. Suffice to say, I'm the tech lead of a small company
in Denmark, and I have no fancy titles. I'm not an MVP. I'm not an MCM. I'm nobody. So why am I here? About a year ago, I started an open source project just as a hobby. And today, it's ended up with a standalone parser for SQL Server data files written in CCR
with a payload of about 256 kilobytes. And it'll parse about 98% of the AdventureWorks database that shipped with SQL Server 2008 R2. And it's open source. So you can get the source. You can use it. It's completely open. I want to show a quick example of how
we use to query SQL Server. We would create a single connection, a command, do a reader, and then loop all of the rows, very simply. Now with OrCam Dev, it's kind of different. You just instantiate the database. You give it the paths of the database files. If you have multiple, just put them in any order.
OrCam Dev will parse them and find out which they are. You create a data scanner, and you tell it to scan the person's table. And then you can print out the rows if you want. Those are standard data rows. You can just loop them. You can also do predicates. So you can say where the field age, which is of type short, is less than 40.
So it has full link support, so you can just query it. It'll also scan indexes. You just created an index scanner, and you can scan the CX person ID index on the person's table. And once again, you just get standard data rows with index data in it. And you can even query DMVs, dynamic management views,
which contains a lot of metadata in SQL Server. And those are a bit interesting, because they're not persistent disk. So they're actually simulated live. And I'll show you later on how that works. And I have a quick demo. I'm not going to show you the code actually running, because there's quite a lot of it. But I also created a small GUI for it, since there was like two downloads in GitHub.
So I wanted somebody to be able to try this outside of me. I've got a standard AdventureWorks database. And just to show that SQL Server is not running, I rename it. I can do it. It's not locked. It's a normal file. And I can open it up. And OrCamDF Studio will find it, show me all of the tables.
And I can do a select from the product table. And I've got the product data. I can sort it. I can query it. I can set filters in it. And it just gives me the data. I can also go and query the DMVs. So if we want to see all the tables, I can just query it. Or if we want to go a level deeper, we can look at the base tables that SQL Server uses
with the actual internal metadata that OrCamDF passes. So you can download OrCamDF Studio. It works today. It's completely open source. And it uses OrCamDF under the hood. So in SQL Server, we have a number of data files.
Just to keep it simple, a database is a single file. And a file is just one big array of these 8 kilobyte chunks, known as pages. And each of these pages has a 96-byte header and an 8,096-byte body. And everything in SQL Server is stored in this page format. So everything, metadata, your indexes, your data,
whatever you have, everything is stored on these pages. At the end of that body in the page, we have what's known as the slot array. And the slot array defines logical order of the rows on these pages. So if we read that one backwards, we have row zero is stored up here.
We'll read the next one. And it just shows us where all of the rows are stored. They're not necessarily stored in physical order on the page. So we have to read the slot array to find out where is the data on this row, on this page. And the records that I just showed you, records is the same as a jubile, it's a row. It's the actual row of data that's stored in the data pages.
These stores both the data, they store the indexes, they store your source code for your, store procedures for your views. Everything is stored as records on pages in data files. And all of these records are stored in a format known as the fixed bar format. I'll go a bit more in depth to this in just a moment.
It's got a couple of status bytes that shows a bit about what's stored in this record. Then there's two bytes that points to the null bitmap, which I'll also go into detail in just a bit. Right after that pointer, we have all of the fixed length data. So this is where your integers are stored, your chart 10, all of your data types that have a fixed length,
they're stored in this portion of the record. Then we've got null bitmap. First we've got a count, how many columns are stored in the null bitmap, and then we have the actual null bitmap. And I'll tell you what it is in just a bit. And finally we have the variable length section of the record. So this is where all of your variable length data is stored. Those are your strings, your var decimals,
whatever we don't at the beginning know what the size is. So the first status byte has eight bits naturally. It's got a single bit that tells the version this is always zero in SQL Server 2008 plus. We've got the record type. Is this a data record? Is it an index record? It's got three bits to tell what kind of record it is.
It might not have a null bitmap, and if it doesn't, it will tell us that there isn't a null bitmap in this record. It also might not have any variable length columns. And in that case, it'll also tell us that you shouldn't expect there to be a variable length data section. Finally, SQL Server might add some versioning information. So if you're using snapshot isolation
and it stores an earlier version tempDB, it adds this 14-byte structure to record, and this bit will be set in that case. And finally, we have a single unused bit, which is kind of weird when you're looking at the next status byte because only one bit is used. So why wouldn't they just compress that into one byte?
I haven't gotten an answer from them, but most likely it is for adding features in the future. If they ever wanted to expand this with just one extra bit of information, they would have to expand every row in your database with one byte, and that would cause fragmentation all over the place and basically be impossible for them. By wasting one byte per row,
they retain the possibility of adding features later on. And this is guesswork. There's a lot of guesswork today because I don't work for Microsoft. I don't have the source code. I've played with a lot of it, and I have some guesstimates, but I might be lying to you, so be warned. So null bitmap is some bytes
that track whether columns are null or not. So every column that you have in your record will have a bit in the null bitmap, and if that bit is set, it means the column has a null value. So if you have eight columns, we will use eight bits, a single byte. If you have nine columns, we're gonna use nine bits, which will take up two bytes space.
So in that regard, you actually get some columns for free. So you have to divide it by eight to find out how many bytes are used to store the null bitmap. And usually the null bitmap is always present on data pages, on data records, but there are a few exceptions where they're not present. And generally, you never care about this, but once you start passing this stuff,
you're gonna run into all sorts of weird combinations and you discover some funky stuff, and you'll have a field day looking through it. Also, null bitmap, it'll use one byte even if you have just one column. So we have seven bits that are not used, and you would guess that those would probably be set to zero, so they're not null, but they're just garbage.
When Sequencer creates this null bitmap, it just takes the bytes from memory somewhere, flips the bits it needs, and just pushes it down to disk. So you might have some data that just doesn't make sense. The variable length of set array is this last part of the record. We've got the first 10 bytes, which is the fixed length data.
Then we've got two bytes that tell us how many variable length data columns do we have in this record. And then for each of these, we have a two byte pointer that tells this is where the data stops in the record, because we know when the data starts right after those pointers. And if we know where the data stops, we can find the data for that given column. So we can just find the actual length of data
for each of these variable length data columns in our record. I've got a small sample. This is a normal record. We've got a couple of status bytes. We've got the fixed length pointer, the null bitmap pointer. Since there's no fixed length data in this table, we only have a varchar, it starts right here.
So we've got the null bitmap. We've got one variable length column. It ends at bind index 16, 15. And if we look at this, we know it starts here and it ends here. So these are our bytes. And they correspond to the value that's been inserted into the table. So this is a very simple example of how the variable length data section works.
And this format is used in a lot of different places in SQL Server, so it's pretty important. And we'll be looking at a lot more of those. So one common misconception is that if you have null columns, the data will always be stored on page if it's a fixed length column. So if you have 10 nullable bitmaps
and nullable integers and all of those are null, they will take up 40 bytes on your record. Even if they're null, they will be stored on disk. For variable length data types, that is not the case. They will not be stored on disk, but they will retain that two byte pointer in the variable length data section. So they will use the two bytes no matter what.
And if you only have a single variable length data column, you will also use those two bytes to track how many variable length data columns do we have on our record. So there's a lot of wasted space even if these values are null. The one exception is for tail columns. So if you have a new column that you're adding that is nullable,
SQL Server will actually just see in the metadata there's an integer at the very end, but there's no data on disk, so it must be null. And that is the only case where you can actually have a non-persistent nullable fixed length data type. In 2012, you can do the same with default values. So if you add an integer that is not nullable
but has a default value, SQL Server will register that in metadata and actually not have to touch disk. So that can save some space, and more importantly, you can add this integer column to your terabyte database without blowing everything up. Otherwise, it would have to touch each of these records on disk. So in all versions prior to 2012, it will have to touch disk
once you add a fixed length column to your table. So when I started passing this, one of the first things I needed to pass was the header in the file in those pages. So those 96 bytes, I needed to pass those. The problem is there's absolutely no documentation out there. There's some great books written, but none of them touch the page header
outside of just saying these 96 bytes are the header. So I tweeted out, and Kimberly Tripp, one of the greatest SQL Server gurus, wrote me back, well, basically you can use time patience and this undocumented command called DBCC page, which I'll show you in just a moment. So let me show you how I found out
what the header format is. I've got this small tool that I wrote that just converts between different number systems. So I can write decimal, a binary, a hex, whatever I want, and it'll just show me the different values. It's also open source. It's also on GitHub on my account, very simple. So what I'll do is just to create a normal database.
This is just an empty database. I'll create a table, insert a couple of rows. It doesn't really matter what it is. We just need a table. And then I'll use another undocumented command called DBCC IND. What this does is to tell us the page IDs of the pages that are used for this table. In this case, the one we're interested in is page 147,
which is the data page for this table. And by using DBCC page on page 147, we can see the actual bytes that are stored on disk for this page. We've got all of the bytes here. Most of them are zeros, given that there's nothing on the page. More importantly, what it shows us up here is the header.
So it tells us that we've got three records in this page. And if we look at the header down here, the first 96 bytes, we can start looking for that values. We just saw there was a value of three. And is there a three? There's a three right there. So we've got a single value of three that we know
because DBCC page tells us. And there's only one place in the header where the value of three is stored. So obviously that must be it. So we've just found one of these fields. We've also got one 8,057. And if we convert that, we should look for 79 1F. And we've got it right there.
So we can start just looking at these values that DBCC page tells us are stored and just look at the bytes. Where are they actually stored? Now the problem is there's also a header version one. And if we look at the bytes down there, we've got a one right there.
We've got one right there. We've got one. There are lots of ones right here. There's two of them right there. So we don't really know which one of these is the one. Anyone have a suggestion? So what I then did was to find the actual file where it's stored. I'll just copy its path.
And then I will shut down SQL Server. While doing that, I'll open a hex editor. Open the file itself. And we know that this is page 147. And we know that each page is 8,192 bytes long. So this is the offset in the file where data is stored. So if I go to that offset in my hex editor,
what we see are the actual bytes. And right now we've got a lot of zeros right here. So let's just do this. And this is where DBAs start to cry when I show this. And you shouldn't do this. So I've just entered some mobile jumble into my database.
I'm going to restart SQL Server. And I'll reconnect to my database. And if we do a select star from our table, I've just corrupted our database. So SQL Server can't even read the table anymore because it doesn't know what to do with the data. And once it does that, it disconnects you from the database.
So I'll just reconnect to it. Now what's interesting is the DBCC page will actually still pass whatever it can do. It'll do a best effort passing. And if we look at it now, we can see that it points to the next page, which is page ID 13,330. And obviously there's no such page.
So it's got a weird value. We can also see that there's currently 30,806 rows in this page. And that would be equal to it having about a quarter of a byte per row, which doesn't make sense. But if we look at that 30,806, we should look for the hex values 56, 78.
And if we go down here, we see 56, 78. And you can kind of guess how this goes on. You can just sit there and substitute these values, trying to look them out. And slowly but surely you're going to lose your girlfriend, but you're going to get the header format down at some point.
And fortunately the format is the same, no matter the page type. So what you'll end up with once you do this is a file like this, which is the whole header, just written in C sharp for all of the fields, and then the byte indexes of all the fields, and then finally just reading them in. So at this point, I've got the whole header format spec'd out.
And you won't find this in Google, except for these source files. So this is completely undocumented, but now you've got it. So that's one step down. But, yes, usually I speak very fast. This is based on an eight hour pre-con, but shoot.
Yes, I couldn't in this case, given that I completely shut down SQL Server and just wrote into the file that was completely offline.
When I get data back, like when I've restarted SQL Server, yes. Oh, if I use OrkMDF. OrkMDF is completely read only. So you can't connect to database that's live, given that SQL Server locks it, but if you do a volume shadow copy of the file,
you can read it through OrkMDF, and there's no impact. Completely read only at this point. It can put it into it, but the thing is you can't read the data file while SQL Server's using it, because SQL Server locks it. But if you take a volume shadow copy of it, so basically a point in time snapshot of the file,
you can read that one, and the real file will still work. So it's just a point in time snapshot that you're reading of. Yes. Exactly. Yes, so the database has MDF files, data files,
and it has log files. I only read data files at this point. So I wanna do log files, but I gotta get rid of the puny remains of my life before I start doing that.
So at this point I have to head up, but there's still a lot of data stored on the pages. So if we look at different data types, we basically have two different types. We have the fixed length types, and we have the variable length data types. So there's two different parts of the fixed file record format. The fixed length types are bits, char, integer, decimals, date. All of these always have the same length.
The variable length data types are all of your string types. You also have XML types in there, anything that has variable length. And there's also this little bugger called SQL variant, which can be anything, and you generally don't wanna use it, so try to hide it away. If you're using it, shame on you. There's rarely a good reason for it.
I'm not gonna go into all of the data types. I do not have the time for it, but there's one of them that's quite interesting, var decimal, which is a decimal stored in a compressed format where it only takes up as many bytes as it needs. So it's a fixed length data type that's been made into a variable length data type.
Once you look at it, when you define it, you define the scale and the precessions. You define how many decimals do you want, and where should the comment be placed. So internally, decimals are just stored as one big, humongous value. It doesn't know anything about precessions or commas. But that nine just defines at what index should be placed a comma in this big, humongous number.
So we're storing one, two, three, four, five, six, point seven, eight, nine. And what SQL Server actually does is to store these in groups of three. And if you have a group of three, you need to represent the numbers from zero through 999. And we can do that using 10 bits, which represents the numbers from zero and 1024.
So we're basically wasting 25 extra values. But if you look at the different sizes they could have chosen, do you want to do 10 bits for three decimals? Do you want to do it 14 bits for four decimals? There's a big waste of these, and 10 is pretty much the most optimal size they could have used. Obviously, I could care less about this.
It was just a fun little exercise to try and reason why would they choose that size specifically. So what SQL Server then does is to store the groups of numbers that actually have values. And all of these zeros will just be truncated because we don't need them. The metadata tells us you should expect there to be two more groups of zeros. And we just save those 20 bits
by not having to store those extra numbers in disk. And I've got a very lengthy blog post on this in case you're really bored some day. So looking at the variable length data types, we've got two of those. We've got slobs and lobs. And lobs and large objects and slobs, I kid you not, are small large objects.
Those are the official definitions of them. So slobs, those are varchar, invarchar, and bar binary x. X being from one to 8,000. If you want to have a larger value, you need to use the lob types like text, intex, image. Those are the classic ones. And you have the newer ones, the max types, varchar max, invarchar max, and so forth.
And let me just show you how they work. I have got an empty database. I created a table with two varchar 8,000 columns. And I will insert a single row in there. Well, two rows, each of them having 3,000 characters. And did I do that? I think I did. So using DBCC IND, we've got a data page, page ID 148.
And using DBCC page, we can find the bytes. And seeing as you can all read this already, I'm just gonna go quickly. I've got a status byte, fixed length pointer, two columns, none of them are null. We've got two variable lengths. And these are the pointers that we want to look at. So C50B and 7D17, those are the pointers
that point to the end of the data in the record. And if we try and convert some of those values, we've got C50B, that's 3,013, meaning we've got 13 bytes of overhead, then we've got 3,000 characters. So the data ends at 3,013. Looking at the other one, we've got 7D17, 6,013.
So 3,013 plus 3,000 extra characters of data. Very simple. So let's try and insert a record with two 5,000 length columns. Now we've got the data page 161. If we take a look at it, we've got the exact same format. And we have got 95, 13.
So 5,013, 13 bytes of overhead, 5,000 bytes of data. The other one is 8,893. So that ends at byte index, 37,805, even though the page is no longer that eight kilobytes. So something is amiss. What I found out is that once we need to represent this number, we have got 16 bits,
but we really only need 15 bits because it can't be any more than 8,192. And 15 bits is plenty for that. So we've got a single bit to spare. And it seems that if you just flip the very last bit, you get a value of 5,037, which is much more sensible. It's still not the data that we're looking for, but at least it's a valid value.
So what they use that very last bit for is to indicate that this is not the data you're looking for. This is a complex column that is going to show you where the data is. And they use these for row overflow, which means, in this case, you've got two 5,000 character columns. They can't be stored on a single page. So one of those columns is pushed off to another page,
and a pointer is left behind. If we look at the pointer, deepcc page will actually show you that pointer and the values in it. So I looked at it, and I saw it had an update sequence value, which corresponded to the bytes in red. It had some pointers, which is the blue stuff.
And then it had a timestamp with this value down here, which kind of corresponds to what's stored on disk. The only thing is that this is stored in little endian. So the first bytes are the most significant ones. And we need those four zeros right here, but they are not stored on disk. And I used way too much time just trying to figure out what was I doing wrong
until I cried on Twitter. And Paul Randall, another excellent database guru, SQL Server guru, he worked on the team back in the days. And well, they're not stored on disk. And you'll run into a lot of these situations where there's just no clearance, so there's just some implementation details that you'll have to just guess about or be lucky that they'll answer you on Twitter or find somebody.
But generally, you just have to guess. So anyways, I now knew the format of that pointer structure. Looking on disk, we've got the ID of this complex column, which is two. Got a field, the level. I can correspond these to what's stored, what DBCC page tells me is up there.
We have got the timestamp as well. Interestingly enough, they got a field called unused. I'm not sure whether it's used or not, but it's called unused. I don't know what it's for. More importantly, we've got the size of the data. We've got the page ID, the file ID, and the slot ID on that page. So using this pointer, we can find out where the data is.
And if we then look up that slot, that record where it's stored at, we will find a record known as a blob fragment record. It's got a couple of status bytes that tells it this is the kind of record, this is blob fragment record. A fixed length pointer that points to the very end of the record. So all of the data is stored in this part, which
is the fixed length part. And what's in here is a special structure that stores a timestamp that matches what's in the pointer. So it can match up to two and check that it wrote this at the same time. We've got an ID that tells what kind of structure is this. And finally, we've got the data itself, which we can read. Once you row overflow, these will
be put on pages known as text mix pages. And if you have multiple columns that are overflowing from the same table, they may be stored on the same overflow pages, provided there's space for it. So it's not going to allocate a complete 8 kilobyte page, even if you overflow like 2 kilobytes of data. It'll just push in more of those. So once it pushes off row, it will only
do so if your data is larger than 24 bytes, given that the point itself is 24 bytes. So it doesn't make sense to overflow anything smaller than that. Once it does so, it'll leave behind what's known as a blob inline root pointer, which is this 24 byte structure that I just showed you. It'll push it off to a blob fragment record on its text mix page.
And one issue is that performance can be difficult to predict, because you're inserting a record with 6,000 characters of data. And you expect it to be inserted in rows. So you've got a previous page, a new page, a new page. What you don't know is that there's already a record on there with 4,000 bytes of data. So your record is going to be pushed off somewhere else.
When the disk needs to read your data, it's going to go like this, just trying to find your data all over the hard drive. So it could be very difficult to predict what kind of performance you're going to get, because you'll basically get fragmented reads, because some of your data is pushed off row. Those were the slobs, the small, large objects.
Now I'll show you the large objects, bar shell max. So these are the new kinds of large objects. I will just need to disconnect this guy and connect this guy, just like this. So I'll create an empty database. We've got a table. I've got a Java and Python column,
just to put in some dummy data. We're not going to use it, just to take up some space. And a bar shell max. Then I'll just insert the values, A, B, C, D. Look at page 148. And what we'll see at the end is that SQL Server tells us through DBCC page that this is stored in a blob inline data structure.
And that's not actually structured. That's just data stored in line. It just kind of tells it that it's a structure. If we instead insert 5,000 characters and look at the page, what we'll see is a blob inline root. Because we've already got 5,000 characters for data through the char column, so it has to row overflow it.
And this is the exact same thing as we saw just before. It simply inserts a blob inline root pointer that points to a different page. And if we insert 16 kilobytes of data and look at the page, what we'll see is, again, a blob inline root. Now we just have two pointers. So that blob inline root can actually have multiple pointers
that just tells us first you need to get the first part of data here, and then you need to get the next part here. And we just follow all of these pointers. If we instead try and insert 48 kilobytes and look at the page, what we will see is, once again, a blob inline root.
But now there's only one pointer, even though we inserted more data. That doesn't really make sense. So let's try and look at page 148 that it points to. If we look at that guy, what we will find is another structure, in this case known as an internal. And that one has a number of pointers. So now we have a blob inline root pointer
that points to another structure known as an internal that then points to, this is where you're going to find your data. So we need to follow this whole structure. So blob inline data, once it fits on page, that is the structure we're going to use. Just the data in row, completely normal. When it doesn't fit in row, we're going to overflow it and insert a blob inline root pointer.
And once it overflows, we're going to use some different records to actually store the data. I showed you the blob fragment record before, and what I'm going to show you now is a series of different blob structures where it actually stores the data. And all of these are stored in completely normal records.
I've got a couple of status bytes and the fixed length pointer. So when you look at these next structures, just remember that we've got four extra bytes, which is the overhead in the record, but we're not going to look at those now. So one of the structures we're going to see is a type-free data structure. And this is where your data is actually stored. So once it overflows, it'll put that on a data record
and you can find your data. And the data record has got an eight byte blob ID, also known as the timestamp, and a two byte type pointer that tells it this is the kind of structure you're looking at, which for data's case is free. And if you look at data, we've got 8,096 bytes on the body in total.
We need two bytes for the slot pointers. We've got 8,094 bytes. And we've got 10 bytes of overhead here, and we've got four bytes of overhead in the record itself. So we're down to 8,080 bytes left on the page. But the most SQL Server will actually store is 8,040. And the reason for that is the SQL Server may need to add this 14 byte versioning structure,
and it may need to add other stuff while rebuilding indexes and doing other maintenance operations. So just to ensure that it has space, it won't fill out the page completely. So the internal has a couple of fields as well. It is type two, same structure as we saw before. It stores how many pointers do I have right now? This one has 19.
It stores a max links field, which is weird because it says 501, but it won't store more than 500. So I have no idea what it's used for. It doesn't really make sense. If we look at it, it's got a blob ID, timestamp, just as the other types. It's got a two byte type indicator that tells us this is an internal.
It's got a max link, current links, and level. So if you have multiple of these pointing to each other, a tree is going to be built, and it's going to say this is level one, this is level two, and so forth. And finally, it's got an array of pointers that tells this is where you're going to find the actual data. So to sum up, if the data fits in the record, it'll just store blob inline data in the record.
If you've got less than 40 kilobytes, it'll store a blob inline route that points to different data lob structures. More than 40 kilobytes, it's going to store a blob inline route that points to an internal that points to a series of these data lob structures. More than that, up to four megs of data, well, more than four megs of data,
you'll have a blob inline route that points to multiple internals that points to a lot of these data lob structures. More than 16 megabytes, you'll have a blob inline route points to an internal, that points to another internal, that points to data. And can anyone guess what happens then? We don't need any more, because if we have just four pointers in that blobbing line route that points to 500 internals, that in turn points to 500 other internals,
we can store up to 7.5 gigabytes of data. And given that the max size is 2 gigs, we have plenty at this point. So we don't need any more levels than that. However, you might see some different variations of this, but as long as you know how to parse these individual pointers, you don't really care whether you're following like 17
different pointers. You will find your data eventually. Those are the max types. Does anyone want to repeat it just quickly? No? Any questions? Excellent. I see brains melting.
So let's look at the classic lob types. These were the new ones. So these were the smartly designed ones. Let's look at the old ones. Has anyone worked on Sybase, on their team perhaps? I think there's remnants in here. So the classic lob types don't store any data in row ever. It always points it out.
You can actually enable those new types to do the same by setting a table variable called large value types out of row. If you do that, the new lob types will actually act completely like the old types. So don't do it unless you have a really good reason. What it does is to leave behind a complex column pointer. So just as we saw before.
But this type, this time, it's known as a text pointer. A text pointer has an 8-byte timestamp. It's got a page ID, file ID, and slot ID. So pretty much like the blob-in-line route, just only 16 bytes, and a bit more get-to-the-point. This is the page pointer. If we look at that, if we store less than or equal to 64 bytes of data, it will punch
a structure known as a small root. And the interesting thing is, no matter how much data you store, as long as it's less than 65 bytes, it's going to store 84 bytes in disk. So the small root can store up to 64 bytes of data. If you only store 24 bytes, you've got 40 bytes of garbage in disk.
And that took me some time to figure out as well, because I was looking at this structure and I had way too much data. Couldn't make sense of it. I thought it was important. But it's just garbage. So no matter what, 84 bytes. If it doesn't fit in those 64 bytes, it's going to insert what's known as a large root-view constructor, which is kind of like a blob-in-line root, or an internal structure that just
points to other places. It's got a header, and then it's got an array of these pointers. And the large structure type, just like data in internal, is type 5. This one won't be less than 84 bytes. So once again, if you've got a header that takes up 20 bytes, and you've got a single
pointer that takes up 12 bytes, you're up to 32 bytes, if that is all you need, it's going to take up 84 bytes no matter what. So this is just a lot of garbage data. And this is a general pattern with those old types. There's a lot of garbage in there that just doesn't make sense. So for the classic lob types, less than 65 bytes of data, it will be stored on the small
root. Less than 40 kilobytes of data, you'll point to a large root view count that then points to the data lobs. More than 40 kilobytes, we will have a text pointer pointing to a large root view count that then points to an internal, the exact same internal we saw just before, that then points to data lobs. And if you have more than that, you'll have a text pointer pointing to a large root
view count, pointing to an internal, pointing to an internal, pointing to data lobs. And that is all we need, because once again, we have those, in this case, free levels, we just have this large root view count for whatever reason, but we have all the space we need. So an interesting thing to observe is that if we store null in either the text, which is the classic lob type, varchar max, the new one, or varchar x, all of those are
going to store zero bytes, because it's null. If we store zero to 64 bytes of data, let's just say this is zero bytes of data, we're storing an empty string, varchar x is going to store zero bytes on disk, varchar max is going to store zero bytes on disk, the text type is going to store 16 bytes
of text pointer, pointing to an 84 byte structure that doesn't contain any data, so you're wasting 100 bytes of data to store an empty string with the text pointer. Do not use these classic types if you can avoid it. If you're storing 65 to 8,000 bytes of data, the varchar x will usually store just that data. It might have a 24 byte blob in the root pointer, and it might have 14 bytes of overhead
for the record it's pointing to. Same for varchar max. The text will store a 16 byte text pointer, pointing to an 84 byte large root Yukon, pointing to another page, another record with 14 bytes of overhead, and then you have the data. So obviously, there's a lot of overhead in this type.
Once you store more than 8 kilobytes, we can do it in varchar x, varchar max will generally store whatever is needed for the tree and the data, and the 24 byte blob in our root pointer. The text type will store pretty much the same tree, using internals, using data lobs, but it'll have the 16 byte text pointer and the large root Yukon.
So the more data you're storing, the smaller difference it actually is. But generally, there's no case where the classic lob types wins over new ones, so you don't want to use them. There's also some difference on the performance stuff. I've got a one hour session on that, if anyone is bored tonight. Otherwise, there's a blog post with another clever guy that has done a lot of testing
on this. Generally, don't use it. One interesting thing is, I've mentioned some different structures here. I've mentioned small root, internal, data, large root Yukon. If you look at the types of those, small root is 0, internal is 2, data is 3. I kind of wondered, what is type 1, because there's a gap in there.
What is type 4, and are there any more types? And if we look at these different lob structures, generally you've got some kind of header. But more interestingly, you've got a two byte value that says, what kind of structure is this? So I shut down SQL Server again, opened up the file in the hex editor, and it changed
those two bytes. I didn't change anything else, just those two bytes. And it changed those two to a 1, and a 4, and a 6. And I came up with some different names. Obviously, I corrupted the database each time, but DBCC page was kind enough to give me the names of these structures. We've got a large root. We've got a large root Shiloh. Shiloh was the code name for SQL Server 2000.
We've got a super large root. I have no idea what it is. I've never seen it in the wild. What I'm thinking is, at some point, pre-SQL Server 2000, they probably had this large root structure. Then in SQL Server 2000, they needed to change something, so they added the large root Shiloh. In 2005, code name Yukon, they needed to change it again, so they added large root
Yukon. And super large root is probably pre-SQL Server 2000, SQL Server 7, maybe in Sybase. I have no idea how old it is. And once it came up to 8, it began giving me null and invalid. 7, interestingly, was neither null nor invalid, but it had an empty name.
I'm not sure what it is. I'm not sure if they're used. You won't see this type in SQL Server 2005 8+, but they are probably out there. So just to sum it up, if you have less than 8,000 characters, the max types are pretty much the same as the x types. Logically, they're stored slightly differently, but on disk, it's the same.
More than 8,000, you start building a whole tree. And if you're using classic text types, there's a lot of legacy craft in there. There's a lot of overhead, so don't use them. Any questions before I speed up, since I'm running short on time?
So SQL Server stores stuff either as a clustered index or a heap. Clustered indexes and heaps are where data is actually stored. Primarily the only difference is clustered indexes guarantees the order of your data, and it's stored in a B-tree, whereas the heap just puts in your data wherever there's space, and there's no tree structure.
And once it does this, it also keeps track of this logically in what's known as allocation units. So if you have an object which is either a heap or a clustered index or a non-clustered index, you have an object. Each of those will have a number of partitions. At least one. It may have more. And each of those partitions will have what's known as an allocation unit.
It will have the heap or B-tree, also known as the hobbit allocation unit. And this is your in-row data. This is where your row is actually stored. If you have row overflow, it will be stored in the slab allocation unit. And if you have the classic text types, they'll be stored in the lab allocation unit type, just as varchar max is also stored in the lab allocation unit.
If we have a B-tree, we've got a root page somewhere. We just need to follow it all the way down to the very first leaf-level page. And since these are stored in a doubly-linked list, we can just follow that linked list of pages and just read all the data. A heap is kind of different, because it's not stored in a tree.
But it uses what's known as an IAM page, which is a tracking page that keeps track. These pages are mine. This is where my data is. So if we can just find that first IAM page and follow the links to the pages, and eventually a link to the next IAM page that has links to the actual pages, we can find all the data. So generally, we need to find that root page, or we need to find that first IAM page.
So usually, well, not usually, always, data is not just stored in pages, logically. It's grouped into what's known as an extent, which is a group of eight pages, six to four kilobytes of data. And they come in two flavors. You've got the mixed extent, where different objects, different tables, different indexes,
store pages in the same logical group of eight. And the first eight pages of any object will be stored on a mixed page, just to save space, whereas from that point on, it'll allocate eight new pages, or it'll allocate a complete, uniformly-dedicated extent to that object once it needs more pages.
And the way it keeps track of this is through some allocation pages. We've got the global allocation map that keeps track, overall, is this extent of eight pages allocated or not, is it in the use or not. We've got the SGAM, shared global allocation map. It keeps track, is this used for mixed pages, are different objects in here, and
is the free space available. And each of these pages has those 8,192 bytes to do with. So it can track 63,904 extents, so 63,904 times eight kilobytes is just about four gigabytes. After that, we've got a new GAM page, a new SGAM page, and we've got a new GAM interval.
So the data file is split up in these four gigabyte intervals that we need to track. We've also got the IAM page, which keeps track of which extents are uniformly allocated to a single allocation unit. So if we have a heap and it has an extent with data in, that IAM page will keep track
that these eight pages, this extent is owned by this object, uniformly. And the head of this IAM page, I'm not going to go into that. More importantly, there's an array of single-page pointers, since the first eight pointers for a given object are dedicated to mixed extents. So they're shared with different objects, so we need single-page pointers.
After that, we just need pointers to the extents where the data is stored. And if we put this up in a table, the GAM bit overall just keeps track, is this extent in use or not? If it's one, it means that it's not in use. So basically one says it's available.
The SGAM will have a one if it's mixed and that there's free space. If there's not free space but it's mixed, it will be zero. And finally, the IAM bit says is this uniformly dedicated to this allocation unit? And if we just combine all of these, we get the different states that the extents could be in, and we get a series of invalid states that doesn't make sense.
It can't be uniformly allocated while it's not in use. So using this table, we can find out a specific extent, what's stored in it, and what is its state. There's also another allocation page type called a PFS page, page-free space, which doesn't store a bitmap tracking extents but stores a bytemap tracking pages.
So every page within its range of 8084, 88 pages will use a single byte that keeps track of how much free space is there. Are there any ghost records that need to be deleted physically from disk? Is it an IAM page, mixed page, allocated page, and so forth? PFS pages aren't that interesting in this case.
It's the other allocation unit pages that really need to find the data. So quickly, an MDF file. Very first page is the file header. This is what OrkMDF parses, it's the very first thing to find out, is this your primary data file, is it a secondary data file, and where does it fit into the sequence of data files?
We've got the first PFS page, GAN page, SGAN page, we've got some legacy pages that are unused. We've got a diffmap page that tracks what extents have been modified since the last difference of backup. So SQL Server knows these are the ones we need to backup. We've got an ML page that keeps track which extents have been modified by a minimally locked operation since the last backup, once again used for backup purposes, and then
at some point it just repeats itself with a new PFS page, new GAN page, new SGAN page. More interestingly, page 9, index 9, is the boot page, and I will show you this in just a moment. Now comes the fun stuff, or funnier stuff.
So now we know how to parse it, but we need to find it. So we need the schema to parse it. We need to know what we expect to find. And more importantly, we need to point it to that very first page so we know where to begin scanning. So let's see how we could do it.
I'll create an empty database, and I'll create a table, it doesn't really matter about the structure, we just need a table. I'll insert a single record, and if we now look up in the sys.tables and column views for this table, what we'll get is the whole structure of it. So we've got the name of the columns, the IDs, the type, which we can look up and find
that these are integers or varchars, whatever they are. So using this, we've got the schema. If we run DBCC-IND, we get the page ID, which is what we need. So we need a way to replicate what DBCC-IND does. If we look up our table in the sys.tables view, we get an object ID.
And if we take that object ID and look up in sys.partitions, we get a partition ID. If we then take that partition ID and look up in system.internals.allocation.units, we get an allocation unit ID. More importantly, we get a pointer to that very first page.
Using this, we've now got the pointer to the page, we know how to parse it. We're done. Almost. So sys.tables to find our object ID, sys.partitions to find our partition ID, and sys.system.internals.allocation.units to find the allocation unit ID, and that pointer to the very first page.
Problem is that these are dynamic management views, and they're not stored on disk. So we need to look up in these to find the pages that we need to read. But we can't read these DMVs unless SQL Server's running, and that kind of defeats the purpose. So there's a problem here. But. There's always a but.
We've got an empty database. And if we use sp-help-text on the sys.tables view, it's actually going to give us the source code of it. If we look at this, what we see is that sys.tables at some point does a select-from-sys.opt-x-dollar. So sys.opt-x-dollar looks interesting. We can do the same for sys.columns, and we will see it looks up in syscall-parse.
So we've got some more, kind of more internal tables here. So let's try and do a select-start-from-sys.opt-x-dollar. It doesn't exist. So SQL Server will hide all of this for you, because you're not supposed to look at it. But there's a special way you can connect to SQL Server.
If you just connect again, and you say admin colon, and you've enabled what's known as the dedicated administrator connection, you can get some special privileges. One of them is that it allows you to query the base tables. So now we're querying sys.opt-x-dollar, and we're getting all of these table names. And a lot of these are the internal base tables that are not returned in sys.opt-x-normally.
We can also look up syscall-parse, and so forth. So let's try and look up sys.opt-x-dollar in sys.opt-x-dollar, since we need the object ID to move on. Problem is, there's no result. sys.opt-x-dollar doesn't exist. If you look at the source code of sys.opt-x-dollar,
you'll find that it's actually hidden view. It's not really an object. So what it's known as is sys.schema.objects. And we can look up sys.schema.optics in sys.opt-x-dollar. And during that, we can now look up sys.schema.optics in sys.schema.optics, and we get the very base table where this metadata is stored. So this is the lowest level you can get that points to this object that it exists.
So we've got an object ID of 34. Using sp-help, we can also get the schema of this table, which is fixed, at least for a given sequence or version. So it's just a matter of running this once, hard-coding it, and we've got the schema for sys.schema.optics.
So we found the base tables. They've got a lot of weird data. I'm going to go really fast now. We can only group them through the dedicated administrator connection, and they've got a lot of weird columns that doesn't really match anything. So sys.schema.optics is where we start. We have got an empty database, and we can look up sys.schema.optics in sys.schema.optics. We get an object ID of 34.
Using that, we can look up in sys.colpaths to get the columns of sys.schema.optics. Using that object ID, we can also look up in sys.rowset, which is the base table of system partitions, we get the partition ID. And finally, we can look up in sys.logunits using that partition ID, and we have found that very first page. So everything is good.
Now we have the base tables, but the problem is we still need that very first pointer to that first base table. So how do we find that one? If we do DBCC page on page ID 9, which is the boot page, what we will find is that at some point there, there is an entry known as DBI versus indexes
that points to page ID hex 10, which is page ID 16. So if we look up at that page, we are actually going to find the sys.logunits base table. Since we have got a fixed location for the boot page, we can now consistently find sys.logunits, and we can pass sys.logunits. The question is, can we use that?
This is the data we have got now. And if we hard code a special allocation unit ID, 327.680, we will find a special allocation unit that is owned by the sys.rosets object. So using that hard coded allocation unit ID, we can now find the partition data.
And using another hard coded object ID of 34, we can find the allocation unit for the sys.schema.optics table. So using this, using just those two hard coded values, we can now find sys.schema.optics and move on from there. Once we have got sys.schema.optics, we can look up sys.colorpaths in that one,
and we can look up the partitions, and we can look up the allocation units, and finally we have got the data for sys.colorpaths, which is the schema. Now we have got the schema, we have got the object, we have got the partitions, and we have got the allocation units. We have got everything. The one thing that remains is that when you look at the source code of this, there is a lot of weird internal functions.
In this case, we are doing, this is Microsoft code, they are doing a select from sys.rs.cols, which is the base table, also alias to C, and then they are doing an auto apply on something known as open row set, table RC prop, and they pass the TI field of this table into it. And if you look up here,
all of these values stem from that very TI value. I Googled this, and if you Google it today, all you will find is me whining in the blog post that there are no Google results about it. So I couldn't really find anything existing. So what I did was I created a table just creating a lot of different data types in different configurations.
Then I did a select on the system and terms partition columns view, and joining that with the base table sys.rs.cols, joining it with the view system types that tells us all of the different data types. And I got a result like this that tells for the binary type, we've got a TI value of 12,973,
and these are all of the values that we need to extract out of that value. If we look at binary, and we convert that to hex, we've got two 0 bytes, and then we've got 32 AD. If we look at the type ID 173, and we convert that to hex, we get the hex code AD, and AD seems to match the very last byte.
So doing a simple bit masking, we can get that value out of the TI column value, so we can get the system type ID. We also need the max link field with a decimal value of 50, hex value 32, which is stored right here. So doing a simple byte masking and shifting it out, we can extract that value from the TI field as well.
All is good. Except each of these different data types stores its data differently in the TI field. So you have to go through each of these data types and figure out how do they store the values. And I have no idea why they didn't just store this in different columns. That would make my life so much easier. I have no idea why they do it. But I went through this for all of the different types,
and what I ended up with was a parser. So this is called TI parser that takes in a value, and it just gives you out all of these different values for the different data types, and I've got a bunch load of unit tests for it, and it kind of works. And it actually does work. And with about five minutes to go,
I think that was my mark. Did anyone not get anything of what I said? Simple stuff, right? Anyone have any questions?
Once again? In SQL Server? Yeah. So I started out doing this for a conference in March last year. I just parsed a single page. I was doing an internal session. I wanted something special.
And I don't know what happened, but a year later I had this project. And right now it parses those about 98% of database format. I even parse compression. I parse sparse vectors, sparse columns. I've got a lot of special types parsed. There's not really a lot more to do. I need some minor stuff here and there, and they've just added column-based indexes.
They add some new, different stuff that I might parse in SQL Server 2012, but really there's not that much that I'm missing from it. I have some special stuff like XML fields. I can parse the data out, but I can't actually convert it into the XML data. So there's some data types that I don't support yet, but I can get you the bytes that are stored on disk.
Well, a year is a long time, but it doesn't actually take that long time. Once you just get the page format down, you get the header, and you just link it up, just magically all of a sudden you can scan a table. Once you get to that point, it's a very nice feeling. So I spent about a year on it as a hobby next to my job.
Thank you. Where Microsoft would let me use it? Or where I would like to use it? So I had recently a database server that wouldn't start. I applied a Windows update.
I tried to start it. It didn't work. I got some weird error in the event log. My first thought was that the master database might have been corrupted, so I opened the master database in OrkMDF and verified that it worked. And from that point on, I just excluded. That wasn't the issue. I found the other issue later on. It was a permissions issue. This is a very simple case where I actually used it.
Another thing is, did anyone notice that I didn't type any username or password for the AdventureWorks database? Unless you're using encryption, your data is open, completely open. You can read any data from any database using this tool without logging in. If it's encrypted, I can't do it yet. But you might also do it if you have a database
and you don't have logins for it. You might do it if you want to learn a lot of stuff about the internals just by looking at the source code. Technically, you could deploy this on your smartphone and read a sequence of a database for a 56 kilobyte DLL. I don't know why you'd want to do it, but you could.
I don't really have any practical use examples. This is like playing football. It's just fun. I don't... Yeah, so self-contained databases. The primary difference is that users are stored in the actual database. And I haven't looked at it yet,
but they're probably just stored in some kind of metadata table in a page. So I don't really care about the users because I just read the data. But if I wanted to parse the users, I would have to find that metadata table. And I know that the passwords are encrypted, so I can't get that out. But I don't think there's a lot of difference in the actual format. The format is the same, same pages, same structure and everything.
So that's just more metadata tables. Yes and no. So will data structures change when we go to SSDs?
If you were to create a SQL server of bottoms up today and you optimized it for SSDs, sure. You'd definitely make some different changes. SQL server is heavily invested in the format they've got right now. They most likely won't change. I can't see them changing anything major anytime soon.
The major difference is that a lot of stuff that you used to do, like defragmenting your tables, you don't really need to do that anymore. It's still the same format, but you don't care about fragmentation. You care about using your pages so you don't have a lot of free space on them, but you don't care about the sequence that they're stored in. Because SSDs, it's free to do random reads. So that's some stuff.
It doesn't matter anymore that you might not want to optimize for anymore, but I don't see the format changing right now, but for new databases, sure. Unless there's anything else, I will be up here and thank you for attending.