We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Data Encoding Schemes

00:00

Formal Metadata

Title
Data Encoding Schemes
Subtitle
Scales & Measurements
Title of Series
Number of Parts
29
Author
Contributors
License
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language
Production PlaceOttawa, Canada

Content Metadata

Subject Area
Genre
Type theoryProgrammer (hardware)Key (cryptography)Condition numberVideo gameStatisticsPoint (geometry)Computer programmingWordSoftwareMultiplication signQuicksortFundamental theorem of algebraMechanism designSubject indexingComputer hardwareForm (programming)Different (Kate Ryan album)Standard deviationTheoryComputer animation
Physical systemCurve fittingSubject indexingMetric systemSpacetimeVolume (thermodynamics)Classical physicsComputer sciencePhysical systemComputer programmingMeasurementCAN busMusical ensembleText editorElectronic data processingNumberFitness functionHand fanMaß <Mathematik>Goodness of fitReal number
Ring (mathematics)Uniformer RaumDivision (mathematics)Range (statistics)Term (mathematics)Barrelled spacePoint (geometry)Function (mathematics)Port scannerMilitary operationDensity of statesOrder (biology)String (computer science)Symbol tableSet (mathematics)Pauli exclusion principleElement (mathematics)CountingNatural numberPhysical systemRange (statistics)Term (mathematics)Symbol tableBitGoodness of fitCategory of beingTheoryQuicksortGroup actionShooting methodCASE <Informatik>CumulantNeighbourhood (graph theory)Barrelled spaceKnotDisk read-and-write headCalculationBridging (networking)Flow separationPoint (geometry)Error messageDecimalExpressionMeasurementExecution unitNeuroinformatikNumberScaling (geometry)Fitness functionStatisticsMaß <Mathematik>40 (number)Morley's categoricity theoremSet (mathematics)CountingCodeFunctional (mathematics)Level (video gaming)AlgebraCodierung <Programmierung>Operator (mathematics)Coefficient of determinationDreiecksungleichungArithmetic meanRepresentation (politics)DatabaseElement (mathematics)Process (computing)FamilyAreaString (computer science)State of matterDivision (mathematics)SurgeryOvalSpacetimeLevel of measurementPlanningCoroutineVarianceForm (programming)Independence (probability theory)Metric systemFormal languageMathematicsTriangleNumeral (linguistics)GodHecke operatorSource codeComputer animation
Radio-frequency identificationPoint (geometry)Natural numberRhombusMilitary operationBoom (sailing)Density of statesGroup actionRankingUniformer RaumFunction (mathematics)Uniform convergenceHausdorff dimensionSample (statistics)Form (programming)Potenz <Mathematik>Logical constantMassData typeAnalog-to-digital converterNominal numberMultiplicationFunctional (mathematics)Power (physics)Execution unitScaling (geometry)GodGraph coloringOvalComputer networkMappingLibrary (computing)ResultantFraction (mathematics)MeasurementInjektivitätNumeral (linguistics)Order (biology)Physical systemOperator (mathematics)Text editorUniformer RaumQuicksortLevel of measurementVotingPoint (geometry)DecimalParadoxNumbering schemeAlphabet (computer science)Arrow of timeDimensional analysisData storage deviceBridging (networking)WindowCalculationElement (mathematics)Maß <Mathematik>Sampling (statistics)RhombusType theorySet (mathematics)Military rankMereologyPairwise comparisonoutputField (computer science)CuboidVolume (thermodynamics)Multiplication signNumberMetric systemObject (grammar)Bookmark (World Wide Web)DampingRange (statistics)DivisorInteractive televisionSocial classSequenceCategory of beingPotenz <Mathematik>LoginReal numberArithmetic meanNP-hardOrder of magnitudeCountingGroup actionBijectionMathematicsRankingData conversionWeightComputer animation
Different (Kate Ryan album)Square numberMetreFunction (mathematics)Linear mapRankingBinary multiplierLogical constantArmData typeAnalog-to-digital converterNominal numberScaling (geometry)Process (computing)RankingNeuroinformatikVirtual machineExecution unitInjektivitätMultiplicationBitTerm (mathematics)Absolute valuePoint (geometry)Morley's categoricity theoremLevel of measurementCalculationDatabaseStatement (computer science)Different (Kate Ryan album)Combinational logicMetric systemNominal numberSI-EinheitenRight angleData conversionLogical constantFunctional (mathematics)MappingPressureFilm editingMetreSquare numberBinary multiplierFreezingOrder (biology)1 (number)Computer chessRing (mathematics)Data dictionarySound effectPlastikkarteQuicksortForm (programming)LinearizationOffice suiteMatching (graph theory)CoprocessorEquivalence relationStandard deviationSinc functionShift operatorSolid geometryState of matterData storage deviceComputer animation
Codierung <Programmierung>Port scannerString (computer science)Operator (mathematics)Symbol tableNP-hardMoving averageUnicodeSet (mathematics)ASCIIStandard deviationAlphabet (computer science)Latin squareNumerical digitData modelFrequencyPosition operatorElectronic visual displayRegular graphRegulärer Ausdruck <Textverarbeitung>MereologyDensity of statesRadio-frequency identificationTime domainData typeCausalityPhysical systemSineCodeParsingVariable (mathematics)Fehlende DatenSPARCExplosionBuffer overflowDivision (mathematics)Metropolitan area networkNumeral (linguistics)Order (biology)DecimalNumbering schemeNominal numberIdeal (ethics)Electronic mailing listTask (computing)Enumerated typeUniformer RaumMixed realityData storage deviceRevision controlPartition (number theory)Level (video gaming)Service (economics)Independence (probability theory)Hyperbolic functionFlow separationDiameterComputerThermal expansionTranslation (relic)Combinational logicSet (mathematics)State of matterLogicHierarchyCodeFlow separationMetropolitan area networkPunched cardSpacetimeKey (cryptography)Buffer overflowPointer (computer programming)MathematicsGroup actionBranch (computer science)Source codeCharacteristic polynomialMultiplication signZoom lensFunctional (mathematics)AreaSubsetMereologyDialectNumberMultiplicationPairwise comparisonAlphabet (computer science)Error messageLibrary (computing)Event horizonNumeral (linguistics)Distribution (mathematics)Table (information)View (database)AbstractionFigurate numberQuicksortWordFormal languageLengthNeuroinformatikLogic gateVirtual machineMobile WebComputer fileGreatest elementComputer architectureOrbitField (computer science)Link (knot theory)Right anglePhysical systemProgrammer (hardware)InternetworkingArithmetic meanServer (computing)CodeOctaveSequelDigitizingLattice (order)Sign (mathematics)PhysicalismRevision controlMaxima and minimaRule of inferenceEndliche ModelltheorieData storage devicePosition operatorFuzzy logicElectronic visual displaySensitivity analysisCalculationExecution unitPredicate (grammar)Electronic mailing listSeries (mathematics)Goodness of fitOrder (biology)Process (computing)AlgorithmEncryptionDecimalCASE <Informatik>TouchscreenProcedural programmingVideoconferencingState observerMixed realityDifferent (Kate Ryan album)MeasurementAddress spaceOffice suite1 (number)Pattern languageMixture modelElement (mathematics)Validity (statistics)Water vaporIdentity managementBoss CorporationRow (database)TrailCodierung <Programmierung>Division (mathematics)Line (geometry)Cartesian coordinate systemGraph coloringExistential quantification9 (number)Context awarenessPlastikkarteNetwork topologyThermal expansionStaff (military)Numbering schemeRange (statistics)Square numberSelf-organizationType theoryPhysical lawTraffic reportingBasis <Mathematik>BitDisk read-and-write headSymbol tablePattern recognitionDigital photographyTerm (mathematics)Product (business)Web pageStudent's t-testPolar coordinate systemDrop (liquid)Social classRadical (chemistry)IdentifiabilityLaptopUniform resource locatorForm (programming)Enumerated typeSurgeryVariable (mathematics)DatabaseGenderDiameterPersonal identification number (Denmark)Service (economics)Maß <Mathematik>GoogolOperator (mathematics)ChecklistConnectivity (graph theory)Vector spaceBoole, GeorgeFrequencyTranslation (relic)Latin squareBookmark (World Wide Web)Representation (politics)String (computer science)Standard deviationCausalityPoint (geometry)Level of measurementCheck digitInvariant (mathematics)Scaling (geometry)Category of beingException handlingClique-widthGauge theoryCase moddingHidden Markov modelFeedbackHost Identity ProtocolHypermediaWeightReading (process)Query languagePartition (number theory)Computer animation
Transcript: English(auto-generated)
I'm probably best known for having put 10 years of my life into the ANSI standards committee and writing books about it. At some point I was an honest FORTRAN programmer years ago and then the minute I started doing SQL I was never allowed to do anything else. Oh the pain, oh the agony, but it's been a
good living. However I also, at one point in my life, was a statistician and actually worked with data. I think this is probably one of the things that us geeky programmer types forget about, where we become more interested in the
hardware, the indexing, the mechanics of our trade, and don't go back to the fundamentals of what we're supposed to be doing, which is working with data. And we don't get a good theory or any kind of background on it or any of the tools. You're just sort of left to figure it out on your own. So I have two manias.
This is essentially two talks based on one of my books, which if you buy I will be able to pay my mortgage. This is important. We're not getting the word data science, but frankly data science from what I've seen looks an
awful lot like statistics with a thyroid condition and a six-digit paycheck. So maybe that's a good thing, but we really need to go back to the fundamentals of what data is, how we represent it, why there's different forms of it. In particular, anybody remember Donald Knuth, the older people?
Yeah, teach the Knuth the whole Knuth and nothing but the Knuth. His art of programming is still one of the classics. They've got, I thought, all eight volumes are out and a bunch of codicils. It's about this much space on your shelf, and he is probably the greatest computer
science guy we've ever had. He quotes everybody. Dexter quoted nobody. That's how you knew your indexing was good because Knuth was encyclopedic. However, his first published work dealing with data was in
Mad Magazine when he was in high school. Look it up. It's for real. The Puprazidi system of Weights and Measures. It's a parody of the metric system. The illustrations are by Wally Wood. If anybody else is a Mad Magazine or a comic book fan, you'll know that name, and it's based on the
You have to be into particularly New York Jewish Yiddish humor, which was Kurtzman's thing, the editor of Mad Magazine, and the metric system to get it. The bad news about this, while it's good for a laugh in a humor
magazine, you run into this kind of crap in the real world when people invent their own systems of measurement in data processing systems. Let's get a couple of terms for a measurement out of the way. The range of a
measurement is how much area does it cover, essentially, in the space you're trying to measure. If I'm doing it with a gun and a bullseye that I'm trying to hit, the range would be how big the target is. Some things are appropriate for one size. Some things are appropriate for other sizes. What's
the joke about close enough is very good for horseshoes and hand grenades, but not so good for surgery. Granularity is how many divisions do I have on my target. How many units of
measure do I have? That one is one of my main areas, too. I used to work for a state highway department. In the U.S., we use feet. Yes, there are now only two non-metric countries left on Earth, and I live in one of them. My Yomar, or whoever the heck it was, finally went metric, but it's still us
in Liberia. If you want to, you can carry out a calculation in decimal feet to several decimal places, but nobody has ever poured asphalt for a road or concrete for a bridge using a micrometer. But they will publish
calculations. It would be a ten-thousandth of a foot of asphalt. Ooh, can you see that? It's actually measuring that? No, it's ridiculous. Precision is how repeatable the measurement is. If I take it over and
over, do I get pretty... I expect some errors to accumulate, but how close do I get? In the case of a gun shooting at a bullseye, how close is my shot group? That would be very precise. If it's sort of spread out all over the place, it looks like I'm using a shotgun, it's not so precise. Accuracy has to
do with how close it is to the truth. How close do I get to the bullseye? Notice that precision and accuracy are not the same thing. My target with rings, high granularity, maybe to the point that it's meaningless.
Tight cluster of shots, accurate, but not necessarily precise. I've got a really good gun barrel, but my sights a little bit off, so I'm always high into the left or something when I shoot. The other way around, the scope is... the scope's good, the barrel's loose, and I sort of get in the general
neighborhood. There's also the concept of a zero point on a scale, on a measurement. It's where the scale starts. Where do I start measuring? Sometimes there isn't one, but it's a useful concept. This is not
necessarily a numeric zero, but it's where your scale starts. There's also a metric function, which is basically triangular inequality, if anyone remembers that one from high school algebra, as a property. If I've got a metric property, a function and metric properties in my scale, I can do
calculations with them, and they can be meaningful. What's sort of funny is scales and measurements actually didn't come in until the 1940s as a science. You would think that somebody before that in statistics or something would
have come up with it, but no, it was a little late in coming. So what's the simplest scale I can use to measure something? The nominal scale. Take your values, assign a name to them. In fact, a lot of people don't even like to count this as a kind of scale. That name, by the way, can be a tag number, a
character string, or technically can be a symbol. We don't like to use symbols very much. They're a bitch to put into a computer, they don't transport very well, and they're not always obvious. The advantage of using a symbol
to measure something, to name something, is that it's language independent, completely language independent. If you want to look at some really beautiful examples of that with a really elaborate system, get a book on Renaissance artwork. The stone cutters in Italy, who were working for Michelangelo and
all those guys, were pretty much illiterate. They all had individual marks. Each of the sculptures had an individual mark that he put on the pieces of marble he was getting. Then little tags and other systems off of those marks showed the father, the son, the grandson, and given families.
They look like alchemist symbols. Be grateful that you do not have to put those in any database you're ever going to work with. They're just a nice little piece of artwork, and that's it. I can't do any calculations on a
nominal scale. About all I can do with a nominal scale is ask, are you Fred Jones? Is this your name? The crudest form of a scale. Also one we use an awful lot, and we do a bad job of it by the way. Naming things inside
databases could be better. Let's be kind about it. If I represent it as character strings or numbers, I can order them. But I'll be doing an ordering on the symbols, on the representation, rather than on the meaning. These are individuals. There's no grouping or category yet. I told you this was the
simplest way to do it. The next level up is a categorical scale where I've got a group, a property, a category with a name. Sets. I'm back to just a simple bunch of sets. Fido is my dog. Dogs are mammals. I can do set operations.
That's it. Union, intersection, all that good stuff. Categories are important, and how you, when we get to the encodings, how you work the categorical scale is a little trickier than most people think.
Problem is with these categories, can I have them overlap or not? What happens when I get something that's weird? Any Robin Williams fans, or is he pretty much an American thing? Okay, I thought he would be more, would be international,
but one of his routines when he was doing nightclub stand-up consisted of God smoking a joint and deciding, I'm gonna make a platypus. That'll fuck up, Darwin. How do you classify a platypus? Is he the only guy in his category? No, it turns
out there's actually about four other egg-laying mammals. The others are echinas. Of course, they're in Australia. All the weird stuff goes to Australia. What do I do when I've got something that just doesn't fit? Say, a Martian. I can make a new category. Okay, I have to have allowed for new categories in my
category. That's a good way to screw up things because miscellaneous winds up being so mixed that you can't do any meaningful work with it. Everything you forgot about, it's like your garage. It gets put in the category and piles up until somebody comes and cleans it out, or you can just pretend
it doesn't exist and exclude it. Now, the other question with a categorical scale is, can I actually see individual members or are they simply members of a group? It's worth telling people apart. It is not worth numbering grains of sand.
The idea of a commodity in a categorical scale is a little hard for people to get their heads around. Absolute scale is just a count on the set. Finally, I can do some math. I can add and subtract numbers. The
elements in my groups have to be interchangeable. It's a dozen eggs. It's not egg one, egg two, egg three. It's a dozen eggs. We tend to give names to these units. The dozen is a gross, choir, ream. Oh, I'm sorry, which ream in
the paper industry? There's about three of them. 500 versus 450 sheets of paper of certain sizes. My favorite, of course, the six-pack. It's not a drinking problem. It's a solution. What was funny was in England when they went over to
their decimalization way back when, one of the dairies there put out a 10 pack of eggs. I'm sorry, English-speaking countries, traditionally, yeah, metric eggs. Ooh, big promotion. English-speaking countries have a love of dozens for eggs.
It didn't sell, never mind the cost of the eggs. And those 10 packs was actually cheaper than they had been in the original dozen packs? People, it was just too strange. All right, if you went to a beer store, would you like to buy a
six-pack or a five-pack, even if the cost per beer was cheaper? Wouldn't it just seem somehow wrong to bring home a five-pack for the kids? Those traditional units really get locked in. Okay, obviously, they've got a zero point
on this scale, the empty set. An empty egg carton is where things start counting. Ordinal scale. I'm going to put an order on something. No operations, just comparisons, just a sequence. No origin, no zero point. Anybody else have
to take a geology class in college? Yes, we've got at least one geology victim. The only nice thing about geology, as far as I was concerned, was they gave you an axe, and you go out and hit rocks with it. The rest of
it, I have never had to use anything I learned in my freshman geology class at any time. It did not help me pour concrete for a driveway, or absolutely useless, but when we went out to the field to whack at rocks with our geologist pick, we got a box of samples called a Mohs scale, and it
was mineral samples in a tin compartment box, and what you would do is you would take your sample and scratch it on the various elements in the Mohs
scale, and you could say, well this is harder than talc, but softer than gypsum, by what could scratch what. Strictly a comparison. It is a quick easy way for someone running around in a pair of shorts with a pickaxe and this box of rocks in a field getting poison ivy, because he has to do this for
his freshman science credits to carry things. The real way to do hardness would have been the Rockwell scale that's used in manufacturing for steel and other metals, but we didn't have that. Oh, by the way, they never gave you a diamond in your Mohs scale. Usually there was a piece of really hard steel
in there, so technically it wasn't, but you know, what did you expect when you went to the school bookstore to get it? But just a comparison, just a linear ordering. Oh yeah, a little thing about ordinal scales, they're not required to be transitive. Ever play Scissor Paper Rock or some of the
other games? What's the one on Big Bang Theory? Bock, yeah. Have you seen the t-shirt with the non-transitive ordering of those things? Okay, we really hate non-transitive relationships. We want a
transitive relationship. We want it properly, tightly, well ordered. You can't make calculations or anything much off of a non-transitive scale. Non-transitive scales are also the tool for fixing elections
when you have more than two candidates. Look up Arrow's paradox and that it's impossible to get a fair voting system if you don't have two people and a tight ordering. Rank scales are sort of a
tightening on ordinal scales. There's an origin point, they're well ordered, they're guaranteed to be well ordered. Military ranks are of course the obvious one for that. Can't do any operations on them. I cannot take three privates, put them together, and make a sergeant. The ordering still stands. If
you shoot your sergeant, you still have to take orders from your captain. The transitive ordering is tight. We like those. I might not be able to do much math on them, but I can sort them. Sorting is good. Interval scales are
really what, when you say scale and measurement to people, this is what they think of. There's a natural ordering to the unit. I don't have any origin point, but arithmetic makes sense because of my units. It's uniform in its dimension. Most common interval scale you use? Calendar. Your common unit
is a day. Regardless of how you cut up your year or group your days together, you've got a common unit, the day. You guys might not get much of this, but for some reason among Christian fundamentalists in the United States,
there is a belief that God made the seven-day week. No, actually the Hebrews did. The Romans had a 10-day week. Parts of Africa had 10-day weeks. How do you cut up your units? Completely arbitrary, but I've got a metric function. I can
add and subtract, and I've got a linear ordering. I can't divide two days by each other. Christmas divided by Thanksgiving doesn't mean anything. One of my favorite t-shirts right now is, on a scale from 1 to 10, what color is your favorite letter of the alphabet? I show that to people when they're
starting to do stupid stuff with their data, and what's funny, when you ask somebody that, they'll stop and think about it. They will try to answer you. It sounds like it ought to be a real question, and if you've seen what they've been doing, you understand why they think it's a real question. Now, the
intervals on these scales do not have to be the same size. In fact, log and exponential scales are a lot more common than you think because you're a human being. Your sensory input, and a lot of things you do for judgment
off of sensory input, go on an exponential scale. My favorite is the Richter scale for earthquakes. Each time a Richter number goes up one, it's 10 times the magnitude of the previous unit. When you adjust volume on a stereo,
it doesn't go up linearly in the amplifier. It goes up, I believe it's something to the 0.3 power, 1.3 power. At any rate, it's not linear. It amps up, and a lot of stuff is exponential. Decibels are another, go
up by powers of 10. Now, having lived through a 7.8 earthquake years ago in Los Angeles, I appreciate the Richter scale more than I did before. When you look out your window and you see a bridge collapse. Now, ratio scales are
sort of the ultimate, and that's what, when you say scale or measurement to someone, this is what they think of. I've got a natural origin of some kind, zero point. The scales got strong ordering, and the unit is uniform in its dimension. Length, width, height, all the things that you would
use commercially are ratios. It's called a ratio scale because everything is expressed off of a single unit as either a fraction or a multiple. Remember the Proctor ZB system of weights and measures, or the
powers of multiples of 10 or fractions of tenths. Nice, handy, easy to work with. Oh yeah, and we've got this number system that the Hindu Arabs invented for us. I often wondered about the Hindu Arabs, but when I was a kid, and we used Hindu Arabic numerals, and I had never met a Hindu
Arab. Now, why are the classifications of scales important? Because if I'm trying to convert between scales, they have to be of the same type for the conversions to make sense. If I do a nominal to a nominal scale, it's a
mapping. One-to-one mapping preferably. Okay, we've lost picture, falling
asleep, and it's back, and it's a little bit ahead. Okay, so nominal scales, one-to-one mapping. Gee, since we're in Canada, French-English
dictionaries. Hmm, so you can get to a one-to-one for at least some of the terms. Ordinal monotonic function that preserves the ordering. Not necessarily the same values on each scale, but I want to preserve
that ordering. Well, that's why we call it an ordinal scale. I say value of Western and Chinese chess pieces. That's not really a good one. Maybe the dates on calendars. Rank-to-rank scales, monotonic function, preserves
the ordering, might not always be a good match. Army-to-navy ranks, the equivalents there. In particular, I don't know if this is still true, the U.S. Army used to consider war officers to be officers and gave them officers' privileges. The British Army considered them to be enlisted, and they didn't
get officers' privileges. There would be various cut points, but a mapping. Interval scales, linear function, and shifts the origin point. We're getting
9 over 5 plus 32. Did I get that right? I just remembered zeros freezing, hundreds boiling, 25 is a little colder than I'd like it, and that 30 to 35 is
comfortable. I live in Texas, and I keep my house set at 80 Fahrenheit. Yes, it's a little unusual, but you lose a lot of heat through your scalp. Ratio scales, constant multiplier, liters to quarts, 2.2. Exact
conversion, that's why we like ratio scales. They're easy to work with. Interval scales, okay I got to do a little math, but ratio scales, simple multiplication. Now, derived units, the concept of a primary unit. This goes
over to the metric system, the System Internationale, and the ISO standard 2955 is where they've got all of the official definitions for the derived metric units. Kilometers per hour, square meters, can be all kinds of
combinations on different scales. Some of them will not make sense, but pretty much you can put any two primary units together, multiply them, and get something that's meaningful. If you ever look at the definition of a
Pascal as a unit of pressure, there's a little more multiplying and dividing in there than you might like. But, okay, gross general statement. In the database, if I'm going to derive something, I have a derived unit, I had really rather do the
calculation in the database from the simplest, most primary units I can store. This is a generalization. That way, if I need to do something else with them, I don't have to try and pull out the primary units to get them again. And multiplication is cheap. Computers are real good at this computing thing.
So, in the old days, yes, when people had to do it, it was a little more work than we'd like, but it was worth storing the computation rather than the basic units. As a generalization, let the machine do the
computing. It's actually faster. Your processor is working in nanoseconds. Your disk drive is still a lot slower than nanoseconds, even if you're doing solid state drives. So it's actually faster. Quick summary on scales.
From weakest to strongest, nominal, categorical, absolute, ordinal, rank, interval in its various forms, either linear or log, and finally ratio scales. Okay, now that's usually one lecture sort of stretched out when I'm doing this
for a class. And at this point, you would all be drawing a 3x5 card. Oh, that's inches. An A5 sheet of paper with something written on it, like ring size, shoe size. And you would have to go to Google or the
library and look up exactly what kind of scale you're using. This is going to be bad. Note it. Make it easier to work with. Now, we're getting into a database
of symbols. And frankly, thanks to a wonderful thing called a Unicode, the representations are going to be alphabets, actually a subset of the Latin alphabet, the bottom of the ASCII characters, numbers 0 to 9,
and some symbols. And I've got rules for manipulating the codes. I've got math for numbers. I've got string operations. And technically, these days, I can put data directly into the databases. But we really don't like to do that so much, or at least us old SQL guys that never really got
over the idea of having graphics and that sort of stuff in our computers don't like it. They're hard to search. One of my favorites was a couple of decades back, IBM was pushing picture recognition. It will probably be face
recognition now. And they had examples of their wonderful product where you could sit down with some colored pencils, draw a quick picture of something, and then search photographs with your drawing. They were trying to face recognition is a whole science in itself, but they were trying to do it very generally. Their example was finding a banana in all these pictures of
a fruit. It found the bananas. It was actually quite good about that. It also found a toucan. He showed up as a banana. So we're still working on it.
In particular, we got Unicode. Let's put all the alphabets and symbol systems on known demand in 16 bits. It's a nice ISO basis for encodings. The ISO people specifically wanted to get this minimal subset in
all the languages on earth. This is why you can write a VIN number in Chinese. It's part of the Unicode set or any of the other languages. Latin alphabet, no accents, no case sensitivity. Some of the positions can be
numeric. There might be rules for disallowing them so that we don't get confused or where the digits and the numbers come up here. And a minimal set of punctuation marks, pretty much commas, dashes, period, or dot, I
things like, and I actually know the names for these, octothorpe. It is not a hashtag. Type setters call it an octothorpe. And that little thing that you think is an and, it's an ampersand. All of them have rather fancy names. But the at sign is technically the little snail. It sounds
much better in French. But that was the official name. The trouble with the meanings in other languages and systems. Remember when Microsoft was talking about how they were going to really get on this internet thingy bob and do their part and blah, blah, blah? Well, they named something
C sharp without any knowledge that the octothorpe had meaning on the internet. Gee, that's careful research, guys. You're really into this. But when I'm doing an encoding, the display is important. Encodings should
be convenient for people. You know, those damn users. We have such wonderful systems that if it wasn't for the users, always screwing up something. In particular, when I've got an encoding, I can either do fixed or
varying length. I would prefer fixed length. The length of a code is part of its validation. If I see five digits, I know it could be a US zip code. If I see a mixture of digits and letters, I know it could be a Canadian postal code if it follows the right pattern. Was it letter, digit, letter
for the first part? No Canadians here? Yeah. But the other one, if you remember, it's between four and 12 letters and numbers that are actually
the abbreviations or attempts at abbreviations of old post offices that existed in the late or middle 1800s in England. It's completely unusable and unparsable. It's so bad they're introducing a five digit commercial code based off of the US zip code for bulk mailers because their own
system has proven to be so unusable. Also the Royal Post Office, the Royal Mail, have a monopoly on their guidebooks for doing the addresses. It's illegal to set up your own postal code service in the UK. You're
a government at work. Now fixed length also has another advantage. Everybody remember printers and paper when we used to get our data on that stuff? Well,
that goes back to the old punch card days when we had fixed length fields, fixed length columns, fixed length displays, 80 columns across on a 3270 video screen. But more than that, it's something a person can see and can line up. Varying length gets confusing. It's being nice about it. Okay, the worst
standard we'll probably run into is not the British postal codes. It is a thing called the IBAN, the International Standard Book Number, Bank Number, Bank Account Number. It is 50-something characters long. It includes
the account numbers, the country codes, a whole bunch of stuff crammed into this one unreadable string that only a machine and on a Swift system can figure out. People that work with it can't read them. People who work in the
automobile trade can read a VIN number, which is only 19 characters long, but nobody can read an IBAN. The thing is with human processing, you don't read letter by letter. You read in chunks or BOMAs. You cluster things. Three is
the best. People will get three digits or three letters correct almost all the time. You can go up to five very safely, but beyond five you start getting errors. In fact, there are four common errors. Missing character, extra character, one bad character, and then pairwise transposes. That's probably from
typing, but pairwise transposes are the fourth most common. For example, phone numbers are grouped into an exchange, a dialing area, and then the actual phone within it. It's very convenient to read it that way. I'm doing okay on time, I
think. What about bad encoding schemes? Well, one of the characteristics is there's no room for growth. In the 1970s, when we were still on punch cards in the state of Georgia, down in the States, we had
auto-type codes. It was one punch in a punch card and it was originally taxis, private vehicles, farm vehicles, just seven or eight of them for the type of license tag you got. That was very nice and it worked fine. Then along came a thing called commemorative tags, which state governments love because with a
commemorative tag, you could charge extra. Great revenue source. California makes a few hundred million dollars off of their commemorative tags, so every group that had a cause, every college, veterans group, whatever, wanted a
commemorative tag. Would you like to be kind to animals? Fine, that'll cost you $35 and you can display it on your license tag and look better than your neighbors. The problem is we wound up having to put all kinds of different
codes. I mean, when I left, it was about 35 of them because every college had to have its own commemorative tag. So how do you get 35 different punches? How many people have ever worked with a punch card? Maybe I would ask this. Thank you. Yeah, thank you. Now usually when I do that, I get
what I call my fish market. All the kids, that's people under 40, sit there like dead fish, mouth open, eyes glazed, looking at you. So we found that you could multi-punch. You'd hold a key down and then you'd punch
several combinations and it's 12 columns, 12 rows to a column, so you had two to the twelfth possible combinations. You had a little translation thing to the side. Oh, wait a minute. We had 029, 027, and 028 IBM key punch machines,
which are all a little different, and UNIVAC key punch machines. So you not only had to know what the multi-punch was, you had to know what machine you were punching it on. Otherwise, the tags would get all messed up. No room for
growth. If they had allowed two digits for the license tag type, it would have been no problems and it would have saved us quite a lot of work. The other one, how many people have ever worked with COBOL? Yeah, I mean, don't tell
mother she'd be so ashamed. She thinks I was playing piano in a dealership. Figured we would never have more than 10,000 dealerships in the United States. This is when they were bringing over the Honda scooters. Do you
remember the Beach Boys song, My Little Honda? Google it. I'll tell you about hip hugger bell bottoms and miniskirts next, but they didn't allow room for that and they had to redo all of their COBOL files. You don't think about this with SQL. If we want to make something, we've
got an abstract view of data. If we want to make something bigger, we just change a check clause to give us a different range or we alter a table. You don't do that in COBOL. What you see is what is processed. Everything is
character strings exactly the way it appears on the physical media. It was really a major leap, but no more than 10,000 dealerships. Remember, was it Bill Gates? Why the hell would anybody need more than 64k on a home computer?
Did he say 40? 640, yeah. And now you've got a watch with more than that. Another bad encoding scheme that happens more than you think is ambiguous codes. My favorite example was the old international standard book
number, the ISBN. This is because they used to own bookstores. It was 10 digits made up of four parts. The first one or two digits, variable length pieces, was the language. Zero and one were English. 93 is Esperanto. It was sort of the end of the list. I don't know where Klingon and Dal Rafi figure on
them on the scale. If they're in there or not, they might be. The publisher's code, the bigger the publisher, the shorter the code. Three digits or up to seven if it was a small one-time thing. The book number within publisher, and then a mod 11 check digit. The catch is, without any punctuation in the 10
digit ISBN, you can cut it up various ways. And in the early days, they had ISBNs that could be parsed two ways. There were about 15 or 16 of them, and it was enough to mess up libraries for a while. That has since been fixed, and
the ISBN is good and usable now, and it's part of the EAN codes. The miscellaneous code, if it gets used a lot, something's wrong. You skipped too much. Now the other thing with a bad code, there's no support for exceptions.
Everything just gets into that miscellaneous category. But I can have unknown values, missing values. Oh, wait a minute. For us, that's a null. When you have a nullable column, how many of you bother to actually document what the null means in context? Does it mean that something's missing? If I
have a null phone number, does that mean I don't have a phone, or does it mean we don't know his phone yet? Hmm. Non-applicable. Something just came in crazy. The non-applicable you probably laugh at, but I just got through working for a company called MIB. No, they're in the insurance industry. They've been in
business for well over a hundred years. They did not get their name when the movie came out, but they get kidded about it when they go to insurance conferences and occasionally wear sunglasses. Their thing is they get dirty data coming in from the insurance companies, so they need a rather
elaborate set of non-applicable values. I've already mentioned miscellaneous and unclassified. It's a bad design. You've missed something. Overflows, underflows, bad divisions, computations that are garbage. Now I can
have an error in one field. That's, you know, take care of that. But how about where two fields are related? A medical record that shows a pregnant man. Bruce Jenner is going to be so disappointed. Have to get that in. How about it's computable? We could find it. We don't know. There was a Sparks committee back
in 1975 that issued a list of 14 different kinds of missing data. Later on, a follow-up to the Sparks committee report gave us 22 different kinds of missing data, and statisticians have all kinds of ways of trying to correct for the missing data. If you think that designing encoding schemes is not important,
try doing math with Roman numerals for a week. Roman numerals were such a bad encoding system
that even the Romans didn't do math with them. No, they looked it up and they had a look up tables for division and multiplication. Try living without alphabetical ordering for a week. In Hong Kong, I don't know if they still do, but in Hong Kong, the telephone operators used
to have a contest every year where they would identify somebody's phone number from their name. And the winners of these contests would have memorized 10,000, 15,000, 20,000 different names and phone numbers and be able to spit them back. I had a friend who taught English in China
before Tiananmen Square, and she would get her class roster of 150 students translated into Roman letter, but it was never sorted, no alphabetical order, and it was never the same order from week to week. Everybody can memorize 150 names, can't they, if you're used
to working with Chinese. Not so good for an English speaker. Go to a library and try and find something without the Dewey Decimal classification. Dewey Decimal has some problems, I'll get to that in a little bit, but I mention this thing about organizing a library
by color. Actually, before the Dewey Decimal classification system came along, every college library or public library had its own individual classification system invented by the head librarian at that particular school. My wife volunteered for a feminist bookstore in the 70s
in Atlanta, and one of their people was definitely not a book person. She did sort the books in the store by color because she thought they would look pretty. Do not work with hippies if you can help it. It will not end well. The drugs will be good, but it's not worth
the work you have to do afterwards. If I've got a good encoding scheme, my aggregations and queries and so on are easy, much easier. Again, going back to Dewey Decimal, if I
want to look up science books, I know they're in the 500 series. If I want to look up math books, I know it's in the 510s. I can immediately zoom in on a small subset. I can use between predicates in my SQL. You'll also find your calculations get to be a lot more accurate. You don't have to worry about things on borders and fuzziness.
Kinds of encoding. Enumeration. Make a list of values, assign a name or a tag number to them. It's really a nominal scale with a name attached to it. If you can get some
kind of ordering on the symbols, that's nice. Chronological, procedural, some sort of physical thing. The bad news about sorting codes by alphabetical or numeric order is what languages were you using for the alphabet? Canadians are probably
a little more aware of that. I have to head on SQL server people because they are the worst SQL programmers on earth. They're all VB programmers whose boss wouldn't pay to send them to a course. They learned it at home in a weekend. They will use their auto increment, their identity
column. Just make a list of things and whatever physical order the values happen to be in the table, in the physical storage, that becomes the encoding for it. No logic, no thought, no nothing. Measurement codes. I've got a
unit of measure. I put the units in. I know it's for that column and I express it. I may have to tell what unit I'm using, but essentially it's just a recording of measurements. Worst design is to put
the unit of measure in the same column as the value of the measurement. Dollar sign with the dollar amount, that's COBOL. Why? Because COBOL was concerned with physical display. There was no difference between storage and physical display. You'll still see people do it. If you have mixed units, you need
to have a column that tells us what our unit of measure is somewhere. Abbreviation encodings. Take a shortened version of the value, the name of the value, and come up with a, usually if you can, fixed length
abbreviation. The goal of an abbreviation code is to be people readable. I'm going to say everybody for the most part flew here. Did you notice your three-letter airport codes? It doesn't take too much to figure out that BOS is Boston or ATL is Atlanta. Y-O-W? Not so much. Yow. I'm going to
get you. Wow, Canada. And the same thing with some of the smaller, really small airports in the Great North and in Alaska. You
can't, Z's and W's on them. But get some pretty weird ones that just are not intuitively obvious to the casual observer. The nice part with abbreviation codes is that business about a human being figuring it out.
Don't fight your user. Algorithmic codes, I've got a procedure, I encode the value, it might not be immediately human readable. In fact, in the case of encryption, it better not be immediately human readable or I've got some real problems. You don't think about it, but rounding a number is technically an
encryption, technically an algorithm. Hashing functions where there's no way to look at them and immediately tell what the originals were. I really like hierarchical codes. I fell in love with Dewey Decimal, but they're
usually numerics, but they can be mixed alphanumerics. The Library of Congress system for libraries is actually more accurate than the Dewey Decimal and it's mixed alphabetic. Zip codes in the United States are based on geographic partition. If it begins with three, it's in the southeastern
United States, multiple states. And then bit by bit as you go from left to right, it gets more and more accurate. So 300 would be the southeast, 303 would be parts of Georgia, 30310 would be a subset of Atlanta, Georgia. And it's
nice and easy to look at and you've got an idea when you see the code what you're zooming in on. Now bad news with it, you can put stuff in the wrong part of the hierarchy. Dewey Decimal put logic under philosophy. Why? Because
when Melville Dewey was inventing this system, George Boole had not written the laws of thought and there was no mathematical logic. When's the last time you saw a philosophical logic book? There may have been one written but for the most part nobody's written philosophical logic for the past
150 years or longer. It became a branch of mathematics. So it's a little messy and we've gradually moved logic more toward math in the Dewey Decimal system. What if I don't have enough space in my hierarchy for some things? In particular, one of Melville Dewey's other prejudices was
that there were Catholics, Protestants, Jews, Muslims, and miscellaneous. The Library of Congress has more books on Buddhism than it does on Christianity. They've been writing longer. There's more breakdown of subsets and whatnot
in the Eastern religions but they all got put in one miscellaneous religion. It's an old British cartoon of Sergeant Major standing in front of his Indian troops. Church of England to the right, Protz in the middle, Papes to the left, and your fancy religions to the back.
And that was pretty much how Melville Dewey saw it. What happens if you've got an item that could reasonably fall under multiple codes? Church architecture and the worship service could be under religion and architecture. You don't think about it, but how Christian churches are laid out as a cross, that
affects how we do our ceremonies. Did you know the aisle is the short arm of the cross? So when the bride walks down the aisle, traditionally it wasn't down the long length of the church, it was originally from the sides. The Muslims have a square for a mosque and there's other things where the architecture affects how the service is held. It's really kind of an
interesting topic. Where do you put it? Architecture or religion or both? Hmm. Well, the solution for the librarians is whatever the Library of Congress says is the code is where you put it. Okay, we're now running out of time. Vector codes are made up of parts but the whole has to be there. The
components can be independent or dependent on each other but the whole code is a unit. My favorite is ISO tire sizes. A 155 SR 15 is metric width, the SR stands for steel radial, and then the diameter is still in inches but
that's going to change soon. And obviously I cannot have a wheel without a width, a diameter, and it's got to be made out of something. I cannot take any of those components out. They're pretty much independent. The Social Security numbers are a US thing. That needs to be removed. We've now done away with
meaningful Social Security numbers because we have so many illegals. They're now being assigned randomly. Yeah, it's that bad. Concatenation codes, variable number of parts, I just keep adding on to the end of it. They can be ordered or unordered. You don't think about it but a keyword
list at the front of an article from a limited vocabulary is a concatenation code. Checklists, yeah, they were called facet codes in Europe. They're not in favor so much anymore for designers. It used to
be when they were physically written down and they were literally concatenated and you'd initial each step in a process. They were popular in the aircraft industry but not so much anymore. Okay, guidelines. Do not reinvent the wheel. Okay, oh good, my pacemaker was
working. You can research encoding schemes. Do it. It's quick and easy. Can
Google's better than anything we ever had when we had to go to paper copies. Now here's the bad news on Google. It's too good at times so you really need to know your industry. In my book I have a whole chapter on sex
codes. The standard one that we'll probably use is ISO 3166. Zero for unknown, one for male, two for female, nine for lawful person like corporation organization. Then there's a whole bunch of other codes used by biologists.
People are very dull. Other than Bruce Jenner we don't change around a lot. We just generally come into flavors but when you get to the different biological codes for all the medical stuff, whoo, which one do you want to use?
Probably for commercial stuff. Male, female, unknown. Oh by the way the reason for zeros and nines for the unknown the miscellaneous code in the ISO sex codes is punch cards. In the old days a blank column, an un-punched column, would
be read as a zero by Fortran and you can also make COBOL read it as a zero. So that way you can take the card, the unit record as it was called, and re-punch it when you found out what the code should be. Nines were easy to do on a key punch machine. Just hold the key down, fill it with all nines and
that way it would always sort to the bottom of the report. Yes the encoding schemes were designed for the physical use of a key punch machine. A lot of things are like that. Why are railroads a certain width? Because I don't know if you've heard this one. Why are the US railroads a certain
width gauge on the tracks? Because we followed the British pattern. What did the British pattern follow? The Roman chariots, the width of a Roman war chariot. So we've been following horses asses on our railroads for years.
The exceptional value should be explicit, allow for expansion. Also this is going to sound really dumb, but you'd be surprised. I actually put a translation of the codes in a database somewhere where somebody can get a machine version of it. I wish I was making that up, but when I had my first
set of eye surgery done in Los Angeles at Cedars-Sinai, a major hospital with a good reputation, blah blah blah, I went down to fill out the forms. The clerk had a loose-leaf notebook with all the codes so we could punch medical
codes and stuff into a 3270 IBM terminal. This is in the 1980s. This is not in the 1950s. This is in the 1980s. They had no drop-down menus, no PCs, and everything was still in laminated pages at a major hospital. Okay, I'm
running a little bit longer than I should. Questions, comments, feedback. If you want to throw something that has to be soft, that's all I ask. Anybody? Give me
my applause and get out of here.