We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Going the Distance

00:00

Formale Metadaten

Titel
Going the Distance
Serientitel
Anzahl der Teile
65
Autor
Lizenz
CC-Namensnennung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache
Produzent

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
If you've ever misspelled a word while searching on Google, you've benefitted from decades of distance algorithm development. In this talk we'll break down some popular distance measurements and see how they can be implemented in Ruby. We'll look at real world applications with some live demos. It won't matter if you get your kicks "Hamming" it up, if you drink your coffee in a "Levenshtein", or if you're new to Ruby: this talk is Rated R, for "all Rubyists". You'll be guaranteed to walk away with O(n^n) expectations met.
39
SystemaufrufGruppenoperationProgrammiergerätMereologie
CodeSpeicherbereichsnetzwerkCOMMultiplikationsoperatorOpen SourceE-MailBesprechung/InterviewComputeranimation
CodeProgrammfehlerVerhandlungs-InformationssystemQuellcodeDezimalzahlVorzeichen <Mathematik>Open SourceSkalarproduktE-MailCoxeter-GruppeKugelkappeBestimmtheitsmaßDigitale PhotographieSpieltheorieBitComputeranimation
Digitale PhotographieComputerunterstützte ÜbersetzungMultiplikationsoperatorXMLUMLComputeranimation
Metropolitan area networkUnrundheitMereologieComputeranimation
BitFormation <Mathematik>Data MiningQuick-SortXMLUMLComputeranimation
AlgorithmusAlgorithmusVererbungshierarchieInformatikComputeranimation
ProgrammiergerätProgrammierungGebäude <Mathematik>TaskProgrammVererbungshierarchieProgrammierungInformatikMereologiePhysikalisches SystemComputeranimation
SurjektivitätAlgorithmusMaschinenschreibenAlgorithmust-TestSoftwaretestReelle ZahlAuflösung <Mathematik>ComputeranimationXML
MultiplikationsoperatorHinterlegungsverfahren <Kryptologie>GeradeXMLUMLComputeranimation
Wort <Informatik>Computeranimation
Wort <Informatik>TexteditorComputeranimation
IndexberechnungTexteditorZeichenketteGeradeCASE <Informatik>VererbungshierarchieWort <Informatik>BitDickeComputeranimation
Perfekte GruppePerfekte GruppeBitWort <Informatik>FehlermeldungZeichenketteComputeranimation
ZeichenketteFehlermeldungProfil <Strömung>DickeZeichenketteFehlermeldungEinsMatchingSchnittmengeTelekommunikationComputeranimationUML
EinfügungsdämpfungAlgorithmusSubstitutionAlgorithmusOrdnung <Mathematik>ZeichenketteEinfügungsdämpfungAusnahmebehandlungUMLComputeranimation
ZeichenketteAusnahmebehandlungGüte der AnpassungMatchingÄhnlichkeitsgeometrieEinfügungsdämpfungCodeComputeranimation
EinfügungsdämpfungÄhnlichkeitsgeometrieWort <Informatik>CodeEinfügungsdämpfungComputeranimation
SubstitutionSchlussregelZeichenketteDifferenteDickeComputeranimation
DifferenteDickeZeichenketteDickeDifferenteCodeEinfache GenauigkeitMathematische LogikComputeranimationXMLUML
RechenbuchEinfügungsdämpfungMathematische LogikSubstitutionEinfügungsdämpfungLokales MinimumComputeranimation
EinfügungsdämpfungGruppoidLokales MinimumZahlenbereichNichtlinearer OperatorCASE <Informatik>AusnahmebehandlungMatchingComputeranimation
Rekursive FunktionSchlussregelRekursive FunktionComputeranimation
Rekursive FunktionDickeEinfügungsdämpfungMathematikComputeranimation
Überlagerung <Mathematik>PaarvergleichMultiplikationsoperatorOrdnung <Mathematik>SkriptspracheCOMXMLComputeranimation
PaarvergleichRekursive FunktionBitPaarvergleichHyperbelverfahrenRekursive FunktionComputeranimation
AlgorithmusRekursive FunktionSchlüsselverwaltungAlgorithmusZeichenketteQuick-SortComputeranimation
Total <Mathematik>MatrizenrechnungMehrrechnersystemComputeranimationXML
MatrizenrechnungMehrrechnersystemMatrizenrechnungZeichenketteSatellitensystemBitFlächeninhaltComputeranimation
MatrizenrechnungZeichenketteTUNIS <Programm>Ordnung <Mathematik>ExistenzsatzMatrizenrechnungPunktComputeranimation
MatrizenrechnungKontrollstrukturEinfügungsdämpfungMatrizenrechnungPunktGesetz <Physik>SchlussregelEinfügungsdämpfungProgrammierungComputeranimationVorlesung/Konferenz
EinfügungsdämpfungSchlussregelZeichenketteDatensatzMatrizenrechnungZeiger <Informatik>Automatische IndexierungXMLUMLComputeranimation
EinfügungsdämpfungEinfügungsdämpfungMatrizenrechnungResultanteProgrammierungAlgorithmusCASE <Informatik>Rechter WinkelComputeranimation
EinfügungsdämpfungGruppenoperationMatrizenrechnungCAN-BusGruppenoperationGüte der AnpassungMultiplikationsoperatorZeichenketteCodeVirtuelle MaschineAutomatische IndexierungXMLComputeranimation
MathematikMatrizenrechnungVirtuelle MaschineZeichenketteCodeMatrizenrechnungMultiplikationsoperatorDatensatzAutomatische IndexierungÜberlagerung <Mathematik>SubstitutionEinfügungsdämpfungComputeranimation
MatrizenrechnungEinfügungsdämpfungNichtlineares GleichungssystemMatrizenrechnungSubstitutionBitMatchingProzess <Informatik>CASE <Informatik>SchnittmengeDatensatzZeichenketteComputeranimation
MatrizenrechnungSubstitutionMatchingZeichenketteComputeranimation
MathematikMatrizenrechnungAlgorithmusGruppenoperationHash-AlgorithmusCASE <Informatik>ZeichenketteMatrizenrechnungCodeQuick-SortComputeranimation
IterationTravelling-salesman-ProblemCloud ComputingQuick-SortComputeranimation
ZeichenketteRekursive FunktionSpezielle unitäre GruppeMatrizenrechnungIterationGewicht <Ausgleichsrechnung>Computeranimation
IterationHalbleiterspeicherAlgorithmusProgrammierungGruppenoperationZeichenketteVerschlingungCoxeter-GruppeMultiplikationsoperatorComputeranimationXML
Migration <Informatik>TypentheorieGenerator <Informatik>Virtuelle MaschineMigration <Informatik>TUNIS <Programm>DruckspannungTaskComputeranimation
DruckspannungTaskTaskSoftwareComputeranimation
SoftwareFehlermeldungBitGoogolComputeranimation
PaarvergleichGoogolAdditionTypentheorieComputeranimation
Lesen <Datenverarbeitung>ZählenReelle ZahlAdditionWort <Informatik>MultiplikationsoperatorEinfache GenauigkeitZählenEin-AusgabeFlash-SpeicherInformationComputeranimation
Data DictionaryEin-AusgabeEin-AusgabeBaum <Mathematik>GoogolSoftwareZweiPunktsinc-FunktionComputeranimation
CachingRoutingComputeranimation
Gesetz <Physik>FehlermeldungLoginElektronische PublikationObjekt <Kategorie>ComputerspielTypentheorieMetropolitan area networkProgrammiergerätAlgorithmusEinflussgrößeProdukt <Mathematik>Computeranimation
EinflussgrößeTopologieEinflussgrößeAlgorithmusTopologieGoogolInformationsspeicherungFlächentheorieZählenPunktRechenbuchMultiplikationsoperatorMailing-ListeSeidelComputeranimation
AlgorithmusEinflussgrößeAlgorithmusProjektive EbeneCodeFormale SpracheDifferenteGrundsätze ordnungsmäßiger DatenverarbeitungBetrag <Mathematik>Computeranimation
AlgorithmusWeb-SeiteGammafunktionCOMCodeTaskTurm <Mathematik>Betrag <Mathematik>AlgorithmusCodeVersionsverwaltungDifferenteProgrammiergerätComputerspielBitComputeranimation
AlgorithmusAlgorithmusGemeinsamer SpeicherComputeranimation
WärmeausdehnungAlgorithmusRechenschieberSoftwareRechenschieberQuellcodeCodeVerschlingungLesen <Datenverarbeitung>Repository <Informatik>ComputeranimationDiagramm
Transkript: Englisch(automatisch erzeugt)
All right, hello everyone.
My name is Richard Schneeman, or at Schneems. If you came to the lightning talk, you know the most important part about me. I'm incredibly, incredibly committed to Ruby. Sadly, she could not come to the conference. And you will also know that Ruby is, in fact, a Python programmer, it's okay.
I've gotten over it. I work for a small company based out of San Francisco that you may have heard of. It's called Heroku, where they give me some free time and allow me to work on things such as codetriage.com, which allows you to get an open source issue
in your inbox, one every single day. Or I also recently announced docsdoctor.org, which is basically the same thing, but with methods. So if you're looking for documentation, you can get method documentation, a documented method or an undocumented method in your inbox.
A little bit more about me. I am in the top 50 of the Rails contributors, so that kinda makes me a big deal. I do not have any cats in this presentation. This is a photo of my dog. Enjoy.
You may have seen a couple people wearing Keep Ruby Weird shirts. This is a conference that we threw for the first time ever in Austin, Texas. Austin is Keep Austin Weird, and you know what? Ruby's amazing and dynamic and powerful, and in the spirit of why, we wanted to preserve that.
We also, we got together and we were like, okay, well, you know what? We kept Ruby Weird in Austin, but at RubyConf, we should also try to keep it a little bit weird, and we were like, what can we do to keep Ruby weird? So we were just brainstorming, and I needed a longer segue.
So we were brainstorming, and we thought perhaps we could do the can-can, you ready? Nah nah nah nah nah nah nah nah nah nah nah nah nah.
Nah nah nah nah nah nah nah. Na, na, na, na, na, na, na, na, na, na, na, na, na, na. OK, so that was the fun part. Stay here. You're not done. Thank you very much. That was wonderful. Can I get a round of applause? They knowingly volunteered to do this.
That's amazing. That's amazing. What you do not know is you have also volunteered to do the conference can-can. So can I get you to all stand up? You have to stay weird. Everyone. So the conference can-can is similar to the actual can-can,
except you can't do this. So it's more kind of like a na, na, na, na. Pretend you're just like a really shy person doing the can-can. And we have some actual music that is perhaps a little bit better than mine. So you all ready?
And you do a sort of knee, step, kick. But really like knee, knee. Yeah, it's like knee, knee, knee, knee. If you're friendly, grab the person next to you. Thank you. Thank you very much. But that's not required. All right, you ready for this?
This is amazing. By the way, this song goes on for an entire minute.
OK, thank you. Thank you very much. All right, I hope you are fully weirded out now. Now that I have your attention, now that you're awake, I'd like to talk about algorithms.
But before I talk about algorithms, I want to talk about my background. So I went to Georgia Tech, and I studied mechanical engineering. I don't have a CS degree, and I've been teaching myself programming for about the last eight years, which has been super fun and super amazing. But computer science is like crazy boring. I don't know if you know this.
If you have a CS degree, it's all right. I find programming really, really interesting because I like building things. I like actually making and putting my hands on parts of systems and seeing how they move. But the CS students might know something that the
mechanical engineers of the world haven't quite touched on, which is algorithms are absolutely beautiful. Like, just unbelievably beautiful. And it's not because, oh, they come from a book, or you have to learn them to pass this test, because algorithms solve problems. They solve real world problems that someone else has
invested time into learning. So I have a really big problem, and that is spelling. I cannot spell at all, which is why I'm a programmer, because if I misspell my variable name, as long as I consistently misspell it, it's OK.
I'm only like half joking on that one. And spelling even becomes more difficult when you're tired or distracted. And like all the time, I'll do something along the lines of like git commit, and then it gets like, yeah, that's not a real command. And something else that git does, which is really cool,
it's like, hey, what you actually probably meant was commit, and that was amazing. And one day, I was like, hey, how does git know? So the method through which git determines this is something called edge distance. And edge distance is going to be the cost to change one word into another.
The less that costs to change one word into another, the more likely that is what you meant. If there's zero cost, then it is the exact same word. A quick example, from zat to bat, this would be a cost of one, or zzat to bat would be a cost of two. So that kind of makes sense.
But how would we go about coding it? When I first found out about this, I was like, oh, hey, that's like probably really simple, and I can do that. And I sat down at my editor and just kind of banged out this like, I don't know, five line something. I was like, hey, distance between two strings. And when I ran it with this really simple use case, I got
that the cost was one. Basically, all we're doing is iterating over each character in one of the strings and checking to see if it matches the exact same spot in the other string. It's super simple. That made sense to me. And I was like, oh, this is great. I'm done. And the talk is over.
But you may have guessed, the talk is not over. It wasn't perfect. And if I put in something a little bit more complicated, maybe things that didn't have the same length, then we would get a really high cost. And this is not correct. It does not take seven edits to change the word Saturday into Sunday.
So it turns out that in my naivety, I accidentally recreated a thing that actually exists and is actually useful. It's called hamming distance. And hamming distance is not edge distance. It's known as signal distance. And it will measure the errors introduced in a string.
And it's actually only valid for strings of the same length. It is really useful, but only not for spelling. It's horrible for spelling. Don't use it for spelling. If you happen to be in the telecommunications industry and you're dealing with, let's say, strings of, essentially, strings of ones and zeros, then you can say, OK, well,
this is the likely match of this set of strings if we can detect that there was an error. So it's really bad for misspelled words. And the reason for that is it only includes substitution. It does not include inserting a character or deleting a character. So I would like to introduce another algorithm called
Levenshtein distance. In order to figure out Levenshtein distance, we have to figure out how do we figure out those extra deletion and insertion. So if we look at two strings that are really similar, we can see here that schnemes and zschnemes is almost a match, except for this first character.
So basically, if everything except for the first character on our second string is a match, then that's a really good candidate for deletion. That's saying, yes, we should delete that. And this is what that would look like in actual Ruby code. If those match, then we delete. For insertion, we can do a similar thought experiment
where we have just one character missing. So schnemes and I don't think the other word is actually pronounceable. OK, yeah, that was good. And if we knock out the first character of the first word, then OK, we can solve this via an insertion.
And here's the code about how we do that. With substitution, we already looked at that. And that's basically saying every single character matches except for the one that you're currently on. And yeah, so we're only looking at the first character in that string. Whenever we put all of these rules together, we can
calculate distance. If you pretended that we already had a distance measurement between different strings, we could calculate the distance between strings of different lengths. And if you have an empty string and a string, the cost to change between those two is going to be the length of that string.
So that intrinsically makes sense. Nothing to foo costs three characters. Foo to nothing also costs three characters. We can represent that in code by just checking the length or returning the length if the string is empty. Once we have this piece, we can calculate the distance between every single substring and do it using the same
logic we did before, where we're going to call that distance measurement. And we're going to either knock off the first character for one of them. We're going to knock off the second character for the other one. We're going to only compare the first characters. And this represents deletion, insertion, and then substitution. We're going to go ahead and calculate all of those.
And we're going to take the minimum cost, because we only want to make the minimum number of edits. And then we're going to add one. So why did we add one? In this case, the cheapest operation just represents which one we should be picking. But adding one to that represents the actual cost of
the operation. OK, that covers all of our cases, except for when all of the characters match. And so that's fairly straightforward. Again, you just look at that individual character. And when we combine all of these things together, we
form, of course, not Voltron, we form recursive Levenstein distance, which is amazing and really useful. It's incredibly simple. And it looks a little something like this. It's actually not that much longer than the other one. Each of the individual rules, if you look back at what I
previously talked about, vaguely makes sense if you just sort of pretend that the distance measurement method already works. So here we go. When I looked into this, I found this. I was like, OK, hey, we're done. I put in Saturday and Sunday. And it turns out that, hey, the distance, it only takes three edits to change Saturday into Sunday.
It's much better than seven. Then I was like, hey, what does this actually look like when it runs? And so I hooked it up all together. I don't know if you can see this. We're still going. We're still going.
OK. So that took 1,647 comparisons. And then you can see that the distance was three. So we got the correct distance. But it took a really long time in order to run that. By the way, all of these scripts are on github.com slash schneem slash going the distance. I think I also have a bit.ly URL.
That might be a little easier to remember. I have that later. So if we look at this, the measurement before the hamming distance, which I didn't know was called hamming. I just called it dirty distance. That took eight comparisons. Recursive took 1,647. It's like, which would you rather have, fast and wrong
or really long and correct? That's not something you can really reason about. So it turns out that there is a better way to do this. We can do slightly better.
And one of the keys to this is if you watch the recursive algorithm closely enough, you'll notice that a lot of those strings that it was comparing were the same. So we can cache those in some sort of a way. And basically, if we have the distance of all of the substrings in a string, then we can add those up and get
the distance of the entire string. So first of all, I would like to invite you to join the club. I don't know if you know this, but there's a very prestigious club for members only.
And when you're in the members only club, you can also learn how the Levenshtein distance calculated with a matrix works, which is what we're going to go into right now. All right, we're going to stick with the same example. Going with an empty string over to Saturday, we can build
a matrix that looks like this. So we have our empty string on the up and down axis, and we're going to have Saturday on the top. If we just convert empty string into S, it costs one. SA is two, SAT is three, four, five, six, seven, eight.
So that hopefully makes a little bit of sense. We can look at it the other way and say, what does it cost to turn Sunday into an empty string? We form the other side of our axis, and we have one, two, three, four, five, so six. So it costs six edits to turn Sunday into an empty string.
We have to delete six characters in order to get it into an empty string. So that would be the distance. Once we have this, we have the starting point of our matrix. We don't need to really calculate this. This is kind of like the zeroth law of edit distance.
So we're going to go back to those rules that we looked at previously, and we're going to break it down. How much does it cost to change an S to an S? Well, intrinsically, we can say that, OK, it is the same thing, so that there is going to be no edit distance, and
it's going to be a cost of zero. With insertion, we can say, obviously, if we want to change S to SA, we're going to be adding the A. So that's going to cost one. That's going to cost one extra character. And so we would add one into our matrix. And we need to be able to do this, though,
programmatically. It's potentially relatively simple for people to do it. But we're going to go back to those rules. So again, we're going to be knocking off the first character of one of our strings. And the way we can do that in a matrix is actually by using the row index and the column index as a pointer
to the string that we're looking at. So we're looking at S, and then we're looking at A. But we want to knock off that last character to see if they match, and the cost of doing that would be a one. And then, of course, we add plus one to it.
And that would represent the cost of doing an insertion right here. So it would be zero plus one, which happens to be one, which is exactly what we thought previously. It's always really good when you're programming an algorithm, and your algorithm produces results that you expect, generally.
OK, so we're going to keep going. In this case, there are no extra characters. It's going to just all be insertion. So we go over and over and over again. And we end up just adding one, two, three, four, five, six, seven onto the end of our matrix. Our next character is going to be changing SU into S. Can
anybody guess the action of this? Changing SU into S. Deletion. OK, good. Again, this is something we can intuit, but we have to be able to figure out how to tell our machine.
Previously, we looked at some code where we knock off the first character of the second string. And again, we're going to look in our matrix. And our column index this time is going to be S. The row index is going to be U. We're going to bump it up by one, and that gives us our initial cost that we can then add one onto.
So it's going to be a cost of one. This covers insertion. This covers deletion. We also have to cover substitution. You probably see where this is going before. You've already seen these equations. And we get to our matrix. We find our rows. We find our columns. And in this case, we go back on each of them.
And this is exactly the same. So substituting A for U is exactly the same as saying, did our previous set of strings match? Because if they match, then we should substitute them. So that hopefully makes a little bit of sense.
We add one, and that covers all of our processes of determining insertion, deletion, and substitution. I did kind of gloss over match. Match is basically just minus one, minus one, where we look at the previous one. It's similar to insertion, except we don't add one.
Because there is no cost to change a string to itself. So essentially, changing an S to an S is the same as changing nothing to nothing, which would be a cost of zero. So once we have all of these things in place, we can
put them together. And we end up with something essentially like this. We're going to iterate over all of the characters in one of our strings. And we are going to store the values in a matrix. And you can see here, highlighted in yellow, all of our different cases.
Granted, if you were going to code this for performance, you probably wouldn't want to allocate a hash. But that's a different story. When we do this, we then iterate. We can run it. And you get something that sort of looks like this. Just goes, there you go.
OK, so you can see the final cost of changing Saturday into Sunday was three. One of the coolest things about this method is that we can also get the cost of all of our substrings. We've already calculated them.
If we were interested, we could find the cost of sun to sat. We just look in our matrix. We pull the n, because that's the last character, and the t. And we find out that the cost would be two. Was this better than recursive? It took a net of 48 iterations, which in my
personal opinion is a little better than 1,647. It is a little memory intensive. And both of these examples that you can feel free to play around with and actually iterate and step through and reason about, as well with all of my notes for this
presentation, are on bit.ly slash going the distance, if you're ever interested, as well as I'll throw up a couple of reference links for the research that I did in coming up with these. OK, so this is a tool. Previously, I was talking about programming. And I was saying, algorithms are really cool, because they
do things. And by now, you're like, string minus one, plus one, add cost, blah, blah, blah. What's going to be for lunch? And you haven't seen it actually in action. I mentioned previously, I'm a horrible speller. I happen to be human. And I get often tired.
And machines do not understand me very well when I get tired. So one day while I was typing, I was typing Rails generate migration. But it was really something that was like, microtoon. And it totally just blew up in my face. And I was super upset. And I was like, stressed. And it was the simplest task known to mankind. Like, why could I not possibly do this?
It just, things were not working. I was upset. I don't even know where this came from. Like, who made this? It's amazing. So I was saying, why can't more software, why can't the
software that I'm using be more like Git? We know what you're trying to accomplish. We know you're trying to run a generator command. Like, we know the generator command's available. Why can't we be a little bit more helpful? So I submitted a pull request, where we used Levenshtein distance. We then, basically, whenever we detect that you have an
error, we compare the command that you gave us to all the possible commands using Levenshtein distance. It's quick enough. And then we recommend the smallest possible distance. So again, this is also similar to what Google does.
If you've ever typed something into Google and misspelled it, it will give you a spelling suggestion. So this is really neat. Peter Norvig has a great paper online where he talks about how Google implemented the first spelling suggestion. And basically, Google, in addition to the Levenshtein
distance stuff that we talked about, they slurped in a lot of real world books. And in these books, they counted words. Every single time a word came up, they counted it. So the word a probably came up like a ton. And they're like, a is probably a real word.
The is probably a real word. Shmorgasbord is totally made up because we didn't find it in any of these books, like dragon flesh or something. So the higher the count of that word, the higher the probability that it's going to get in there, once we
have this information, we can also get the edit distance between the input that you gave it and the dictionary that we have. The lower the edit distance, the higher the probability. And then put that all together, boom, we show a suggestion. Also, Google's really smart, and they totally cheat about this.
Originally, running a Levenshtein distance across all of those would probably not return in whatever point-something-something seconds with network lag. So they cache the correct spelling suggestions. Basically, show you a spelling suggestion, and then you click on it, they're like, ha, gotcha, that's what
you actually wanted. And if most people search for that thing and also click on that spelling suggestion, then they know that that's probably the right one. Another really cool thing that actually came out relatively recently is the did you mean gem. Has anybody seen this?
So if you add the did you mean gem to your gem file, it will give you a really cool error. If you call user logged in, it's like, you know what? This method doesn't have a user logged in. It basically catches no method errors and finds all of the methods on that object and runs Levenshtein distance
across all of them and then suggests a method for you to use. Probably don't use this in production. So this is really cool. This stuff, it wasn't just in a textbook, and I was like,
people are bored, and professors are trying to make your life horrible. This is applicable actual things. This just came out. Somebody just thought of this. What else can you add it to? There's a ton of CLI commands. Yes, it's very specific. It's generally in dealing with misspellings.
But as programmers, we spell and misspell a lot of things. We type a lot of things. There are a ton of applications, I'm sure, that we could add this to. In preparing for this talk and doing research, originally I
was like, oh, I'm going to do all the distance measurements. And that's like saying, I'm going to do a talk on all the algorithms. Because there's a lot of them. So we talked about Levenshtein distance, and we talked about Hamming distance. There's longest common subsequence distance. There's Manhattan distance, tree distance. And there's also Jaro Winkler distance, which is
nowadays what I've been told, most people, if you're doing large, massive amounts of spelling suggestions, would actually use instead of Levenshtein. And you can get really creative. If you're going to be doing this often and frequent
enough, like Google had to bootstrap at one point in time those spelling suggestions, they can do things like store the distance edit calculations or the count in a try so that you can easily just give it all of your characters, and it will give you back a list
of possible spelling suggestions. So this is kind of just scraping the surface of all of those measurements. But really at the end of the day, hopefully you're going to walk away and be like, algorithms, not just for CS undergrads anymore.
Like, they're really cool. There's a lot that we can learn from them. And if you're interested in learning more, just general about algorithms, Wikipedia has an absolute ton of stuff. Rosetta code is great. If you're not familiar with this project, they will pick common algorithms and implement that algorithm in
a bunch of different languages, though the Ruby version of Levenshtein distance was so not Ruby code. It was obviously not written by a Ruby programmer. I felt free to fix that a little bit. Or, honestly, the best way to learn more about algorithms, and my personal recommendation is to give
an algorithm talk. Like, seriously, it's probably going to happen as soon as we open up this for questions. Somebody's going to be like, hey, did you hear about like this algorithm? And I'm like, no, and that's amazing, and we should talk about this. And algorithms are essentially these ways of sharing
knowledge, and they explore how we can do things and do them efficiently. So, this is the anti-penultimate slide, which I've been informed means the third from the last.
And this is basically a nonsensical slide. So, does anybody have any questions? OK, cool. Well, I'm going to be around. All of this code is online. Like, I highly recommend going out and playing with it, actually exploring. All of my notes are on that Ruby gem, or on that GitHub
repo, and if you wait maybe like five minutes, I'm going to actually push up a bunch of links to the readme that help explain where a lot of my source material came from. So, thank you very much for coming, and yeah, have a great day.