Lightning Talk: Spell Correction & Query Segmentation
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 48 | |
Autor | ||
Mitwirkende | ||
Lizenz | CC-Namensnennung 3.0 Unported: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/68818 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
Berlin Buzzwords 202011 / 48
14
19
26
43
47
00:00
AbfrageComputeranimationXMLUML
00:08
Besprechung/Interview
00:24
AbfrageMereologieXMLUML
00:38
AppletImplementierungKontextbezogenes SystemSuchmaschineData DictionaryDesintegration <Mathematik>Einfache GenauigkeitMultiplikationSchnelltasteAbstandEinfach zusammenhängender RaumAbstandDiagonale <Geometrie>Zusammenhängender GraphWort <Informatik>Gewicht <Ausgleichsrechnung>MultiplikationAutomatische IndexierungNegative ZahlOrtsoperatorImplementierungMereologieAppletCodeData DictionaryMinkowski-MetrikNichtlinearer OperatorKomplex <Algebra>TypentheorieSymmetrieKontrollstrukturAdjazenzmatrixGenerator <Informatik>MultiplikationsoperatorHeegaard-ZerlegungSchnelltasteMatrizenrechnungRechenschieberAutomatische DifferentiationSuchmaschineBasis <Mathematik>UmwandlungsenthalpieEinfügungsdämpfungRechter WinkelCASE <Informatik>KinematikKontextbezogenes SystemDefaultResultanteMUDIntegralZählenFormale SpracheCodierungFreewareAlgorithmusPhysikalisches SystemSymmetrische MatrixEliminationsverfahrenProgram SlicingMultipliziererKette <Mathematik>PolstelleRechenbuchComputeranimation
06:12
Open SourceVerschlingungVererbungshierarchieComputeranimation
06:35
EinflussgrößeDefaultWort <Informatik>SchnelltasteForcingAbstandGraphAutomatische HandlungsplanungSchaltnetz
08:16
XMLUML
Transkript: Englisch(automatisch erzeugt)
00:08
I think we move on further. We have next talk again from Lucky. He would be talking about the SIM-based spell correction. Welcome back Lucky.
00:21
Hi. Welcome back everyone. This one, I'm going to talk about the spell correction and the query segmentation part, which is one of the basic things which we need to encounter in e-commerce searches. So the idea which we created a SIM spell based,
00:45
customized Java implementation of that. The SIM spell is basically written personally by Wolf Grabbe and C-Sharp. So we have adopted that and write its Java code. But add on to that, we added some customizations to that also. So basically what SIM spell
01:01
comprises of is basically comprises of a single word spelling correction. The compound of the multiple spell corrections like the word breaking and the word joining part. For an example, nut-free chocolates without space and with space. We added the QWERTY distance to that. So if someone is typing slice,
01:21
and in the index we have slices and olives. So if I look at the edit distance, both have the edit distance one in the beginning. But if I look at the keyboard, I see that slides compared to olives, the S and O are basically poles apart. They're on the opposite sides of the keyboard.
01:43
But if I see the V and C of slides and slices, they're basically adjacent. So this gives a major weightage that this is more closer to the original word. So in that case, the QWERTY distance will help us. The other part is we have integrated the SIM spell to solar. We have write one component on that SIM spell
02:03
so that it can be easily used. So which leverages us to use search engine data as a dictionary. So what happens with any spell correction is you need to provide a dictionary to that. So if I'm writing a dictionary and I'm having a new brand at all in my catalog,
02:22
and if someone misspelled it, it'll not get autocorrected until and unless we have that word in our dictionary. And our spell corrector knows about it. So with the leverage of the integration, we get a part where we can read the index data and create the dictionary either on every commit operation
02:41
or any reload operation. So that's what we did. And there's one more idea on base of SIM spells, like a context-based spelling correction. So context-based spelling corrections are like, if I type the M-A-L-K, right? Which could be milk or malt. So basically, which one would be appropriate?
03:02
So if a person is like browsing the beverages, and if he types M-A-L-K, then it surely means malt. But if it's like browsing the groceries and you type M-A-L-K, then it means milk. So what we are currently doing is that we are creating context-based,
03:21
we are trying to approach and encounter this problem of context-based spelling corrections using the SIM spell. So as I mentioned, there are examples like nut-free without paste and scorned milk. And in nut-free, we don't even find to see the skimmed milk, because it's basically correcting of one word
03:41
and basically splitting of words also. But SIM spell is basically a symmetry delete spelling correction algorithm, which basically reduces the complexity of added candidate generation. It's basically much faster and language-independent. So basically, we have also created, along with QWERTY, we have created the QWERTS, the keyboard for German.
04:03
And it is much faster, because it basically creates the delete dictionary so that it's all really at a distance at a time while building the dictionary itself. So here are some matrices. So compared to Novritsch piquetry lint spell,
04:20
SIM spell performs much better on a 500,000-word dictionary with such time for 1,000 words. So this is what the QWERTY adjacency matrix we use. So we see that V is connecting to both C, F, G, and B. So we weight them on the basis of directly connectivity
04:41
and the diagonal connectivity. So direct connectivity has a very short weightage, and diagonal connectivity has slightly higher weightage compared to direct connectivity. And we use these weights in the Damrao-Levestein calculation, where we basically multiply it with the replace distance. So Damrao-Levestein is nothing but the added,
05:01
basically, what should I say? It's basically, it's add-on to Levestein distance, which also calculates the transpositions, in other words, of the characters. In this one, we basically added add-on to it, like we added that you can also add weights to specific operations, like you can add weights to insert operation, delete operation.
05:21
So by default, it's all one, but if you want to say that it could be delete-probable, so you can reduce the delete weight, and you can basically play with these weights to find out what could be the possible best approach for you, for your ecosystem. So this is our accuracy summary. So when we index like 3,695 indexes and did 8,060 searches,
05:44
with Levestein, it was giving a lot of false negatives, which was 1,550, but when we tried it with vanilla semi-spill, it gives 545 false negatives, and when we tried with 40 piece vanilla semi-spill, it was 573 false negatives. Similarly, the true positives also increased
06:01
compared to Levestein, and that's why we planned it to solar, to move it to solar, so that we can leverage the more accuracy from what we're getting from semi-spill. So that's all. This is the GitHub link for the customized semi-spill. I've open-sourced that. It's available in Maven Center,
06:20
so you can directly use it. Few companies are already trying to use it, and here's my link, and thank you, everyone. That's all I have. Any questions? Great. Super talk. I think I was looking forward to this one, at least, for tonight. Yes, we have questions.
06:42
There are two questions that I actually had, but I think you answered both of them. I was wanting to see if we compared it to the existing solar spell checker. I think you covered that, plus the plan to incorporate that in solar as well. I think you've covered that, too. So both of my questions are answered, but we have one question from Jens, and he asks you that,
07:01
is there a way to combine keyboard and default distance measures? Combine keyboard? I didn't get that, too. Like, combining keyboard and default distance measures as in, if I'm getting the question correct, it's like you are trying to do the default distance measures along with the QWERTY distance, is it?
07:20
If that is the question that you're calling for. I think that's tough. Yeah, I think, yes. Yes, we have. So another question coming in from Zenit. He says that you mentioned compounds. Can it handle compounds of three or more words or compounds that have infixes? Yeah, definitely.
07:41
So that's what, yeah, it can handle that. So suppose I'm writing, like, I'm a fox going to market without switching in the jungle and with spaces, like in a formatted way. It basically corrects them. It did that. Great. I cannot wait to try it, actually, and I'll let you know.
08:00
And I'm really excited that it's getting incorporated into Solr. So that was a great talk and kind of a food for thought for tonight, at least. So thank you so much for a great talk, Laki. I think we'll see you around on the social.