We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Lightning Talk: Dense Native Vector Scoring in Elasticsearch

00:00

Formal Metadata

Title
Lightning Talk: Dense Native Vector Scoring in Elasticsearch
Title of Series
Number of Parts
48
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Lightning talk from Berlin Buzzwords | MICES | Haystack – Joint Event 2020
Population densityVector spaceComputer animationXMLUML
Multiplication signInheritance (object-oriented programming)Electronic mailing listVideo gameTheoremMeeting/Interview
Meeting/Interview
Principal idealVector spaceDean numberVector spaceElasticity (physics)Multiplication signMeeting/InterviewComputer animation
Principal idealTwitterMachine learningIdentity managementOpen sourceMachine learningOpen sourcePrincipal idealSoftwareMultiplication signArtificial neural networkTwitterFocus (optics)Computer animation
Video gameEnterprise architectureOpen sourceSource codeSoftware developerSystem programmingVideoconferencingQuicksortRankingVector graphicsMathematical analysisSimilarity (geometry)Set (mathematics)RankingWordVector spaceFunction (mathematics)Bit rateVariety (linguistics)Matrix (mathematics)Endliche ModelltheorieTerm (mathematics)Software developerPower (physics)Product (business)Software frameworkGreatest elementCategory of beingResultantMereologyOpen sourcePhysical systemGame controllerDimensional analysisSound effectRepresentation (politics)Operator (mathematics)Integrated development environmentSimilarity (geometry)Linear algebraQuery languageElectronic mailing listArray data structureType theoryCASE <Informatik>Functional (mathematics)Binary codePlug-in (computing)MappingTrigonometric functionsMedical imagingVideoconferencingNumberProjective planeVirtual machineCode2 (number)Musical ensembleRow (database)Complex (psychology)Combinational logicJava appletRight angleSparse matrixComputer animation
MereologyMeeting/Interview
Utility softwareDiallyl disulfideTorsion (mechanics)Structural loadCountingElectronic visual displayComputer-generated imageryPattern languageCodeLocal ringDatabaseLink (knot theory)Coma BerenicesQuery languageFunction (mathematics)String (computer science)QuarkData modelSoftware developerSign (mathematics)Point cloudTwitterVector spacePattern languageScripting languageCodeStandard deviationDot productTwitterDimensional analysisType theoryCASE <Informatik>Key (cryptography)BitFunctional (mathematics)Structural loadQuery languageSimilarity (geometry)LaptopProduct (business)Trigonometric functionsXMLComputer animation
XMLUML
Transcript: English(auto-generated)
Hello, everyone. Welcome to the Lightning Talk session at Berlin-Bazwa 2020. I'm your host, Atitha Arora, and I would be your host for,
so I would be hosting the Lightning Talk session tonight. Personally, looking at the list of topics that we have for the session today, I think I'm super excited, and I look forward to each one of these talks. I hope you guys have a good time. Just to let you know, if you have any questions,
please keep them posting on the Mises Live channel. And I think without further delay, let's welcome the first speaker for tonight. Nick, over to you. Great. Hi there, thank you very much, Atitha. And welcome to this very quick Lightning Talk
on Native Day Inspectors Scoring in Elasticsearch. It'll actually be my first Lightning Talk. I normally do longer ones, so let's see if we can get through everything on time. I'm ML Nick on Twitter, GitHub, and LinkedIn, principal engineer working at IBM Center for Open Source Data and AI Technologies, focused on machine learning
and artificial intelligence open source software. We're a team of over 30 open source developers at IBM, focusing on a wide variety of data and AI open source software frameworks and projects. So we'll start just talking about what are vectors.
And a vector is essentially a ordered and indexed set of numbers, very much like an array. You have different dense and sparse vectors, but the dense vectors we'll be talking about today are very much an array. And why do they matter? They actually arise in many different scenarios.
So we can represent various things as vectors, including images, music, movies. It's used, it can be used in e-commerce and recommendations, social networks, and even documents. So one example that where vectors are very common
is in recommender systems. And in recommenders, we have a set of users, and we have a set of items. So for example, in movie recommendations, the set of items might be a list of a set of movies or videos. And you can see it here on the bottom left
where we represent the user ratings given to the set of movies as a matrix. And it's a sparse matrix, not every entry is filled. So that means that not every user has rated every movie. And a typical approach for doing recommendation systems and recommender models is matrix factorization.
And that takes this matrix that we see on the left, and it splits it up into two smaller matrices. And the first is a user matrix, and the second is a movie or item matrix. And each entry, each column or row, as the case may be in one of these matrices, is actually a vector.
So it turns out that to create, to compute a predicted rating for a movie and a user combination, we just take a user vector, and we perform a linear algebra operation dot product between the user vector and the item vector. That is a part of this matrix. And similarly, if we want to find similar items,
which power things like products you may wanna buy, we do a similar croissant similarity. So this looks very conceptually similar to the way search ranking works. We start with a query, we represent it as a term vector, kind of binary term vector, perhaps, we compute a similarity,
and then we sort the results by similarity effectively. So can we use the same set of machinery to compute arbitrary vectors that are not necessarily the typical search query vectors? So for example, in recommendations, we have a user vector, the dot product cosine similarity machinery
is quite similar to what we use in search ranking. So scoring and ranking work, but the term vectors don't necessarily work because the way that arrays and vectors are stored in Elasticsearch natively up until now, has effectively been meant that the array
gets stored as an unordered set of numbers. So we lose that ordering and we lose the ability to do the operations we need on those vectors. So there is a way around this. And up until now, the way to do this was to use a different representation for the term vectors
and then use a custom plugin, which would be your scoring function, which will allow you to do things like dot products, cosine similarities and other arbitrary functions. But this did require a custom code and loading a custom plugin in Java. And that adds complexity.
So definitely it works, but it's not necessarily the easy way to do things. And for example, in environments where you don't control necessarily the cluster and you can't add plugins, it makes it difficult to use. So this has been solved recently in Elasticsearch 7,
a native dense vector type, effectively an ordered array stored as a binary, was added in ES 7.0 and then vector functions in 7.3, built in vector functions of the type that we need for doing things like similarity and scoring dot products and cosine similarity.
So you can see there that we've got the dense vector that we can have as a mapping in our properties. And we just need to specify a dimension. So this gives us everything we need if we compare it to what we had previously. We can take a vector that represents, let's say a user word item in recommender systems
or an image if we're doing image search or a document that is the output of a machine learning model or deep learning model. And we can apply native scoring functions like dot products and cosine similarities and so on and get our ranking. And this is now all built in and we don't need to do anything funny. So I'm going to very quickly try and show
an example of this. And this is a Elasticsearch Spark recommender that I've credited as part of my work at IBM. This very long notebook, I won't go through all of it,
but works through how to kind of load data into Spark from Elasticsearch into Spark, run recommendations and then over movies using this exact functionality that we talked about. So the key here that I just want to highlight is that as we see, we can simply create a vector type
for users, which specifies the dense vector type and the dimension and similarly for items and movies in this case. And then all we have to do is actually plug straight into a standard script query, script score query, where the score function is exactly cosine similarity
or dot product. And that's it at the end of just using that functionality, we get recommended movies and we get similar movies. Okay, so yeah, thank you for sticking with me for these probably a little bit more than five minutes now.
I encourage you to go and check out the code pattern that I mentioned and check out codea.org and find me on Twitter and GitHub. Thank you.