Lightning Talk: Dense Native Vector Scoring in Elasticsearch
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 48 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/68821 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
Berlin Buzzwords 20208 / 48
14
19
26
43
47
00:00
Population densityVector spaceComputer animationXMLUML
00:08
Multiplication signInheritance (object-oriented programming)Electronic mailing listVideo gameTheoremMeeting/Interview
00:54
Meeting/Interview
01:02
Principal idealVector spaceDean numberVector spaceElasticity (physics)Multiplication signMeeting/InterviewComputer animation
01:10
Principal idealTwitterMachine learningIdentity managementOpen sourceMachine learningOpen sourcePrincipal idealSoftwareMultiplication signArtificial neural networkTwitterFocus (optics)Computer animation
01:24
Video gameEnterprise architectureOpen sourceSource codeSoftware developerSystem programmingVideoconferencingQuicksortRankingVector graphicsMathematical analysisSimilarity (geometry)Set (mathematics)RankingWordVector spaceFunction (mathematics)Bit rateVariety (linguistics)Matrix (mathematics)Endliche ModelltheorieTerm (mathematics)Software developerPower (physics)Product (business)Software frameworkGreatest elementCategory of beingResultantMereologyOpen sourcePhysical systemGame controllerDimensional analysisSound effectRepresentation (politics)Operator (mathematics)Integrated development environmentSimilarity (geometry)Linear algebraQuery languageElectronic mailing listArray data structureType theoryCASE <Informatik>Functional (mathematics)Binary codePlug-in (computing)MappingTrigonometric functionsMedical imagingVideoconferencingNumberProjective planeVirtual machineCode2 (number)Musical ensembleRow (database)Complex (psychology)Combinational logicJava appletRight angleSparse matrixComputer animation
06:42
MereologyMeeting/Interview
06:59
Utility softwareDiallyl disulfideTorsion (mechanics)Structural loadCountingElectronic visual displayComputer-generated imageryPattern languageCodeLocal ringDatabaseLink (knot theory)Coma BerenicesQuery languageFunction (mathematics)String (computer science)QuarkData modelSoftware developerSign (mathematics)Point cloudTwitterVector spacePattern languageScripting languageCodeStandard deviationDot productTwitterDimensional analysisType theoryCASE <Informatik>Key (cryptography)BitFunctional (mathematics)Structural loadQuery languageSimilarity (geometry)LaptopProduct (business)Trigonometric functionsXMLComputer animation
08:10
XMLUML
Transcript: English(auto-generated)
00:08
Hello, everyone. Welcome to the Lightning Talk session at Berlin-Bazwa 2020. I'm your host, Atitha Arora, and I would be your host for,
00:22
so I would be hosting the Lightning Talk session tonight. Personally, looking at the list of topics that we have for the session today, I think I'm super excited, and I look forward to each one of these talks. I hope you guys have a good time. Just to let you know, if you have any questions,
00:41
please keep them posting on the Mises Live channel. And I think without further delay, let's welcome the first speaker for tonight. Nick, over to you. Great. Hi there, thank you very much, Atitha. And welcome to this very quick Lightning Talk
01:03
on Native Day Inspectors Scoring in Elasticsearch. It'll actually be my first Lightning Talk. I normally do longer ones, so let's see if we can get through everything on time. I'm ML Nick on Twitter, GitHub, and LinkedIn, principal engineer working at IBM Center for Open Source Data and AI Technologies, focused on machine learning
01:21
and artificial intelligence open source software. We're a team of over 30 open source developers at IBM, focusing on a wide variety of data and AI open source software frameworks and projects. So we'll start just talking about what are vectors.
01:41
And a vector is essentially a ordered and indexed set of numbers, very much like an array. You have different dense and sparse vectors, but the dense vectors we'll be talking about today are very much an array. And why do they matter? They actually arise in many different scenarios.
02:02
So we can represent various things as vectors, including images, music, movies. It's used, it can be used in e-commerce and recommendations, social networks, and even documents. So one example that where vectors are very common
02:22
is in recommender systems. And in recommenders, we have a set of users, and we have a set of items. So for example, in movie recommendations, the set of items might be a list of a set of movies or videos. And you can see it here on the bottom left
02:40
where we represent the user ratings given to the set of movies as a matrix. And it's a sparse matrix, not every entry is filled. So that means that not every user has rated every movie. And a typical approach for doing recommendation systems and recommender models is matrix factorization.
03:02
And that takes this matrix that we see on the left, and it splits it up into two smaller matrices. And the first is a user matrix, and the second is a movie or item matrix. And each entry, each column or row, as the case may be in one of these matrices, is actually a vector.
03:20
So it turns out that to create, to compute a predicted rating for a movie and a user combination, we just take a user vector, and we perform a linear algebra operation dot product between the user vector and the item vector. That is a part of this matrix. And similarly, if we want to find similar items,
03:41
which power things like products you may wanna buy, we do a similar croissant similarity. So this looks very conceptually similar to the way search ranking works. We start with a query, we represent it as a term vector, kind of binary term vector, perhaps, we compute a similarity,
04:01
and then we sort the results by similarity effectively. So can we use the same set of machinery to compute arbitrary vectors that are not necessarily the typical search query vectors? So for example, in recommendations, we have a user vector, the dot product cosine similarity machinery
04:22
is quite similar to what we use in search ranking. So scoring and ranking work, but the term vectors don't necessarily work because the way that arrays and vectors are stored in Elasticsearch natively up until now, has effectively been meant that the array
04:42
gets stored as an unordered set of numbers. So we lose that ordering and we lose the ability to do the operations we need on those vectors. So there is a way around this. And up until now, the way to do this was to use a different representation for the term vectors
05:00
and then use a custom plugin, which would be your scoring function, which will allow you to do things like dot products, cosine similarities and other arbitrary functions. But this did require a custom code and loading a custom plugin in Java. And that adds complexity.
05:20
So definitely it works, but it's not necessarily the easy way to do things. And for example, in environments where you don't control necessarily the cluster and you can't add plugins, it makes it difficult to use. So this has been solved recently in Elasticsearch 7,
05:43
a native dense vector type, effectively an ordered array stored as a binary, was added in ES 7.0 and then vector functions in 7.3, built in vector functions of the type that we need for doing things like similarity and scoring dot products and cosine similarity.
06:02
So you can see there that we've got the dense vector that we can have as a mapping in our properties. And we just need to specify a dimension. So this gives us everything we need if we compare it to what we had previously. We can take a vector that represents, let's say a user word item in recommender systems
06:21
or an image if we're doing image search or a document that is the output of a machine learning model or deep learning model. And we can apply native scoring functions like dot products and cosine similarities and so on and get our ranking. And this is now all built in and we don't need to do anything funny. So I'm going to very quickly try and show
06:46
an example of this. And this is a Elasticsearch Spark recommender that I've credited as part of my work at IBM. This very long notebook, I won't go through all of it,
07:03
but works through how to kind of load data into Spark from Elasticsearch into Spark, run recommendations and then over movies using this exact functionality that we talked about. So the key here that I just want to highlight is that as we see, we can simply create a vector type
07:21
for users, which specifies the dense vector type and the dimension and similarly for items and movies in this case. And then all we have to do is actually plug straight into a standard script query, script score query, where the score function is exactly cosine similarity
07:41
or dot product. And that's it at the end of just using that functionality, we get recommended movies and we get similar movies. Okay, so yeah, thank you for sticking with me for these probably a little bit more than five minutes now.
08:03
I encourage you to go and check out the code pattern that I mentioned and check out codea.org and find me on Twitter and GitHub. Thank you.