We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Comparing apples to apple trees: Injecting domain knowledge to similarity searches

00:00

Formal Metadata

Title
Comparing apples to apple trees: Injecting domain knowledge to similarity searches
Title of Series
Number of Parts
69
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Image & text similarity problems have plenty of textbook ML solutions. However in the wild, these solutions often fail. In this talk I'll present a use-case about inferring product similarity from multiple sources of data (images, text, etc) and discuss how we developed a practical & scalable approach using our understanding of domain knowledge.
Time domainSimilarity (geometry)Domain nameSource codeSimilarity (geometry)Different (Kate Ryan album)INTEGRALCalculationXMLUMLComputer animation
Computing platformProduct (business)Similarity (geometry)Software testingComputer-generated imageryTime domainOrder (biology)Domain nameLinker (computing)Group actionAbsolute valueMatrix (mathematics)Link (knot theory)Equivalence relationBlock (periodic table)Maxima and minimaMedical imagingProduct (business)Multiplication signCASE <Informatik>Link (knot theory)ImplementationSimilarity (geometry)Term (mathematics)CodecMaxima and minimaRow (database)Shared memorySpeech synthesisMetropolitan area networkCovering spaceSoftware testingLevel (video gaming)AverageMathematicsBitDomain nameWave packetTime seriesArtificial neural networkTransitive relationNumberRevision controlDescriptive statisticsEinbettung <Mathematik>SoftwareProcess (computing)ResultantPower (physics)TwitterWeb pageCuboidComputer animation
Similarity (geometry)Computer-generated imageryProduct (business)Matrix (mathematics)Absolute valueTime domainLink (knot theory)Equivalence relationBlock (periodic table)Maxima and minimaGroup actionQuotientGraph (mathematics)Domain nameProduct (business)Similarity (geometry)Level (video gaming)ScalabilityMathematicsWebsitePairwise comparisonWeightMedical imagingSocial classThresholding (image processing)Total S.A.Maxima and minimaNetwork topologyFunctional (mathematics)BitLink (knot theory)Order (biology)WordMathematical analysisCASE <Informatik>Graph (mathematics)Virtual machineOrder of magnitudeForestBlock (periodic table)Connected spaceSingle-precision floating-point formatGene clusterAlgorithmOverlay-NetzCountingImplementationBlock matrixSimilarity matrix1 (number)Scaling (geometry)SoftwareCuboidGraph (mathematics)Domain nameRepresentation (politics)Semiconductor memoryMatrix (mathematics)Latent heatData structureSoftware engineeringTerm (mathematics)Intrusion detection systemInheritance (object-oriented programming)Computer animation
WebsiteTwitterMultiplication signComputer animationXMLUML
Transcript: English(auto-generated)
Today, I'll be talking to you about domain knowledge, about similarity calculations, how to combine them together, and also how to integrate similarity from different sources in, hopefully, a smart way.
So I'm not going to introduce myself too much. You can find me on my web page on Twitter and LinkedIn. And I don't think I need to introduce my employer too much, especially after the last year. Shopify is basically an e-commerce giant, powers a lot of businesses around the world. It's a Canadian company, but there's a very strong presence here in Berlin.
And since we are talking about e-commerce, let's talk about products. So can you tell me just by looking at these product images and the little descriptions, what makes all of these products essentially the same thing? Well, it's pretty hard to tell. We have this psychedelic giraffe here.
We have a lovely place in Italy and a red umbrella in the middle of the street. Well, if we start working on the data that we have, so I don't know, use some out of the box NLP kit, you can immediately find that there are some similarities in the descriptions, although these descriptions, again, the case I was working on, these are marketing texts,
not always even describe the product itself, maybe more about the experience and branding. And if you want to work really, really hard and process hundreds of millions of images through some deep network, deep neural network labeling effort, you might be able to find some overlapping labels, but this is not great.
And something like embedding, just by looking at the images, you can pretty much say that something like embedding without a lot of training will probably not gonna work here. So what is going to work? Well, if you dig a little bit deeper, so like I said, these are the main product images. All of these products have more than one image. They have three, four, six, 16 images
in the background somewhere. And each one of these products, somewhere in the background, had a version of one of these two images. So all of these are paying by number of products, became super successful in terms of sales when the quarantine started and are still doing really, really well. So all of these, again, just by looking at the image,
you can't really say what it is, but if you dig a little bit deeper and look at other images of the product, you might be able to find the links. So what kind of use case am I describing here? So we have an entity of interest. This can be something like a product. This can be something like a user or a patient. If you look at medical records
and each one of these entities has some data items that are connected to it. So in terms of products, we talk about something like product reviews, product images, in terms of patients, case notes, test results, imaging, et cetera, et cetera. And where domain knowledge, again, in this example comes into play
is the fact that all of these items are somehow linked through the entity that we're interested in. Now, what we don't have, and this is important, what we don't have for this case is a natural way to somehow align or sort the items. So if you think about something like stocks,
so you think about prices over time, there is a natural way to align prices over time for each stock, and if that was the case, I would use something like time series clustering, but here there is no natural way to order things. We basically have to compare everything to everything or think about a way, how do we calculate similarity on the item level
and then aggregate it in a meaningful and also efficient and scalable way to what we actually want to know, which is similarity between products or between users or between patients or whatever. So what does domain knowledge in our case looks like? Like I said, we know that products, sorry, that images are connected to products.
And again, I can work really, really hard to say that two images of an iPhone cover and then a picture of a man speaking on the phone are related to the same product, but I know this already. This is something that our suppliers told me, and if I see that product A and product B share an image,
again, this doesn't have to be a one-to-many relationship, but share an image and product B and product C have two images that are similar enough in a sense. I can say, this is the transitive property, I can say that product A and product C are somehow similar. Again, just to emphasize, I'm not saying they're the same thing. I can say that product A and product B are 70% similar
and product B and product C are 90% similar and therefore A and C are, let's say, 80% similar. The average, or 90%, I take the maximum or the minimum or whatever, but in essence, I wanna make the link between products and I can do it through the images and I can also augment the similarity
between the images through the products. So what would that look like in terms of implementation? Here, a little bit of math, apologies. Basically, domain knowledge, if I take the approach, I like to call it once and done, you will see why in a second, if the approach is that, again,
images that are linked to a product represent the same thing and if it's enough that two products share a single image in order to make them linked or similar in my domain, then I can do something really, really simple. All I need to do is take the image similarity matrix that I precalculated, I can add on top of that
a block matrix of ones, basically blocks for all the images that are linked to the same product or the same patient or the same user and do some capping, there are some things you need to do here, but essentially, if I take overlay the product links on top of the image similarity and then I can take this matrix
and if I run whatever clustering algorithm I wanna run or extract connected components or whatever, what I will get is clusters of images, but then all I need to do is extract the product IDs and I will have clusters of products and because of the way that this works, I have the guarantee that a product will always fall in a single cluster, everything will behave nicely in a really nice way
and also kind of a side note, the connections between the images that belong to the same product don't have to be absolute, they can be weakened, I don't have to use one, I can use weaker links or I can use whatever counts of how many users view the same product or something a little bit more sophisticated,
again, I wanted to keep it simple here, but again, this is a very strong assumption that I use here, that it's enough that two products share a single image similar enough and that immediately makes them kind of stick together, what we found is that this was a little bit too sticky in a sense, too many products that cluster together
when we use this kind of approach, so what we did instead was to build something a little bit more sophisticated, again, if you remember from whichever machine learning course you took, a similarity matrix is equivalent to a similarity graph, so what you see here is images, each node is an image, I just colored for your convenience
all the images that belong to the same product, sorry, and the links between the images represent the similarity and what I want to do is basically take all the images that belong to the same product and somehow squish them together into a single node and this is what is happening on this graph,
it's called quoting graphs or graphs of minors and then I need to decide, okay, I squished all the images into the product, this is the entity I want to analyze, how do I decide how this product relates to this product? Well, in this case, I just took the maximal similarity of all possible pairs, now,
and then I get the new similarity matrix and I can continue working doing my analysis, I just said a very dangerous words, when you talk, dangerous few words when you talk about scale between all possible pairs, yeah, this does not scale very nicely, so we've looked through different implementations, I think the one implementation I found that worked out of the box kind of immediately
is NetworkX and Python, if you know of any other implementations that work very nicely out of the box, please reach out to me and let me know, I'd be super interested, but that was the one that kind of looked the most promising but it definitely doesn't scale very, very well, so I did a little bit more math and took my software engineering skills
and wrote two packages, one for Python, one for R and you can find them on GitHub and feel free to explore them and use them if you want and there's also a paper because the approach that I use, NetworkX is built on graph, I went back to the similarity matrix,
that it's really nice because you can use sparse representations, so immediately kind of the scale problem goes down at least one, two orders of magnitude, especially when you have sparse problems, if you do want to extend this further, I'm also building something for TensorFlow, so you can actually scale this up in terms of the memory representation
into really, really, really large similarity structures, but the trade off is that when you do this kind of matrix approach, you are limited to a specific class of aggregation functions, I mentioned something like maximum similarity, total similarity, counting how many images, pairs of images cross a certain threshold, it's a very wide class of functions
and there's a paper on my website, if you want to see why the mathematics works and you want to convince yourself or if you want to, I'm trying to publish it now, if you want to collaborate, feel free to reach out, again, sorry for the shameless promotion and now once we have an efficient way, scalable way to aggregate the similarity from the image level to the product level,
then we can start doing some really interesting things because now that we have this tool in our hand, we can start aggregating things in an interesting way, we don't have to limit ourselves to images, we can go to text, we can go to everything else,
we can train weights, we can combine things, we might even want to go beyond products, think about instead of comparing trees, we can compare orchids or forests, there's a lot to be done here and again, thank you for listening, I have to mention we're hiring, like I said, a global company and fully remote, so feel free to browse the website if you find anything interesting,
please mention me to the recruiters and feel free to reach out to me on Twitter on my website and again, thank you for your time.