#Haystack: Query Logs, Click Logs and Insights
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Alternative Title |
| |
Title of Series | ||
Number of Parts | 48 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/68809 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
Berlin Buzzwords 202020 / 48
14
19
26
43
47
00:00
BlogSystem of linear equationsMathematical optimizationLoginQuery languageNetwork topologyVirtual machineBranch (computer science)Mathematical optimizationProduct (business)Service (economics)Machine learningComputer animationXMLUML
01:03
Multiplication signReal-time operating systemRankingExpert systemMetadataDomain nameProduct (business)Process (computing)Computer animation
01:46
Multiplication signCuboidArchaeological field surveyInheritance (object-oriented programming)Integrated development environmentChatterbot
02:18
Data conversionQuery languageMaizeSpacetimeGraph (mathematics)Query languageSampling (statistics)Revision controlData conversionPlastikkarteOnline helpMultiplication signLevel (video gaming)Term (mathematics)WebsiteComputer animation
03:23
BlogQuery languagePresentation of a groupQuicksortLoginProjective planeDemo (music)Query languageLaptopInformationLink (knot theory)Blog
04:10
TimestampQuery languagePosition operatorPosition operatorField (computer science)Real numberDemo (music)TimestampQuery languageLoginComputer animation
05:15
FaktorenanalyseQuery languageCountingStructural loadTimestampDirection (geometry)GUI widgetView (database)Shape (magazine)Formal verificationDisk read-and-write headPrice indexMetadataElectric currentIntegrated development environmentQuarkSource codeCountingSource codeUniqueness quantification2 (number)Query languageTimestampField (computer science)Row (database)Transformation (genetics)Type theoryMultiplication signLine (geometry)CASE <Informatik>Single-precision floating-point formatNumberTemplate (C++)Group actionStructural loadRight angleSubject indexingWebsiteZirkulation <Strömungsmechanik>Addressing modeMappingFile formatMatching (graph theory)Real numberComputer animationXML
10:23
Query languageLoginResultantSearch engine (computing)Web pageRight angleMultiplication signDigital electronicsAreaGraph (mathematics)Link (knot theory)Online helpGreatest elementRelational databaseReal numberDomain nameXML
12:09
Query languageDirection (geometry)Structural loadComputer iconElectronic mailing listView (database)Dependent and independent variablesParsingSource codeSinguläres IntegralShape (magazine)Junction (traffic)File formatPrice indexFormal verificationInstallation artQuery languageElectronic mailing listSubject indexingTraverse (surveying)Graph (mathematics)Matching (graph theory)Template (C++)Field (computer science)MappingForm (programming)Junction (traffic)Structural loadMultiplication sign1 (number)WordLink (knot theory)Key (cryptography)Real numberState of matterWeb pageSocial classDifferent (Kate Ryan album)Source codeGroup actionTransformation (genetics)ResultantGreatest elementGUI widgetProof theoryDependent and independent variablesCountingSearch engine (computing)Relational databaseShape (magazine)LoginVariable (mathematics)Inheritance (object-oriented programming)Term (mathematics)CASE <Informatik>Right angleArray data structurePhysical systemFormal grammarBlock (periodic table)Electronic visual displayOnline helpControl flowComputer animationXML
17:12
Direction (geometry)Structural loadQuery languageSinguläres IntegralSource codeFormal verificationJunction (traffic)Price indexFile formatView (database)Disk read-and-write headInstallation artQuery languageMultiplication signElectronic mailing listType theoryDifferent (Kate Ryan album)Term (mathematics)Search engine (computing)Field (computer science)CASE <Informatik>TrailState of matterFront and back endsMereologyDomain nameSource codeStructural loadContent (media)Matching (graph theory)Subject indexingProof theoryQuicksortProcess (computing)Real numberShape (magazine)Combinational logicWordCondition numberoutputRight anglePentagramTable (information)Numerical taxonomyLie groupMultiplicationLoginComputer animation
23:43
Similarity (geometry)Content (media)Profil (magazine)Source codeLoginComputer animation
24:17
Query languageDirection (geometry)Structural loadView (database)Multiplication signShape (magazine)Query languageRight angleCASE <Informatik>Different (Kate Ryan album)Real numberMereologyPoint (geometry)WebsiteProfil (magazine)MetadataElectronic mailing list1 (number)Link (knot theory)Similarity (geometry)LaptopLatent heatRow (database)Parameter (computer programming)QuicksortPresentation of a groupGroup actionChainFrame problemSubject indexingType theoryLoginCore dumpInheritance (object-oriented programming)Demo (music)Closed setCondensationField (computer science)Source codePhysical lawBitElasticity (physics)MappingFile formatWhiteboardShared memoryAreaCollaborative filteringComputer animationXML
30:14
Query languageMetadataOpen setQuicksortSearch engine (computing)Set (mathematics)Sound effectCASE <Informatik>AreaGraph (mathematics)Pattern languageDomain nameWave packetPoint (geometry)Online helpLoginMachine learningUser profilePattern recognitionProcess (computing)Relational databaseTerm (mathematics)Different (Kate Ryan album)InformationContent (media)AdditionExpandierender GraphContext awarenessFrequencyEinbettung <Mathematik>Vector space modelVirtual machineMereology1 (number)Level (video gaming)Scaling (geometry)Noise (electronics)Type theoryField (computer science)Shooting methodSpacetimePresentation of a groupRight angleSlide ruleMultiplication signData storage deviceConnected spaceBinary codeGame controllerFocus (optics)Task (computing)TimestampPosition operatorDirection (geometry)Figurate numberMeeting/Interview
37:23
XMLUML
Transcript: English(auto-generated)
00:09
Hi from New York, everyone, and thank you for tuning in. I'm Peter Dixon Moses, and this talk is titled Query Logs, Click Logs, and Insights.
00:24
Here I am in a tree. I clambered into the search business a little over 10 years ago, and I've been exploring the different branches ever since. My work today can loosely be described as search relevance in the service of funnel optimization.
00:41
But this isn't a search relevance talk. It's also not a machine learning talk. It's a data and product talk. Much of what I'm going to cover can be improved with machine learning, and I'll be in particular. But the purpose of what I'm about to share with you is simply to get acquainted with a really useful data collection, which grows as your business grows.
01:09
So where do most search teams spend their time? Well, on findability and ranking
01:20
for searchable business data, of course. It's the product, the showcase, the reason your customers are there, the reason why you have a job. It's the most important data set. It needs domain experts. It's the best metadata. It's the Sorai, sophisticated ranking techniques, real-time updates, and badges, lots and lots of badges.
01:44
But what about Retail 101? Hi, what can I help you find today? Let me show you what we have available. Did you find what you were looking for today? It's not an easy experience to replicate in a digital environment.
02:01
Chatbots can be super annoying. And post-checkout surveys. I mean, since the last time you filled out a post-checkout survey. But step this way into the world of search, where customers just can't stop sharing what they're after and where they found it. It's your query log.
02:21
Everyone has one, but not everyone uses theirs. This sample has outbound clicks connected with search terms, which is great because once queries are connected with conversions, customers can help answer questions about search intent.
02:40
It's also a graph. It's a graph of searchers, queries, and their clicks or other conversions. The queries provide a hint of search intent, the motivation for why this person is searching. Clicking a version provides linkage between search intent
03:01
and a physical resource on your site, in your collection. Possibly with a level of interest, if you collect other signals like dwell time or cards or checkouts or anything further down the funnel. Once you have that linkage, it makes it possible to crowdsource some answers to common search questions.
03:24
So most of this talk is going to be covering some recipes for things you can do with search logs. And I'm going to be switching back and forth with some Jupyter notebooks. The link is there for anybody that wants to clone those
03:42
and try this out. So be aware that with any sort of crowdsourcing project, you don't want to just serve raw information from your logs without a filter because query logs are rife with misspellings, inadvertently disclose personal information
04:02
and all sorts of other things you probably don't want people to see. Okay, now that we've had a disclaimer, here we go. So this demo dataset is a bunch of real estate searches, about 50,000.
04:20
The fields are user ID, query, timestamp, position, document. Position is useful for search relevance, but we're not going to be using it in any of these demos in particular, we're going to be working on features. My first recipe is auto-suggest.
04:44
So what I like to do when I think about working with query logs, is I think about what question would I ask the crowd? What question would I ask all these searchers to answer to help fill out this feature? So with auto-suggest, you've got a query that somebody's typed in
05:02
and you want to find out, you want to serve up all the other common ways to frame this question. So the way we're going to do that is we're going to take our log and we're going to extract the queries, right?
05:22
We're just searching queries. That's what auto-suggest is. We're going to reshape it down to, we're going to group it by distinct queries. We're going to count the number of unique users who executed that query and click on something.
05:41
And sometimes it's useful to know how old a query is, otherwise the last time it was in circulation, right? So the last timestamp can be useful sometimes. And then that's going to go in an index. And we're just going to run a search against what some of these types so far.
06:02
And we're going to try to incorporate the popularity of that search, right? That's the count and the recency maybe, along with how close of a match, what somebody's typed is with whatever else is typed. And that's going to be our auto-suggest. So here we go.
06:22
All right, so first thing is to reshape the data, right? So here's the source data. This is like what I was showing before. And we're going to get distinct queries out of this and count.
06:47
And that timestamp is the last timestamp, right? So for example here, this transformation says that 189 people searched for Prudential Rail State. They ran that query.
07:01
189 distinct people, because this log has a new record for every single row, for every single document somebody clicked. Even after executing one search, you run one search, and you click on five things, you get five records, right? So it's important to be unique by user in this case.
07:23
So let's make sure that it's using Elasticsearch. So this is how Elasticsearch consumes records. Changing it to this JSON lines format. We're going to load the index. This is the mapping we're using.
07:43
It's got the count. It's got the last query date. And it's got the query, the field for the query that people are typing. And it's got this searches you type type, which is relatively new in Elasticsearch. It's a convenience feature that creates a bunch of engrammed or shingled fields for you,
08:02
which makes end-to-end some prefix, some fields optimized for prefix searches. So it's a convenience thing. You can do it before. It's in the last few releases of Elasticsearch. You can do it manually. But this is convenient. Let's load the index.
08:21
All right, that's done. Only 50,000 records. It's great. Actually less, because we just compressed it. And now, so we're going to query for suggestions. And take a quick look at the query we're going to use. This is a query template. So it's in Mustache.
08:41
And it's trying to match the text of whatever's been typed with these different fields that were created by that searches you type field, the subfields. And then we're going to influence the score by multiplying it by that count field, which we're
09:02
going to take the natural log of so it doesn't blow the score too high up. But this way, things that are queried more frequently will rise to the top. So let me pretend I'm typing here. I'm going to run some queries.
09:22
All right, p. Let's see. All right, there's some p's. R. Prudential is still a plot. But now I'm e, so I'm going to lose Prudential. And there's Prescott. And some stuff for Prescott. Arizona, prestige, Tennessee, should get Prescott.
09:47
Everything's got Prescott, right? So five letters, and I'm already down to a bunch of stuff about one city, which is great. And that's what we want out of Autosuggest. And notice it's also finding things that are index, right?
10:01
So it's not just queries that start with Prescott, right? They can be at the end. It can be in the middle. So all right, so that was Autosuggest. The second recipe we're going to look at is related queries.
10:25
So related queries you often see at the bottom of a searching results page. Because by the time you get there, and you scroll down and say, all right, well, these 10 don't help me, or however many results you have on that first page, what are some related searches?
10:44
And these often help people specialize their search. They may help them broaden it out to an adjacent area. And the way that we can use query logs to, or one way we can use query logs to help generate related searches is to look at other,
11:04
thinking back like a graph, is to look at other queries that found the documents you found. So I made a pretend scenario here, because I don't actually have the search engine. I just have the logs. So this is a search for licensing. And it returned, among nicer looking stuff, these 20 links.
11:28
So these are 20 links about licensing in the real estate domain. And the 20 links from the log I'm working with. So we want to ask the crowd, what are some
11:42
of the other commonly answered questions by the documents you found? What are the other queries people ran to find and click on the documents that have just come up on your first page of results? All right, that makes sense. So we're finding the other ways in.
12:00
So we found some things that are relevant to licensing, but we want to know what other topics they're relevant to, because those will be your related searches. So this time we need the query and the documents. And we're going to group it, this time by the document.
12:21
So it's a different index, different shape, different transformation. So we're going to group by distinct documents. And then we're going to aggregate, list up all of the queries that found each document in an array, in a list.
12:42
Every query that resulted in a click on that document. So we're using that as a form of crowd proof that that document's relevant to that query, that click. And then when we serve it up, we want to display the little widget at the bottom of the search results page.
13:01
We just want to match all of the documents that came up on that first page. Sometimes you can fetch more if it's helpful, but since you have those already, you can match those ones on the first page. And then just aggregate across queries and see which ones bubble up. So here we go, related queries.
13:27
So here's the source data, looks exactly the same. But this time we want to transform it so that the document column is the key, the prime of the distinct key.
13:42
And with those queries, whenever multiple queries find the same document, we want a list of those. So here it goes. All right, so now we've got these document links. And most cases, it's just one query for some of these obscure ones. But here you have a grand junction,
14:00
a couple grand junction queries, maybe more than two, that found this 126 real netsystems.com link. All right, so this is the shape we want. We've got to convert that into a form that lots of search will like, which looks like this.
14:22
All right, we've got your document links. And you've got your query array. We're going to create an index with this mapping, real simple mapping. Nothing even is inverted, right?
14:41
This is just a keyword field for the document and keyword field for the query. So these are only used for exact matching. This is really basically just graph traversal in a search engine. So let's load it. Done. All right, great.
15:04
So here's the query we're going to use. What we're doing is this is a mustache template. This is one of the ways that Elasticsearch allows you to run queries by substituting variables. So this is going to take the list of documents
15:20
that I pass, in this case those links, and substitute them in for this block. It's going to do a terms query with all of those documents. So it's going to find those documents in the index. And then it's going to aggregate the queries that found those documents. Right? Super simple. So it's going to reach into those arrays,
15:42
those lists of queries for each document, and make a single aggregate with a count for those, just to see what bubbles up, right? So here goes. This is the fake query I created, the fake results. These are the document links.
16:00
And the results, they're real results from the log. But this was in response to my query for licensing over this real estate data set, right? So these are the things that came back for licensing. We're going to see what other queries caused people to click on these links. That's what this query is going to run.
16:21
Here goes. Cool. So broke a real estate license in Maryland, California real estate license, connected real estate license. This is basically things, these different search intents in this log break across states. It's US, obviously, but it breaks across states and cities.
16:42
So there's a lot of cool stuff you could do with geolocation. But right now this is going to just be scattershot. There's licensing classes. Real estate classes doesn't even have the word license in it. So that's cool, right? We get that back. New Jersey real estate courses.
17:01
So we can get that back as related to licensing, because you got to take a course to get licensed. But the word license isn't in there. So that's related queries. It's fun stuff, right?
17:21
So the third recipe is synonym candidates. So this is part of the job of a search relevance engineer is keeping track of synonyms within a business domain. This is super important for precision. The reason is, especially with concepts in your domain
17:43
that have more than one word in them. So for example, I live in New York. When you stand up a search engine for the first time and you index a lot of content that has, let's say, US states in it, anytime somebody types New Mexico, New Jersey, New York,
18:00
they're going to get all of those coming back because they all match new, right? Until they start working with precision. And there's some cases where you want that. But if you've already said New York, search engine should be smart enough to say, hey, New York is a thing. It's a thing that has these two terms. Let's make it one thing instead of two things
18:21
and not search for the independent parts. Real estate's another one of those too, right? So grooming your synonyms is a big part of search engine upkeep. And it's useful to have tooling that helps suggest new synonym candidates.
18:41
So this is something that a back-end relevance engineer or a taxonomist would use to find new synonym candidates. So take an example again that worked with this concept license. We're going to try to find in the query logs what are all the other phrases or short phrases
19:07
or terms that appear around license. This is something that you can do a lot better with using NLP tools, but just wanted to illustrate the data for this purpose.
19:21
So we can just do a proof of concept here. So what we need is we need the queries and documents again. This is the same shape as the last one. It's distinct documents and a list of queries that found that document, that clicked on that document. And then we're going to match on the license,
19:46
licensing on the concept that we want to find synonyms for. And then we're going to query the different ngrams, all the different combinations of adjacent terms in the index. So here we go.
20:07
So synonym candidates, same shape, document, and a list of queries. Here's our same source data. And transform it the same way we did last time.
20:22
Load it. This index, we're going to use the search as you type again because it's convenient for the ngrams for those adjacent terms. So that's slightly different.
20:40
And we're going to load it. Cool, that's done. All right, here's the query we're going to use. This is a multi-match again across all these different ngram fields that were created by the searches you type type. You're going to say 100% need to match because if you are looking for a concept
21:04
that has multiple terms, if you're looking for New York, you want to find New York City, you don't want to find New Mexico as a synonym. Or New York bagels or something. It may not be a synonym, but it's the related terms. So we're going to run the query,
21:21
match everything that matches at 100% what you've typed in the queries. And then we're going to aggregate up the significant one-term queries, the significant two-term queries, and the significant three-term queries that we found in the query log.
21:40
And we're going to filter out the... We're going to filter out the one you're looking for. So let's run this and see what happens. So again, we're looking for license. This one is less spectacular. So we're looking for license. We're going to parse out the synonym candidates. Oh, and it blows up.
22:00
Hold on, let's skip something.
22:23
Trying to get fancy. So it's trying to get fancy because we get license back. And that's the one, that's the input. And you'd strip that out before you showed candidates,
22:41
synonym candidates to somebody. But here are some other terms that are related to license. They're related to license. So conditional requirements, real estate license. That's a good one, especially in this domain, as a thing that you might want to make synonymous with license because in this domain, if this was your business, that would be synonymous.
23:04
Licenses, renewal, department licensing. Basically, this just gives somebody who's building a controlled vocabulary a lot of fodder for constructing that controlled vocabulary so they can make sure that when people search for an exam,
23:21
that they find the same things that somebody would find for a licensing exam, for example. So there you go. So that's a bunch of different synonym suggestions. You take those and build up synonym tables, taxonomies, suggestions, all sorts of fun stuff.
23:43
All right, recipe four. This is the last one. This is taste profiles. Taste profile is what your searcher is interested in. This is the signals that your searcher has provided to let you know about what they're after.
24:04
Their interests. And one of the things you use taste profiles for is recommendations. So search logs is a great source of finding taste profiles. So what we're going to do this time,
24:21
first of all, we need the user ID because we're looking for user and the query and the document. And this time, we're going to group it by user. So distinct user and then a list of queries that they've run and a list of documents that they've run. A super simple taste profile. If you have metadata or other things like that
24:41
on your documents, that's often really useful too because it could be more coarse-grained interests. In this case, just queries and documents. And then once you have a list of these users with their bags of queries and documents, then you can look for similar users. It's sort of like a collaborative filter
25:03
in light. Look for similar users, and you try to find things you can recommend that other users have found that your user has not yet found or not yet queried. So taste profiles.
25:24
Same source data. This time, we're going to transform it to keep the user and group by user. And we get a list of queries and a list of documents for each user. So in this case, user 756 searched for
25:41
Chesapeake Real Estate Assessor and Virginia Beach. And at some point, they clicked on this Chesapeake City link and at some point some other link. So that's what this shape tells us. So we're going to make sure it gets converted into an Elasticsearch happy format.
26:04
We're going to create an index. This is the mapping, so new index for this. Notice I like to make lots of new indexes. This is index per feature. Often you want to scale things differently, so it's good to keep things compartmentalized.
26:20
This is using search as you type again. Not sure if it needs to, but we've got the query and the document and the user. It's just the same triple. This is the core triples of the query line. And we're going to load it. Here we go. It's done. All right, so we're going to take one demo user
26:43
that I just grabbed off the close to the top of the list and see what that one record looks like for them. So this is the taste profile for user 1008. They did a bunch of searches for Cape Cod, New Hampshire, Maine, Falmouth.
27:04
So it's New England real estate searches. New Hampshire, different parts of Massachusetts, somebody's particular realty, Marcy Brodney area. And they clicked on a bunch of these sites.
27:21
So let's ask the crowd. Here's our query. That's a crowd. Who else shares these interests? So this is the more like this query. And we're looking at similarity across the field. Across the query and document fields for all these users. Who's got the most similar queries and documents
27:41
for this particular user? Because we want to look at what they have that this user does not have. Here's our user going in for this index as we're looking for users that are like this user along query and document. Some tuning parameters. And we're going to get out a bunch of users that are similar.
28:00
And we're going to aggregate their queries and see what bubbles up. We're going to aggregate their documents and see what bubbles up. But in this case, I'm just going to pull the queries for us to look at. Just because that's what we've been looking at for everything else.
28:23
So here's our user, 1008. And here's some suggested queries. So a lot of them are redundant. Because there already has been some New Hampshire interest. But it's been specific things in New Hampshire. It's not the exact same thing.
28:40
Oh, I went in here and I specifically stripped out all of the queries that this user had already run. Everything was already in their taste profile. So they didn't see anything new. Or so they didn't see anything that they've done before. So it's all new queries for them. Main, real estate. And these are the most common ones. So that's why it's broader first.
29:00
And then more narrow. Bath, Maine. Brunswick. Somebody's looking at North Carolina. Oregon. Some different outfits. Some of these. So it's a little bit across the board. And down here, it's starting to be more one-offs. But the common ones that bubbled up are these Maine and New Hampshire.
29:21
And some of these particular popular places in Maine and New Hampshire. Oh, we can do the documents. Let's see. So here's suggested documents. Although this is a little bit unfocused, because we're taking somebody's whole history and saying, oh, look at these documents. If you're going to do a document recommendation feature, you would probably do it with more of a condensed time frame
29:43
and decay things out over time as somebody lost interest. So that's our recipes. And that's the presentation. Thank you very much. And the link is there for the notebooks
30:01
that I just showed you. And if you have any questions, you can find me either at searchintubition.com or on LinkedIn. And thank you very much. And I hope you enjoy the rest of the sessions. And thank you, Peter. And Peter joins us now live to answer any questions you have on his presentation.
30:22
So I'm just going to have a quick look at the slide. Oh, can we hear you there, Peter? I can now. Can you hear me? Thank you. OK, we'll just check the Slack. David says, nice bikes.
30:41
But we're going to move on to some questions first. Zenit asks, would it make sense to add terms that were in the queries but not in the documents and then add them to the documents? This is towards recipe two. Sure, Zenit. I saw a similar question on the Zalando talk from you about broadening queries out, even misspelled queries,
31:04
and figuring out how to make that direct connection, right? They were using a neural IR approach, which takes into account how likely it is that those misspelled queries actually have to do with the items
31:20
that were found. And you could certainly do what you're describing, but it becomes more of a binary solution, right? It either finds it or it doesn't. There's no nuance. Mostly for this talk, I just wanted to focus on what kind of information you can get out of the logs. And there's certainly a ton of different things
31:40
you could subsequently do with it, whether it's feeding a manual process or an automated process or a semi-automated process. It's a good question. And that's the kind of insight that you really do want to get from your users, is what is the lexicon mismatch between your users and your collection.
32:03
Fantastic. Thank you. And I know you've partially answered this in the Slack, but Tito asked, apart from user query, timestamp, position, and document, are there any other search transactional log metadata you have found useful to store available? Well, this is one where I'm sure there's
32:21
a ton that just could be done here, right? It's just flipping around and looking at the experience completely from the user's shoes to look at the collection, right? So it's pretty greenfield. The thing that I'm particularly interested in
32:40
is doing entity recognition over query logs to really start to understand, what are the domain-aligned concepts that exist in there? And how can you use that to help search relevance, right, in terms of query rewriting and everything else, or query segmentation, right?
33:01
How do you train a classifier to figure out what kind of query this is? But also, my answer here is just with respect to taste profiling, if you're able to pull out entities and especially control domain entities, then you've got a much better signal that is domain-aligned for what this user is interested in, right,
33:21
if you want to recommend them content in the future than if you're just working with terms. Fantastic. So that's it for the questions in the channel. I perhaps got one for you. So you've done recipes one to four. Give us an idea of the next recipe you could do.
33:43
Sure, I think I went as far as I could without machine learning. And that was sort of the point of the time. There's probably some other cool stuff you could do here, just sort of exploring that graph. Also, the data set itself is really limited. Just trying to find a data set that I could use.
34:02
But obviously, if you've got the full collection and if you've got full user profile information, you could analyze this much more deeply into different, in terms of identifying what is this user's persona based on what you know about them and how would you use that. How would you interpret that from their queries, right? Or what's all the metadata on those documents
34:23
that's useful in really understanding query intent or especially in terms of expanding the context. This is something that's come up when the neural IR tracks a lot, right? Is that when you're working with embeddings,
34:41
but typically does better with longer queries. And one of the ways that people overcome this because most queries are short queries, right? Or the common ones is that they will expand user queries with additional context from the user's profile, right? Whether it's information about the user or past searches they've run or things like that, right? So this is another sort of like wide open area to explore.
35:04
And I also really like the Zalando idea of just taking of the sessionization and just taking everything that a user in this case has searched in a day and everything that clicked on in a day and just dumping it into a bag and saying, okay, this is the role to, this is the intent
35:22
and this is the actual effect of this user for today. And if we look at enough users like that, you start to see patterns and it starts to create queries and items, right? So there's definitely like much, much more that could happen here in the NLP
35:44
and the machine learning vector embedding space. Great, thank you. So there's one more question from Tito here. When generating related queries and sending them candidates at runtime, do you limit the depth of the documents and analyze for performance reasons? So this was a toy data set.
36:01
So there was no need to really do that, but there is a problem of course, which is that when you start to get really down the tail of frequency of occurrence of something, like it gets very noisy, right? So there are practical reasons at scale to want to look for a certain level of agreement between users on some of these areas, right?
36:23
So like something might not truly be a synonym candidate unless it shows up frequently, right? So when I think about depth, I think about like how far down the tail of people's queries do you go before you chop it off, right?
36:45
And especially with auto-suggest, that's a big deal, right? There's a big part, I mentioned filtering, right? If you're using raw queries, you get profanity, you get, depending on what your search engine is, like you got all sorts of content that you don't want to just immediately surface. And a big part of that filtering gets done for you
37:02
if you just look for consensus. So for like suggestions, like there's what are the most commonly other type terms with this term and you'll get often the best suggestions. Like that said, you still have to filter for the things you don't want in there, but a lot of the noise comes out by just deciding this is the end of, this is where I want to cut the tail.