Privacy-Preserving Web Search
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 60 | |
Autor | ||
Lizenz | CC-Namensnennung 3.0 Unported: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/66632 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
Berlin Buzzwords 202333 / 60
5
12
20
23
33
34
35
46
49
00:00
Formation <Mathematik>BenutzerbeteiligungSprachsyntheseDatenmissbrauchDiagrammVorlesung/Konferenz
00:33
DatenmissbrauchFokalpunktRankingLoopMaschinelles LernenGebäude <Mathematik>Spider <Programm>AbfrageURLWeb-SeiteResultanteProzess <Informatik>Virtuelle MaschineProjektive EbeneSuchmaschineStatistikGruppenoperationAutomatische IndexierungLeistungsbewertungDatenmissbrauchDatenverwaltungAbfrageNatürliche SpracheWeb-SeiteBenutzerbeteiligungSpider <Programm>URLUmwandlungsenthalpieRankingAlgorithmische LerntheorieDatenbankComputeranimation
02:56
DatenmissbrauchAbfrageVorlesung/Konferenz
03:29
DigitalfilterSoftwaretestCookie <Internet>AbfrageObjektverfolgungInformationHash-AlgorithmusGesetz <Physik>FreewareAbfrageInformatikResultanteCookie <Internet>Fächer <Mathematik>InformationEinfach zusammenhängender RaumSnake <Bildverarbeitung>SoftwaretestDatensatzInhalt <Mathematik>Regulator <Mathematik>Quantisierung <Physik>LastURLGeschlecht <Mathematik>MAPComputeranimationVorlesung/Konferenz
06:10
Machsches PrinzipKostenfunktionDatenmissbrauchNormierter RaumWeb-SeiteResultanteAbfrageProgramm/QuellcodeJSON
06:36
FokalpunktBenutzerbeteiligungUmwandlungsenthalpiePasswortDatenmissbrauchZentrische StreckungSpider <Programm>ComputeranimationVorlesung/Konferenz
07:05
Spider <Programm>MaßstabURLDatenstrukturInhalt <Mathematik>GasströmungMerkmalsextraktionNatürliche SpracheRankingWeb-SeiteBitrateAbfrageverarbeitungDatenmodellLineare AbbildungDatenmissbrauchBenutzerbeteiligungWeb-SeiteInhalt <Mathematik>DatenbankInformationDatenmissbrauchURLGraphfärbungPASS <Programm>StapeldateiLineare RegressionTermVirtuelle MaschineSpider <Programm>DatenstrukturZentrische StreckungEndliche ModelltheorieNatürliche SpracheZweiGraphAutomatische IndexierungProdukt <Mathematik>RankingCodeMereologieSynchronisierungBitrateInformationsspeicherungAbfrageVektorraumComputeranimation
10:41
Virtuelle MaschineRankingSchnittmengeVorlesung/Konferenz
11:06
RankingFunktion <Mathematik>ZeitabhängigkeitMotion CapturingInformationOrtsoperatorCoxeter-GruppeAbfrageDatenmodellProxy ServerMinimumSchätzungVektorraumLaurent-ReiheInformationNatürliche ZahlBenutzerbeteiligungOrtsoperatorCoxeter-GruppeClique <Graphentheorie>Arithmetisches MittelEndliche ModelltheorieAbfrageMultiplikationsoperatorRankingSchnittmengeEinsResultanteRohdatenKraftNatürliche SpracheVorhersagbarkeitLoginProdukt <Mathematik>VektorraumComputeranimationVorlesung/Konferenz
15:06
Laurent-ReiheMultiplikationsoperatorCoxeter-GruppeExogene VariableKontextbezogenes SystemComputervirusVorlesung/Konferenz
15:50
Domain <Netzwerk>Ordnung <Mathematik>Pi <Zahl>ResultanteAbfrageVorlesung/Konferenz
16:35
Coxeter-GruppeFokalpunktVorlesung/Konferenz
16:52
Wort <Informatik>SuchmaschineAbfrageVorlesung/Konferenz
17:28
Formation <Mathematik>Diagramm
Transkript: Englisch(automatisch erzeugt)
00:09
OK, so, hi everyone, and welcome to my speech about privacy-preserving web search.
00:22
So, it's just a quick overview of what I'm going to talk about today. So, first I'm going to introduce you to Quant, and that maybe not all of you know. Then I'm going to talk about the two main subjects, so privacy, what does it mean for us and for our users,
00:40
and also web search specificities, and then I'm going to dive a little deeper into our ranking pipeline and what does it mean to improve relevance when you don't know anything about the users. And then I'm going to finish with a short outlook.
01:03
OK, so first about me, so I'm Lara Panetti, I've worked at Quant since 2020, and I started there as a data scientist, specialized in natural language processing,
01:20
and now I'm kind of a machine learning engineer slash project manager. And I work there in the search team, mainly on the index, so query processing and ranking evaluation of the whole pipeline. So, what is Quant? Quant is a web search company developed in France and is focused now on the French market.
01:49
We do respect the privacy of our users, and some statistics that I can give you is that we received 2,200 million monthly requests, which represents roughly 6 million monthly users.
02:10
In the search team, so we are a group of 50 people, and we are building our own search engine, so from crawling to indexing and evaluation.
02:22
As I said before, we are focused on the French market, so our crawler is focused on searching for French URLs, and we are focused on the relevance of our results on French queries. But you can use Quant in Germany or in other countries, it's just that we will redirect the Bing results.
02:50
And other statistics, we have 5 billion indexed web pages in our cluster, and I said that we received 200 million queries monthly, but 70% of those queries are unique,
03:09
so we haven't seen them before and we won't see them afterwards. OK, so first I wanted to focus on privacy, what does it mean for us and for our users.
03:25
So let me first introduce you to some buzzwords about privacy. So of course we follow the GDPR regulation, but on top of that we do not show to users consent banners,
03:42
so it means a lot of things, but one thing is that we do not use tracking cookies. About our users, we only receive user queries, we log their clicks if there is one, and we also have the hashed user IP, it's just because we are bound by the law to do it,
04:04
but it's hashed, so it's not possible to make a connection between the queries, clicks and the IP. Of course there is no user session, so we do not record history,
04:21
and because of that we do not have filter bubbles, we don't create them, except when we run A-B tests, I mean we have different results when we run A-B tests.
04:40
OK, so what's for users? We do not track their search history, as I said before, we do not monitor their behaviour, we do not have their precise location, and we also do not have any information about demographic or socioeconomic information about our users,
05:02
such as gender, income level, age, etc. So it ensures that search results remain unbiased, depending on those kinds of information, and free from potentially discriminatory personalisation.
05:22
OK, what does it mean for us? It means that there's a lot of queries that are really ambiguous, and if you don't know anything about your users, they can't expect us to be relevant when they search for best restaurants, or Python, I mean we don't know if they are
05:47
a fan of snakes or computational scientists. So how does it work in practice? So this is an example of a query,
06:01
so I queried quant on quant in Germany, and this is the answer. So what do we record? We record the query, we record that the first thing the user did is to click on the first result, and then he gets back to this page and clicks on the third one.
06:27
What we record for that is that some user has queried quant and clicked on the first one, and another user has queried quant and clicked on the third one.
06:41
OK, so now that I've introduced you to the privacy focus, I wanted to also introduce you to web search specificities, because at Berlin Postwords, there is not that much people working on the web.
07:03
So, the web is vast and constantly expanding, so our web crawler must handle the massive scale of the web. About the web pages, they are dynamic and constantly changing
07:22
in terms of content structure and also availability. So our web crawler needs to adapt to these changes and revisit previously cropped pages to detect the updates and store them.
07:40
And as you might know, a lot of web pages could be malicious or just have deceptive content, so our web crawler must detect this kind of content and avoid them by using machine learning,
08:04
for example, like spam detection techniques. So we have developed our own web crawler. At first we used the notched web crawler, but we wanted our own to run 24x7,
08:21
and the notched web crawler is batch crawler, so for now we crawl 24x7 3K URLs per second, but the goal for the end of September is to crawl 10K URLs per second.
08:41
Once we have crawled our URLs, we do some feature extraction, so we split the feature extraction into two parts, synchronous features, which means that's the feature extraction on only one document, so we extract content, title, et cetera, but we also use machine learning to extract other kinds
09:05
of information, such as language or topic detection, and we also use the information from the web crawler, so the web graph which is created by the web crawler, to get some asynchronous features, such as pop rate to scores, page rank, or also anchor text,
09:27
anchor text is the text describing page A on the page B, and then we index those documents and those features, so we're using Vespa, and as I mentioned before, we have indexed 5 billion
09:50
documents in our Vespa cluster, so we do some query processing, but we do not use vector search
10:04
for now in production, and we have a two-phase ranking, both of them are learned with machine learning, so the first one is a linear model, because it has to be fast, and then we have a high GBM model, we tried to model all other kinds of models, like deep learning models,
10:26
but this one is the one that works best, so that's it. And now that I've introduced you to privacy and web, I wanted to dive a little deeper into
10:45
some challenges about how we improve the relevance of our ranking without knowing anything about our users. So, as I mentioned before that, our ranking is based on machine learning,
11:03
so we have to use datasets. We do not use manually annotated data, first of all, because it can be really expensive to produce, but also because manually annotated data is usually static, and it can capture the evolutionary nature of the web and also of the user behavior,
11:28
that changes a lot through time. And the third reason is that we do not have that much data about the user, but we have the only dataset that we can use can be seen as a cheap implicit
11:47
relevant signal, which is the click information. So we use it as an implicit feedback, but it's even more implicit since we don't have a user session. So each click information is
12:05
isolated from the other one. So by using clicks, we know that we introduce some biases, such as the noisiness of a click, which means that a click doesn't mean that the document
12:23
is relevant, and the other way around, which means that if a document was not clicked, it doesn't mean that it was not relevant. And also other biases, such as position bias or presentation bias. So users tend to click on the top ranked document, even if the lower
12:48
ranked are also relevant. So that's why we used a click model. So just a reminder of what we have. We have in our logs the query, the displayed documents, and the rank at
13:04
which a click occurred. And the goal is to not use the clicks as raw data, but to de-bias those clicks. And that's why we used a click model, which is a Bayesian model. And we used a specific one called cascade model. So basically a cascade model
13:29
assumes that the user scans from top to bottom the results and chooses the relevant one. And we used the cascade model because it doesn't allow sessions with more than one
13:43
click, which is what we have. And then we get the probability of the attractiveness of the document given a query. So that's what we do to improve our ranking without
14:01
knowing anything about our users. And now just a short outlook. So, sorry. Yeah, we have to imagine new ways of understanding our users without knowing nothing about them.
14:24
We have still a lot of difficulties and challenges ahead of us, such as query ambiguities that I've mentioned before. And the next steps is that we want to open to other countries, European countries first, which will lead to other challenges like new
14:44
languages to handle, but also new user behaviors. And also we want to use vector search in production and maybe also large language models in production. And that's it. If you have any questions now or afterwards, you can talk to me or to Lara.
15:07
Thank you so much, Lara. So we have time for a couple of questions. Does anybody?
15:26
Hello. Thanks for the presentation. It was very interesting. So I have a question about like relevancy. So you mentioned that you don't collect any data and you don't know who is making the request. In this context, how do you make sure that the response that
15:48
you give is relevant? Because if you don't know the person, you don't have any context around the person that made the request, how the results can be relevant for that
16:00
particular person? They won't be. They won't be? I mean, it's just going to be relevant for the query, not for the user. So I don't know if you search for a recipe for an apple pie. We are just looking for apple pie recipes and not, I don't know, a certain domain that
16:23
the user wants. OK, so that's like the price to pay in order to... That's it. OK, thanks. Any other questions?
16:43
Probably removed from the focus of the presentation, but just genuinely curious and really like the idea. But how does monetization look for you? Monetization, how does it look for quant? Sorry, I didn't catch the word. Monetization. OK, sorry. How do we make money?
17:03
Yes. We do make money by advertising, such as Google or other search engines do. But it's not personalized for a user, but only contextualized for a query. So if I'm looking for a chair, you're going to have an IKEA ad. That's it.