Scaling Facets to the Stars
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 69 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/67355 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
Berlin Buzzwords 202111 / 69
11
25
39
43
45
51
53
54
60
00:00
SoftwareDifferent (Kate Ryan album)Domain nameSoftware engineeringMachine learningSearch engine (computing)XMLUML
00:40
SolrLink (knot theory)Enterprise architectureSearch engine (computing)Open sourceLuceneSubject indexingJava appletLibrary (computing)ScalabilityComputing platformSoftware developerSoftwareVideoconferencingHausdorff dimensionGroup actionProcess (computing)Web pageSearch engine (computing)Image resolutionService (economics)Point cloudOpen sourceAxiom of choiceCASE <Informatik>ResultantField (computer science)Group actionQuery languageTotal S.A.CuboidElectronic mailing listInverter (logic gate)Subject indexingTimestampFrequencyElement (mathematics)Analytic setPhysical systemBitDomain nameConnectivity (graph theory)Object (grammar)Multiplication signDatabaseBefehlsprozessorData storage deviceMiniDiscSpacetimeMathematical optimizationCountingTime seriesEnterprise architectureJava appletLibrary (computing)Data structureDimensional analysisUniform resource locatorNumberMereologyRange (statistics)2 (number)Front and back endsScripting languageSlide ruleParameter (computer programming)Dependent and independent variablesSet (mathematics)Right angleScaling (geometry)TouchscreenDemosceneMoment <Mathematik>Computer animationXML
09:48
Time domainPlug-in (computing)Complex (psychology)Personal digital assistantAutomationRead-only memoryQuery languagePhase transitionRandom numberBit rateLeakJava appletCore dumpRange (statistics)Semiconductor memoryCache (computing)Software frameworkGraph (mathematics)Java appletResultantMetreStructural loadComplex (psychology)Parallel portProcess (computing)Multiplication signGastropod shellScripting languageGraphical user interfaceSoftware testingGeometryVariety (linguistics)Service (economics)Server (computing)Suite (music)Query languageSoftwarePlug-in (computing)NumberFrequencyKey (cryptography)Data structureCASE <Informatik>Slide ruleDomain nameMereologyPhase transitionLeakElectric generatorRandomizationTraffic reportingBit rateIterationVariable (mathematics)10 (number)2 (number)BitGraph coloringNeuroinformatikField (computer science)CodeData storage deviceSet (mathematics)Intrusion detection systemTotal S.A.Right angleDefault (computer science)Memory managementCore dumpDifferent (Kate Ryan album)Filter <Stochastik>Subject indexingLengthCodecOpen sourceTimestampLevel (video gaming)Point (geometry)Reduction of orderInformationLastteilungWeb serviceReverse engineeringFunctional (mathematics)Pointer (computer programming)Touchscreen1 (number)Stress (mechanics)Web 2.0Computer font3 (number)CountingComputer animation
18:56
Query languageRange (statistics)Read-only memoryPhase transitionCache (computing)Bit rateOrder of magnitudeSemiconductor memoryRange (statistics)Query languageMatching (graph theory)ResultantWebsiteMemory managementFrequencyCore dumpRight angleFunctional (mathematics)Total S.A.RootMultiplication signEndliche ModelltheorieConstructor (object-oriented programming)Cache (computing)NumberSubject indexingMereologyFile formatService (economics)Mathematical optimizationChemical equationMoment (mathematics)CASE <Informatik>Software testingPhysical systemDirection (geometry)Expected valueAnalytic setKey (cryptography)Stress (mechanics)Point (geometry)Mathematical analysisStructural loadMaxima and minimaSpacetimeUltraviolet photoelectron spectroscopySound effectUtility software1 (number)Cartesian coordinate systemBoundary value problemMetreError messageData storage deviceDifferent (Kate Ryan album)Graph (mathematics)2 (number)Bit ratePhase transitionTimestampSemiconductor memoryReverse engineeringRandomizationCode10 (number)Single-precision floating-point formatIntrusion detection systemXMLComputer animation
28:04
Category of beingDirection (geometry)CountingRight angleCodeMereologyResource allocationMultiplication signField (computer science)Standard deviationTerm (mathematics)Queue (abstract data type)BitIterationCycle (graph theory)Query languageFrequency2 (number)Core dumpPhysical systemAnalytic setComputer configurationFrustrationTimestampBlogQuicksortMemory managementXMLUML
Transcript: English(auto-generated)
00:07
So hi, everyone. I'll start by introducing myself. Who am I? Well, my name is Shikhar Srivastava. I'm a software engineer in the new search team at Bloomberg in the European headquarters in London.
00:20
I'm originally from India and have been with Bloomberg since I graduated college a few years ago. I have some experience working across different technology domains from machine learning and market data to distributed search engines. I'm here to talk about some of the interesting problems that we're trying to solve at Bloomberg News. So with this, let's begin our talk.
00:42
And let's start by having a quick chat about what one of the problems that we're trying to solve is. So we basically deal with a lot of search queries here in news engineering every day. Some of these queries are simple search queries, which are easy to handle. But some of them need a bit of an extra effort. Let us look at a query as written
01:00
on the slide, which is answer questions like, give me the number of stories matching apples and oranges. So this is just a simple search query that any search engine can resolve. Things get a bit more interesting when I add this bit. So now the query becomes, give me the number of stories matching apples and oranges for the last five years aggregated by a day.
01:23
Now, this has made the query much more complicated for a simple search engine to resolve. Because now we have some elements of analytics and time series attached to the search query as well. So the goal basically over here is to solve this problem and to build a system that could solve user queries
01:42
like this one in an interactive manner. By interactive manner, I mean the end users of this service are going to be humans and not scripts. So we will show you in this talk how we achieve this by scaling solar facets to the stars. OK, so before going any further, let me try to answer the first question you all might have,
02:03
which is why solar? So we did some initial research and analyzed a bunch of solutions available out there. Some of which like Druett, Solr and Cassandra are mentioned here in the slides. We compared the solution against each other based on certain parameters like responsiveness, what is the query flexibility that we have?
02:21
What is the operational costs of setting up the cloud and maintaining it? And then what are the analytical features that are supported? So we found Solr to be matching all that we need. And again, this is a subjective choice depending on the use case. For our use case, Solr provided us with everything that we needed. And so we just decided to go ahead with it.
02:42
OK, so now let's have a quick intro to what Apache Solr is before we move any further. Apache Solr is an open source enterprise ready search engine. It is based on another open source Java library called Apache Lucene. Lucene does the actual heavy lifting in Solr, which is indexing and searching and everything else. Solr is a layer on top of Lucene,
03:02
which makes it distributed, scalable, and reliable. So this basically makes Apache Solr an enterprise ready solution for search. Cool, so this talk is about scaling facets. So it's really, really important to understand what faceting is, right? So faceting is basically a popular search feature,
03:20
which most of the modern search engines provide. It allows you to group your search results based on certain dimension or fields of the result. So let's understand this by a simple and quick example. You can see a job search page here, which has a list of all the jobs available at Bloomberg. As you can see, there are around 544 results in total.
03:41
By the way, we are actively hiring in my team. So yeah, that's something. OK, coming back to an example, every job in the result has many more fields associated with it. Some of which we might see on the screen, like the experience, the location, and many more which are behind the scenes, and we don't really see it in the search results.
04:01
So a quick example would be the check boxes that you see on the left-hand side is actually faceting in action. This part of the webpage, it basically uses faceting on the field location to group all the results based on the value of the field location. So the way it works is that the search engine, it takes all the unique values of this field
04:21
from the search results, and then creates that many groups. And then it assigns every document to one group. So as I said earlier, faceting basically just groups the search results based on a field. OK, so what's special about Solr? So Solr provides a way for this dimension of field to be a time range. Now, that gets interesting.
04:41
Let us look at a quick example of what this means. So assume every search results basically also has a field called date posted, right? Which basically just stores the timestamp of when a particular job was posted. So if Solr was the search backend used here, I could have said, hey, Solr, give me all the results on the faceted on the field date posted
05:03
and keep the time range to one day. So what Solr would do then is that it would create one day long groups and assign documents to the correct groups based on the value of its date posted field. The interesting thing about this is that I can set the time range to anything like one day, one week, one month,
05:22
or just anything else, right? So this feature, this is the feature that actually makes it possible for Solr to solve some of the problems that a traditional time series database would solve. And this is why we actually chose Solr as well. And this Solr feature is the one that we will use to solve the problem that we just discussed earlier.
05:42
OK, moving forward. Now we know that range facets, which we discussed earlier, can be used to solve a problem, but it's not really that straightforward and simple. Faceting has its own challenges. First is basic, yeah, faceting is actually very slow compared to a simple search request. The latency for faceting search
06:00
basically grows tremendously with the number of documents which we have. The next major problem is that faceting is very resource intensive and it takes a lot of CPU when we have a lot of documents to count. And when looking at the Bloomberg scale, we have around 3 billion new stories on which we want to apply this faceting solution. And our expected latency
06:20
is somewhere between one to two seconds for the 90th percentile of the heaviest queries. Now, heavy queries are the queries which match a lot of documents. And again, to remind you, this is for the analytics use case and not for the search use case where the latency should be very, very small. Moving ahead, let's get started. Now that we know what the problem is
06:41
we are trying to solve and what challenge it has, this talk will present how we solve this problem by following an experimental approach. Okay, so before we start with the experiment to solve the faceting challenge, we need to have some basic components ready. The first of these components is obviously the Solr setup itself. Solr needs a schema defined
07:00
before we can start ingesting data into it. The data that is ingested in Solr is called a document and you can imagine it as a simple JSON object which has all the fields as defined in the schema and their values. We start out by a very simple Solr schema which has just three fields, the timestamp, the company and the topic. And we also did some basic optimization
07:20
for the faceting use case. So we use a special data structure called doc value which is used to store the field that we will be faceting upon. We will discuss more on what doc values is later in this talk. And the next optimization is that since we are only concerned with the counts of the results and we don't really need to return any content,
07:42
we just don't store the data on the disk. This saves us some space and it makes it slightly faster because of non-existing read and write to disk. The next important component is the query that we will be using for the faceting. It is a simple JSON facet query from the Solr API.
08:00
And it asks, the query you see on the screen, it basically asks Solr to give us the results from the start timestamp to the end timestamp. And we want it to be aggregated by a period called aggregation period which you can see as gap in this query. Okay, so now let's take a quick side step and try to understand what doc values is
08:22
which we just discussed earlier and see how the speed of the faceting, right? So let us look at an example. Let us basically just take an example where we have three documents in our Solr collection as stated here. And these three documents just have two basic fields, domain and company. This is a simpler example as compared to our schema earlier just to make it easier for everyone
08:42
to understand doc values, right? If we ingest all this data in Solr and index these two fields, Solr will store an inverted index data structure which looks like something like this. And now let's say you just get a simple query, right? And the query says, give me all the companies starting with the L, right?
09:01
Solr will just go and look up in the inverted index for the field company and find out the documents one and two that match a query, right? Because in the inverted index it's really straightforward to just get it, right? Now, what if the query also had a facet part and asked Solr to facet the results based on the domain field of the results that we got,
09:23
right? So for this, Solr would have to know what the values stored in the field domain for the documents one and two are, right? And it's not really that straightforward as you can see from the inverted index. Solr would have to look through the entire index to find out what are the values for these fields in the document, right?
09:43
And this is where doc values actually come into play. If the field domain was marked as a doc value, Solr would store a column-based data structure like the one you see in the green color. And then it can very quickly find out what is the value of the field domain for the documents which it matched,
10:01
which is document one and two, right? And this is how basically doc value speed up faceting a lot and not just faceting but other features like sorting as well. Okay, now coming back to our discussion on the faceting challenge. Remember we don't have a final solution to the faceting challenge yet. And so we need to follow an experimental approach
10:20
and move ahead in iterations. And the most important thing that we need for that is to have a very defined process. So the process outline here is relatively straightforward. We first fill all our data inside Solr. The next step is to send a lot of faceting queries to Solr. And then we capture the results from each of these queries and generate reports and insights and aggregates.
10:42
Afterwards, for the next iteration, we just change some variables like the queries themselves, the load distribution, parallelism or yeah, anything else. And then we repeat all over again from step two, which is sending these queries to Solr again. We keep on repeating this process until we reach a final goal and what we expect.
11:01
Okay, so the steps two to four, they might look trivial but they really take most of the time. There are many solutions to go through this. Probably the easiest way is to use Python or shell scripts that most of us would do. Or maybe just create a simple Python framework. But then there are many drawbacks of doing this as well. We need graphs and charts to get proper insight
11:22
from the results, right? And we also want to easily be able to tweak as many variables as possible while still maintaining reproducibility. When taking all of this into account, it becomes a totally different problem and definitely not the one that we initially set out to solve. And this is why we use another great tool
11:40
called Apache Jmeter. So Apache Jmeter is an open source software that can be used as a load testing tool for analyzing and measuring the performance of variety of services. Just primarily used for web services, but yeah, you could test just about any server. It is very easy to set up, modify test suites and Apache Jmeter as it comes with good GUI to do that.
12:02
It has a lots of plugins that support many complex testing requirements. Also, it is very easy to automate using Apache Jmeter and it also supports scripting. So the bit that I personally like the most is that it generates beautiful reports. Some of which you can see on the screen as well, which gives us a lot of insights into the test results.
12:23
So now that we have everything that we need to run our experiments, the next step is to actually run them. This talk will present three major experiments which helped us overcome the faster challenge. There were many more small ones, but these three summarize all the findings from the smaller ones as well. Every experiment helped us understand
12:41
the solar functionality better and gave us pointers for the next step or next set of experiments to follow. We iteratively move forward until we are satisfied with our results, which we'll be in experiment number three. Okay, so let's start with our first experiment. We just ingested all the data in the basic setup,
13:00
which we discussed earlier and used Jmeter to send out lots of queries, totally random, data in time ranges because remember, we are faceting on time ranges, right? And we didn't expect too much from this initial experiments, but the results, they were just way too bad. We had near zero cache iterate, the solar JVM went out of memory really, really quickly.
13:21
The throughput was really bad at around 10 queries per minute. And the latency was in tens of seconds for the facet queries. And then again, remember, we are talking about 3 billion documents, right? So we expected not too good results, but it was just way too bad. Okay, so at this point, we had two major unknowns, right?
13:40
Like solar has many different cache. Why did the cache not work then? And the second question was, why does solar JVM go out of memory? Just how much memory does it need? What is the magic number that we should use for our use case? So with all of these questions, we moved to the next phase of analyzing what we should do. And then the only thing that we had at this point
14:01
that we could really analyze was the heap dump from the solar JVM. And so we just used it as a starting point and started inspecting it. This is where we used another great tool called Apache, sorry, not Apache, obviously, it's Eclipse Memory Analyzer. The Eclipse Memory Analyzer is a fast and feature-rich Java heap analyzer
14:22
that helps you find memory leaks and reduce memory consumption. It is a great tool to inspect Java heap dumps. It is also very easy to use and really, really helps you see a lot of things inside your heap dump. And we will now go through the information that we got while analyzing the heap dump, which we had.
14:40
Okay, so this is how the tool looks like when you load in a heap dump into it. And on a closer inspection, we were able to actually see the different cache inside the heap dump. As you can see in the right, the filter cache is clearly visible in the heap dump. Going a bit further, we can actually also see what is the size
15:00
of the filter cache when the solar JVM went out of memory. It's around 690 megabytes. So yeah, digging even deeper, now this is where things get really interesting. We can also see what is stored as a key in the filter cache. And as you can see, it is actually a time bound, which is represented by a lower timestamp and an upper timestamp.
15:21
And yeah, this was really interesting insight, which we got by using this tool. And if we dig even deeper, we can actually see what is the value that is stored or mapped to this key in the filter cache. So as you can see, the value is essentially an array of document IDs. These are the document IDs of all the documents
15:41
that lie within the time range represented by the key. So this reverse engineering using heap dump was a breakthrough in understanding how solar uses the filter cache for our use case. Okay, so before we go any deeper and try to understand the earlier findings, it's important to know what a filter cache is, right?
16:00
So solar has many different cache, like I talked about, which serves different purposes. Filter cache is one such cache, which is typically used for storing filters, which are used in the FQ part of the query, filter query. And it is also extensively leveraged in faceting when using the default facet method, which is FC. In essence, it's just a fast LRU cache,
16:22
which stores the last X recently used entries. So usually the filter cache, the way it works is that it stores filters or bit sets. So if a solar index has 1000 documents, let's say, right? Each such bit set in the filter cache will have a length of 1000 with all the matching documents for a query as one
16:42
and everything else as zero. So these bit sets are essentially a compressed way to represent the documents matching a certain query. However, in our use case, as we saw earlier, the filter cache did not store bit sets. It stored an array of document IDs. So let's see how the filter cache works for our use case. So now, as we saw earlier
17:02
in the Eclipse memory analyzer tool, the filter cache in our use case stores a time range, the start and end time as its key. And for the values, it stores an array of all the document IDs within that time range, right? So let us see what happens when we send a facet query to solar to understand really
17:21
how the filter cache is storing things. So if you send a facet query like this one, which has a time range of 1st of January, 2020 to 1st of January, 2021, and we want the results to be aggregated by one day, a total of 365 entries will be generated in the cache, which is actually the total time range divided by the aggregation period.
17:40
And each entry in the cache, it can be thought of as a time range bucket, right? And which will be one day long and which will store as values the document IDs of all the documents that lie within that time range of one day. Let's take this example a bit further and let's see what happens if we send the same query with an aggregation period of seven days this time.
18:02
This time, there are 52 entries generated in the cache as one year has around 52 seven-day buckets. These entries will span a period of seven days each and the values will be just like before, all the document IDs within the seven-day period represented by the key. So this really helps us understand the structure of cache
18:22
and how it is being used for our facetting use case. Let's take a step further. If you noticed in my earlier slide, I deliberately missed the search part of our Solr query and only talked about the faceting bit. This is because the way Solr computes these facets does not depend on the query too much.
18:41
This might sound a bit surprising, so let's dive a little deeper to understand how a facet query that we sent to Solr works under the hood. Okay, so let's take a simple query like COVID-19 and vaccines. And we want to aggregate the count of the matching stories from 20 January, 2020 to 15th of November, 2020.
19:01
And we want all the results to be aggregated by one day. This is what the Solr query means. When we send this time range facet query to Solr, what happens is that the search part of the query, which is COVID-19 and vaccines, it basically goes to the Solr index and yeah, that's where it goes. The facet part of the query, however,
19:20
it generates smaller subqueries. Each of these subqueries basically have a time range equal to the aggregation period that we specified, which is one day, right? And totaling all of them, they span the total time range of our query, which is 20 January to 15th of November, 2020. And they're exactly the same as the time range buckets, which we saw earlier in the filter cache, right?
19:43
And what really happens, it's very easy to guess, these small time range buckets, they are then looked up in the filter cache for a match. And finally, we get results from both of these sites and we have two results in total, right? The result from the Solr index are all the documents matching only the search query.
20:01
This includes documents that are outside the time range of our facet part, right? It could be, yeah, way outside. And then similarly, the data that the smaller time range buckets give us are all the documents within the total time range given in the facet query. So it also includes documents which do not match COVID-19 and vaccines.
20:23
So this is something which is very interesting. And once Solr has these two results, it applies an intersection to give us back all the documents that are within the time range and also match our query. So now putting everything together, the filter cache is actually the part
20:41
that we have to exploit, not hitting the cache and being have to count all the documents in these small time range buckets. Every query is really the bottleneck over here. And it takes a lot of time to count all of these documents when the number of documents is very large. And it also consumes a lot, lot of resources. If we just somehow fill everything inside this filter cache
21:03
that we will ever need, we can theoretically always hit the filter cache and be really, really fast and just unblock this bottleneck and make it faster. So this is what we try to do based on our understanding of the filter cache and result construction. We can apply a few optimization now to, like I said,
21:21
theoretically always hit the cache. And for that to happen, we need to pre-fill the filter cache with all possible intervals and all the document IDs in those time ranges, and then rely on the intersection by Solr to give us the correct results. So the first thing that we need to do is we need to fix the maximum time range supported by our application.
21:41
This is because if you want to fill everything in the filter cache, we need to know its size, which depends on what the maximum time range our application will use in its queries. The next thing is we need to fix the number of aggregation periods that we support, right? It cannot be infinite. As we saw earlier, Solr stores small time range buckets as its cache entries,
22:01
which are calculated based on the total time range and the aggregation period in the query. So if we have to pre-fill the cache with all possible time range buckets, the aggregation period, just like the maximum time range has to be finite. The next important thing is that we need to fix the time boundaries of each of these aggregation periods. Since we know that the small time range buckets
22:21
store timestamps, which are granular up to milliseconds, to always hit the cache, we need to make sure that the boundaries for all these individual time range buckets are aligned properly and they are consistent. So every day should always start at the same time. It cannot be just totally random queries or random times.
22:41
Okay, so the next thing is to actually fill the filter cache, right? And we can use multiple queries with the search part being star colon star, which is just give me everything. And all the aggregation periods for the maximum time range that we are going to actually use. So this will basically fill everything in the filter cache.
23:01
And finally, once we have pre-filled the cache, to now actually hit the cache, we need to align the time boundaries in our queries with the ones in the filter cache. And so we need to normalize our queries to have the same time boundaries as we used to warm the cache in steps three and four. So these were all just some theoretical optimizations which we came up with after analyzing the heat dumps
23:23
and during our testing. And we run the same Jmeter tests like experiment one, but this time with the optimizations that we just discussed. And as you can see from the graphs, we had 100% cache hit rate and our throughput increased significantly from 10 queries per minute to five queries per second,
23:43
which is around 30 times better. As a result of always hitting the cache, the query latency also reduced drastically from tens of seconds to well within a second this time. However, as you can see, we were still getting memory errors, but one thing was sure that the analysis that we did really helped us. So though we went quite far from where we started,
24:05
we still don't understand why Stoller getting these out of memory issues. And this leads us to do some analysis once again to get more insights in the internal working on the setup. So we know the only heavy duty memory usage is done by the filter cache, which we use to store everything right now.
24:21
On a closer look at what is being stored, we can see that all different aggregation periods over a time range, they take the same memory. So if the maximum time range is one year and you are storing both daily and weekly aggregates in the filter cache, they will take the exact same amount of space in the cache, because essentially it's just two copies of same thing, they're structured differently.
24:41
And the next thing, even at this point, the only valuable thing which we had was the heap dump. So we did a further analysis of the heap dump to see if we can get more insights. And as you can see, every Solr shard basically has filter cache. So what this means is that if we reduce the size of filter cache by X, let's say we have five shards on a Solr JVM, right?
25:03
Then we would actually reduce the load on the Solr JVM by five X. So yeah, this was again very interesting insights which helped us optimize memory. And this time what we did is based on an analysis, we figured that we only need to store the smallest aggregation period in the filter cache
25:21
and not all the ones that are supported by an application. Why? Because if we just store the smallest one, we can calculate the higher ups in the service layer. So we could just add 30 days and create monthly aggregation on the service layer. The next important thing is that we need to balance the number of shards on a single Solr JVM
25:40
as it has multiplicative effect on the memory utilization. So finding the right balance between throughput and memory is really important. And then there are some empirical things that we observed, which is like the moment you commit data into the Solr index, the cache is cleared. So you don't really need to commit until you actually need it.
26:01
And it may vary depending on use case, but for our use case, we will find this committing once a day. And the next thing is, and final thing is, filter cache is not exclusively used for faceting. It's also used for other things. So it will not always remain bomb or prefilled with every entry that you prefilled inside it. So you need a cache form. And that's where we use another service
26:21
which sends the same queries again and again over a period of time to keep the cache warm. Okay, so this brings us to the final phase of the experiments in which we included all these optimizations. And the results this time, they were really good. It met all our expectations. So we were able to scale the range faceting performance from 10 queries per minute to 10 queries per second
26:41
for the heaviest faceting queries, which is like 60 times improvement. It can easily go much more than that for simpler queries, but yeah, these are for the heaviest ones. And remember, this latency is not the search traffic, but for the analytics, and also on about 3 billion documents. Okay, so the key takeaways from this talk,
27:00
which I think will really help you reach your goals faster is that experimentation is the key. The more you experiment and play around with things, the better you can innovate and find solutions to reach your goal. And I cannot stress enough how important it is to have a robust testing system. We use JMeter to quickly validate the findings of our experiments and decide the direction
27:20
instead of spending too much time on the wrong things. And yeah, always keep an eye for awesome tools out there. The reverse engineering from heapdump all the way to understanding the filter cache functionality was only made possible because of its memory analyzer. And be prepared to dig deep and deeper all the way to the root of something. It gives you so much insight, which might not be applicable immediately,
27:42
but helps you understand ways to reach your goals faster. And then the final thing is you should really iterate fast. This is the model that we followed and it really, really helped us. With this, I would like to conclude this talk. Thank you so much everyone for your time. And as I mentioned earlier, we are actively hiring in our team. So do reach out to me if you're interested.
28:00
Thank you, and I'm ready to take questions. Yes, lovely talk. We have a couple questions for you in queue. The first one is, do you have experience using nested facet queries in the JSON Facet API? Any tips or tricks that you could pass along? Yeah, so we actually did investigate quite a bit
28:22
about how to optimize our query, which were JSON Facet queries. And one of those things were actually nested Facet queries as well. We also did quite deep inside the Solr code to understand if there are ways to optimize it. But what we found, so we thought that if the level of nesting, the heavy part is inside or it's on the outer layer,
28:42
it might affect the performance, but it really doesn't. So as of now, I don't think I have very valuable insight into how you can optimize it. Yeah, it just works pretty well, I think. Awesome. Second question for you is, did you consider allocation profiling options like JFR
29:02
instead of the out-of-memory heap dumps? No, not really. So to be honest, we didn't even think about analyzing heap dumps. It was the last thing which we did out of frustration because we tried reading Solr code and we tried going to blogs and reading stuff, researching stuff.
29:20
And in the end, when we were frustrated, not able to find any answer, we just thought, okay, let's try this. So yeah, maybe this idea is a very good idea and maybe if I do it all over again, I can definitely try doing this, but we didn't. Yeah, that's fair. There's only so much time in each iteration cycle, right? You have to be pretty ruthless with figuring out what's going into each one.
29:42
Third question for you is, did the team consider decomposing timestamps into kind of more rougher buckets? They give year, year, month, year, month, day, and then kind of using that in the standard term faceting. Sorry, so when you say year, year, month, day,
30:02
does it mean like all the aggregation periods that we talked about or something different? I think they're referencing pre-constructing almost like a keyword. I know that's not the sort of thing, but pre-constructing the actual timestamp down into one of those buckets. So you add a year field, you add a field for year,
30:21
month, add a field for year, month, day, and then use those like a standard term aggregation you would, and like the facet you showed for category or something like that. Yeah, actually we didn't really think in that direction at all. So the way we started is that, so we just basically wanted the counts with all the new stories that we have at Bloomberg, right?
30:43
And we wanted it to be exactly as the news schema for Solr that we have, and we didn't really want to modify it too much. We tried to have solutions which would use the same schema as we have for our other news systems. So yeah, no. Yeah, that makes sense. And you guys are in an analytics situation too,
31:02
where the buckets might, you might want to reshake up what the buckets are on the fly as questions come up.