We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Building Better Benchmarks

00:00

Formale Metadaten

Titel
Building Better Benchmarks
Serientitel
Anzahl der Teile
32
Autor
Mitwirkende
Lizenz
CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
TPC, SSBM, YCSB... These are just some of the database benchmarks that are out in the wild. Yet there are many derivatives, implementations and variations of these and other benchmarks. The differences between them vary between ease of use, and effectiveness at testing at large scales. End-user licensing agreements and licensing of the software being used can be just as prohibitive as the test itself for being able to measure the performance of your software. "Better" is certainly subjective. From the point of view of open source software developers, we want something that: * can push the limits our of software and hardware * we are allowed to modify and change to better suit our needs * can be shared with colleagues and the world about the achievements made The current state of the art is surveyed for what is available to help us now and what hurdles remain in order for us to be able to do more. These hurdles only remain as opportunities for us to overcome with open source solutions.
Computeranimation
Computeranimation
Computeranimation
Computeranimation
Besprechung/InterviewComputeranimation
Besprechung/InterviewComputeranimation
Transkript: Englisch(automatisch erzeugt)
Hello! Welcome to this presentation on building better benchmarks. My name is Mark Wong. I'm employed by Second Quadrant, a Postgres support company.
I've been recognized as a Postgres contributor since 2005. I'm also on the board of directors at the United States Postgres Association, a nonprofit for supporting the adoption of Postgres. I'm also involved in co-organizing the Portland Postgres Users Group.
My background is in developing and running TPC benchmarks for marketing and engineering purposes. So I'm going to talk about what benchmarking is about, as it can mean a couple different things, and try to focus on what I think is most relevant to the Postgres community.
I'm just going to generally mention what benchmarks are out there, in particular open source benchmarks, and talk about what the issues are with this benchmark and how those issues relate to the Postgres community.
Then after establishing what is out there, what some of the issues are, I want to propose some thoughts on where we want to go and get some feedback on what the community needs are. So I think there are two general reasons for benchmarking.
One reason is for competitive benchmarking, and the other for self-assessment. With competitive benchmarking, I think this is most important to the end user, where the end user is looking for how many transactions we can execute per minute, or how fast we can execute a reporting query, and how much is it going to cost to get that performance.
With self-assessment, we're trying to characterize our development work. For example, is Postgres scaling the parallelization of a query across all processors? Well, is this code optimization actually improving the system performance?
Or even if this new index is performing better than any of the previous indexes, or what types of queries it helps on, what types of queries it doesn't.
So let's take a look at a couple of industry consortiums to describe what the competitive benchmarking arena looks like. This isn't meant to single anyone out in particular, but I think we can take a look at a couple of specific consortiums to give a pretty good picture
of what competitive benchmarking is about. First, I want to point out that there have only been a couple of industry benchmarks published using Postgres over the years, specifically SpecJAppServer published in 2007, well over 10 years ago now.
And SpecJAppServer is a multi-tiered Java application server benchmark. It's not really about the database itself. And for those of you familiar with the TPC, there hasn't been any TPC benchmark publications with Postgres at all.
Now, I will admit that Spec and the TPC are the ones that I tend to look for. But if there are any other industry standard benchmarks using Postgres, I'd be happy to hear about them. Please let me know. So a question I want to raise is, is this reflective of the demand of having an industry standard benchmark publication with Postgres?
Perhaps it's really not all that important to Postgres. Perhaps it may be in the future, but just food for thought for now. So Spec is a nonprofit organization.
They develop their own benchmark suites that measure system performance. Spec covers their fees by charging vendors to pay for the benchmarks that they want to run. The vendors will run the benchmarks and return the results to Spec.
Spec will then handle the publication of the benchmarks. The TPC, the Transaction Processing Performance Council, is another nonprofit organization. They primarily do two things. One is to create benchmark specifications
and to define a process for reviewing and policing the publication of the benchmark results. They do provide a variety of OLTP and business intelligence benchmarks.
So TPC benchmarks are not all that trivial to run. It takes a bit of effort to develop a benchmarking kit and can take quite a bit of effort to execute. The majority of these benchmarks that the TPC defines do not provide a complete kit.
Some do not provide any code at all. Some of these kits provide some code that will help you create the data for the benchmark and create the queries that you need to run. Some of these benchmarks also provide a framework that you need to develop your kit around.
The benchmark specifications are pretty detailed. They do describe how the database needs to be created, how the data needs to be generated, and how to drive the workload. So the responsibility is on the parties that wish to publish a benchmark
to actually develop the kit themselves. Then they need to also get the kit audited by a third party before results can be published. In addition to paying for these auditors, the TPC also has a fee for publishing the benchmark results.
There are other requirements for the publication such as needing the software that you're benchmarking and the hardware to be available for sale and be supported for some amount of time after the publication.
So to summarize what competitive benchmarking is about, these industry consortiums do provide well-defined workloads, but they are expensive, something a company might be able to afford, but certainly not an individual developer.
To summarize competitive benchmarking, these industry consortiums do provide well-defined workloads, but they're not a trivial amount of effort to implement.
If they're not providing you a kit, you need to put in the effort to develop the kit and then get it audited. It's also not a trivial effort to execute these benchmarks. In addition to putting these kits together, you do need to get hardware to execute them on.
So these can be expensive for a variety of reasons, either purchasing the kit, getting a kit approved, and then publishing the results. So it's not intended for the individual. These are activities that a company would be taking on.
So I think what is more relevant to the Postgres community is using these benchmarks for self-assessment. To best characterize our work, though, we need more than what the metrics provided by the benchmark are.
We need to know more than how many transactions we run per minute or how fast we can execute a query. So benchmark publications don't ask for things like reporting the processor utilization on the IO statistics,
but these can be good information to have to know how well we're utilizing the system. Furthermore, profiling the system, including the operating system, Postgres, any other software components that need to support the execution of the benchmark can be very helpful.
Having annotated source code, generating call graphs, generating flame charts, anything else that helps us understand what the system is doing, what the code is doing. I like how OLTP Bench described this in their VLDB paper.
In many cases, researchers and developers are limited to a small number of workloads to evaluate the performance characteristics of their work. This is due to the lack of a universal benchmarking infrastructure and to the difficulty of gaining access to real data and workloads. This results in lots of unnecessary engineering efforts and makes the performance evaluation results difficult to compare.
So along the lines of unnecessary engineering effort, I did a quick search to see what benchmarking kits are out there and found a dozen projects that provide at least one derivative of a TPC benchmark, including OLTP Bench.
So my intent is not to show a comprehensive list of what is out there, but to suggest that there might be more options than there needs to be. I think the number of implementations is reflective of the difficulty of providing a few good benchmarking kits. I think there is room to have more than one good kit, and reasons to,
but having over a dozen feels excessive, and these are just derivatives of TPC benchmarks. I think part of the problem is that not everyone is comfortable working in the same programming language for a variety of reasons. Maybe someone is more comfortable working in Java or working in C or TCL or just plain shell scripts.
But I don't think it's a bad idea to try starting from something that's already out there and extending it or adapting it to meet our needs better if it doesn't already.
These are only some of the options, and I'd be happy to hear about others that people think are worth taking a closer look at or starting with.
After showing a lot of TPC-derived projects, I now have a TPC-related licensing issue that isn't meant to pick on the TPC as much as show what could happen with anything that doesn't have an open source license.
This clause in the end-user licensing agreement didn't always exist, and I think it has since been removed. But this clause is saying that we cannot contribute any part of the materials to an open source software project without express written consent of the TPC chair. As an open source community, I would suggest that. It doesn't make sense to rely on a non-open consortium for building our benchmarking kits on.
We may be better off looking for another source of materials or, unfortunately, building our own from scratch. There are a couple that I am aware of. The Yahoo Cloud Serving benchmark, which is a key value store type workload.
And the LDBC, which is another open consortium, although they focus on graph databases. So these are the only two that I'm aware of. I'd be happy to hear more if you know them.
So I think our needs as an open source community for self-assessment to be able to characterize our work. We need something with unencumbered licensing. And we need something that collects system statistics and profiles the system for us.
In addition to just generating or measuring transactions per second or execution time. So here is where I start asking, where do we need to go from here?
Where I want to start pulling the audience to get feedback about some of the issues I presented thus far. But I'll continue to raise some questions, see if there's enough interest to have an MConference session. I believe that we are still planning to have some online MConference sessions.
Although I'm not sure of how that's going to happen yet. But I'm sure it will be posted on the website soon. So I do think we are going to have to ultimately develop some of our own workloads to get away from disagreeable licenses.
I consider these two archetypes of workloads to cover the spectrum of the types of workloads that we want to have.
So on one end, there are OLTP type workloads that are multi-tier client server type architectures that involve emulating users, requiring a transaction manager or connection pooler in addition to the database system. So some examples of these are electronic data processing, wholesale suppliers, managing orders or brokerage firms executing customer transactions.
At the other end are business intelligence type workloads. So you may also hear these called decision support systems, data warehousing, online analytical processing.
But basically these tend to be single systems that are executing business oriented ad hoc queries or large big data system star schema design databases.
So if I were going to create a benchmarking kit from scratch, I wouldn't recommend an OLTP type benchmark, or at least not for the first one. It's complex because it's multi-tier. On large systems, we'll need to simulate thousands and not tens of thousands of users with that many processes and threads.
We shouldn't be running that on the same system as a database. We know at that scale we need to have a connection pooler in order for Postgres to perform well. So then do we use an existing connection pooler or develop one for the kit? Are we developing one for the kit because it makes it easier to execute the benchmark? Or are we using it because we can actually get better results than using an existing connection pooler?
Then we need to make decisions like should the application logic be on the client side? Or should we implement stored procedures and functions on the server side for better performance? What procedural language should we use?
And while the kit itself might not need to scale to the same degree as a database, nevertheless, when the database gets large enough and we need more users to drive it, the kit will need to scale to some degree.
Perhaps a single process can drive thousands of users well, but I have a hard time believing that it can drive tens of thousands of users efficiently. But then maybe it's not important to have a kit that drives large systems, or to rephrase, as it's sufficient to be
running benchmarks on small systems, something that we can run everything all in one tier and get meaningful and useful results out of. And in some cases, perhaps using pgbench is also a good tool for some of these smaller scale benchmarks.
But if I were to start up, create a new benchmark from scratch, I would recommend creating some kind of BI type workload, because it's much simpler.
You only need a single system, just the database system itself. It's not as important what the language is that the kit is written in, whether it be shell scripts, Python, Perl, Julia, Go, Tcl, C.
We know we don't need a connection pooler because we shouldn't be running, or not that we shouldn't, but these type of benchmarks don't need to have thousands, or maybe even hundreds of users executing queries at once.
Some of these benchmarks are typically just a single user executing a series of queries. Although in some cases, it can also be interesting to see how the system performs running multiple series of these queries simultaneously.
So if there was enough interest, I would propose having an end conference session to define a database schema, how the data should be generated, and what kind of queries should run against database.
I just wanted to revisit the competitive benchmarking briefly. I do think it has its place, but I'm not sure what the interest level is with this group. And I don't think it's necessarily more interesting than having a good benchmark kit for self-assessment, for characterizing the work that's done in the community.
I don't think there's anything wrong with anyone trying to enter the TPC arena and getting benchmark results published as Postgres. I think it'd be personally kind of fascinating to see what a result would look like. But I would question, do we have to enter an existing benchmark arena?
Perhaps it makes more sense to create a new arena for open source software. But I don't think this is a trivial amount of work. There's lots to consider, such as how do we create a fair playing field?
How do we enforce that people play fair? I think it could be done, but it's going to take a lot of effort to create this arena. I'm happy to discuss further in an unconference session. So if this is something
that you're interested, please let me know. We can see how much interest there is. So to wrap up in the next few slides, I wanted to do something that I hope is a little fun for everyone. I wanted to compare the performance of a Java benchmarking kit with a C benchmarking kit. My intent isn't to show why or how Java is better than C or how C is better
than Java, but to show why is not trivial to decide how to build an OLTP benchmarking kit.
So I have a little HP ZBook mobile workstation, which is a fancy way of saying laptop. But it has a six core processor, 64 gigabytes of memory, a two terabyte NVMe solid state device.
And what I'm comparing is OLTPbench. It's a Java implementation of a TPCC benchmark. And DBT2, which is a C base kit, also a derivative of the TPCC benchmark.
So what I'm going to run is a single warehouse database built with only a single warehouse. And so there are going to be 10 terminals, which is emulating a user.
So 10 users, each with their own database connection. So 10 total connections to the database. And each of these users are going to execute queries as fast as possible, meaning that there's no keying or thinking time involved. And I'm going to disable the threading on the processors, just because.
So the results of the test, of running each of these tests, is that OLTPbench can execute over 21,000 new order transactions per minute.
While DBT2 executes a little under 29,000 new order transactions per minute. So roughly 30% throughput difference. With OLTPbench, the processors were, the system was reporting that each of the processor cores were between 80, 90% utilized.
Jumping up between that range throughout the duration of the test. DBT2, none of the cores were utilized more than 60%.
According to the per-process statistics, OLTPbench, the Java process, was supposedly utilizing 100% of a processor. Where the processes, there are two processes that drive the workload in DBT2.
Did not total more than 76% of a processor. Both of these tests were IO bound. They were both saturating the 2TB storage device. And in each case, the 10 Postgres backends were not utilizing more than 30% of a processor each.
So then the questions I pose for this are, not so much whether it be Java
versus C, but is one language, using one language, worth a significant performance difference at the cost of, say, maintenance, of getting other people involved, of keeping these benchmark kits up to date.
So with that, with those final thoughts, I want to thank you for listening. Please get in touch. My email address is mark at secondquadrant.com. I'd like to get together with people and have a couple of unconference sessions.
If there's enough interest to talk about any of the things that I've gone over in this presentation, or if there are other things that you feel are worth discussing that I didn't touch on, please let me know. Thanks again. We're back live with Mark for Q&A. Mark, go ahead.
Hi. So I got a question about who the potential collaborators would be for open source benchmark, even all other open source benchmarks are competitors to Postgres. I think, I think that that depends on the purpose.
I think from the point of view of competing with each other. People who want to be able to say, we do this as well as someone else or we really do do this workload better than someone else. People who want to be able to say that without, sorry, people who want to be able to say that without, what's the word I'm looking for?
With confidence that no one's going to say it's an unfair comparison or whatnot.
I think for just purely being able to see how you do, compared to the rest of the world. I suppose the argument might not be as strong for, say, Postgres as a feature that someone else had to
see if, well, I suppose, play off one another to see where one lags behind or where one can improve. That's how I would think about it. Just reading through some of the comments.
Companies that sell PGAAS could compare who has highest PPS for their offer. Yeah, yeah. I think for, if I'm taking that correctly from the point of view of making sales.
I don't know if these necessarily have to be open source benchmarks as opposed to using one of the already established consortiums to make that kind of claim. So another question, if we make open source benchmarks, how do we compare the performance of Postgres to that of some proprietary database?
I don't think that will be possible unless the proprietary databases give... My recollection, even though it's been many years now, is that all of the proprietary databases have end user licensing agreements.
So if you download or purchase their software, you can't use it to make performance comparisons. So doing something open source compared to proprietary doesn't really get solved by having an open source benchmark.
And I think that's all the questions for now. I'm happy to still hang out for the remaining time if more questions come up.
Thanks, Dan. All right. Bye bye. Thanks, everyone. Bye bye.