Postgres Performance for Humans
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 199 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/32609 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 201475 / 199
2
3
10
11
12
13
14
17
18
19
23
24
25
27
29
45
46
47
48
49
50
51
52
55
56
57
58
65
67
68
70
72
74
75
76
77
78
79
81
82
84
86
87
89
90
93
94
100
101
102
103
104
106
107
109
110
111
112
113
114
115
119
122
123
124
126
127
128
129
137
139
141
143
145
147
150
154
155
158
161
163
164
165
166
168
171
174
175
176
179
182
183
185
188
190
191
192
194
195
196
00:00
Speech synthesisData storage deviceMultiplication signRule of inferenceQuery languageHacker (term)Subject indexingEmailRow (database)Cache (computing)Mobile appView (database)Greatest elementHydraulic jumpLocal ringRight angleVideo gameBlock (periodic table)WorkloadContent (media)Latent heatSemiconductor memoryExpressionReading (process)2 (number)NumberTable (information)Process (computing)Different (Kate Ryan album)Electronic program guideComputer configurationBitType theoryBlogTraffic reportingWeb 2.0Web applicationData warehouseDatabaseWindow functionLink (knot theory)Slide rulePoint (geometry)Real numberGoodness of fitPlug-in (computing)Pattern languagePort scannerGradientOrder of magnitudeVirtual machineInheritance (object-oriented programming)MiniDiscPhysical systemFactory (trading post)SummierbarkeitBit rateClosed setExecution unitWebsiteFocus (optics)Electronic signatureHypermediaIntegrated development environmentComputer hardwareSound effectResultantVisualization (computer graphics)TrailMotion captureContingency tableField extensionOnline helpPerspective (visual)MereologyChromatisches PolynomStress (mechanics)System administratorWordINTEGRALLecture/Conference
09:27
EstimatorQuery languageMultiplication signResponse time (technology)VolumenvisualisierungData warehouseNumberBlock (periodic table)InformationThumbnailRow (database)Total S.A.Dependent and independent variablesBoiling pointBitPhysical systemType theoryDatabaseAverageSubject indexingPoint (geometry)Order of magnitudeData dictionaryData storage deviceWeb pageWeb 2.0ResultantMathematicsPopulation densityDifferent (Kate Ryan album)MeasurementGoodness of fitSlide ruleRule of inference2 (number)Right anglePlanningKey (cryptography)Maxima and minimaCartesian coordinate systemSelectivity (electronic)Functional (mathematics)EmailRaw image formatBasis <Mathematik>Single-precision floating-point formatCondition numberStatement (computer science)MultiplicationTable (information)Speech synthesisSet (mathematics)Perfect groupConditional-access moduleSimilarity (geometry)1 (number)Product (business)Mathematical optimizationFood energyCASE <Informatik>Web applicationVolume (thermodynamics)Cellular automatonMetreMathematical analysisCore dumpInteractive televisionWebsiteRing (mathematics)Source codeReading (process)Fault-tolerant systemStress (mechanics)Vector spaceUniverse (mathematics)Normal (geometry)TrailMetropolitan area networkOrder (biology)SubgroupLecture/Conference
18:53
Multiplication signConnected spaceOverhead (computing)Cycle (graph theory)Greatest elementRight angleRelational databaseSubject indexingTable (information)DatabaseProduct (business)Software frameworkState of matterFormal languageLibrary (computing)Cartesian coordinate systemFunctional (mathematics)ResultantCASE <Informatik>MathematicsNetwork topologyReverse engineeringQuicksortCuboidSequelDifferent (Kate Ryan album)outputMetropolitan area networkReplication (computing)Set (mathematics)Workstation <Musikinstrument>Mathematical optimizationDivisorQuery languageBitComputer configurationScripting languageBuffer solutionCache (computing)Game controllerProcess (computing)Element (mathematics)Software maintenanceTerm (mathematics)DemonCylinder (geometry)Figurate numberDefault (computer science)Computer hardware2 (number)1 (number)Computer architectureFlow separationCross-platformArmBuildingVacuumAreaType theoryRandomizationConcurrency (computer science)MiniDiscWordReading (process)Core dumpBackupSoftware developerMultiplicationGoodness of fitLecture/Conference
28:20
Multiplication signPhysical systemPoint (geometry)Range (statistics)Traffic reportingProcess (computing)Replication (computing)Statement (computer science)Portable communications deviceBackupThumbnailStructural loadComputer architectureProxy serverOnline helpBasis <Mathematik>Web pageLine (geometry)Overhead (computing)Slide ruleSemiconductor memoryDatabaseSpecial unitary groupDefault (computer science)ArmRevision controlPoint cloudValidity (statistics)Order of magnitudeSubject indexingCASE <Informatik>Core dumpSummierbarkeitException handlingFile systemMereologyBitOrder (biology)InformationFunctional (mathematics)Level (video gaming)Type theoryDivisorTerm (mathematics)Computer fileSet (mathematics)Combinational logicDifferent (Kate Ryan album)Coefficient of determinationCustomer relationship managementImage resolutionLibrary catalogResultantTable (information)Stress (mechanics)InferenceReal numberMiniDiscoutputMathematical optimizationDrop (liquid)Query languageCache (computing)Product (business)Rule of inferenceLink (knot theory)Single-precision floating-point formatLogicError messageLecture/Conference
37:46
Inductive reasoningPoint (geometry)Forcing (mathematics)Branch (computer science)Memory managementTunisDirection (geometry)AuthorizationMereologySeries (mathematics)Bit rateDistanceQuicksortCASE <Informatik>Event horizonSemiconductor memoryEndliche ModelltheorieMoment (mathematics)Reading (process)Right angleMultiplication signRemote procedure callRule of inferenceNumberSound effectIntegrated development environmentComputer configurationExistenceParameter (computer programming)ResultantText editorState of matterMappingMiniDiscAsynchronous Transfer ModeConsistencyDatabaseRow (database)Visualization (computer graphics)Source codeLetterpress printingMathematicsBitGraphical user interfaceQuery language1 (number)PrototypeBlogLecture/Conference
47:13
Order of magnitudeGroup actionTheoryMultiplication signStudent's t-testGreatest elementInheritance (object-oriented programming)BitLecture/Conference
56:39
Water vaporRight angle
Transcript: English(auto-generated)
00:00
You can see the title and you're at the right place. Really quickly, a show of hands, is there anyone here that doesn't use Postgres? Okay. Jump right in. So a few just completely shameless plugs at first.
00:20
So I do a good bit of writing about Postgres, among other things, in my personal blog. Hopefully it's a helpful resource. I curate something called Postgres Weekly. So this is a little different from the one that hits the Postgres mailing list. It's more end-user content, kind of how-tos. It's not what's going on on the hackers list.
00:42
It's more curated at end-users. I like to think it's a pretty good resource. Postgres Guide is another thing. If you're on a Mac, give Postgres.app a look. Hopefully it makes your lives much better and happier than Homebrew or MacPorts.
01:01
And then I work at Heroku. Has everyone here heard of Heroku? Okay. We run essentially the largest fleet of Postgres in the world. So if you don't want to deal with all the administration, we pretty much take care of it for you. So I travel a decent bit and speak at quite a few conferences. And when I'm talking about Postgres, I usually put up a slide like this, either at the beginning
01:25
or the end, as kind of the why is Postgres great. If many of you are in here for Magnus's talk, there are not nearly enough hands go up for some of these things. Extensions, in particular, like HStore, it's a key-value store directly in your database.
01:43
I think Magnus at his conference yesterday was talking about you shouldn't use your database as just a key-value store, and it's a perfectly good one. We'll get into a little bit of that and how with indexes it can work pretty well. Common table expressions are awesome window functions.
02:02
If there's things up there that are surprising, write them down and search for them. You should probably know about maybe every one of these features if you're using Postgres, just because it can make your life better if you know what to look for. So digging in, there's kind of two general workloads for any database, OLTP and OLAP.
02:25
OLTP is essentially web apps. So this is mostly where I'm going to focus on for the next while. OLAP is a perfectly valid place and needs performance considerations as well. But it starts getting much more varied and deeper much quicker.
02:43
In the OLAP world, you're looking at BI and reporting. I'll give some general rules for it, but it's essentially a whole other talk that we can dig into. Really quickly here, are people more on the web application side here, or are they doing more OLAP? How about the web side?
03:01
Okay. People doing data warehousing type things? Okay. So sorry if I disappointed any of you on the latter half. Hopefully there's a little bit in here that's still useful. So really quickly on Postgres setup and config. I'm not going to dig too much into this, because there's some guides basically on what to do
03:22
and basically copy and follow along and then don't worry about it. So if you're on Amazon, shameless plug again. Use Heroku or look at Christoph Pettis' talk, PostgreSQL when it's not your day job. Basically set about 50 configurations and then you're good. And then you don't have to worry about it.
03:41
For other clouds, look at his talk. And if you're on real hardware, so good performance machines, Christoph's talk can be a good starting point. But to really optimize it, high performance PostgreSQL. This is a book by Greg Smith.
04:00
It is pretty much the defacto for tweaking and tuning configuration of Postgres. It's based on 9.0, but it's still pretty relevant overall. And there's the link. All the slides will be up online as well after, so don't worry too much about links. But there's the link to Christoph's talk.
04:26
So generally when looking at a database, I work with a lot of customers and they'll say, you know, I'm having performance problems. And I think the first thing to do is to take a look from the 30,000 foot view. Like the nice far away, how healthy is this system overall.
04:44
And I think the number one thing to pay attention to here is the cache. So Postgres is really good about keeping frequently accessed data in memory. If you were in Magnus' talk before this, he mentioned pre-war, which is huge when you fail over to a read slave
05:02
that hasn't been receiving traffic, your performance is basically going to go to crap. So pre-warming that before you fail over every so often is really nice. Postgres is going to do a better job of this than you are. You don't have to think about it. It's pretty intelligent. It's going to keep your frequently accessed data in cache.
05:22
And you're looking at a thousand to one performance difference if you're getting cache for having to seek to disk. What we've seen is generally data follows the 80-20 rule. So you don't need as much memory as the amount of your data. A lot of your data is going to be pretty frequently accessed.
05:43
And that's kind of rough, but generally 80% of your data should be in cache. So the cache hit rate, there's a nice simple query to get it. You can Google, you can find this.
06:02
If you're on Heroku, we have a plugin called PG extras. You can just view the source and grab this query from there. It's going to give you something nice and simple that looks like this at the end of it. And what you're looking for is somewhere around 99% or higher. As soon as you start to dip below that, you're going to see pretty significant performance degradation on your queries.
06:24
The other one is index hit rate. So how often are you using indexes versus coincidentally scanning a ton of data? So this one's a little bit prettier. I'm going to give you something that looks like this. So on this one, indexes I would say follow a little bit of a different rule.
06:46
Right here I would probably say on anything that's over those top three tables, they're performing horribly. I'm doing a sequential scan over a few hundred thousand rows. And this is actually a real-world example from an app that we had
07:01
where queries I think were taking about three seconds, adding some indexes on them, dropped it down to I think around ten milliseconds. So that's what you're typically looking at there. I'll get into a rough rule of thumbprint in a second. So I mentioned the 80-20 rule. I would say rough guidelines.
07:21
Cache hit rate, you're going to want it to be at 99% or higher. So anytime I'm looking at a database, if I can get it to 99% by just throwing more memory at it, my life's going to be a lot better. Index hit rate, looking at 95% or higher. And I put the where 10,000 rows there kind of in gray.
07:42
This isn't a hard and fast. Doing a sequential scan across a hundred rows isn't too bad. Sometimes if you've got ten rows in a table, the sequential scan is going to be faster than your index. But it's not going to hurt heavily to have an index on every single row if you're frequently using it as well.
08:04
So I mentioned some of those queries, the cache hit, the index hit. I generally recommend just saving these and using them all the time. A nice thing to do. How many people customize their psqlRC here? OK, cool.
08:21
So hopefully this isn't news. It's basically setting a query name, and then I can just select from that. So the bottom one right there. Really nice to save these. So if you're on your own local machine, things like cache hit, index hit, slow queries.
08:41
We'll get to this one in a little bit. But super handy, and you can save yourself so much time versus googling for it every single time. One really quick detour. Datascope is an awesome tool for visualizing this, cache hit, index hit, and some other things.
09:01
It'll actually track it, visualize, overlay, look for things like block contention and other stuff. Just pull it out from the GitHub repo, install it. That should be pretty straightforward. So from there, the obvious thing is cache is decent, indexes are decent, but some things are still slow.
09:23
So understanding specific query performance. How many of you understand what an explain plan looks like in Postgres and can read it and make sense of it? There was a lot of really unsure hands there.
09:43
So a pretty basic query. So for those of you that aren't familiar, every query has a query plan, and explain will show you what it thinks it's going to do and run like. Explain analyze will show that to you in Postgres, the actual path and how long it took at each step. So this is what one looks like, essentially.
10:03
And I think when I showed this slide once before, well, I'll talk about that in a second, but I get yelled at because I haven't analyzed my data. So it'll show the cost there, and the cost is what it roughly thinks it's going to do. It's actually in some measurement that I don't understand.
10:22
I don't think it's in actual milliseconds like it is on explain analyze. It's some weird costing setting internally to Postgres, but you can generally judge it the same way. So super confusing trying to simplify it down. The initial cost right there is the startup time for it.
10:42
The middle is the max time, and on the right is the rows returned. I believe in 9.2 now, maybe it's 9.3, it'll show rows excluded as well sometimes on explain analyze when it discards rows, which is very helpful as well.
11:01
So this is nice and marginally helpful. The real value is explain analyze. The difference here is it's actually going to execute the query. So if this is a query you don't want running on your system because it's going to take down production, then don't do this with it. It's going to run it. But it's also going to show the actual time.
11:21
So you've got the estimates out there that we saw before, and it's going to show you the actual time that it took for each step there. So you can start to say certain things that, right there, 295 milliseconds. I'd say this is actually pretty bad for what it's doing. It's returning three rows in total,
11:42
which is pretty poor. So that actual time there is what you want to pay attention to, and a rough rule of thumb there, for most web applications, page response times I would say you're looking to get under 100 milliseconds. So common queries under 10 milliseconds,
12:00
rare queries under 100 milliseconds, and this entirely changes in the data warehousing world, but on small kind of result sets, this should be pretty much what you're aiming for. And I'd say common queries, you can get them down very close to one millisecond pretty commonly,
12:20
but at that point you start kind of micro-optimizing where you could be solving bigger problems. So as I said, that's a pretty bad time overall. Enter the obvious thing, indexes. So creating an index on this, you're going to see it start to change.
12:40
So it's going to use an index scan now, and it's going to bring that time down to 1.7 milliseconds. So there we're looking at a couple orders of magnitude shaved off, which I think is where you want to optimize. So any query, I'm looking at how many orders of magnitude versus just raw seconds. So Magnus also mentioned this in his talk,
13:02
and there were ways, I think it was a mix, so there were some people that were aware and using it, but still way too many. I don't think you should, if you have Postgres installed, you should be using this on a probably weekly basis. So PG stat statements normalizes every single query that's run against your database.
13:20
So if I actually have something in there, like select star from users where email is craig.heroku.com, it's going to replace craig.heroku.com, as you see right there, and it's going to record a ton of information about it. It's going to record the number of times it's run, the average time that it takes,
13:41
the number of IO blocks dirtied, written, all of these things that are interesting. But I think the most interesting thing is that you can, with a very, very simple query, get an insight of where can I get the best cost savings on my database. So hopefully this is fairly straightforward.
14:00
It's basically going to take the total time of every query, render that as minutes, take the average time that it takes, so I can start to tell where do I want to optimize, and then how many times it's run as well.
14:20
So that's going to start to give something like this. Assuming the full query is out there, so you can see that I've got 295 minutes on the first one against my database, and 10 milliseconds on average against each time it's run.
14:40
Looking at those two things, I can pretty much tell where I can give myself a ton back. I actually probably wouldn't optimize the first one, and would just jump down to the second one since the total is close enough, and I can probably get two orders of magnitude back. Has anyone in the room looked at their database like this and taken advantage of pgstats segments this way?
15:03
OK, a few people. This is a super easy way just to know where are you spending time and where can you get it back on your system. All right, so indexes. Postgres indexing is really awesome. There's quite a few types.
15:22
B3, gin, Gist, KNN, SPGist. I hear vodka is coming along as well, but I don't know when. Or at least it's rumored by the crazy Russians that create all of the index types. If you're like me, though, the first time I looked at this,
15:42
this was kind of my response. This is great, and it was explained to me, and I still didn't quite follow when to use which. And they have entirely different implications of will it actually help performance, size on disk, and other stuff.
16:01
I think Christoph talked a little bit about hstore and gin and Gist earlier on performance. I'm going to boil it down, hopefully, much simpler, and this isn't a perfect rule of thumb. B3 is usually what you want. When you say create index, this is what you're going to get. In most cases, this is what you want. Use it. Don't worry about it.
16:23
Gin. For those using arrays or hstore, if you're not using hstore, take a look at it. It's awesome, like I was saying earlier. It's essentially a key value store directly inside your database. You can show a dictionary right in there, and it just works. Gin, generally, is when you have
16:41
multiple values inside a single column. Array, hstore, obviously makes some sense there. Gist. Here's where you have essentially a couple, like a 2D thing that can overlap. Full text search, shapes,
17:03
rough rule of thumb where it can fall under multiple different ones. KNN is for similarity. SPGist, I think I've had that explained to me 20 times, and I still vaguely understand it. I hear it's good for phone numbers,
17:21
so when you've got different densities within different subgroups, it's the best I can do there. I'm sure there's some people in the room that maybe can explain it better. I would love to hear from you afterwards. I hear it's good for phone numbers, so if you're doing anything with phone numbers, give it a try.
17:42
But then there's a few more indexes, right? So there's your index types. There's also the ability to do conditional and functional indexes, which for a ton of applications are hugely applicable. So conditional ones, basically if I'm selecting star from places,
18:02
maybe I only want to index my top cities. I want to index places that have a population of over 10,000. So this is going to be keep that index for only where that condition is true.
18:24
So creating it pretty straightforward, create it on the table, put it in the column, and put your condition there. You can actually have functions on this as well. So functional indexes also equally as awesome. Is anyone here using JSON in production?
18:42
A few people. I'm betting it's not performing well for you unless you're doing some functional indexes, or unless you're just scrubbing a raw JSON blob. So JSON in Postgres is getting better. Hopefully when we have JSONB it actually gets a lot more performant.
19:01
But right now JSON is... It's not going to take advantage of a GIN index or GIST index just because it's text. With functional indexes, with something like PLV8... So PLV8, for those not familiar, is the V8 JavaScript engine embedded directly into Postgres.
19:22
So you can execute whatever JavaScript you want, which is awesome if you're using JSON. So using it, you can start to do things like this. So I've got a basic JSON blob down there. I can create a function called getNumeric, which is actually going to traverse down into the JSON,
19:42
pull out the key population from my column data, which data is an awful, awful column name, except I think it works for JSON because there's still all sorts of random data down in there. And so in this case I'm actually combining the functional index...
20:03
Or no, I'm not. Let's see. So here we go. So you can create the index on that function, and it's going to do what you expect there. One probably important thing is declaring it as immutable.
20:22
Immutable essentially says if I pass this same thing in, I'm going to get the same result out, versus if your function has the ability to do random things, to write data, to change data, which there are some valid cases for,
20:40
you're going to suffer then... Essentially, it's not going to be able to be immutable, and Postgres is going to optimize for that differently. So if you've got something that always returns the same result based on the same input, then you want it to be immutable because you're going to get the performance gains there.
21:03
And then, of course, you can combine them. So you can do conditional combined with functional indexes. Pretty straightforward. Create index concurrently. Is anyone here not aware of this?
21:23
Does anyone have friends that use MySQL? Well, friends may be the wrong word. This is the single easiest way to convince them. So hopefully I'm preaching to the choir here, but it's going to be roughly two to three times slower,
21:41
and when I say it doesn't lock the table, it doesn't take a lock the entire time on the table. It's going to take a lock, but it's essentially unnoticeable, and it's going to build up your index in the background. So you can keep writing data, keep doing shit, and you don't have to worry about bringing down production for a day or two while you try to improve performance,
22:02
which certain other databases I hear can happen. So another big thing is pooling. Does everyone here... I'll just get into it. This is a New Relic performance chart, and right there at the bottom is the connection time from psychopg2,
22:26
and this is pretty normal. Some people have commented that it seems really high. The bulk of the cost there is SSL negotiation, which does have some overhead, but this was the default for a long time in Django,
22:42
that it would grab a connection on every request. I would say there's a couple options here to fix it. There's the application framework layer, which often has connection pooling, so I think Rails does by default. The SQL and Ruby library...
23:02
Django now has persistent connections, which is better than 1.6. I think the Play framework has persistent connections. I would look at your language framework. If it doesn't have it, figure out how to get it in there. It should just be a default. The other is a standalone daemon, and I think this is what most people think of
23:21
when they think of connection pooling. In reality, both can work and be useful, and larger applications will have both. A general opinion, I'd say you don't need the standalone daemon until you're maybe at 100 connections or more. There's varying opinion there in the Postgres community.
23:42
On the Postgres daemon side, there's two options, pretty much. PgBouncer, PgPool. I'd pretty much say PgBouncer is what you should be using. I'm sure people may disagree with that, but it's kind of one tool for the job,
24:01
which is nice, where PgPool does a lot of different things and comes with a few more caveats. You need to know a little bit more about what you're doing. As I mentioned early on, cache is important, and it's probably the most important piece I'm paying attention to when building LLTP applications.
24:23
The easiest way when performance is crappy is to just add more cache, throw more hardware at the problem. One nice option of doing that is replication. Replication has gotten a little bit better in Postgres, but the nice thing when I say replication and cache,
24:43
you can send read queries to a slave. If you can handle 10 milliseconds, 100 milliseconds of latency from the time it's written to the time it's read, which most applications can, the read slave is going to have its own cache and optimize for that, which can be very, very nice.
25:01
When it comes to replication, there's quite a few options. There's Slonie, it's supposed to be Londiste, that's a typo, Picardo, PgPool, and Wally and Barman. Personal recommendation, if you're not already using replication and wanting to get started,
25:22
I would start with one of those two. They're simpler to run. Again, it's that really small, lightweight tool, so you don't have to worry about too many things. You'll know when you need the others generally. If you're just setting up for replication, they generally just work. They work with the Postgres streaming replication,
25:41
and there's just some nice tooling around it to make it a little easier to set up. In updating Postgres settings, I encourage most people, if you're not comfortable with looking at a Postgres config file,
26:03
I encourage you not to look, because it's scary. When you do need to update settings, there's essentially just a few I would really generally start to point people at. I'm not going to go into super detail, because I think the Postgres docs actually explained them pretty well overall,
26:21
but essentially optimizing your cache size for your database and what's allocated shared buffers, WorkMIM and maintenance WorkMIM, though apparently now it's going to be auto-vacuum... I forget what it is now. The new one in 9.4
26:40
that's essentially for auto-vacuum. But WorkMIM is going to be, if you're doing a little more OLAP stuff, you're going to start tweaking and optimizing there a little bit more. On backups, that's another area that actually has performance implications. Most people just add them automatically
27:01
to their database, which is a good thing to do. I'm not saying get rid of your backups, but there's a couple types of backups, and it does have some implications. The first one I think is this is what most people really do. It's what's called a logical backup. This is going to be a PG dump of your database. It can be human readable.
27:21
It doesn't have to be, but it can be human readable. It's portable, so you can take it, restore it locally, cross-platform. It just works. This is what most people do. The other option is physical, so this is the actual bytes on disk. Generally, it's said to be
27:43
kind of architecture-dependent that you're on the same one. I've heard rumors that it can work across ones, but I don't think anyone in the community commits that that works. Basically, you're going to have to have the same infrastructure there to restore to it. It's not for pulling from a production setting down to your development laptop,
28:00
unless it's identical. But this is the actual bytes on disk. The base backup by itself is often useless. You need some of the write-ahead log as well to actually take advantage of it. Kind of comparing and contrasting the two. Logical, it's good across multiple architectures.
28:21
It's good for portability. It has load on the database. It has fairly significant load. I think the rule of thumb in the community, I would say it works for under 50 gigabytes. If you're over 50 gigabytes, it's going to start failing on you and not perfectly consistently.
28:40
It's just going to get worse over time. If you're up in the terabyte range, I'd be shocked to hear if you're doing a PD dump and it actually works. It starts just failing. The really nice thing about logical backups is because it is human readable and portable, it's actually going to do some validation in your data.
29:01
It's going to be a way to find corruption. I would encourage people to do it, but do it on a once a week, once a month basis. Check sums now in 9.3 definitely help for checking for corruption. But if you're on an older database, this is probably the best way to actually find it and isolate it. Physical ones, more initial setup.
29:23
There's definitely some more work early on than just PD dump, but it is less portable, but it does have limited load on the system. I think you scale. This is really where you want to be. The other thing is physical backups are typically what enable the replication setup
29:42
and just a good thing to do. I'd encourage not necessarily one or the other. You shouldn't be looking at them. But what combination of both do you need? For us, we actually do base backups daily, and we do logical backups. It's kind of up to the users, but I encourage a once a week, once a month type thing.
30:03
I actually blew through all of that, but I'll recap it all in plenty of time for questions. For OLAP, like I said, it's essentially a whole other talk. I think the key principle is there. If you're doing something in data warehousing, this IIO is going to be typically pretty important to you.
30:22
The order on disk is helpful. There's an interesting tool that I'm not sure if I would encourage you to use. pgReorg, it comes with a lot of caveats where you can essentially ruin your entire database if you do it wrong. But it is really awesome, and I think we're moving towards a little bit of that functionality
30:41
to appear in core over time. There's some interesting things there, but I don't think the full functionality is coming anytime super soon. The other thing is you start to get more and more hardcore is MPP solutions on top of Postgres. These can be private commodity products, or they can be homegrown solutions
31:03
with things like PLProxy. For OLTP, the basics. Ensure your data is in cache as much as possible, so 99%, and I think for us, anytime we see someone drop below that, we encourage them to upgrade.
31:22
Optimize your query load with pgStat statements. I would solely start with pgStat statements. I wouldn't look at one-off queries. I'd look and see how much overall load they have on the system, unless for some reason there's one report you really, really care about. Efficient use of indexes, and then when scaling any database,
31:42
the easiest way to scale it is to throw more money at the problem. It actually does work. Databases need memory and work pretty well with it. All the slides should be online. Speaker deck, Craig Kerstens. I'll open it up for any questions.
32:13
I definitely recommend using pgStat statements on production systems. It's got some overhead, but it's pretty low overall.
32:20
I think the overhead that it introduces, you'll easily get back as you optimize. We started turning it on for every single customer by default because it is so valuable. Inevitably, you're going to say, how do I optimize my database? Having that data already there is hugely valuable.
32:47
pgDump not being reliable above 50 gigabytes. Can you talk more about what you meant by that? I mean it won't dump. The question, in case everyone didn't hear it, was can I elaborate a little more on pgDump not being reliable
33:02
over 50 gigabytes? At a certain point, it'll just fail. It won't give you a dump. It won't really clearly say why either. pgDump doesn't have the best error reporting. Sometimes it'll work, sometimes it won't. Typically,
33:20
when you get to that size, it's usually the one large table problem. You've got one table that's just really, really huge. For some reason, it just doesn't work well with it. I think there's some improvements being made to help. It also fails when you have thousands and thousands of tables.
33:41
So some of the management with getting that information from the catalogs doesn't work very well. So essentially, users are going to get nothing, and you're going to know it's not working. But the solution isn't to try to keep inching it along more and more and more. It starts to go the other way.
34:09
The question, in case anyone didn't hear it, is do I have any input on which file system should be used for Postgres? What layout?
34:21
File system layout. I think it really depends on a lot of things of what you're doing. I'd say look at Christoph's talk. He gives a pretty basic version of if you're on cloud drives, do this. If you're on these shittier cloud drives, do this.
34:41
And if you're on real hardware, do this. Essentially what you're doing there is you're just setting some of the page cost settings of how long does it typically take to grab a page. Once you tweak that for what you're on, Postgres will pretty much do the right thing with the planner. In terms of
35:00
is there a preferred file system to run on, I think that's probably a huge bike shed really. There's a ton of different opinions. I'm trying to think offhand of what we run on and it's escaping me.
35:25
I think you're going to find people in all camps. I don't have a strong opinion personally there, but check with me after and I can double check what we run on and probably get some more information on the way.
35:45
In PG terms, or the automated Yeah, so for automated backups we use PG Well it depends. So for all of our customers we do base backups every 24 hours. And our customers don't see that. We just do it.
36:02
We manage it. We worry about it. They don't have to think about it. Because people like to feel and touch and hold a backup, we have something called PG backups for Heroku and that is a PG dump one. And that's because people do want to pull their data down locally. I'm not discouraging the use of
36:20
PG dump. Just know that it has load on your system. Know that you should be doing other things and know that it will stop scaling for you at some point. I think a lot of people add it on on a daily backup and then set up replication as well, where replication should lend itself to replace some of that. And by running PG backups or PG dump
36:42
on a daily basis, you're introducing unnecessary load. And we have a request every so often come in Hey, can I run this backup every hour? And it's like, why? You don't need it. You're not going to recover that way. There's a better approach. So yeah, it's there.
37:01
Users use it. It's helpful for moving data around. But it's not our default backup option. Awesome. Thank you guys.
37:31
Christoph, if you search for PostgreSQL when it's not your day job, you'll find it. Yeah, that link is there.
37:41
But if you get to my slides, that link's on there.
38:00
My problem is I'm really able to fast read from the disks because we have these over 100 gigs and basically I have huge
38:22
sequential reads and it's over above 20 million records. So in this case, my main issue is how fast I read from the disks and to make those records
38:41
I can't sometimes delete something from that database because it gets overblown. Yeah, so there I would definitely that's where Greg Smith's book becomes really handy. Greg does a lot of work on if you've got
39:02
Greg's book is the best source there. I hear he's updating it I don't know when it's happening but I know he's working on it up and forth.
39:27
Are you Greg? Yes. I'm Kathleen. Oh, nice to meet you. How's it going so far? Good, busy. Just a very good question. Induction.
39:41
Do you actually have the induction on the app? It's not. Postgres is still maintaining. Induction was kind of a nice prototype but Matt I know is generally spending a lot more time on other things but I'm actually working on a blog post around
40:01
Postgres kind of query GUI tools. So like if you're on Postgres weekly you'll see it. It'll be up in a week or two. Because there's been a few pop-ups. There's been three or four pop-ups. And some are better than others but I think they're all improving.
40:20
So trying to actually review all of them and say which ones are nice or not. Induction I don't think is getting a lot of love and updates. No. The promise was kind of cool. I agree. And it's something I definitely want to see keep improving and
40:40
change. Because right now the state of GUI editor sucks. I'm not sure if it was a gist that Matt actually published at one point. Where he actually revealed the future look and maybe it was a private branch but it didn't push or something. There was another
41:01
print on it that he was making. There were various problems with it. So I'll stay tuned to most of it. Thanks a lot. Hi. I was just wanting to ask about a quite special use case I have.
41:20
I tried to use Postgres basically as a manipulation engine. Because I just loaded the data and used some manipulation. You could say it's all our fault. Not really. But I don't really need consistency or anything. Just
41:40
transform it and dump it out. So if you have any tips about this and keeping it in memory? It has an in-memory mode. It has everything off. Well no, I think there's a full in-memory mode for Postgres but I can never remember how to run it. You can also use
42:01
RAM disks. There's another way it started actually all in... From what I read online I think in the official documentation about consistency they say don't use RAM disks these are the options which... Yeah, but if you're using in-memory
42:21
the other way you're not interested in consistency of that there's all these fsync off and delay bit before there's the synchronous commit mode that you can't disable
42:40
So if you're on fsync basically you have a synchronous... Running in Mongo mode It's the MySQL mode you say Yeah, I mean I don't know if anything immediately comes to mind just because I mean
43:00
I think a lot of people will run just fsync off because it's like OK, this works well enough. It's probably not going to lose data in that time. Hold it to memory run it, and then you're good. The people I know that have done similar things have done that and been OK with it and that usually works well enough for them, but I don't know if
43:20
there's anything a whole lot better and there you have some risk or some corruption, but if you're just loading it off and then running the 99% of the time I don't know anything super more elegant or better right off
43:45
Are you running in that mode now or not? I'm just playing with it I think that's the right general direction I want to say I know of one or two people that have done that and that's basically the direction they went But I don't remember
44:02
a definitive result or how that went
44:28
Well I think but hopefully we're like decentralized world with like two or three people whole population Yeah, very well
44:42
More than one or three people whole population So you've got other maps right? You've got the like the like the Yeah, so there is one So you've got
45:04
the So it's So it's this way I think you're a little bit too less Yeah, so Here we go More visual Going right works for John
45:21
and there's no more content So like we thought the budget's going to be a little bit more and we've been trying to get the source of it instead of just them giving us code right now So
45:42
We thought we were trying to move the It's going to be a bit more More visual More visual More visual More visual More visual More visual More visual
46:01
More visual More visual More visual More visual More visual More visual More visual More visual More visual More visual
46:28
So, I'll be standing there, and let me know if you want me to consider 50 minutes, or
46:42
45 or 40 minutes, so I'll do something.
48:55
So, you have 50 minutes in total, so I'm including questions, so let me know whether you want
49:28
me to modify you for 50 minutes, or if you want to leave 5 minutes or 10 minutes, the total time.
49:40
So, when it's 30 minutes, you're speaking, you're talking for 10 minutes, okay, so much less.
51:26
Oh, my parents were, so you can speak English, maybe a little bit more, not enough to, I know,