We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Highway to Hell or Stairway to Cloud?

00:00

Formal Metadata

Title
Highway to Hell or Stairway to Cloud?
Subtitle
How we migated our biggest cluster to AWS
Title of Series
Number of Parts
35
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
At Zalando we run PostgreSQL at scale: a few hundred database clusters in sizes from a few megabytes up to 10 terabytes of data. What is a bigger challenge than running a high-OLTP multi-terabyte PostgreSQL cluster? It is the migration of such a cluster from the bare-metal data center environment to AWS. There were multiple problems to solve and questions to answer: Which instance type to choose: i3 with ephemeral storage or m4/m5/r4 + EBS volumes? Should we give Amazon Aurora a try? Direct connection from AWS to the data-center is not possible. How to build a replica on AWS and keep it in sync if VPN is not an option. The database (primary and replica) is used by a few hundred employees for ad-hoc queries; ideally, they should retain access through the old connection url. How to do backups of such a huge DB on AWS? We should be able to switch back to the data-center if something goes wrong. In this talk, I am going to provide a detailed account of how we managed to successfully solve all these problems.
19
Thumbnail
42:43
29
34
Thumbnail
52:38
Computer animation
Computer animation
Computer animation
Computer animationLecture/Conference
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Transcript: English(auto-generated)
Pleasure to see some people there because in other rooms, there are very important people also giving the talks, like Roger Haas and Peter Eisenraut. So my name is Alexander. I work at Zalanda as a database engineer in Postgres community.
People know me as the Patroni guy or Mr. Patroni. I'm responsible for maintaining and developing Patroni project. So probably not many of you know about Zalanda. It's a European fashion retailer company.
We work across 17 countries in Europe. We sell a lot of shoes, clothes and some accessories online. We have more than 15,000 employees working for Zalanda and among 15,000, 2,000 of them are working in the technology department.
In total, we have more than 200 teams who develop applications. We started like as many companies on-premise. We still have a lot of infrastructure on-premise.
Like if we talk about Postgres, we run more than 300 databases. Those are not database clusters. The number of database clusters is closer to 100 probably. Four years ago, we started migrating a lot of stuff to the cloud.
Unfortunately, we didn't succeed much with migrating a lot of on-premise infrastructure to the cloud as it is. It's a long process, but meanwhile a lot of new stuff which is being developed directly deployed to the cloud. Recently, I just did an update of this number like yesterday and I was impressed that we have already more than 1,000 database clusters.
Those are real clusters. They are highly available. About 200 of those clusters we are running on EC2 instances on AWS and remaining we run on Kubernetes which deployed on AWS.
So what this talk is about. Historically, we got to support in a data center a very big database which holds different business events. This database is used by different applications to track orders by BI guys to do some reports.
It is accessed by applications and different people from our organization.
At some moment, this database becomes so big that we had four years ago ordered new hardware. In the data center, this process took us more than one month to complete from the point when we ordered something until we got our postgres running and everything migrated to this new hardware.
We knew in advance that one day we will hit the limit of disk space on this server. Since the company goal is to migrate everything to the cloud when it is possible, even if it costs
a little bit more, we decided that we should not order anything for the data center anymore but migrate. We have to take care about some preconditions. This cloud setup should be powerful enough to keep the load which we currently have.
For employees and especially for applications, nothing should be really changed. They should be able to access the database as they used to with a username and password and some host and port.
In case if something goes wrong in the cloud and it is not powerful enough or way too expensive, we still want to be able to get some paths to migrate everything back. Of course downtime should be minimized as usual and breaking up like 10 terabytes or more no matter where like on-premise
or in the cloud is a very important topic and sometimes it takes too long to do such backups and not very efficient.
So how did this old setup look like? It was quite powerful machine with 16 cores, 256 gigabytes. It's not the most power what you can get but anyway, we had 14 1.5 terabytes SSDs per machine and we had a RAID 10.
So in total, there is 10.5 terabytes of disk space. On every machine, we run Postgres and for applications and for developers, we are using Elastic IP to give access to the database.
In case of disaster, we promote the replica, move service IP from the primary and people still connect to the new database without any changes. So what do you have inside this database?
Maybe it sounds very funny and ridiculous but there are 3,000 tables. So in total, we have 1,500 of different events and every event is stored in separate table. One table is for hot data for the past 10 days.
So it's important to keep it separately because most of the queries people execute are against these relatively small tables. After 10 days, the very last day of hot data is migrated to long data, to archive table.
We have some policies about archive data, how long do we keep them. Some events we keep more or less forever, some other events we are expiring like depending on event type.
It might be a few months, half a year, maybe a year. One and a half year ago, the size of our cluster grew up to 10 terabytes. According to statistics we accumulated during previous years, average growth of our cluster was about 2 terabytes per year.
So in total, we had only 10 and a half terabytes. Basically, we had really made decision what to do very fast. Because remaining 10, half a terabyte gives us at most just three months.
And after that, nothing could be stored in this cluster. We decided, okay, let's give it a try. But of course, cloud is not cheap, everybody knows it.
And in my estimations, we are paying currently in the cloud for the same setup, roughly two and a half times more than in the data center. In case if something goes wrong, we want to be able to switch back without rebuilding everything from scratch. Basically, we want to replicate data back from AWS to data center.
Applications and developers and other people from the company who are using this database should be able still to use it as it is. Since we keep data in the cloud and there are some communication channels between data center and the cloud,
we have to take care about security to protect everything with certificates. Downtime should be minimal. What can we choose from on AWS? Of course, there is an amazing offering of AWS Amazon Aurora.
General availability was announced just a few months before we started this project. Another option is do it yourself. As usual, you can spin up EC2 instance and install Postgres,
replicate your data, restore from backup and start running. There are different kinds of EC2 instances on AWS. We really like i3 instances. They are very powerful, very fast. There are certain disadvantages of i3 instances, but we will get back to it a little bit later.
Another option is just instance which always relies on elastic block store on EBS. You can, in the cloud, allocate some volume of arbitrary size
and use this volume to keep your data. In case an instance is terminated, you just attach existing volume to a new instance and your data is back, you just need to start the Postgres app. What was wrong with Aurora? It seems to be very nice.
You pay for storage only once, although you can run multiple instances of Aurora. They are using the shared storage. The price for storage is absolutely the same as for GPT volumes on EBS. At the same time, there are hidden costs.
On Amazon Aurora, you have to pay every IOA request to such storage. One million of IOA requests costs you 22 cents. It seems to be a very small number, but if you are doing billions of IOA requests,
at some moment this number really blows up and you end up paying two, three, maybe five times more than you expected. The most important factor for us is that the number of extensions offered by Amazon is very limited.
Among them, PLProxy extension is not available. We use it and, basically, therefore, Aurora came out of play. I3 instances. They have local NVMe devices. It means that disk IOA instances are very small.
They give you high bandwidth, high throughput. Price per terabyte is really small. The biggest instance, i3-16xlarge, has nearly half a terabyte of RAM. That's a lot.
Disadvantages of local NVMe volumes, those are ephemeral. It means that if you stop instance or restart instance, your data is gone. For high availability, therefore, in order to make sure that you don't lose any single bit, you need to run at least three instances, in my opinion.
The biggest instance those days with local volumes had only 15 terabytes. About one month ago, AWS announced a new type of instances with way more disk space on ephemeral volumes than i3.
But it wasn't available one year ago. With EBS-based instances, we have different pros and cons. Of course, on EBS, data survives restart. That's nice. It means that we can benefit from it.
If you want to run a better instance with more performance, you can simply start the new instance up, attach volume, and start the postgres. That's it. And since data is persisted on EBS, we can afford to run only two instances for high availability.
One primary, one replica. Disadvantages of EBS. IO latencies are much higher compared to local volumes, and performance is limited. Actually, this number is a little bit outdated. Right now, on GP2, we can get up to 250 megabytes per second per volume.
But those days, it was 160. And the number of IOPS is also higher. It's 16,000. IO1 gives you a better performance throughput, like IOPS and so on.
But let's compare GP2 and IO1. Let's imagine that we want to have a volume which provides us 10,000 IOPS, and with 10 terabytes of data.
So in fact, actually, we will get 16,000 IOPS, but it doesn't matter. The price remains the same, and this price is 1,190 euros per GP2 volume,
and nearly twice more for the same amount of IOPS on IO1. It's a really interesting factor. Basically, on IO1 volumes, you have to pay not only for the storage, but you are also paying for provisioned IOPS. You are paying slightly more for just allocated storage, like roughly 30% more.
And on top of that, you are paying for IOPS. And if you want to have more IOPS, you can take different approaches. For example, with GP2, you can allocate multiple volumes and build a stripe out of it
by effectively multiplying IO performance. This strategy doesn't really work on IO1. Also, you can still use this strategy, but it will cost you more. And if you want to save with IO1, at the end, you end up just paying for additional IOPS
on top of price for the data. On this picture, we see that if you go from 10,000 to 30,000,
with IO1 volumes, you will end up paying nearly three times more. That's not nice, so therefore we always use GP2 volumes. Yeah, cloud has certain advantages.
Basically, you can spin up instances, you can allocate volumes with a matter of minutes, and do your performance tests. Ideally, you can deploy your database and apply load, which is very similar to production load. In the best case, you can actually duplicate production load.
That's actually what we did. An application which writes events into the database, it is able to push the data to multiple databases at the same time.
This ability in the application wasn't really used before, but it was really nice to use it for testing. Another very nice thing from AWS, for example, there is a spot market of spot instances.
Those instances are not very cheap, but usually they are two or three times cheaper than the usual instances, but they could be terminated at any time, because you are bidding on this instance, and when price is below your bid, your instance is running.
If the average price comes up, AWS will take your instance out. But for testing a system, this is fine. You can save some money on such tests. Yeah, but you should never forget about risks.
If you are building something that is supposed to be highly available, you should consider all risks. If you choose to run only three instances, you have to run at least three, because the chances that two instances go down at the same time
are much higher than chances that all three instances are going down at the same time. Restoring from S3, for example, will take hours. I have some numbers. With EBS, we can afford running only two instances.
Those are prices for the single i3 instance in US dollars. And for instance and the same amount of storage, what i3 instance provides, like 15 terabytes.
But if you go to highly available cluster, we need to run three instances, and it becomes twice more expensive than just running two instances, EBS-based instances.
Basically, we made up our mind and came up with such a setup. Two R4 ATEX large instances. This instance spec is pretty much similar to what we had in the data center. The amount of RAM is slightly less, but the number of CPU cores is exactly the same, 32.
Although storage is twice bigger. So we ended up provisioning six GP2 volumes, and we made 0 out of it. So basically we striped six volumes and got 20 terabytes of free space.
The next thing, which we are exploring in parallel, is how to retain access by applications, by developers, and by analysts to our database which will be running in the cloud.
What are the possible options? We can use, for example, DNS, because people and applications are not using IP addresses. They are just using DNS names. Another option is using a certain proxy strategy.
So we can, for example, route connections with the help of IP tables or start HAProxy or use PGBouncer. But there are certain things which are very important. Some of the old applications are not using SSL, and they don't really support SSL connections to the database.
And those applications are considered to be legacy, and nobody likes to touch this legacy code. But basically the only option which can give us the ability to have SSL connections over the internet
and give unprotected access to the local instances is PGBouncer. So in the cloud, we start up the two replicas.
We create two instances of PGBouncer in the data center, and we will just move virtual IP of existing database host to PGBouncer host. Basically everybody remains happy. They will use the same connection string.
Everything will go encrypted over internet, and everything seems to be fine. So how to make it more secure? So it also requires from us a little bit of investigation, and we had to set up our own small certificate authority.
This is a very important step. Basically it gives us control on this URL where we can check client certificates of PGBouncer. And PGBouncer, in opposite, will check that the server is exactly the server where we are supposed to connect.
It will check server certificates against certificate authority. So how to configure all this setup? With PostgresConf, it's fairly straightforward. We enable SSL, we specify client certificate, client key, and we point to some root certificate.
In PGHBA, we open only SSL access from specific public IP, which we have in the data center.
We specify option only client cert. This option will check that client has a valid certificate and verify the certificate, but it will not get username from the certificate. Basically existing users will use the same usernames.
And everything that is not coming through SSL connection will be rejected by HBA. So in PGBouncer, it's also more or less straightforward since we have tons of different users,
we cannot use the advantages, the biggest, the best mode, like transaction pooling in PGBouncer. We have to use session pooling. In usersConf, there is only one user actually allocated, which is used by PGBouncer itself in order to execute our query.
And this user lookup function simply extracts MD5 hash from the database. So PGBouncer itself doesn't connect to database as a super user. And we configure PGBouncer to use, to verify server certificate against certificate authority.
And of course, client certificate and client keys should be present. So how we migrated data?
We considered basically two options. One, we can use PGBase backup to pull backup either from primary in the data center or from replicas in the data center. But unfortunately, in the data center, everything is protected and there is no direct connection from the internet.
Setting up of any kind of VPN will, from us, require communication with system administrators. And from past experience, usually it takes a few weeks to set such VPN up. So another option we can use, for example, SSH tunnel.
But still, in case if something breaks, we will have to restart this base backup from scratch. It's not possible to continue if connection was broken.
The second option, you can use any kind of S3 compatible tool. And S3 is a standard storage on AWS. There are actually four tools on the market which are implementing S3 support. One of those is the oldest, it has the name Volle and it's battle-prone, everybody is using it.
The second one is a PG backrest, it got support of S3 probably a couple of years ago. And the third one is very modern, it's written in Golang and here is one of the authors of OLG. And the fourth tool is actually PG-HORD, it also supports S3 and it is also written in Gol.
But at the moment when we were doing migration, we didn't aware of this tool. So since we already had experience with Volle and we use it very actively in the cloud,
we decided, okay, let's go with Volle. So here is our plan. First, we push base backup to S3 from the primary, there is no way to do it from the replica with Volle because it doesn't support non-exclusive backup method.
In the cloud, since I'm a Petroni developer and everything is supposed to be supervised by Petroni, with a few tricks, we set up, so to say, a standby cluster on AWS. It has no real primary, only replicas and all replicas are replicating data from S3 bucket.
Yeah, so here is some statistics, like pushing 10 terabytes base backup from data center to S3 took nearly half a day, 12 hours roughly.
Due to data locality on AWS, fetching up base backup from S3 and rolling it on the system took nine hours, so it's slightly faster, but meanwhile, a lot of walls during 21 hours accumulated
and it took roughly four hours to replace this wall stream from S3 and get this replica quite close to the primary.
Since we don't have a streaming replication and we are using wall shipping, replication lag sometimes could be a few seconds, and basically this replication lag is determined by an amount of write activity on the primary. During business hours, when a lot of people are working,
or at the evening when a lot of customers are purchasing something on the website, we get a lot of events, we get a lot of write activity, and therefore replication lag is very small. And during night hours, replication lag could be up to one minute, maybe more,
but it is acceptable for us. Here is our switchover plan, but it's boring to read the plan. Let's just do it on the picture. Of course, we first set up two replicas, which are replicating data from S3, and primary is pushing with the help of walled data to S3 bucket.
The very first thing we did, we moved virtual IP from the replica to the Replica PG Bouncer. Basically everybody who is using Replica already executed a query against Replica on AWS, and it made it possible to ensure that our plan will finish successfully.
We can again analyze performance of the replica, see some weak points, and so on and so forth. And actually we leave with running for a few days in such a state.
So the day of migration, what we did. So first we shut down the replica, moved virtual IP to the primary PG Bouncer.
At this moment, applications and users started getting some exceptions if they want to do transactions, but it was for a very short time. Why it was necessary? In order to remove load from the primary and minimize time,
which it takes to archive and push the last wall segments. So we shut down the old primary, it pushed the last wall segments to S3, and we promoted one of the nodes on AWS. So basically at this moment, migration was done, except that we had to prepare our fallback plan.
So we started back both servers in the data center with RecoveryConf, and RecoveryConf has a restore command which pulls data from S3 with the help of OLLI.
Unfortunately, there is one interesting trick which we had to use. When you shut down the primary, Postgres doesn't archive the very last wall segment. It writes into this wall segment checkpoint records, but it's not getting archived.
You either have to push this wall segment manually to S3, or you have to use PG Rewind in order to rewind this former master. So we prefer just archive this wall segment manually.
Okay, so what about backup? Since lack of a lot of free space on some storages in the data center, we had to use logical backups, and likely Postgres can do backups in parallel.
If you're using directory format in the cloud, we have more or less infinite amount of space in S3. So how does it work with usual base backup and wall archiving? Again, tooling, which we are already familiar with.
It's OLLI, PG backrest, and WallG. With OLLI, there are certain advantages, and with every tool there are certain props and cons. OLLI is our primary backup tool. We have a lot of experience we know to expect, but it's not too good when you're breaking up tons of data, terabytes of data.
And it can't work on the replica. PG backrest, it's very modern, it supports S3. It can do incremental and differential backups. Basically, one time you do base backup, and the next day you just take a difference between what you took base backup last time and what was changed.
So it's much faster and way more efficient. Unfortunately, PG backrest doesn't support AWS instance profile. It's a feature of AWS that from the API on the instance,
you can fetch temporary credentials and use them to access AWS services like S3. And we are really using such feature. It's very convenient for us, and unfortunately there is no support in PG backrest.
With WallG, you can also do data backups. You can use it on the replica. It's very important that it is backwards compatible with WallG. And you can also choose a different compression method. By default, it is using LZ4, which is very fast, but compression ratio is not the best.
Meanwhile, WallG supports only LZ4, and PG backrest is using Zlib, which is kind of a few dozen years old, but we are living in the 21st century. Yeah, so what about efficiency?
Like baseline, what does it mean? Like if you just read 10 terabytes from disk with given performance limitation on the specific instance type and EBS volumes, it takes four hours to read.
Okay, so it was already 12 terabytes when I was doing such tests. It takes four hours just to read this data out from the disk. Meanwhile, WallG, it takes 17 hours to read 12 terabytes of data, compress it,
write temporary files, push it to S3. Unfortunately, we are still using WallG, and now it takes 22 hours. Yeah, so we still have a little bit of time to better test WallG, but WallG really shows incredible results.
It manages to compress 12 terabytes down to 2 terabytes, saving us money which we pay to AWS for S3 storage. It takes only six hours to do such compression, so it's basically three times faster than WallG, although it takes more CPU to compress with broadly our algorithm.
But with Delta backups, it's really amazing. It takes about two and a half hours, and since our database is mostly insert only, there are not so many changes.
Therefore, the Delta backup on the next day is only 75 gigabytes. Yeah, so basically the whole story was a success for us, but we didn't really pay attention on monitoring of our database.
So what we noticed, on this picture you see a graph of average runtime of the stored function which inserts data into those different tables.
Before migration, the line was very smooth. In average, it's about two milliseconds. Maybe it's not really good, visible. But after migration, we got a lot of spikes. But we didn't pay much attention. Nobody really complained. The data is not coming just from application.
It's anyway buffered in the Kafka or some other queues, and application just pulls data from the queue and writes it to our database. So for end user, basically it means you get some events in the database
just a few seconds later, and they can get there. So we did some analysis and figured out that we forgot to turn synchronous commit off.
So in the data center, it wasn't enabled. There is not much value in synchronous commit in such setup. So after switching it off, line again turned into relatively smooth, but not so smooth as it was in the data center. And I can explain it only by the storage system.
Yeah, basically that was it. A few useful links. I like actually personally the most EC2 instances info. If you want to compare instance performance or instance spec on AWS,
they really have everything on one page. Like what memory do you have on the specific instance, how many CPU cores, what is the network performance, and so on and so forth. Because on AWS, if you start looking for some numbers,
they are spread over multiple pages, and it's not really easy and nice to click here or there, compare, write something down maybe on the paper. Yeah, so I hope this talk will help you in case if you have something similar.
And if you have some questions, I'm happy to answer. Well, so there is one thing which I didn't mention.
Basically, application which is pushing most of the events into the database also was migrated to AWS. Therefore, there is not so big luck. But applications which are accessing and using it for reporting
or developers or BI people, basically for them 15 milliseconds round trip on top of the query doesn't really matter.
Yeah, it was moved to AWS. It was relatively simple. Yes, it's for the apps, for developers, for BI people. The biggest problem for us is that some of the legacy applications,
they have some weird setup, and they don't support SSL in the database connection. And changing something in those applications would be a big pain for us because somebody who actually implemented these applications is not with the company anymore.
Nobody else wants to touch legacy code. And by using PgBounce, we actually avoided a lot of headache.
On AWS, you can increase volume size just online without big performance hits. So you simply increase volume.
So there is a comment on Linux which you can execute in order actually to extend file system on the bigger volume. Basically, you do it online without downtime.
Well, basically, you are referring to something like noisy neighbor problem, I think.
So AWS provides certain SLA and guarantees. It's not in our control. They claim that their infrastructure is elastic, absolutely elastic.
In fact, we all know that it's not 100% true. They also have some capacity reservation. Like if you speak about disaster plan, we do run those two instances in different availability zones.
Basically, availability zones, if you use AWS vocabulary, those are different data centers. They are graphically distributed. So in between two availability zones, latency is something like one millisecond. So it's quite unlike that if one data center, one availability zone is going down due to some disaster,
that it will also affect the second one. But we will see. So far, in the past four years, we've had no big problems.
More questions? Yeah. Oh, yeah, it's a former colleague of mine. Unfortunately, he left our company like a few months ago. He's kind of an amazing artist, but he's even more amazing engineer.
Some people are very talented. All right, so thank you very much for coming. Yeah, yeah, sure, I will do it.