Yandex.Mail success story
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 34 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/48480 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
PGCon 20161 / 34
1
2
3
4
5
6
12
13
14
15
16
18
20
21
23
26
27
29
32
34
00:00
Enterprise architectureEmailInternetworkingMobile WebSubject indexingBitStructural loadData storage deviceSearch engine (computing)Service (economics)Number2 (number)EmailPhysical systemQuantum stateComputer animation
01:34
OracleHuman migrationMetadataDatabase normalizationDecision theoryMetadataWorkloadProduct (business)Query language2 (number)Right angleTotal S.A.Computer animation
02:42
MetadataEmailMetadataMultilaterationEmailData storage deviceHierarchyBitService (economics)Group actionMessage passingInformationDatabaseThread (computing)Line (geometry)Different (Kate Ryan album)Object (grammar)FlagEnvelope (mathematics)Source codeXMLProgram flowchart
04:03
MetadataOracleLogicComputer hardwareOperations researchMetadataDatabaseLogicComputer wormMiniDiscType theoryCore dumpStructural loadOperator (mathematics)Computer hardwarePhysical systemAverageDifferent (Kate Ryan album)Normal (geometry)WorkloadEmailSubject indexingComputer animation
05:31
InformationFault-tolerant systemSystem identificationMedizinische InformatikCuboidNumbering schemeService (economics)HTTP cookiePasswordConnectivity (graph theory)Device driverInterface (computing)LoginClient (computing)String (computer science)Decision theoryComputer fileDifferent (Kate Ryan album)Black boxDependent and independent variablesOracleFront and back endsInformationProgram flowchart
06:53
Java appletGreatest elementLibrary (computing)Meta elementDatabaseCartesian coordinate systemAbstractionConnectivity (graph theory)Greatest elementImplementationMaxima and minimaWeightMultiplication signDifferent (Kate Ryan album)Formal languageFront and back endsLeakGraph (mathematics)Program flowchartDiagram
07:50
Operations researchInterface (computing)Integrated development environmentLibrary (computing)Cache (computing)Heat transferCASE <Informatik>CoprocessorIntegrated development environmentDatabaseEndliche ModelltheorieLogicSystem callLoop (music)Library (computing)Operator (mathematics)Event horizonDevice driverService (economics)Cartesian coordinate systemCodeCache (computing)Heat transferPlanningQuadrilateralInterface (computing)CircleXMLComputer animation
09:22
Decision theoryOracleMereologyEmailProduct (business)Decision theoryStability theoryDatabaseMultiplication signDifferent (Kate Ryan album)Service (economics)MetadataHuman migrationSearch engine (computing)SequelJSONComputer animationXML
11:08
MetadataGraph (mathematics)Proxy serverPrototypeEmailStreaming mediaDecision theoryStructural loadWorkloadSoftware testingComputer hardwareAbstractionLibrary (computing)WorkloadGoodness of fitMetadata2 (number)PrototypeDatabaseDecision theoryNumbering schemeComputer hardwareOracleObject (grammar)ResultantStructural loadAbstractionSoftware testingLibrary (computing)XMLComputer animation
13:02
Human migrationHeat transferCodeProcess (computing)Decision theorySoftware developerEmailPlanningHeat transferDatabaseMultiplication signSoftware bugHuman migrationCodeLaserFront and back endsComputer animation
14:44
OracleSoftwareHuman migrationCodeMultiplication signQuadrilateralWorkloadProcess (computing)Graph (mathematics)DatabaseHuman migrationService (economics)Graph (mathematics)JSONComputer animation
15:43
Complete metric spaceHuman migrationImage registrationDatabaseWorkloadOracleMathematicsComputer animation
16:19
OracleJava appletSummierbarkeitLibrary (computing)EmailGraph (mathematics)Connectivity (graph theory)Fault-tolerant systemImplementationMaxima and minimaGreatest elementAbstractionFront and back endsDevice driverService (economics)String (computer science)WordReplication (computing)DatabaseInformationAsynchronous Transfer ModeMathematicsCase moddingProgram flowchart
17:43
Computer hardwareWorkloadMultiplication signBefehlsprozessorCore dumpWorkloadBand matrixPoint (geometry)Computer hardwareCircleSoftwareMiniDiscScalabilityFault-tolerant systemFlow separationMathematicsDatabaseDiagram
18:50
Computer hardwareType theoryWorkloadMobile appOperator (mathematics)Service (economics)MiniDiscHuman migrationPartition (number theory)DatabaseMathematicsEmailType theoryDifferent (Kate Ryan album)Computer hardwareWorkloadNormal (geometry)Computer wormLevel (video gaming)XMLComputer animation
20:18
OracleUniqueness quantificationSequenceNumberIdentifiabilityMultiplication signMessage passingRange (statistics)SequenceCombinational logicSpacetimeDifferent (Kate Ryan album)DatabaseComputer animation
21:14
Price indexWeb pageObject (grammar)Revision controlArray data structureType theoryNetwork topologySubject indexingMobile WebMobile appLevel (video gaming)Subject indexingContent (media)Revision controlGoodness of fitMobile appServer (computing)Web pageDecision theoryNormal (geometry)IdentifiabilitySequenceSingle-precision floating-point formatArray data structureObject (grammar)Port scannerHome pageRange (statistics)Differenz <Mathematik>Type theoryQuery languageReading (process)MiniDiscHeat transferComputer animation
23:04
Key (cryptography)Combinational logicRow (database)Table (information)Normal (geometry)Subject indexingFlow separationField (computer science)Message passingFunctional (mathematics)Computer animation
23:56
LogicData storage deviceCodeLibrary (computing)Cache (computing)ConsistencyLogicSoftware testingLoginCodeDatabaseConsistencyLibrary (computing)Cache (computing)Message passingOracleData storage deviceComputer animation
25:18
Software maintenanceCodeHuman migrationTask (computing)Integrated development environmentSampling (statistics)Software testingDatabaseTask (computing)Software maintenanceHuman migrationMultiplication signMathematicsSoftware testingPoint (geometry)Open sourceSubject indexingSoftware developerCodeDifferent (Kate Ryan album)Integrated development environmentSimilarity (geometry)Covering spaceNumbering schemeExistenceStack (abstract data type)XML
26:37
ZeitdilatationHuman migrationVacuumOracleThread (computing)Link (knot theory)Network topologyDatabaseProcess (computing)Computer animation
27:23
MereologyTranslation (relic)DatabaseLink (knot theory)Fault-tolerant systemCore dumpComputer animation
28:19
BackupBlogOracleOperations researchData compressionMultiplication signBackupDatabaseLogin2 (number)File archiverPatch (Unix)Point (geometry)Cartesian coordinate systemWeb pageSpeech synthesisParallel portNumberLevel (video gaming)Computer animation
29:40
Human migrationCodeHuman migrationCodeComputer hardwareSoftware bugService (economics)Heat transferRAIDQuadrilateralProcess (computing)CuboidComputer animation
30:17
Partition (number theory)Data recoveryInterface (computing)SynchronizationBuffer solutionData compressionParallel computingPartial derivativeElectronic mailing listPartial derivativeAreaData recoveryData managementDemoscenePlanningBuffer solutionFitness functionGoodness of fitPartition (number theory)WeightInterface (computing)Video gameCommitment schemeCore dumpJSONComputer animation
31:34
Code refactoringComputer hardwareDatabase normalizationHuman migrationOpen sourceProduct (business)Code refactoringSubject indexingDatabaseNumberMultiplication signReflexive spaceWorkloadProcess (computing)BackupLibrary (computing)Computer animation
32:51
DatabaseMetadataDirected graphBefehlsprozessorMiniDisc2 (number)Core dumpFlow separationConsistencyMathematicsMathematical optimizationService (economics)Chemical equationCuboidHuman migrationMultiplication signRevision controlMaxima and minimaDifferent (Kate Ryan album)Discounts and allowancesProduct (business)EmailDisk read-and-write headFrequencyData storage deviceComputer hardwareStructural loadNumberImplementationObject (grammar)Nichtlineares GleichungssystemSoftware testingOracleDatabase transactionComputer wormScripting languageWorkloadLink (knot theory)Line (geometry)Process (computing)CodeDivisorPerturbation theoryWritingFront and back endsCountingComputer animation
42:25
VacuumOracleSoftware maintenanceStack (abstract data type)Human migrationCodeTask (computing)Software testingIntegrated development environmentSampling (statistics)State of matterExecution unitSimultaneous localization and mappingCartesian coordinate systemMathematicsStructural loadWorkloadCodeProduct (business)Shared memoryComputer animationProgram flowchart
43:42
Web pagePrice indexRevision controlObject (grammar)Array data structureType theoryNetwork topologySubject indexingMobile appMobile WebSoftware maintenanceCodeHuman migrationTask (computing)Integrated development environmentSampling (statistics)Software testingNumberFile systemHuman migrationTable (information)Multiplication signMathematicsPhysical systemRevision controlCodeData managementGradientDatabaseJSONXMLComputer animation
Transcript: English(auto-generated)
00:01
Hello, everyone. My name is Vladimir. I'm really excited to be here. This is my first visit to Canada. And I'm a bit nervous because it's my first talk outside of Russia in English and probably in front of the audience that don't know anything about Yandex. Could you please raise your hand if you have ever heard about Yandex?
00:23
Quite much. That's a surprise. For all others, long story short, Yandex is Russian Google. Our main business is search engine. We are number one in Russia. We are really good in several other countries. You can see some details here.
00:42
I'm working as a DBA in the Yandex Mail team. We have a mail service. It was launched nearly 16 years ago, so it is quite old. Our daily audience is more than 10 million people. And our beacons receive more than 200,000 requests per second.
01:04
So we think it's quite a high load system. Every day we store more than 150 million new letters. These are letters that we store. There are much more that we, much more spam letters that we reject.
01:22
And our storage right now is more than 20 petabytes in size. So we think of it as a somehow big data service. What I'm going to talk about today, we made a decision to migrate all our metadata
01:42
from Oracle to Postgres. And we did it. It is nearly 300 terabytes in size with a total workload of 250,000 requests per second. It is OLTP workload, so small queries.
02:01
Most of them are read queries. Actually, this is not our first attempt to remove Oracle in our production. More than 10 years, we tried to migrate to MySQL, and it failed because I'm at Postgres conference.
02:22
And we also saw that we are so cool that we can write our own solution for storing our metadata. It was really fashionable to write different NoSQL solutions nearly seven years ago, and it also failed.
02:42
So what is actually my own metadata? This is everything you see when you open your inbox. For example, this is folders, which is actually a hierarchy with some counters. This is labels, for example, important letters or unread letters or something else.
03:05
This is collectors, a way to collect your letters from other mail services. I'll talk about it a bit later. And of course, letters. We don't store message bodies in our databases. It is stored in our object store.
03:22
In databases, we store things that we call envelopes. They contain meta information about the letter. For example, the subject, the recipients, from, to, so forth, the first line of the letter, dates, different flags, information about attachments, size, and so forth.
03:45
And letters could be grouped into threads or somebody called them discussions. This is a group of letters with the same subject, for example, or something else. And it is also stored in databases.
04:04
So how it was nearly four years ago? All this metadata was stored in Oracle. We had lots of PL-SQL logic. These hosts were probably the most efficient hardware usage in all Yandex and probably in many other systems.
04:26
We used to store more than 10 terabytes of metadata per shard. Our working load average was 100. That was not load average when we had any problems. It was a typical, normal load average.
04:41
We have 32 cores usually per host. We had not so many databases, so lots of operations were done manually. And nearly no automation. And for hardware efficiency, we used to divide our databases into worm with SSD disks
05:05
and cold databases with SATA disks for different type of users. There are users that get registered and then just don't use this mailbox.
05:23
So nearly one quarter of our databases were with SATA disks and most of them were with SSD disks. Sharding is very important for Yandex because we are used to scale horizontally. We used to have the following scheme.
05:42
When the request comes to one of our backends, the backends change the house information to shard name in our internal service called Blackbox. Blackbox is used for checking out identification. Again, password, cookie, all the stocking, everything.
06:03
In response, backends receives user ID and shard name. It then gives this shard name to Oracle driver. Our backends are written in C++, so we use Oracle C++ client interface, OCCI. OCCI gives the shard name, searches for shard name in a special file, 10 less names.
06:24
And it gets connection strings from there to different hosts. And Oracle driver makes a decision who of the hosts is primary and who is the standby. And it actually opens the connection, opens the connection always to primary.
06:42
So sharding was implemented with our internal service and fault tolerance was implemented inside the Oracle driver. Okay? As I already said, most of our applications are in C++ and they had an abstraction library to go to our databases.
07:03
It is called Max, meta access. And for many years it had the only implementation, Max Aura. We also used to have a library for implementation of connection pool called dbpool and on the bottom of course we had OCCI.
07:23
This is how it was designed, but for many years it became the following. Abstraction leaked. Many of our backends started using methods of Max Aura or dbpool. We also started having backends in different languages and of course they couldn't use C++ library.
07:43
So we had a mess and it took us a lot of time to bring order to it. Although Oracle is really good in many things, we had some problems with it. For example, deploy of new code of PL-SQL logic was a big pain for DBA
08:06
because Oracle uses library cache. And when you update some stored processor for example lots of other sessions have a shared look on it and when you deploy new code you get lots of problems with looking.
08:23
As I already said, we didn't have really much automation for most of operations. Switchover, typing new database, transfer between shards were done mostly manually.
08:41
C++ driver for Oracle has only synchronous interface and most of our applications are based on event loop model or using many asynchronous calls and that was a big problem for us. Since we didn't have much automation for deploying new database
09:01
we had problems with development environments. And although Oracle has support and it is good in many cases sometimes we had problems with it. For example, when you have a problem and you need to be fixed very fast because we are a 24x7 service sometimes it didn't happen.
09:23
But the main reason is of course the price. Oracle costs are terrifying. If you don't know them, just sleep in peace and don't go there. So, how we did it?
09:42
In October of 2012, a political decision to get rid of Oracle in three years was made. That was a political decision. That was not a decision to migrate from Oracle to Postgres. That was a decision to get rid of Oracle.
10:00
Six months later, six months, half a year we started first experiments with different databases. That was Postgres, lots of NoSQL databases that were really fashionable that time and we even tried our self written solution on base of our search engine. Actually, this half a year for this goal were wasted.
10:27
We were doing some other useful stuff for improving our stability and for other product launches but for six months nothing happened for this goal. And it took us nearly a year
10:41
to make our first experiment with Postgres. That was Collectors. We took small parts of our mail metadata and migrated it to Postgres. What is Collectors? Collectors is a way to collect your letters from other mail services. For example, if you have a mailbox on
11:01
for example, Gmail and Yandex, you can collect letters from Gmail and see it in Yandex inbox. That was our first experience with Postgres. We had lots of problems but we solved it and we really liked it. The experiment was successful.
11:21
We moved nearly 1% of our metadata to terabytes and with a good workload. It was 40,000 requests per second. After that, we decided to make a prototype of the whole metadata
11:40
with folders, labels, letters, and so on and so forth. And we started storing all our letters that are incoming to Postgres We stored it to our object store, to Oracle and asynchronously to Postgres. Actually, if Postgres would break, nothing would happen
12:02
because all the workload was served from Oracle databases. And it let us make some initial decisions on our scheme. We didn't just copy that, we were changing it because Oracle's scheme was not ideal for many years it was changing.
12:25
Initial scheme and decisions was really important for our abstraction library because it is really difficult to do it without understanding the scheme. We also made some load testing under our workload. Unfortunately, Oracle license prohibits posting the results, comparing Oracle with other databases.
12:43
I could tell you something in private after the talk maybe. It let us to choose hardware that we would use for our Postgres database and of course we got a lot of new experience with Postgres. And then we started the main work.
13:04
It took us a lot of time to rewrite all our backends to use Oracle and Postgres simultaneously. Actually, we understood that we can't migrate 300 terabytes in one night for example, or something else. And we made decisions that we would have
13:21
some users that would live in Oracle databases and some users that would live in Postgres databases and a controlled process of migrating them. In June of 2015, we started dogfooding. We moved our inboxes from Oracle to Postgres
13:41
and it really accelerated development. We fixed lots of bugs. We started implementing many new features because we really needed it for ourselves. Sometime later, we started migration of inactive users. They live in SATA databases and all the activities they have
14:02
is receiving new lasers. They actually don't come to read them or something else. We didn't have all the features completed by that time, but we started migrating them to Postgres to fix bugs in our transfer code. And of course, we had a rescue plan
14:21
when, for example, we migrated some user to Postgres and then he started using his inbox. We had a reverse transfer. We could transfer it back to Oracle. We made it even before we made the main transfer. And then, of course, we made the migration.
14:42
Some cool things about it. It took us ten many years to rewrite all our code. That is a huge amount of time, actually. And we actually saw that it would take nearly half less, but there
15:01
were some problems with that. But, after long rewriting of our code, we had a really fast migration process. These graphs show how many how much workload is served from Postgres databases overall.
15:22
It is divided per service. For example, how much IMAP workload is served from Postgres databases. And this graph is only for three months. So the main migration process took us nearly three months. Really quickly we expected lots of problems and they weren't.
15:41
That was a really nice surprise for us. Right now Postgres databases serve nearly more than 95% of our workload, but not all because new users are still registered in Oracle. They are registered in Oracle and then transferred to Postgres.
16:01
While I am giving this talk here, guys in Moscow are launching registration to Postgres and after we would do that we expect to migrate all other users 100% to the 1st of July. So what are the main changes that we made?
16:20
First of all we brought order to our abstraction library and now it has two implementations one to work with Oracle and one to work with Postgres maxpg on the bottom is of course libpiku and we also have header-only library implementing connection pool
16:41
and timeouts. Sharding and fault tolerance. Fault tolerance was hidden inside the Oracle driver and Postgres driver can't do that. So what we did after backend changes house information to shared name
17:01
if the shared name is a magic word Postgres it goes to extra service, it is also our internal service that is called sharpa the backend gives user ID and the preferred mode the mode could be give me the primary host or give me
17:21
the standby with minimum replication lock or give me the nearest host or something else and sharpa gives back connection strings or one connection string and the backend opens the connection to an appropriate database
17:44
we changed our hardware in several ways for example for Oracle databases we used to scale vertically because Oracle is licensed by CPU cores we used to attach lots of SSD disks, lots of network
18:01
bandwidths and so on to one host and optimize the workload so that the CPU consumption was minimum that was a vertical scalability and we used to have one standby in each shard one primary and one standby for fault tolerance because another
18:22
standby won't be expensive in Postgres we started having lesser, not so big databases and we started having two standbys and serving some of read-only workload from our standbys this approach was so successful that we eventually started doing the same with Oracle in some point of time
18:43
our Oracle databases that will end up in two months now have two standbys also another change in the hardware was introducing so-called hot databases we counted that
19:01
2% of our users generate nearly half of our workload they're really hot they receive lots of letters they do lots of operations through IMAP for example, or mobile applications and we started having separate databases for them
19:20
with the hardware optimized for them, and of course we made automation to move users between different shard types for example, when you're registered, when you start using mail service you're put into worm shard and for example, if you stop using your mail in three months you would be moved to SATA database
19:41
that is made automatically if you live in SATA database and start using your mail you would be migrated to SSD database in one hour and one more thing we really want to do is migration of old letters of one user from SSD to SATA
20:01
right now, if you are an active user all your letters are stored on SSD disks that's not really efficient because you probably don't access to your old letters for this and for this to be done we need normal partitioning composers we changed our identifiers
20:21
in Oracle they were globally unique we had different sequences for different shards and these sequence ranges were stored in a special database and we had the fuck up when two Oracle shards were using the same sequences for
20:41
some time, and we spent a lot of time to fix that also all identifiers were quite big in Postgres we changed the approach and our identifiers are now globally not globally unique, are unique inside a particular account
21:00
for example message ID was globally unique and now the unique is a combination of user ID and message ID it gives us some space economy and also it there is an obvious win here
21:22
we have less contention for single index page because when you take your identifiers from a sequence all of them are inserted to the last page of the index to the same page of the index and to solve this problem in Oracle we used to have reversed indexes
21:42
in Postgres we can have normal B3 with normal range scans for that we added revisions to all our objects for example, when you mark some letter as read we also increment the revision of the folder in which letter this is
22:00
this lets us to query only actual data from our standbys you first make a query to the primary give me revisions of all folders and then you get data from standby about a particular folder you can compare the revisions to make a decision
22:22
if this data is actual or you should re-read re-read it from primary it also made us easier to implement incremental diffs for a map and mobile applications mobile applications usually usually come to a server to begin with a query like give me what was changed from yesterday
22:42
or for example, give me what was changed in such a folder from such a revision revisions made it much easier we also de-normalized some data because Postgres has a good support for arrays and we are heavily using composite types
23:01
here is one of the here is one of the examples this is our main table mailbox which contains one row per each letter as I already said the globally unique combination is user ID and message ID and this table also have a field with labels
23:22
on one letter you can put several labels so that's an array and to answer the question give me all letters with the following label we have a special gene index and it's not just gene index it's gene functional index so
23:40
quite tricky but it works if you turn off fast update of course in Oracle we couldn't afford the following thing for us we had a separate table with normalized labels for that we had lots of start logic in Oracle we had lots of pain with it
24:02
and when we started our experiments we were pretty sure that we would have no start logic in our Postgres database but PLPG SQL is awesome it is really much cooler in many things than Oracle PLPG SQL and we still have some start logic
24:21
in our database much less than we used to have in Oracle we have logic for ensuring the data consistency for example when you store your letter you increment counters for an appropriate folder for example or something else so this is implemented inside the database
24:42
this is not the code of our bookends we also greatly increased test coverage of our start logic because in Oracle there is an undo and when you have logically corrupted data
25:01
you can restore it from undo in Postgres you don't have so the cost of failure is higher and we are now writing more tests and the main win here is easier to deploy since there is no library cache logs we changed our maintenance approach
25:23
because we now have much more databases than we used to have we used Solstech for managing our databases for managing everywhere everything on our databases configs, packets and so on and so forth and at any point of time
25:41
we can see a detailed difference between what should be on host and what is really now on host that's really convenient if you change something manually we can see it in our monitoring solution all schema and code changes are now deployed to databases through migrations we use a tool similar to opt-fly
26:03
this is an open source solution and all the migrations almost all the migrations are made transactionally we automated all our common tasks for example switchovers, failures I don't know
26:21
re-indexing, bloated indexes, rebuilding bloated indexes and so on and so forth and since we have lots of automation now we can have a new testing environment in one click that's really convenient for our developers so what problems did we have?
26:42
we had lots of problems during our experiments these are links to community threads where for problems which we couldn't solve ourselves of course there were much more problems that we solved ourselves
27:00
we even used to joke that when you don't know what's happening AutoWacom is to blame because usually when you have some performance problem with your database you run PG2 and on the top of the consumption resources you would see AutoWacom processes
27:20
actually we had some problems with it but we solved it we had a big problem with understanding what is happening inside of our database guys from PostgresPro made a solution for that, it is called PGstat Wait you can read about it by the link by the first link
27:40
and with the great effort of EnterpriseDB guys, this became part of 9.6 actually a small part of it became part of Core Postgres but I hope that it would be developed in the future, we are discussing it yesterday here this is not the only tool we
28:02
use for for diagnostics I used to give a talk about it but unfortunately it's in Russian you could use Google, sorry Yandex Translate for that if you are interested in it we had a problem with our backups
28:22
we want to have an ability to restore at any point of time for the last seven days and in Oracle storing all the backups and archive logs for seven days took nearly the same size as the database size
28:41
for example if you have a 15 terabytes database backups for one week takes 15 terabytes in Postgres it is minimum five times more so for storing backups for our 300 terabytes we needed 2 petabytes for backups 2 petabytes just for
29:02
backups, of course we didn't like it and we made the patch for Barman and actually guys from second quadrant are not really responsive for us, I hope to talk to them here, but this patch implements parallelism, compression and page-level increments and that led us
29:21
to have nearly the same numbers, for example right now for a 5 terabyte database we have nearly 4-5 terabytes of backups for a week this is so far because of compression and page-level increments
29:42
as I already said we expected to have lots of problems problems during migration process, but there weren't that was a nice surprise for us there weren't any problems related to Postgres, we had lots of problems with Linux RAID hardware and so on and so forth of course we had lots of problems with
30:02
with our data because for many years of developing this service you accumulate lots of legacy in your data and of course we had bugs and transfer code and we used to fix them so we have a list
30:22
of things that we really want to see in core Postgres for making our life easier we want a normal partitioning for moving old letters of a particular user from SSD to SATA we really want a good recovery manager not implemented in Barman
30:40
or something else implemented in core Postgres because some of the features like partial online recovery can't be implemented outside of Postgres we want the weight interface monitoring of weights to be developed in the future and that was discussed yesterday also partitioning was
31:02
discussed yesterday and that's really good we want to have an ability to give the huge shared buffers to Postgres with ODIREACT and asynchronous IO because on IO-bound Postgres is not so good as Oracle actually
31:20
we also want quorum commit and many other things that don't fit here but these are the main actually and many of them were discussed yesterday and are planned to do in the future and that's really good so for the conclusion what do we have right now we moved nearly one petabyte with accounting, standbys
31:41
reflex and indexes without it it is nearly 300 terabytes with the workload nearly 250,000 requests per second it took us three calendar years and more than ten many years to do it
32:00
but right now we have faster deployment and actually although we have three times more databases we have the same number of DBAs for managing them we also made a huge refactoring during the migration process we haven't had major
32:21
backups yet and that's cool and Postgres became one of the technologies that are heavily used of open source products that are heavily used in our production we use lots of open source technologies, products and libraries but the products that are used
32:41
nearly everywhere in our productions were Linux and Jinx and Postgres and right now it is also Postgres and I suppose it's a success I would be glad to answer your questions
33:08
we have right now nearly 90 shards
33:25
one more thing I remember that I need to repeat the question because you couldn't hear the question was how much of time from these ten many years was to rewrite the PLSQL that was nearly two weeks
33:40
that was really fast process, yes that was the initial version, then we started fixing bugs, writing tests and adopting our backends I can't divide this process I can't divide time for these processes, but the initial implementation took nearly two weeks
34:01
it was really fast, we didn't expect it the question is how we get how we provide consistency
34:21
while the migration actually when we start the migration we open a transaction to Oracle database and we are locking the inbox of a particular user for the period of a migration his inbox is read only the user doesn't receive any letters
34:40
and can't change anything but for most of the users who don't have more than 50,000 letters or something else 100,000 letters this process takes several seconds less than ten seconds and read only for ten seconds
35:00
is not so big deal there are large users with more than 20 million users but they should suffer actually as a DBA we don't like really much users that generate many
35:20
workloads usually these are users that receive lots of new letters and they don't read it they never use these letters but we are a free unlimited service and we provide so while we are migrating fat
35:41
users with lots of letters it can take several minutes and of course he is read only during the migration that is the easiest way to provide consistency actually our migration script our migration tool is nearly 5000
36:00
lines of code it is already too complicated and implementing their consistency without going to read only would be really difficult well actually
36:21
the metadata is in the database contains the storage ID the link to our object store where the letter is stored and of course they are stored in our object store they are not moved during the migration we only move the metadata
36:48
again the question is what is the difference between hot databases and warm databases they have less SSD disks
37:00
per CPU core because the basic idea here is to find an optimal balance between CPU usage and IO usage these are main resources that we consume basically on our databases and in warm databases
37:20
we can attach more disks to the same CPU counts and that's it cheaper than hot databases
37:45
the question was when we started paying less money I suppose actually when we did some load testing and we understood
38:01
that we would need more hardware for postgres databases we were happy because the number of the hardware that we needed more was much cheaper than the price of Oracle and it took us 10 many years to
38:21
rewrite everything and then we started the migration and actually when we migrated the first database it took us nearly one week we started the economy of money and then we just went per database one by one migrating all users
38:41
from one database to several postgres databases did I answer the question? okay, sorry
39:01
ah, um so the question is how much money did we save? actually I can't answer your question in strict numbers
39:21
that was millions of dollars that was quite much money and it's also, it's too complicated to count them because we have Oracle in other our products, mainly internal services and we have I don't know how to say it in English
39:43
sorry we don't pay the full price of the license we have sorry? yeah, we have a good discount because we have we have Oracle in other services, but
40:01
even with the maximum discount that Oracle can give 70% we saved millions of dollars
40:22
we have we have a DBA team, actually I'm the head of the team, that's five people and these databases, these 90 shards are not the only databases we have we have databases of our of other services we now really love postgres, but we have
40:40
many other technologies for storing our databases so actually we now have less people for managing more databases that's even nicer I suppose we don't use any of them
41:12
well actually there is there was a problem when we started
41:21
having sick fault in postgres after upgrading to 9.41 some of our databases were sick faulting randomly and we I reported to PGSQL box and Tom Lane fixed the box in 38 minutes that was nearly
41:41
100 times faster than Oracle do although it has a support which costs lots of money Tom Lane fixed it much faster and many thanks to him and many others
42:04
our shards are dedicated and they don't have any relationship between each other so each shard is one primary and two standbys one of them is synchronous and one shard doesn't know anything about other shard
42:20
and there is a sharding service called sharpy that knows about all the shards but one shard doesn't know anything about other shards
42:40
because all current solutions don't really work on a huge workload I suppose that will happen in the nearest future I hope it will happen, but right now it's not
43:01
schema change is made shard per shard so we deploy new code or some DDL on one shard and then to the other and so on and so forth and of course our applications support working with old schema and new schema
43:21
because it is impossible to do that at once in high load production yes, sorry
43:41
yes, we use we use a self written tool we call it pgmigrate and the principle of work is really near to opt-fly you have a separate table in your database, it is called schema version
44:01
and all your changes that you do is done through migrations, so for example you add a new column you want to add a column to the table you make a new migration and when pgmigrate is launched it goes to this table and sees that
44:21
the last migration was number 9 and on the file system there is migration number 10 and it applies it it adds the column to the and it is heavily related to our salt stack to our managing system so that deploying new code is a one click
44:41
I suppose we have ran out of time I will be here today and tomorrow and you could ask any questions me in private thank you