We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Non volatile Memory Logging

00:00

Formal Metadata

Title
Non volatile Memory Logging
Title of Series
Number of Parts
34
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Emerging byte-addressable, non-volatile memories (NVMs) is revolutionary as the data written by a store operation of the CPU is nonvolatile. This talk presents the architecture of a prototype and performance evaluation results that demonstrate the potential of NVMs when they are used for the WAL buffer in PostgreSQL; a transaction becomes durable promptly after its WAL records are written in the WAL buffer. There is no need to wait the WAL records are written in the storage device. In a nutshell we can exploit the performance of asynchronous commit without impairing the transaction durability by using NVM for WAL buffer. Although the idea is simple, its implementation, however, is not. This talk also covers the difficulties to implement NVM WAL buffer and how to address them. Finally, I would like to share some knowledge that I obtained through the implementation and examination of the prototype.
Enterprise architectureImplementationPerformance appraisalContent (media)SynchronizationNon-volatile memoryBlock (periodic table)Address spaceSemiconductor memoryData bufferWritingComputer fileProcess (computing)Database transactionTable (information)Dependent and independent variablesOverhead (computing)Data storage devicePrice indexRational numberComputer programmingPrototypePerformance appraisalShared memoryTwitterResultantGroup actionProcess (computing)Figurate numberProduct (business)Data storage deviceMultiplication signNormal (geometry)State of matterTable (information)Data managementPhysical systemSynchronizationCommitment schemeSubject indexingDifferent (Kate Ryan album)DatabaseDatabase transactionWritingDataflowVideoconferencingOperator (mathematics)Row (database)Computer animationProgram flowchart
Database transactionTransaktionsverarbeitungSynchronizationUltraviolet photoelectron spectroscopyWater vaporClosed setProduct (business)Content (media)Multiplication signProcess (computing)Right angleWordDatabase transactionWritingOperator (mathematics)BefehlsprozessorComputer animation
Database transactionTransaktionsverarbeitungClient (computing)SynchronizationWitt algebraDependent and independent variablesDatabase transactionDependent and independent variablesCommitment schemeSystem callOrbitCASE <Informatik>Computer animation
Database transactionTransaktionsverarbeitungClient (computing)SynchronizationDependent and independent variablesDatabase transactionInsertion lossClient (computing)Group actionProduct (business)Computer animation
Cache (computing)MiniDiscClient (computing)SynchronizationDifferent (Kate Ryan album)Graph (mathematics)Cache (computing)Data storage deviceDiagram
Price indexWritingTable (information)Process (computing)Data bufferSemiconductor memoryNon-volatile memoryComputer fileDatabase transactionData storage deviceNon-volatile memoryRow (database)Right anglePower (physics)NumberPoint (geometry)Data bufferFundamental theorem of algebraComputer animationProgram flowchart
Non-volatile memoryAddress spaceBlock (periodic table)Data storage deviceBefehlsprozessorMilitary operationMaß <Mathematik>AuthorizationCharacteristic polynomialOperating systemHard disk driveOperator (mathematics)Non-volatile memoryPhysical systemWordAddress spaceData bufferData storage deviceComputer animation
Dynamic random-access memorySpeicherzelleSemiconductor memorySocial classData storage deviceChannel capacityFRAM <Informatik>MathematicsPhase transitionMagnetoresistive random-access memoryAddress spaceNon-volatile memoryComputer programmingPoint (geometry)Spherical capGame theoryNumberSpeech synthesisType theoryBackupCombinational logicAddress spaceMechanism designMultiplication signComputer animation
LoginFundamental theorem of algebraNon-volatile memoryData recoveryBefehlsprozessorCache (computing)Partial derivativeRow (database)Process (computing)Semiconductor memoryLogarithmDatabase transactionTable (information)Computer filePrice indexData storage deviceImplementationAreaData bufferAsynchronous Transfer ModeFundamental theorem of algebraCovering spaceAreaCondition numberShape (magazine)Computer programmingRight angleRoyal NavyImplementationWritingPartial derivativeAsynchronous Transfer ModeRow (database)Data bufferBefehlsprozessorComputer animation
Process (computing)Data recoveryDatabase transactionNon-volatile memoryProcess (computing)Data recoveryDatabase transactionRow (database)Condition numberModal logicProduct (business)QuicksortCausalityComputer animation
Data bufferProcess (computing)Data recoveryCrash (computing)Physical systemPartial derivativePhysical systemBitCrash (computing)Computer programmingMereologyBit rateThomas BayesCausalityAreaRow (database)Process (computing)Data recoveryFigurate numberComputer animation
Data bufferProcess (computing)Data recoveryDatabase transactionRow (database)LengthRow (database)Data recoveryLengthDatabase transactionDisk read-and-write headMechanism designLogicNumberData bufferSequenceFlash memoryBitRight angleQuicksortExploratory data analysisCausalityBit rateComputer programmingFiber (mathematics)Group actionComputer animation
BefehlsprozessorCache (computing)ConsistencyNon-volatile memoryProcess (computing)Data recoveryCASE <Informatik>BitData structureProcess (computing)Physical systemCrash (computing)Non-volatile memoryRight angleRow (database)Insertion lossData recoveryBefehlsprozessorDatabase transactionCache (computing)Computer animation
ImplementationNon-volatile memoryData recoveryData bufferParameter (computer programming)PrototypeArchitectureAsynchronous Transfer ModeKernel (computing)Installable File SystemMiniDiscSimilarity (geometry)BefehlsprozessorCache (computing)ImplementationPrototypeContent (media)Kernel (computing)Process (computing)Module (mathematics)EmulatorComputer architecturePower (physics)Latent heatDiagramCASE <Informatik>Non-volatile memoryPoint (geometry)BitDemo (music)Computer animation
Data bufferWritingPointer (computer programming)AreaEmpennageRow (database)MereologyOnline helpComputer programmingRight angleData storage deviceProcess (computing)PrototypeArchaeological field surveyRow (database)Crash (computing)WritingData bufferLengthPhysical systemPartial derivativeDisk read-and-write headSlide ruleField (computer science)Computer animation
WritingProcess (computing)Row (database)PrototypeWritingDatabase transactionGame controllerData bufferOperator (mathematics)Commitment schemeRight angleComputer programmingGroup actionObservational studyBit rate
Mechanism designFunction (mathematics)Fluid staticsNon-volatile memoryProcess (computing)Data bufferParameter (computer programming)Control flowStiff equationDivisorWeightObservational studyComputer configurationFunctional (mathematics)Game controllerCache (computing)PrototypeMechanism designModal logicImplementationComputer animation
Kernel (computing)Installable File SystemBounded variationAsynchronous Transfer ModeWeb pageNon-volatile memoryData bufferModule (mathematics)Computer programmingRight angleNon-volatile memoryBefehlsprozessorAsynchronous Transfer ModeCache (computing)PrototypeBounded variationConsistencyAreaComputer animation
LoginNon-volatile memoryParameter (computer programming)Open setData bufferAreaAsynchronous Transfer ModeBounded variationWeb pageInstallable File SystemMechanism designFunction (mathematics)Fluid staticsProcess (computing)Control flowBitPrototypeParameter (computer programming)Computer fileFiber bundleGame theoryPerimeterData recoveryMechanism designComputer animation
Non-volatile memoryData bufferComputer fileData recoveryWater vaporProcess (computing)TorusMereologyNumberData recoveryDatabaseSequenceData storage deviceComputer fileLogicMaxima and minimaProcedural programmingComputer animation
LogicNon-volatile memoryView (database)Data bufferComputer fileElectric currentObservational studyWindowData bufferNon-volatile memoryVarianceMultiplicationGame controllerComputer animation
Non-volatile memoryData bufferComputer fileData recoveryLogicView (database)Electric currentCausalityPoint (geometry)NumberVideoconferencingLogicState of matterDiagramData bufferRow (database)Ocean currentLine (geometry)Data storage deviceRight angleTape driveComputer animationDiagram
Non-volatile memoryView (database)LogicData bufferComputer fileElectric currentData recoveryOcean currentFunctional (mathematics)Row (database)PrototypePointer (computer programming)Data bufferPoint (geometry)VarianceState of matterLink (knot theory)DiagramProcess (computing)Group actionPower (physics)Kritischer Punkt <Mathematik>Computer animation
Performance appraisalWritingReduction of orderServer (computing)Client (computing)Computer networkBefehlsprozessorSemiconductor memoryData storage deviceResultantPerformance appraisalVotingSoftwareMagnetic-core memoryClient (computing)AreaNon-volatile memoryData storage deviceServer (computing)Graph (mathematics)Configuration spaceCausalityComputer animation
Cache (computing)MiniDiscClient (computing)QuicksortProcess (computing)Bridging (networking)Data storage deviceRow (database)MiniDiscRight angleCommutatorPrice indexMultiplication signGoodness of fitWater vaporDependent and independent variablesCache (computing)ImplementationSynchronizationResultantOpen sourceFunctional (mathematics)CASE <Informatik>Client (computing)Graph (mathematics)NumberBenchmarkCommitment schemeBounded variationComputer animationDiagram
LoginSynchronizationWritingReduction of orderBlock (periodic table)EmpennageWeb pageData storage deviceNon-volatile memoryRight angleBlock (periodic table)Database transactionMultiplication signOperator (mathematics)Strategy gameFigurate numberQuicksortXML
Database transactionData recoveryEquals signClient (computing)Physical lawCovering spaceArithmetic meanProduct (business)Data recoveryProcess (computing)Client (computing)Database transactionTable (information)AxiomLogical constantComputer fontSlide ruleSoftware testingResultantScaling (geometry)Computer animation
SynchronizationNon-volatile memoryServer (computing)Product (business)Scaling (geometry)Database transactionResultantMultiplication signShared memoryTwitterComputer animationXML
Point cloudServer (computing)Product (business)Hybrid computerSemiconductor memorySeries (mathematics)Server (computing)Physical systemAreaRAIDCartesian coordinate systemLimit (category theory)Multiplication signDivision (mathematics)Standard deviationComa BerenicesGame theoryComputer animation
Library (computing)Non-volatile memoryServer (computing)Data storage deviceNon-volatile memoryAliasingInstance (computer science)Physical systemWeightComputer programmingComputer animation
Non-volatile memoryProduct (business)Server (computing)SynchronizationSimilarity (geometry)WritingReduction of orderCommitment schemeDivision (mathematics)Reduction of orderServer (computing)Right angleComa BerenicesBlogService (economics)Computer animation
Non-volatile memoryStandard deviationMilitary operationLattice (order)WhiteboardState of matterPower (physics)Goodness of fitSource codeLine (geometry)Standard deviationDatabase transactionPrototypeOperator (mathematics)Computer animation
AreaImplementationMechanism designFunctional (mathematics)CASE <Informatik>Data bufferVirtuelles privates NetzwerkLoginOctahedronIntegrated development environmentTwitterExplosionComputer animation
Transcript: English(auto-generated)
to solve the problems in the prototype. Then I'll show some evaluation results for performance and durability. And share some related technical trends before concluding this talk.
Here, let's start by introduction. This video briefly shows basic flow of write data in database management system. While table and index data are written in the storage
in storage device asynchronously, while transaction log x log records are written synchronously. Thanks to this synchronous write,
DBMS is guaranteed to be durable. However, as shown in the, however, sync write operation takes a longer time, which introduce
overhead in the transaction processing done by the worker processes. Especially in line-heavy transactions. Here, let's look at the difference between synchronous and asynchronous write commit.
This time chart shows a transaction processing and word write operation. Transaction processing is is modify the DB contents, which is carried out mainly by using CPU.
On the other hand, the word write here is an I operation. Important here is that the transaction becomes durable after the word write is completed, which is a synchronous commit.
In synchronous commit case, the notification of a transaction commit is issued after the transaction durability is guaranteed.
But but so but it is achieved at the cost of slow response. On the other hand,
when you use sorry, when you use asynchronous commit in your DBMS, the notification of a transaction commit is issued to the client before, before, here, before
the transaction becomes durable. Therefore, using asynchronous commit has a risk of losing committed transactions. In a nutshell, performance improvement by using asynchronous commit is achieved at the risk of transaction loss.
These graphs shows difference in performance between synchronous and asynchronous commit measured by Pgbench. As can be seen, the difference is large
when the disk drive cache is off, and the difference is still visible even if disk drive cache is on. The experimental setup will be shown later.
And the fundamental idea for NVM logging is that
the NVM logging has been delivered from one question. Why synchronous write is necessary for wall write? The reason is that the memory for the wall buffer is volatile.
This is because the user DRAM is used for this memory device. It was common sense that main memory usually implemented by DRAM is volatile. In this concern, fundamental idea for NVM logging is very simple. That is,
synchronous write of wall is no longer necessary if this wall buffer is on a non-volatile memory. In other words,
the key point in NVM logging is expanding the non-volatile world to include the wall buffer. Here, I would like to note the important characteristics of
NVM use for wall buffer. When we interpret the word NVM literally, hard disks and SSDs are included in NVM.
However, they differ from NVM use in NVM logging. This is a byte address of NVM, which means that the data written by CPU instruction is doable. On the other hand, user storage devices
need IO operation, issued by operating system. From now on, I will use the word NVM to mean byte addressable NVM.
As to the byte addressable NVM, roughly speaking, there are two types in implementing NVM mechanisms. One is a combination
of existing technologies, such as DRAM, DRAM and SSD, and backup batteries, so-called NVDIM. The other is using
non-volatile memory cell, such as PCM, M-RAM, F-RAM, and memory store. NVM logging does not depend on the type of NVM device, but it is rather suitable for NVDIM, as NVDIM is
relatively small small capacity and ready to use. Also, access time is a strong point of this device, because user mode, DRAM, is used for
memory device. SSD is only when used such as power failure or something like that. And then a problem to be solved. Again, a fundamental idea for NVM logging is
simple, but it is not simpler than it looks. Live implementation of NVM logging would be allocating one buffer in NVM area and using a synchronous commit mode
to the VDMS, but it is not sufficient, because there are problems like partial write unreachable XLOG records and CPU cache effect.
Before I detail the problems, let's review the necessary condition for recovery. In a recovery procedure, a transaction is possible to recover if the recovery process needs all that
XLOG records of the transaction correctly. This is, the transaction will be lost if the recovery process needs an incorrect XLOG record, or it cannot find any necessary XLOG record.
First problem is partial write. This occurs, this recovery process will need an incomplete XLOG record.
This occurs when system crashes in the middle of writing an XLOG record. This figure shows the situation at peak, at point, if when the
this partial write problem occurs. This brown area shows the area for XLOG record is written, and this white area is not written. In the middle of XLOG record copy,
system crashes, this situation arises. Next is unreachable XLOG record problem. This picture shows
a wall buffer situation when this problem occurs. The problem is that the write of the XLOG record 3 is finished, and the write of XLOG record is not began. This
problem occurs when the DBMS uses pipeline flash, XLOG flash mechanism.
If recovery is carried out using this wall buffer, XLOG reader cannot find this XLOG record 3 because XLOG reader cannot know the length of XLOG record 2.
Because while XLOG reader access this XLOG and adding this logical sequence number to and to and the length one, then
head of XLOG record 2 is obtained, and trying to similar calculation, but there is no length field, so this XLOG record cannot be found. As the write of XLOG record 3 is completed,
commit of the corresponding transaction is possibly finished resulting in a transaction lost. As
CPU cache employing write-back policy, data written by a strong instruction of the CPU does not reach memory immediately. Of course, NVM has the same story. That is,
there is a risk of transaction loss if the system crashes after a transaction commit finishes and before its XLOG record reach NVM. In this case, the recovery process will need incorrect XLOG records.
Then, I present implementation of prototype.
This diagram illustrates the architecture of prototype that implement NVM logging. This prototype uses two elapse kernel module, this one and this one,
specific for the prototype. PLAM module reserves kernel memory, which is used as a pseudo-NVM. It is NVM emulator in that it preserves
its contents even if the process that use pseudo-NVM are both. But this content will be lost when power,
in the case of power loss, so this prototype does not care the power loss event, but enough for process abort. And then, how to, this
slide shows how to prevent partial write problem. Normally, well, XLOG records
copied into world buffer from head to tail, but in this prototype, XLOG records return other than length field. So,
length field is written in the XLOG buffer at the last. So, if system crashes in the middle of XLOG copy, length is
zero. So, XLOG reader recognize that there is no XLOG record. So, partial write problem doesn't occur. And this slide shows how to
prevent unreachable XLOG records. This prototype prevents the unreachable XLOG record problem in the most strict way.
That is, when buffer process finished to write XLOG records of commit. Here, it confirms that all the previous XLOG records are written or not. If those write operations are not finished in this case,
the worker process wait until these write operations are finished. this, this worker process of that carries transaction three wait
until previous XLOG records write finished. This wait control is implemented
using existing possible SQL function. Fortunately, possible SQL has function that implements the necessary wait mechanism. Therefore, the prototype uses this function
for the, this wait control. This simplifies the implementation of wait control very much. And after cache problem,
the prototype used write-combine mode. That is, the memory area assigned for NVM set write-combine mode,
which is a variation of write-through mode of CPU cache. By using a write-combine mode, it is not necessary for possible SQL to care the problem due to the inconsistency between cache and main memory. And,
and this is just a little bit trivial thing, but a new GUC parameter is added to the prototype.
That is, I'm sorry, NVM file name. When this GUC parameter is set,
NVM logging mechanism is active, activated. When the parameter is not specified, the behavior is the same as that of the original Postgres SQL, and the recovery is
performed as follows. At first, the recovery process check whether the data in the NVM is valid or not. If the NVM maintain, contains valid extra records, then
began and end LSN logical sequence number are identified. After this checking,
recovery procedure is performing, is being performing, in which NVM is read for those extra records, whose LSN is between the
minimum and maximum LSN, instead of VAR files written in storage device. Other part, data is read from the VAR file on the storage device.
Other than the extra reader, recovery process is exactly the same as the original Postgres SQL.
Another, maybe it is also trivial things, but wrap around of VAR buffer control is necessary.
As shown in the previous, as shown in the previous slide, the NVM VAR buffer is managed as if it is a collection of multiple VAR segments. This video shows the
VAR segment is, number of VAR segments is four. This diagram illustrates the logical states of VAR where it is evacuated up to a point the current tail of X-log records that has been saved in the storage device.
Current line right point indicates the tail of a valid log records and initialized up to point the tail of VAR buffer initialization.
Corresponding physical uh corresponding physical state is shown in the lower diagram. As VAR buffer is used in a link buffer manner, it arrives a race between current
initialized up to and current point pointers. Current prototype manage the race in that these two pointers does not point the same VAR segment. That is, advance of initialized up to
I'm sorry, race function occurs initialized up to and evacuated up to, I'm sorry. Advance of initialized up to into a new VAR segment have to wait until
evacuated up to pointer leaves the this VAR segment. Otherwise this this records cannot be read by the recovery process. And here from here
I will show the evaluation results. Experimental setup is that DB server has 16 cores DB server memory
is 64 gigabytes and storage is the SSD way to zero configuration of SSD for data and HD four of HD for one area
and client is this configuration and they are collected via one gigabyte is a network. This graph uh show the variation of throughput as a function of the number of clients
and as this graph is previously shown but except that the results of NVM logging case is added.
As can be seen NVM logging deliver a similar performance as asynchronous commit case and also this
in this cache on case NVM logging also deliver the same performance of asynchronous commit and these are DBT2 cases.
DBT2 is an open source implementation of TPCC benchmark. In in the disk cache on case the performance of sync commit was closer to that of asynchronous commit and NVM logging.
The reason is that responsible response of right IO is showed when right back cache this disk drive cache is used. Therefore you may think NVM logging is not necessary when SSD is used for
while storage. However NVM logging is also good for SSD lifetime because it reduces the right amplitude as shown in this figure. When synchronous commit strategy is used
IO operation is performed every time when a transaction is committed. This multiple right to one data block increase the right amplitude which reduces the SSD lifetime.
On the other hand when NVM logging is used IO4 and one data block is performed once. It is good for SSD lifetime.
Also durability has been examined for NVM logging. As I could not find some test tool for durability of the VMS I have made a test tool and use it. This slide illustrates the tool outline.
There are two tables. Each has key and value columns and each client updates where operates value a column in the corresponding row constantly.
In this situation a fault is injected in the DVMS. More specifically postulate scale process are intentionally killed and then started and then started where recovery is carried out.
The criterion for the durability is that taking whether the value of the table one and table two is equal to if we
that each client recorded as a result of the last transaction.
The results were just we expected that is committed transaction recovered when postulate scale uses synchronous commit or NVM logging and not recovered when a synchronous commit is used.
Before concluding my talk I would like to show I would like share recent technical trends.
NVM device has been manufactured for long time and it was used for limited applications such as RAID systems. But recently thanks to NVM standardization
NVM has extended its application area to DV servers. In fact HPE has announced this IO server which supports NVM.
This movement indicates DV server with NVM div is just around the corner and programming support for NVM also has been developed.
An example is HPE IO that provides an access to NVM through Linux operating system. Here fast instance memory fast instance memory is an alias of NVM conclusion is that
NVM is becoming commodity and NVM div is already shipped as a product.
And server DV servers began to equipped with NVM. And benefits of NVM logging is performance improvement that is almost the same as a synchronous commit. And durability is ensured which is similar to synchronous commit.
right amplitude reduction. It is good for SSD lifetime. And future work include bring to a state source code state which is acceptable for the main line.
By doing to doing that to do that cope with standard for NVM access. VPN is a promising candidate. I think and check the operating
and check the correctness of operation in using real NVM. This prototype is a pseudo NVM so we are we we have used real NVM and check when power is transaction is is survived
for power failure. Does it sounds for this thing?
Pardon? VPN is not related to directly related to xlog logging. I think.
I think VPN logging mechanism is not used. It's only allocate NVM area to
post-re-scales wall buffer, I think.
From the trend of log. libpm already have that. Yeah, this implement. I don't think
that I don't think post-re-scale use DPM logging mechanism. This implementation shows that
propose that only allocating NVM area to post-re-scales wall buffer only use that function of DPM. So it's not
a fully used DPM function. Did it does it answer your question?
Yes, but it is a very simple case of NVM utilization, I think, but because it is easy we can implement as soon as soon soon.
Other questions? Okay. Thanks.