Jim Gray SIGMOD Dissertation Award Winning Thesis: Data Management on Non-Volatile Memory
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 155 | |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/42975 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
SIGMOD 2019108 / 155
34
37
44
116
120
122
144
148
155
00:00
HypothesisData managementNon-volatile memoryMagneto-optical driveReflection (mathematics)Social classHypothesisSource codeMereologyLecture/Conference
00:21
Data managementNon-volatile memoryHypothesisNon-volatile memoryClassical physicsData managementComputer animation
00:32
Data managementNon-volatile memoryTime evolutionFlash memoryDynamic random-access memoryCharacteristic polynomialChannel capacityStandard deviationSystem programmingArchitectureIntelData storage deviceChemical equationRandom numberFlow separationSequenceDatabaseData recoveryLoginSubject indexingOrder (biology)Database transactionStress (mechanics)Computer data loggingDatabaseVector potentialAtomic numberNon-volatile memoryStandard deviationLatent heatSound effectData recoveryMiniDiscSystem programmingUniqueness quantificationSet (mathematics)Software developerUniform resource locatorCommunications protocolMathematicsDatabase transactionKey (cryptography)Cartesian coordinate systemRevision controlData managementBlock (periodic table)Data storage deviceSequential accessMultiplication signSpeicherhierarchiePairwise comparisonComplex (psychology)Fault-tolerant systemData structureCASE <Informatik>Process (computing)Focus (optics)Power (physics)InformationCrash (computing)Random accessOperating systemEvoluteFundamental theorem of algebraData bufferPoint (geometry)Dynamic random-access memoryWindowCoprocessorWeb pageCharacteristic polynomialData transmissionRight angleTable (information)Flow separationHuman migrationSequencePhysical lawOffenes KommunikationssystemAreaContrast (vision)Sinc functionPresentation of a groupOrder (biology)Basis <Mathematik>Source codeView (database)Line (geometry)ChainServer (computing)WhiteboardSystem callCodeMultilaterationNeuroinformatikBitModal logicCategory of beingHypothesisGame theoryComputer animation
06:28
Order (biology)Database transactionCrash (computing)Mathematical analysisMathematicsFreewareComputer data loggingDatabaseSeitentabelleTable (information)Communications protocolData recoveryWeb pageRow (database)Linear mapNon-volatile memoryStapeldateiSystem programmingWritingData storage deviceLoginLogical constantKolmogorov complexityComputer hardwareVirtual machineScale (map)TensorDeclarative programmingData managementMechanism designType theoryProcess (computing)DatabaseMetadataData managementDatabase transactionMechanism designSystem programmingDeclarative programmingMultiplication signNon-volatile memoryKey (cryptography)Process (computing)TimestampOrder (biology)Right angleFitness functionRobotSource codeMathematicsError messageHypothesisData recoveryExecution unitData storage deviceStapeldateiSystem callRow (database)GoogolElement (mathematics)NumberMathematical optimizationAreaSoftwareCommunications protocolExtension (kinesiology)Scaling (geometry)ChainSurjective functionState of matterCASE <Informatik>PropagatorMappingData bufferEnterprise architectureDisk read-and-write headRange (statistics)Reading (process)Complex (psychology)Electronic data processingWritingMiniDiscCrash (computing)Web pageTransformation (genetics)Computer data loggingType theoryCoprocessorDynamic random-access memoryComputer hardwareFile systemMachine learningSpeicherhierarchieTensorComplexity classNumbering schemeInjektivitätCache (computing)Virtual machineComputer animation
12:24
System programmingNon-volatile memoryCommunications protocolData recoveryType theoryProcess (computing)HypothesisEmailJames Waddell Alexander IISystem programmingSimilarity (geometry)MathematicsEnterprise architectureVirtual machineHypothesisFlow separationComputer animation
12:52
HypothesisCollaborationismResultantSet (mathematics)QuicksortLecture/ConferenceMeeting/Interview
13:05
Flow separationProjective planeHypothesisLocal ringMachine visionLecture/Conference
13:27
IntelCollaborationismOnline helpFeedbackDistanceShape (magazine)HypothesisConstructor (object-oriented programming)NumberCollaborationismMeeting/InterviewComputer animation
13:48
Lecture/Conference
Transcript: English(auto-generated)
00:03
Thank you so much. It is an enormous honor to receive this award. Jim Gray was a source of inspiration for a large part of work that we did in this thesis. So today, I'll be presenting a brief overview of this thesis and reflect on the lessons that we learned.
00:23
So this thesis focused on classic data management problems that emerge with non-wall-to-memory technologies. So this timeline shows the evolution of memory technology over the last several decades.
00:41
Right now, we are at an exciting point in this timeline. Device manufacturers have created a new non-wall-to-memory technology that can serve as both system memory and storage. Non-wall-to-memory invalidates several key design assumptions that have guided the design of database systems for the last several decades.
01:07
Database systems have always been designed for a two-tiered storage hierarchy with DRAM and DISC. So we can do fast reads and writes to DRAM, but all data is lost in case of a power failure. In contrast, DISC is durable, but only supports slow bulk data transfers as blocks.
01:28
So database systems employ complex protocols and data structures to work around the performance and durability trade-offs between these devices. Non-wall-to-memory, or NVM, blurs the gap between memory and storage.
01:45
So we can do fast reads and writes to NVM, similar to DRAM, but all writes to NVM are persistent, like with the DISC. So this table presents the high-level characteristics of NVM in comparison to DRAM and DISC. So let's first look at the device latency.
02:01
So NVM is roughly 100 times faster than an SSD. It supports fine-grained access like DRAM. Unlike DRAM, it is durable and supports high capacity, and the cost of NVM lies between that of DRAM and SSD. So with this unique set of properties,
02:20
we believe that NVM is a game-changer for database systems. There have been several promising NVM-related developments over the last few years. First, JDUC, the semiconductor standardization body, has published the design specifications for NVM technologies.
02:41
Next, recent versions of major operating systems like Linux and Windows now natively support NVM. Intel has released new assembly instructions in their latest KBLake processors for managing data on NVM. And lastly, it has started shipping NVM DIMMs just a few months back.
03:02
So all of these developments indicate the potential for widespread impact of non-walter memory. So the key question is, are database systems ready for NVM? The last 50 years of database systems research have been grounded in a few fundamental design assumptions.
03:23
So we assume that memory is fundamentally different from storage, that compute is significantly slower than storage, and that random access are significantly slower than sequential access. NVM invalidates all of these key design assumptions that are deeply embedded in today's database systems.
03:43
So in this thesis, we revisit the data management problem with a special focus on NVM. This is challenging because NVM is fundamentally different. So we can't just take an off-the-shelf database system, tweak it a bit, and run it on NVM and expect it to work well.
04:03
Instead, it is important to design new protocols and data structures from first principles for NVM. So to resolve the shortcomings of today's database systems, we studied and built Peloton, the first fully featured database system for NVM, and our research
04:22
demonstrates that the impact of non-walter memory straddles across all the layers of the database system, starting from how we do logging and recovery, all the way up to how we process queries. So in this talk, I will give a brief overview of how we redesigned the logging and recovery protocol for non-walter memory.
04:45
The logging and recovery protocol allows the database system to provide the durability and atomicity guarantees. These guarantees make it easier to design database applications. So let's look at an example to understand these guarantees. Let's say we are running a shopping application on top of a database system.
05:02
In this particular transaction, we are placing an order and updating the relevant item stocks. Now we want to ensure that after a user has placed an order, that data is safe, even if the system crashes due to a power failure. The database system provides this kind of fault tolerance by logging the information on durable storage.
05:25
So we refer to that as the durability guarantee. Next, we want to ensure that after the system comes back from a failure, either all or no changes made by a transaction are present in the database. So this is accomplished by removing the effects of uncommitted transactions during recovery.
05:43
So we refer to that as the atomicity guarantee. Most database systems provide the atomicity and durability guarantees using the classic right-hand logging protocol. This protocol is designed for a two-tiered storage hierarchy with DRAM and DISC. So the primary storage location of the database is on DISC, and the database system buffers some
06:04
pages in memory, so when a transaction comes in, we cache the changes initially in this buffer pool. Now all the data on DRAM will be lost in case of a power failure, so you need to make sure that the changes reach DISC. Since random writes to DISC are 10 times slower than sequential writes, the database system first migrates those changes using
06:24
fast sequential writes to this log, and later propagates it to the database outside the critical path of transaction processing. Because we write the changes first to the log and then propagate it to the database, this protocol is referred to as right-hand logging. So that takes care of the durability guarantee.
06:43
What about atomicity? Let's say we have a bunch of users concurrently placing orders on our shopping application, and the system crashes. Now we want to ensure that the changes made by the uncommitted transactions are removed when we come back online. This is accomplished using a three-phase recovery protocol.
07:00
So we first analyze the log to figure out the transactions that are running at the time of failure. We then replay the log to repeat history so that all the changes make it to the database. And lastly, we remove the changes made by the uncommitted transactions, and this protocol restores the database to a transactionally consistent state. So now we can start handling new transactions.
07:23
So right-hand logging works really great on a two-tiered storage hierarchy with DRAM and DISC. But that's not the case with non-volatile memory. With non-volatile memory, right-hand logging suffers from two major problems. First, there's a lot of data duplication happening here.
07:40
For a single logical page, we are making three physical copies, one in the buffer pool, another one in the log, and a third one in the database. All of this data duplication hurts the performance and increases the storage cost of the system. The second major problem with right-hand logging is that recovery takes a lot of time because it is linearly dependent on the number of log records.
08:02
And this greatly reduces the availability of the system. So the key question is, how can we improve performance and availability on non-volatile memory? So I will next introduce the right-hand logging protocol, which helps solve these problems. So the right-hand protocol takes an NVM-centric design and improves the availability of the system by enabling instantaneous recovery from failures.
08:28
In particular, it exploits the ability of NVM to support fast random writes. And it provides the same guarantees as right-hand logging. The key techniques in right-hand logging are twofold.
08:41
We directly propagate changes to the database during active transaction processing. And we only record metadata about the active transactions in the log. So I will next explain how this is sufficient to ensure the same guarantees as right-hand logging. So with right-hand logging, we greatly simplify the architecture for NVM.
09:01
We no longer need a buffer pool because the database itself is on NVM. And unlike disk, NVM supports fast random writes, so we don't really need a log. So when you have a bunch of transactions, we can directly propagate the changes out to the database. And we must just make sure that the changes actually hit NVM from the volatile processor caches.
09:21
And this can be done using the newly added instructions from Intel. So that takes care of the durability guarantee. What about atomicity? To ensure atomicity, the database system assigns a timestamp interval for each batch of transactions. And it records this timestamp interval metadata in the log before propagating the corresponding changes to the database.
09:43
By using this timestamp interval metadata, the database system can ignore the changes made by uncommitted transactions when it comes back online from a failure. So this solves both of the problems associated with right-hand logging. First, there is no data duplication because we only record the data once in the database.
10:04
We only record metadata in the log, so that helps improve the performance and reduces the storage cost of the system. And it also enables us to accomplish instantaneous recovery from system failures. Because there is no longer a need for a redo process because all the changes are propagated to the database during active transaction processing.
10:22
There is no need for a physical undo process because we can logically ignore the changes made by uncommitted transactions. So this constant-time recovery scheme greatly improves the availability of the system. So to summarize, right-hand logging helps improve the availability, increase the performance, and reduces the storage cost of the system.
10:44
It advances the state of the art by shifting the complexity class of the recovery protocol on NVM. So reflecting back on this thesis and prior research on hardware management, we observed that most hardware-centric optimizations are system-specific.
11:04
So that is this one-to-one mapping between hardware-centric optimizations and software systems. For example, in this thesis, we optimized the Peloton database system for non-volatile memory. Another example would be how researchers at Google have optimized the TensorFlow machine learning system for tensor processing units.
11:25
I believe that this approach does not scale. As a community, I hope that we can work towards declarative hardware management for scaling out hardware-centric optimizations. So a declarative hardware manager could transform declarative data storage and data
11:44
processing requests from a wide range of systems onto concrete hardware-centric mechanisms. And this will help accelerate a wider range of data processing systems using a specific hardware-centric optimization. To conclude, the advent of non-volatile memory has begun a new era in systems research.
12:05
In this talk, I presented the design of an NVM-centric logging and recovery protocol. This protocol illustrates the importance of designing protocols from first principles for NVM. Although I focused only on database systems, the insights from this work are more broadly
12:23
applicable to all types of data processing systems, including file systems and machine learning systems. All of these systems are amenable to similar architectural changes to unlock high performance and availability on NVM. So this thesis was greatly influenced by several NVM researchers.
12:43
I would like to especially acknowledge researchers at TU Dresden, University of Toronto, TU Munich, and UCSD. This thesis was the result of a wonderful set of collaborations. First and foremost, I had a truly spectacular advisor in Andy Pavel.
13:08
So when we started this project in 2013, it was a moonshot because we were not sure if NVM would ever see the light of day. It was Andy's vision, thoughtful guidance, and unceasing support that made this thesis possible.
13:23
I have been fortunate to have had the guidance of several remarkable mentors in Sam Madden, Donald Kosman, and Jack. Their valuable feedback and constructive criticism helped shape this thesis. And lastly, I am greatly indebted to a number of phenomenal colleagues and collaborators at CMU, Microsoft Research, Intel, and Samsung Research.
13:49
So with that, thank you again for this really terrific award. I am incredibly humbled and honored to receive it. It is a pleasure to be a member of this community. Thank you so much.