We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

SMT-Friendly Formalization of the Solidity Memory Model

00:00

Formal Metadata

Title
SMT-Friendly Formalization of the Solidity Memory Model
Title of Series
Number of Parts
13
Author
Contributors
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Solidity is the dominant programming language for Ethereum smart contracts. This paper presents a high-level formalization of the Solidity language with a focus on the memory model. The presented formalization covers all features of the language related to managing state and memory. In addition, the formalization we provide is effective: all but few features can be encoded in the quantifier-free fragment of standard SMT theories. This enables precise and efficient reasoning about the state of smart contracts written in Solidity. The formalization is implemented in the solc-verify verifier and we provide an extensive set of tests that covers the breadth of the required semantics. We also provide an evaluation on the test set that validates the semantics and shows the novelty of the approach compared to other Solidity-level contract analysis tools.
Keywords
Modal logicRead-only memorySolid geometryProgramming languageRead-only memoryModel theoryFormal grammarComputer animation
SpeichermodellPlastikkarteCodeComputing platformCommunications protocolAbstractionContext awarenessNeuroinformatikCodeData storage deviceMachine codeDesign by contractSingle-precision floating-point formatCommunications protocolComputer animation
Gamma functionPlastikkarteSpeichermodellBootingPermanentSet (mathematics)World Wide Web ConsortiumWitt algebraData storage deviceBoolean algebraState of matterStatistical dispersionParameter (computer programming)Functional programmingSolid geometryDesign by contractRead-only memoryRow (database)Contrast (vision)Local ringDatabase transactionComputer programmingFormal verificationPointer (computer programming)Social classJSONXML
SpeichermodellFunctional programmingProgramming languagePoint (geometry)Type theoryStatement (computer science)Regulärer Ausdruck <Textverarbeitung>Steady state (chemistry)Data storage deviceSemantics (computer science)Read-only memoryPointer (computer programming)Local ringFormal grammarProgramming languageRead-only memoryMemory managementStatistical dispersionElement (mathematics)Level (video gaming)Model theoryState of matterBytecodeDesign by contractVulnerability (computing)Pattern languageInstance (computer science)Local ringUniform resource locatorPointer (computer programming)Data storage deviceException handlingFunctional programmingIntegerCategory of beingForm (programming)Hydraulic jumpSoftware bugExpressionComputer programmingBlock (periodic table)Formal verificationGraph (mathematics)Term (mathematics)PlastikkarteSolid geometryStatement (computer science)Database transactionSemantics (computer science)BuildingContrast (vision)Computer animation
StrutStandard deviationModel theoryRead-only memorySpeichermodellIntegerPointer (computer programming)Interior (topology)Default (computer science)Read-only memoryMemory managementDatabase transactionFunction (mathematics)Recursive languageQuantificationType theoryDesign by contractFormal verificationCondition numberData storage deviceNetwork topologyRecursionLocal ringData modelElement (mathematics)Error messageBootingTexture mappingModel theorySemantics (computer science)State of matterPointer (computer programming)Formal grammarData structureArray data structureData storage deviceCASE <Informatik>Design by contractParameter (computer programming)Functional programmingMemory managementArithmetic progressionDifferent (Kate Ryan album)Conditional probabilityIntegerSubject indexingPrice indexNetwork topologyElement (mathematics)P (complexity)Solid geometryRead-only memoryState observerProgramming languageData typeInstance (computer science)ExpressionLocal ringQuantificationNumberLengthCondition numberFinitismusFlow separationField (computer science)Statistical dispersionCategory of beingCuboidDatabase transactionComputer animation
Context-sensitive languageMusical ensembleSpeichermodellCodeGenetic programmingModul <Datentyp>Read-only memoryModel theoryMemory managementData storage deviceCompilerFunction (mathematics)Statistical hypothesis testingPersonal digital assistantCategory of beingPointer (computer programming)WeightLatent heatSoftware testingDesign by contractRead-only memoryResultantCodierung <Programmierung>Proper mapLengthDifferent (Kate Ryan album)Solid geometrySuite (music)Symbol tableModel theoryLatent heatCompilerDefault (computer science)Statistical hypothesis testingMathematical analysisFormal verificationCategory of beingData storage deviceSemantics (computer science)Computer animation
Woven fabricData storage deviceSpeichermodellCodeFormal verificationMusical ensembleOrder of magnitudeArithmetic meanFormal verificationSystem on a chipBytecodeLevel (video gaming)ResultantRevision controlGreen's functionGraph coloringSolid geometryComputer animation
SpeichermodellRead-only memoryStandard deviationData storage devicePointer (computer programming)Local ringComputational physicsEvent horizonQuantificationSystem on a chipException handlingRead-only memoryEvent horizonBytecodeLevel (video gaming)Model theoryFormal grammarSolid geometryModul <Datentyp>Standard deviationContext awarenessPointer (computer programming)Category of beingComputer animation
Transcript: English(auto-generated)
So welcome, my name is Akos Haidu and I'm going to present an SMT-friendly formalization of the Solidity memory model, a paper originally published at ESOP 2020. Solidity is the most prominent programming language of the Ethereum blockchain, so it is a rather specialized language. Nevertheless, its memory model has some interesting aspects, so I hope that you can
take away some general ideas from this talk. So first, just let's start with a little bit of context. In blockchain-based distributed computing platforms, there is a bunch of nodes storing the same data and executing the same code. This is what they call smart contracts.
Now the main feature is that there is no distinguished central party that has to be trusted, but rather a consensus protocol ensures that they all share the same view as if it was a single world computer that stores data and executes code. The code is called a smart contract and it is mostly written in the Solidity programming language, and these contracts
are similar to classes at the first glance. They can define data types, for example a struct called record with a boolean and an array, and they can also define state variables. This is stored permanently on the blockchain. For example, here records is a mapping.
Contracts can also define functions which can be called as transactions. For example, append gets a record at a given address, sets its flag, and pushes the data to the end of the array. And as I already mentioned, state variables live in the permanent storage on the blockchain,
but in contrast parameters, return values, and local variables live in a transient memory that only exists locally when the individual nodes are executing the transaction. And in an internal scope it is also possible to define pointers to the permanent storage as
seen for example in the parameter of the internal isSet function. Formal verification has gained great interest in this field, mostly due to the financial consequences of bugs. There are tools that operate on the compiled bytecode of smart contracts which has various
formalizations, but these tools are usually limited to common vulnerability patterns. And on the other side, there are tools that operate on the Solidity level, so directly on the level of smart contracts. They can check high-level functional properties and they are usually based on SMT, and here a precise formalization is really required to ensure
that they follow the actual execution semantics. And basically we observed that the memory model lacks a detailed and effective formalization that could be used as a building block for automated verifiers. So based on the motivations presented before, we did our
formalization of the memory model in terms of simple SMT-based programs. We used SMT types, we allow variables to be declared, and we allow basic statements and expressions from SMT so that
these programs can be expressed in any modern SMT-based tool and can be checked by SMT solvers by simply translating it into an SSA form. So let's jump into the actual formalization. Let's start with an overview. As introduced, one of the locations where data can be stored
is the permanent storage on the blockchain. This is where the state variables are stored. It has pure value semantics, which means that there is no overlapping, there's no aliasing, and so on. For example, suppose that in a contract we have a struct T with an integer Z and the struct S with an integer X and an array of Ts. For example, if we have state variable T1,
it will have its own slot. If you have a state variable S1, that also has its own slot, including its own T instances. And the same holds for arrays, they all have their own slots recursively for the inner members. And the other location for data is the transient memory
that is used locally during the execution of the transactions. And this is where parameters, return values, and local variables are stored. And in contrast to the storage, it has pure reference semantics. For example, we can allocate a new S instance with the T array and T instances
inside, and they are all pointers. Then we can just point to, for example, the second element of the array, or we can also allocate a new S instance, which shares the array. So basically arbitrary aliasing graphs can be created in memory. Now, a nice property is
that there is no mixing in storage. There are no pointers and the memory you cannot store by value. There is one exception though, in a local internal context, it is possible to define pointers to storage. So for example, here TP is a pointer to one entity in a storage. And interestingly,
these pointers can be passed around in internal functions and they can also be reassigned to point somewhere else. So let's start the formalization with the memory for which we use a standard heap
model, assigning a separate heap per type, because Solidity is a strongly typed language. Pointers are simply SMT integers, structs are SMT data types with their fields inside, and arrays are data types with the actual SMT array inside and the length. So for example,
if we have this struct T and struct S from the previous example, then these data types can encoded with the following SMT data types. You can see, for example, that the T array is an
array from integers to integers, basically pointers. And then for each type, we have an actual heap to dereference the pointers. Now, interestingly, there are no null pointers in Solidity. If we allocate something, its members and elements are recursively allocated to
default values. So let's see an example, allocating a new S instance with a T array of size two. This will be a pointer, so an SMT integer. And first, the two T instances inside the array are allocated and stored on their respective heap. And we use an allocation
counter that is incremented for each allocation. Then the array is allocated in its heap with the two pointers from the previous steps. And finally, the S instance is also allocated with a pointer to the actual array from the previous step, so that in the end,
we get our pointer. Now, then accessing an element can be done in the opposite direction, basically accessing and dereferencing the heaps step by step. So for example, SM is heap S3.TA,
again with the heap, the first element and the member Z can be accessed this way. So a nice aspect of the memory model of Solidity compared to traditional languages is that its scope is limited to the execution of a single transaction. And this is what makes this heap
model tractable. We don't have to reason about allocating permanent and global stuff. Now, one thing that needs to be ensured though, is that new allocations should not overlap or should not alias with previous allocations. So for example, if we have this function F and we want to allocate something inside, then we know that this should not overlap
with the parameter SM. And this can be done by assuming that the parameter SM is less than the allocation counter and also recursively for its members like the array, and again,
recursively for each array element. So this requires quantifiers in the general case, but this is limited to a decidable fragment. This is work in progress. Now let's move on to the storage, which has pure value semantics. So we will not be using heaps here. For example, if we have this struct T and S as previously, they are encoded like
this. The main difference is that now nested structure arrays are not pointers, but the actual data types. For example, the T array now stores actually the instances
and not just pointers. Now, if we have these state variables, T, S, and an S array, then simply they become, they simply become variables of these data types. And a nice property of this encoding compared to using a heap-based model is that it ensures non-aliasing
and deep copy out of the box. And this is especially useful in modular verification, otherwise we would require many framing conditions. And we also know precisely what is being modified, which makes reasoning more effective. Now, the one big question remains,
how do we deal with these local storage pointers if we use these data types to encode the value semantics of storage? And the key observation here is that storage is basically a finite depth tree of values. For example, let's say we have this struct T with an integer Z
and struct S with an integer X and T, and also let's suppose we have a T array, and suppose that we have state variables T1, S1, and an S array. So here, the storage tree looks like the following. In the root, we have the contract itself, and the state variable T1
is simply a T instance, and S1 is an instance of S, which has a T inside, and also a T array, which again has a number of Ts inside. And SA is an S array, which again has S instances inside,
which are again recursively consisting of Ts and array of Ts. Now, the main observation is that each element can be identified by a path in this tree, and basically by indexing the edges, a path will be just an array of integers. We can assign an index to each
state variable, and then recursively to each member, and so on, and we can also use array indices. So now, if we want to, or if we need to point to an expression like, for example, sa8.ta5, then we have to fit this expression onto this tree, which we call packing. And in
this example, sa8.ta5 identifies this path highlighted with the indices 2, 8, 1, and 5. So
the pointer is just an array from integers to integers with these values mentioned just before. Now, the opposite is if we have a pointer, say as a function parameter, and we want to use it, like this T storage PTR, then we have to deconstruct where it can
possibly point, and we call this unpacking. Here, we create a conditional expression based on the tree. For example, with this tree, if we have some pointer to a T instance, then we first have to check the following. If it starts with a 0, then it can only be T1.
Otherwise, if it starts with 1, then it can be either s1.T if the next element is 0, or s1.ta index with the second element, which is the array index, if the next element is 1,
and so on. And since the tree has a finite depth, we can always deconstruct a finite conditional expression. Now, I mentioned that there is no mixing between the different data locations, so there are no pointers in storage and no values in memory. However,
it is possible to make assignments between the two locations, which can result in very different behaviors, like deep copy or just a pointer assignment. And the solidity semantics can be summarized with this table, and I'm not going to go into the details, but our formalization covers all the cases with different aspects. And these were just the key
ideas. The paper has all the details and all the corner cases, most importantly, formalized, so I encourage you to check it out. And let's now evaluate if we achieved our goals.
We implemented this encoding in our tool, saucy-verify, which is a modular verifier based on Boogie and SMT, and this encoded that I have just presented. And we compared it to which is a symbolic execution engine operating over the bytecode, and to VeriSOL, which is again
a modular and BMC-based tool, also based on Boogie and SMT, but they use a heap-based model for the storage as well. And we also compared to SMTChecker, which is the intra-function analyzer built into the solidity compiler. Now, just a side note, these experiments were performed for
ESOP 2020, so basically we did them almost one and a half a year ago, so the tools, including also ours, could have improved since then. And we developed a test suite with around
300 tests organized into different categories like assignments, deleting, initialization, storage operation, storage pointers, and so on. And each test exercises a specific feature and checks the result with assertions or with an assertion. So for example, here the contract
initializes a fixed-size memory array and checks whether it is really initialized to the proper length and the proper default values. So this way, the tests should only pass if tools properly translate and run the analysis with appropriate semantics. Now you can see the results grouped by categories, each tool having a bar with a color coding, green means correct,
gray is unsupported, and red is a wrong result. And highlighted with these purple boxes, you can see that the bytecode level material is precise. It does not have to deal with many of these high-level features, but we still managed to find some bugs. See, for example,
this issue on the slides. And you can also see that SoC verifiers comes close, especially compared to other SMT-based tools. We still have some unsupported features that we formalized but did not implement yet, like quantifiers, and there are some that we choose not to support
because it is rare or it will be removed in newer versions of Solidity. And another aspect is efficiency. And you can see that SoC verify, represented by the green bars, is one or two order of magnitudes faster than material, which is represented by the blue bars. And it is also
faster than Verisol. And yeah, with SMT checker, it did not support most of the features at that time, so we couldn't really compare. So to summarize, I presented an SMT-friendly formalization of the Solidity memory model. There are two data locations,
memory, which is a model with a standard heap, and storage, which has a pure value semantics, with the exception of pointers in a local context. And the formalization is implemented in SoC verify, modular verifier, and our conclusion was that SoC verify is on pair
with bytecode level tools, but at a low computational cost. And since then, we have actually been working on some stuff like verifying properties that involve Solidity events. We also started to support quantified properties, and we are working on upgrading to the latest
version, Solidity 0.8, because these experiments were still performed with 0.5, which was the latest back then. So thank you for your attention, and please check the paper, our tools, or our websites. Thank you.