We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

SPARKNaCl: A Verified, Fast Re-implementation of TweetNaCl

00:00

Formale Metadaten

Titel
SPARKNaCl: A Verified, Fast Re-implementation of TweetNaCl
Serientitel
Anzahl der Teile
287
Autor
Mitwirkende
Lizenz
CC-Namensnennung 2.0 Belgien:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
SPARKNaCl is a new, verified and fast reefrence implementation of the NaCl API, based on the TweetNaCl distribution. It has a fully automated, complete and sound proof of type-safety and several key correctness properties. In addition, the code is fast - out-performing TweetNaCl on an Ed25519 Sign operation by a factor of 3 at all optimization levels. This talk will cover how "Proof Driven Optimization" can result in code that is both correct and fast on bare-metal embedded targets. TweetNaCl is a compact reference implementation of the NaCl API. It was initially constructed to show that an entire crypto library could fit into "100 tweets", but has since been re-used in some critical applications, such as the WireGuard VPN. There are no comments in the code at all, and all assurance rests on a single brief academic paper, and the formidable reputation of the authors. Can we do better? Can we produce a reference implementation which is amenable to automatic verification and yet is competitive with TweetNaCl in terms of performance and code size? This talk presents SPARKNaCl - a complete re-implementation of TweetNaCl in SPARK, which comes with a fully automated proof of type-safety, memory-safety and a number of key correctness properties. Having established a solid foundation, we went on to compare the performance and code size of SPARKNaCl against the original C implementation. Various transformations and optimizations have been applied that result in SPARKNaCl out-performing TweetNaCl on a bare-metal 32-bit RISC-V machine for a single Ed25519 "Sign" operation, while retaining automation and completeness of the proof. Furthermore, SPARKNaCl is freely available under the 3-clause BSD licence. This talk will present an overview of the results from both the proof work and performance analysis of SPARKNaCl.
CodeComputerVerzeichnisdienstAnwendungsspezifischer ProzessorCAN-BusKryptologieImplementierungQuellcodeGasströmungSpeicherabzugAlgorithmusStromchiffreAuthentifikationHash-AlgorithmusOptimierungKonstanteGanze ZahlPrimidealRestklasseOvalVollständigkeitÜbertragSeitenkanalattackeKryptologieSoftwareSchlüsselverwaltungDeskriptive StatistikNichtlinearer OperatorAlgorithmusFehlermeldungHash-AlgorithmusBitVersionsverwaltungVirtuelles privates NetzwerkZahlenbereichUmwandlungsenthalpieAggregatzustandDifferenz <Mathematik>QuellcodeStromchiffreTypentheorieGanze ZahlDigitalisierungRestklasseTwitter <Softwareplattform>VerschiebungsoperatorPrimitive <Informatik>Produkt <Mathematik>PunktProgrammbibliothekAuthentifikationProgrammverifikationElliptische KurveLaufzeitfehlerMathematikMultiplikationsoperatorAssemblerPolygonImplementierungKategorie <Mathematik>DistributionenraumFormale SpracheVariableElektronische UnterschriftEin-AusgabeKurvenanpassungCodeMultigraphVerzweigendes ProgrammGruppenoperationNummernsystemHypergraphWeb-SeiteSchnittmengeProjektive EbeneOptimierungMAPPufferüberlaufTermBrennen <Datenverarbeitung>MathematikerinGarbentheorieFunktionalanalysisMultiplikationQuick-SortAdditionTeilmengeAbenteuerspielLesen <Datenverarbeitung>GamecontrollerLeistung <Physik>DoS-AttackeProzess <Informatik>Inhalt <Mathematik>ResultantePhysikalisches SystemKonstanteProgrammfehlerPhysikalischer EffektGrundsätze ordnungsmäßiger DatenverarbeitungSchmelze <Betrieb>Streaming <Kommunikationstechnik>Software EngineeringSchlüsselverteilungOpen SourceSummierbarkeitSystemaufrufInelastischer StoßAlgebraisches ModellSchreib-Lese-KopfLipschitz-StetigkeitSymboltabelleWeb SiteWasserdampftafelPlastikkarteWellenlehreWechselsprungRechter WinkelDreiecksfreier GraphDiagrammComputeranimation
ÜbertragCodeFormale GrammatikImplementierungMagnetkarteKryptologiePhysikalisches SystemProgrammbibliothekCAN-BusSoftwaretestLineare RegressionKonstanteOptimierungROM <Informatik>TypentheorieBefehlsprozessorCompilerCOTSSpeicherverwaltungFlächeninhaltAlgorithmusHash-AlgorithmusPartielle DifferentiationBefehl <Informatik>LoopNP-hartes ProblemProzessautomationInvarianteLipschitz-StetigkeitDesign by ContractHilfesystemBenchmarkPlastikkarteMereologieAusnahmebehandlungBitRechter WinkelEntscheidungstheorieFormale SpracheSystemaufrufFlächeninhaltPhysikalisches SystemProjektive EbeneOptimierungKategorie <Mathematik>ResultanteDistributionenraumGruppenoperationDatenfeldHypermediaReelle ZahlFaktor <Algebra>Zentrische StreckungLogiksyntheseFunktionentheorieProgrammbibliothekSchnitt <Mathematik>URLWeb SiteNatürliche ZahlDreiecksfreier GraphMathematikPhysikalische TheorieArithmetisches MittelQuick-SortBenutzerfreundlichkeitTwitter <Softwareplattform>QuaderProzess <Informatik>AggregatzustandGesetz <Physik>Web ServicesInteraktives BeweissystemMooresches GesetzAssemblerStatistikProgrammfehlerMaschinencodeFunktionalanalysisBrennen <Datenverarbeitung>Mathematische LogikSoftware EngineeringBildschirmmaskeAlgorithmische ProgrammierspracheAlgebraisches ModellCASE <Informatik>GeradeTermMAPCodeHalbleiterspeicherImplementierungAlgorithmusProgrammverifikationOffice-PaketSoftwaretestVollständigkeitDynamisches SystemVirtuelle MaschineKurvenanpassungMultiplikationsoperatorComputersicherheitProgrammschleifeBefehl <Informatik>Hash-AlgorithmusVariationsrechnungPräkonditionierungKonditionszahlLesezeichen <Internet>Hecke-OperatorIdentitätsverwaltungSchnittmengePrädikat <Logik>FreewareSeitenkanalattackeDeklarative ProgrammierspracheHardwareZusammenhängender GraphLaufzeitfehlerLoopProgrammierstilProdukt <Mathematik>SoftwareentwicklerTypentheorieCOTSSpeicherverwaltungKryptologieKeller <Informatik>CompilerKonstanteTypprüfungPartielle DifferentiationAnalysisPufferüberlaufInteraktives FernsehenFormale GrammatikDomain <Netzwerk>TheoremSoftwareLineare RegressionInvarianteComputeranimation
LoopBefehl <Informatik>NP-hartes ProblemAlgorithmusProzessautomationInvarianteCodeFormale GrammatikProgrammschleifeGanze ZahlNichtlinearer OperatorGruppoidRestklasseOffene MengeVorzeichen <Mathematik>DatentypROM <Informatik>Stochastische AbhängigkeitAusnahmebehandlungPufferüberlaufDivisionCAN-BusZeiger <Informatik>TypentheorieVollständigkeitFlüssiger ZustandKonstanteLokales MinimumKoeffizientProdukt <Mathematik>NormalvektorProgrammbibliothekMaß <Mathematik>BefehlsprozessorNotebook-ComputerThreadModul <Datentyp>SoftwaretestWhiteboardMessage-PassingCodeHalbleiterspeicherAusnahmebehandlungPufferüberlaufDivisionProgrammbibliothekTypentheorieVirtuelle MaschineSoftwaretestVorzeichen <Mathematik>Zeiger <Informatik>SpeicherverwaltungProgrammverifikationLoopTheoremCoprozessorBitPrädikat <Logik>Zentrische StreckungTermNotebook-ComputerAutomatische IndexierungKurvenanpassungDeklarative ProgrammierspracheLeistung <Physik>SpeicherabzugMultiplikationGanze ZahlLastDigitalisierungGebundener ZustandSpannweite <Stochastik>InvarianteKategorie <Mathematik>Nichtlinearer OperatorPhysikalisches SystemQuaderSchlüsselverwaltungObjekt <Kategorie>KonstanteKonditionszahlSpezifisches VolumenDesign by ContractZahlenbereichRestklasseEinfache GenauigkeitProgrammschleifeAlgorithmusSchaltnetzFunktionalanalysisPartielle DifferentiationDynamisches SystemSoftwareschwachstelleMultiplikationsoperatorElement <Gruppentheorie>PrimzahlTypprüfungErzeugendeRadikal <Mathematik>OrdnungsreduktionLuenberger-BeobachterFormale SpracheGüte der AnpassungElektronische UnterschriftSkalarfeldUmwandlungsenthalpieOptimierungArithmetisches MittelProgrammiergerätElektronische PublikationFuzzy-LogikEigentliche AbbildungGeradeZweiParallele SchnittstelleWeb SiteAdditionWeb-SeiteRechter WinkelDreiecksfreier GraphRechenwerkFatou-MengePlastikkarteLeckPunktGesetz <Physik>VollständigkeitNeunzehnPrognoseverfahrenProzess <Informatik>MomentenproblemFormation <Mathematik>TransitivitätVollständiger VerbandPräkonditionierungHypermediaKontrollstrukturLoginForcingMinkowski-MetrikSystemaufrufWhiteboardDistributionenraumBildverstehenWasserdampftafelProdukt <Mathematik>EinsProfil <Aerodynamik>Flüssiger ZustandBetti-ZahlComputeranimation
SoftwaretestWhiteboardVorzeichen <Mathematik>GruppoidMessage-PassingCodeOptimierungSystemaufrufVariableTypentheorieFunktion <Mathematik>p-BlockStandardabweichungGlobale OptimierungMAPAliasingCompilerGasströmungRegulärer Ausdruck <Textverarbeitung>BefehlsprozessorVerschiebungsoperatorVollständigkeitInnerer PunktDatenanalyseObjekt <Kategorie>EliminationsverfahrenPartielle DifferentiationGanze ZahlZahlzeichenKompakter RaumInformationsspeicherungOperations ResearchCachingTransformation <Mathematik>BeweissystemInformationsüberlastungBitNichtlinearer OperatorFunktionalanalysisFormale SemantikQuellcodeOptimierungCompilerMultiplikationsoperatorAusnahmebehandlungSoundverarbeitungKategorie <Mathematik>TypentheorieKartesische KoordinatenMessage-PassingSchlüsselverwaltungFreewareWhiteboardTransformation <Mathematik>InformationsspeicherungGlobale OptimierungCodeAliasingDifferenteRahmenproblemProgram SlicingMathematikKeller <Informatik>NormalvektorTwitter <Softwareplattform>SelbstrepräsentationMultiplikationZahlenbereichDatenfeldMAPHyperbelverfahrenOrdnungsreduktionGeradeDifferenz <Mathematik>AdditionDigitalisierungUnrundheitDreiecksfreier GraphFormale SprachePufferüberlaufSystemaufrufGüte der AnpassungKreisbewegungObjekt <Kategorie>Physikalisches SystemÜbertragResultanteTypprüfungBefehlsprozessorTeilbarkeitArithmetischer AusdruckImplementierungLaufzeitfehlerFront-End <Software>KryptologieElliptische KurveFunktionale ProgrammierspracheRelationentheorieLuenberger-BeobachterAggregatzustandInterface <Schaltung>LoopPartielle DifferentiationSpannweite <Stochastik>KoeffizientGerichteter GraphSpielkonsoleKompakter RaumVorzeichen <Mathematik>MultipliziererExistenzaussageVersionsverwaltungSpieltheorieZentralisatorPlastikkartesinc-FunktionWeb SiteEinfügungsdämpfungProgrammschleifeMobiles InternetSchnitt <Mathematik>Suite <Programmpaket>ChiffrierungMomentenproblemBildbearbeitungsprogrammRechter WinkelSchwebungGesetz <Physik>NeunGruppenoperationGanze ZahlPunktComputeranimation
MAPGlobale OptimierungBefehlsprozessorCodeProgrammschleifeCachingBitrateSpeicherabzugp-BlockProgrammbibliothekGruppoidGanze ZahlZahlzeichenTDMAPhysikalisches SystemIterationCompilerSystemaufrufFunktion <Mathematik>Message-PassingUmwandlungsenthalpieArchitektur <Informatik>EinflussgrößeSubstitutionTransformation <Mathematik>InformationsspeicherungNichtlinearer OperatorInvarianteLoopVersionsverwaltungIdeal <Mathematik>LeistungsbewertungProzessautomationInformationKonstanteGraphAnalysisStochastische AbhängigkeitPublic-domain-SoftwareNichtlinearer OperatorGlobale OptimierungTransformation <Mathematik>AusnahmebehandlungVorzeichen <Mathematik>CodeProgrammbibliothekBitMultiplikationsoperatorSchaltnetzCASE <Informatik>Quick-SortMAPMathematikKategorie <Mathematik>DatenflussQuellcodeDreiecksfreier GraphProjektive EbeneFunktionalanalysisRepository <Informatik>BefehlsprozessorCachingMomentenproblemCompilerProgrammschleifeZweiunddreißig BitBitrateFolge <Mathematik>MaschinencodeFormale SemantikInformationWeb logVirtuelle MaschineReelle ZahlFormale SpracheServerResultanteEinsSelbst organisierendes SystemFaserbündelPunktp-BlockInvarianteOptimierungPhysikalisches SystemHalbleiterspeicherLoopSystemplattformSystemaufrufSoftwareentwicklerMessage-PassingInverser LimesGamecontrollerUnrundheitProgrammverifikationProgram SlicingPartielle DifferentiationEliminationsverfahrenUmwandlungsenthalpieTypentheorieDatenkompressionEinflussgrößeKonditionszahlAnalysisVersionsverwaltungWechselsprungForcingVollständigkeitAlgorithmusGüte der AnpassungTermRandwertKonstanteGraphTheoremStandardabweichungHardwareKartesische KoordinatenMultiplikationRechter WinkelHypergraphEinfügungsdämpfungComputerspielRichtungGruppenoperationVerknüpfungsgliedGesetz <Physik>ComputerarchitekturRelationentheorieStellenringProzess <Informatik>Puls <Technik>SchnittmengeURLPunktwolkeDatenstrukturWeb SiteEchtzeitsystemPhysikalischer EffektZahlenbereichBaumechanikOrdnungsreduktionHilfesystemLuenberger-BeobachterComputeranimation
AggregatzustandMereologieVersionsverwaltungPhysikalisches SystemGerichteter GraphImplementierungVertauschungsrelationKategorie <Mathematik>Abstimmung <Frequenz>ProgrammfehlerRechter WinkelGrößenordnungWeltformelEinflussgrößeRPCResultanteDigitalisierungTypentheorieVerschiebungsoperatorTermPlastikkarteBitAdditionProxy ServerLeistung <Physik>Gebundener ZustandCodeZweiAlgorithmusDynamisches SystemPrädikat <Logik>Element <Gruppentheorie>Prozess <Informatik>Divergente ReiheQuick-SortComputeranimationBesprechung/Interview
ParametersystemDynamisches SystemDatenflussArray <Informatik>Formale SpracheObjekt <Kategorie>StereometrieMathematikAnalysisCodeGlobale OptimierungPhysikalisches SystemFunktionalanalysisCompilerKategorie <Mathematik>Element <Gruppentheorie>AliasingTermDatenfeldAdditionPrädikat <Logik>KonditionszahlSoftware Development KitOrdnungsreduktionGruppenoperationWorkstation <Musikinstrument>Vorzeichen <Mathematik>Produkt <Mathematik>PunktAttributierte GrammatikMereologieARM <Computerarchitektur>Besprechung/Interview
Design by ContractAlgorithmusBitrateLoopMultiplikationPunktMultiplikationsoperatorRepository <Informatik>InvarianteProgrammverifikationQuick-SortKonditionszahlSchnittmengeOrdnungsreduktionCodePräkonditionierungProzess <Informatik>FunktionalanalysisBitAdditionVerzweigung <Mathematik>MehrrechnersystemGesetz <Physik>ForcingBesprechung/Interview
ZahlenbereichElliptische KurveCodeMereologieDatensatzQuick-SortGeradeMultiplikationsoperatorKoroutineSchreib-Lese-KopfGruppenoperationAlgorithmusSchnelltasteFrequenzProdukt <Mathematik>Minkowski-MetrikNabel <Mathematik>Translation <Mathematik>ChiffrierungGüte der AnpassungPunktspektrumEntscheidungstheorieInhalt <Mathematik>UnrundheitComputeranimationBesprechung/Interview
No-Free-Lunch-TheoremChiffreMAPNP-hartes ProblemSpeicherabzugGruppenoperationCASE <Informatik>MereologieMathematikComputerunterstützte ÜbersetzungTopologieHypergraphQuick-SortUmwandlungsenthalpieZentrische StreckungEntscheidungstheorieTouchscreenCodeMultifunktionProjektive EbeneAlgorithmusMultiplikationReelle ZahlSchwebungArithmetisches MittelSchnitt <Mathematik>WechselsprungFormale SpracheDesign by ContractBitTypprüfungMaschinencodeTheoremKryptologieInteraktives FernsehenBinärcodeTwitter <Softwareplattform>DifferenteFunktionentheorieComputeranimationBesprechung/Interview
Data Envelopment AnalysisEntscheidungstheorieBitGerichteter GraphMailing-ListeZweiMultiplikationsoperatorARM <Computerarchitektur>ZahlenbereichLokales MinimumTheoremHilfesystemLeistung <Physik>Besprechung/Interview
ComputeranimationTechnische Zeichnung
Transkript: Englisch(automatisch erzeugt)
Okay, thanks for coming everybody. Nice to see you all at FOSDEM this year. Nice to see you even virtually, if not in person. So this afternoon, I'm going to be talking about Spark at ACL, which is a cryptographic library written
in Spark, which I'm sure most of you already know is a subset of Ada and associated tooling. This was a kind of a lockdown side project for me. It started out as a bit of fun, just to do some programming work during lockdown, and then kind of became a bit more serious as I got it more towards a production level of quality.
So what I want to talk about then, in order, is to start by talking a little bit about something called Tweet NACL, which is our existing version of the NACL crypto library, which I started from. Tell you a bit about that and then tell you, well, why am I bothering to do this in the first place? What caught my interest about it?
I'll tell you about the goals that I set myself for this project. I'll also tell you what my goals definitely weren't in terms of things I didn't want to try to achieve. The two main technical contributions, I suppose, are firstly a section I'll talk about the proofwork on the software. So this is about the formal verification of the Spark version and what I was able to verify and what I wasn't able to verify.
Then we'll talk about the performance side, which is kind of saying, well, OK, we verified it, but can we actually make it go fast enough to be competitive with the C code without breaking the proof as well, which is a very interesting adventure. And I'll talk about finally some further work and ideas that we might have to take it further.
So moving straight ahead, what is NACL? Well, OK, NACL on its own sometimes pronounced SALT or it actually stands for Networking and Crypto Library, but no one calls it that. People refer to it as libsalt or libsodium sometimes. NACL is, I guess you could describe it as the specification for a cryptographic libraries API, which embodies a number of particularly recent or cryptographic algorithms.
And there are there are many implementations of NACL written in different languages. There were implementations in C, Python, Rust, C++, you name it. Lots of languages have implementations. But there are particularly well-regarded implementations of this, which have been subject to a
lot of quality control, and it is very highly regarded as a piece of software. Tweet NACL is an implementation of the NACL API, which is written in C, and it has some rather unusual properties. This version of NACL was actually written and designed by the people that designed the API in the first place.
So a rather odd property of Tweet NACL is that the whole software, all of the source code for Tweet NACL for the whole library can be tweeted in 100 tweets. Now, this is a slightly strange thing to do. Firstly, it's an exercise in demonstrating that a useful cryptographic library can actually be
done in such an amazingly small amount of code, which is an achievement in itself. And secondly, it's a little bit of an exercise in saying, well, you know, demonstrating that you can distribute this kind of stuff to the whole world by tweeting it without having to get in the way of any copyright or distribution or export controls or anything like that.
So, as a result of trying to get the thing down to 100 tweets, one issue is that there are really actually almost no comments in the code whatsoever. So it is spectacularly difficult to understand how it works. I'll show you a bit of the code in a minute or two.
There is a technical paper about 16 pages that describes how the library is put together. This is really rather succinct in a paper and it makes sense if you understand the mathematics of things like elliptic curves and all these other algorithms. But if you're not a hardcore crypto mathematician like me, if you're a software engineer,
it's pretty difficult reading this paper and trying to understand the code is still pretty difficult. And the assurance for TweetNACL largely rests on things like, well, there are lots of people using it. There were significant, some significant users. I believe a version of it has been reused in something called the WireGuard
VPN system, which is widely used and is built into many Linux distros these days. So there were lots of users and the designers of NACL, particularly a chap called Dan Bernstein and his colleagues, do have a very formidable reputation as cryptographic mathematicians, certainly.
But it's interesting to look at a piece of code like that and say, well, you know, is it really actually trustworthy and has it got any bugs in it? And if we applied state of the art formal verification tools, what would we find? So that's kind of where I came from. I thought, well, OK, let's just rewrite it in Spark and see what we get. What you get with TweetNACL, the algorithm set you get are a number of modern cryptographic algorithms.
So Salsa 20 is a modern stream cipher. There's an authenticator called Poly 1305. It uses a standard hash algorithm, just just the 512-bit variant of SHA-2. And for signatures and key exchanges, it uses an elliptic curve called Curve 25519, on top of which is built the Edge 25519 signature scheme.
And you can also use it to establish a shared secret using the variant of ECDH built on that elliptic curve. So you get a really useful set of primitives with it with which you can do most most things that you would want to do. So it is really a useful library in so little code, it's quite surprising.
So as I said, looking at the C code, when I originally looked at it, I just looked at it and thought, gosh, I don't get this. How is this possibly working? Because, as I said, there were no comments. There's also a bit of trickery going on where the code is doing some cunning mathematical tricks to achieve an acceptable level of performance.
The code is doing some interesting mathematical trickery to make things like the elliptic curve operations go sufficiently fast. It's not as highly optimized as some libraries that you might come across. So it's not really an assembler or doing some really wild things, but it is still pretty difficult to understand. Secondly, it is programmed in a fashion called constant time programming style.
So the idea here is that the running time of the algorithms does not vary in a way that depends upon any variable or any input variable of any of the algorithms, if you like. So this is to defend the algorithms against side channel attacks based upon the execution time of the algorithm.
And that makes for the code being more difficult to understand because you look at a piece of code and think, what on earth's going on? And then you think, oh, why didn't they just write, you know, be much easier if you wrote this out as an if statement. And then you think, oh, they can't write an if statement because you'd be testing a sensitive variable and you can't branch based on a sensitive variable.
So it's kind of written in a slightly weird way in this constant time style. And finally, as I said, the paper is rather terse. So here's an example of what the paper says and then what the code looks like. So, for example, here's a rather marvelous paragraph from the paper. There's an algorithm in the signing algorithm.
There's an algorithm that reduces a number, a large number, modulo a big number called the order of the group. And you think, OK, and this one paragraph is the description of this whole code function in the library. And you look at that and you think, huh, what on earth is that about?
And yeah, well, yeah. And is that really right? Well, yeah, except actually when you get down into the weeds of it and off as a sort of something I found while doing the formal verification of this code, there is actually a mistake in this paragraph and I'm sure you all spotted it, right?
Obvious. It's a harmless mistake, I'm pleased to say, but there is actually a factual error in this paragraph. I won't tell you where it is yet. But that one paragraph is the only description you get of some code. Just for the sake of readability, we just introduce a couple of things to make it readable.
There's a couple of typedefs you need to know about. There's a signed integer 64 called i64 that the code uses. There's a type called GF, which is an array of 16 of those signed 64 bit values. So the GF type is used to represent very large numbers, basically.
It's an array of integers where each integer is like the digit, a digit of an array of digits. And those are used to represent numbers which are points on the elliptic curve. So this is a type that's that's all over the place in the software. It then uses a couple of hash defines just to make to abbreviate a couple of common idioms,
which makes the code shorter and therefore easier to fit in 100 tweets. But so that paragraph of mathematical stuff becomes that code in the tweet in ACL distribution. And you look at that and you think, wow, what on earth is going on there? If you look here, you can see all these mathematical operators.
There's lots of various multiplications and additions and subtractions and shifting. And imagine every single one of those operators. You've got to convince yourself that that's not and these these are all signed integer operations. So remember that in C signed integer operations are undefined behavior on overflow.
So you've got to convince yourself that every one of those could never overflow or underflow in any way. That's pretty difficult to do. And it's pretty difficult to see how this code works at all, let alone know it's doing the right thing and you're not going to be undefined or anything. So that's the kind of scale of the challenge that you're faced with this kind of code.
So why why did I even bother to do this? There's in the last five to 10 years, I guess there's been a real explosion in interest in this field, particularly people doing formal development or formal verification of cryptographic code. There's been many, many notable achievements in this area from people like Microsoft Research,
Amazon Web Services, the people at Galois out in the States with their Cryptol and Saw tool sets. There's a system called Jasmine and EasyCrypt, which is extremely strong, an effort called HackSpec with Rust to specify and refine cryptographic code, fiat cryptography. And probably as many more projects, again, that I've forgotten to mention.
And I apologize to anyone who's who's, you know, feels aggrieved that I've forgotten to mention their favorite project. And there's lots and lots of very impressive results in this area. Also, there's very highly respected implementations of NACL out there already, particularly the kind of benchmark implementation that many other distributions are based upon is called Libsodium.
And Libsodium is very highly regarded. It has been subject to a great deal of audit and analysis. It is also extremely efficient because bits of it are implemented in assembly language for particular targets. And they use some extraordinary tricks to make the thing go very, very fast. And that's that's a really, really, really well respected implementation. So so, you know, why bother doing another implementation of NACL?
Well, I really wanted to see, particularly with Spark 2014, since the language was was, you know, so significantly enhanced with Spark 2014 and modern day proof engines, I wanted to see if the proof system could deal with code of this kind of complexity, this kind of mathematical
nature, using this constant time programming style to see if the tools could cope with with with code of this nature. So what could we prove? Could we prove that it's so-called type safe or free of runtime errors and prove the code couldn't raise an exception, for example? Or could we do more? Could we actually go on to prove some more advanced sort of correctness properties?
I'm also interested in automation, meaning can I get the theorem proving tools to prove everything automatically? So you just push the button and it says, yes, everything is OK. And it proves all of the verification conditions automatically with one or more of the of the main proof engines, because I don't want to be involved with having to check false alarms from the proof tool.
I don't definitely don't want to have to be firing up some sort of interactive proof tool to kind of finish off the proving job. I want to automate the thing completely. So it'd be interesting to see what level of automation can be achieved. So there's also a myth that I come across working in industry, I come across this quite a lot.
There is a general perception I see in industry that people think that formal languages are either too slow or not compatible with people's target hardware and sort of unusable in the real world, where the real world might be, you know,
running on a small embedded target, something that's not got much memory or much computing power or these days, not much battery power. People are very interested in power consumption as well as speed. But there is this perception that formal languages are sort of too slow or too big for production use. This is often restated in a way it sort of comes out in the wash and people say, well, this code has to be really fast and really efficient.
And it's got to run on this little embedded target so we can program it in C, C, C or C or any language that starts with C or possibly assembly language. And languages like Spark don't even get a look in. People don't consider that it could be done in Spark. And I was interested in seeing if that really was the case and seeing if we can achieve both efficiency at execution time and also
the level of assurance and proof that we can do with Spark and see if we can get both of those at the same time. So what goals did I set myself and not set myself? So I wanted to do the whole library. I didn't just pick on one
of those cryptographic algorithms. I decided I'd do the whole library, all of the algorithms. So did that. Fine. Obviously, my new library should at least pass all the regression tests that exist for TweetNACL. And indeed it does. And that's fine. That's pretty, pretty obvious. The implementation retains the constant time property of the programming.
That's important. That stuff about avoidance of the side channel attack from timing variation. It is an important property of the library. And I decided I had to retain that to see if the tools could deal with it as well. And what I really set my goal on was this idea that I get a completely automatic
proof of at least what so-called type safety, where type safety in Ada terms basically means proving that the program could never raise an exception at runtime, but also covers things like the complete getting rid of all possibility of undefined behavior or erroneous behavior and so on.
Autoactive is a term that was, I believe, coined by Rustin Leno, who used to work for Microsoft Research and more recently works for Amazon. Autoactive is a sort of a portamento of automatic and interactive. So it's this idea that the proof is automatic in that you just press the button
and it works, but it's interactive in that to make it automatic, you have to write contracts in the code, where contracts are things like assertions, preconditions, postconditions, loop invariants and so on. So there is a bit of work involved, you actually have to think. But every once you've got it in the code, once you've got those contracts in place, the proof is completely automated.
Because I really didn't want to have to fire up an interactive proof tool like Isabelle or Coq or any of any of those other tools, because I know from experience they're not much fun to drive. Another goal was that the library for bare metal, kind of very small embedded targets, I
wanted the whole library to be compatible with the zero footprint runtime, which is implemented by GANAT. So that means you put the compiler in a particular mode, and it never ever generates a call to the Ada runtime library, so that the code will effectively run on anything. There's no dependencies on any other packages or libraries, there's no COTS components at
all, and the whole library is programmed in this sort of heap free style. So there's no use of dynamic allocation or deallocation or anything, it needs a bit of stack memory, and that's basically all. And it's completely portable, the same code compiles, if you can compile it, I assume, it will run on anything, any 32 or 64 bit machine, it should just work straight out of the box.
I wanted the performance to be, shall we say, competitive with TweetNACL, we'll see how we go, we'll come back to that. And will we find any bugs? If I actually attempt this verification, will I actually find bugs in my code, which turn out to be bugs in the original master copy from TweetNACL?
We have some form in this area, I did a similar exercise some years ago with a hash algorithm called SCANE. And in doing that verification in Spark, I found a bug in the SCANE algorithm, or at least the implementation of the SCANE algorithm in C, from the designers of the algorithm, it actually had a very subtle arithmetic overflow bug in it, which
had never been discovered before by anyone else, so we have some sort of form in this regard. What I didn't want to do, I'm not going to try and compete with libsodium in terms of performance, because that would be silly, I'm not going to compete with very carefully crafted assembly language. I'm not going to try to prove partial correctness of actual, you know,
heavyweight mathematical correctness properties of the code, that seemed probably a bit ambitious. And I'm not trying to say anything about the mathematical security of things like the elliptic curve, curve 25519, that's the domain of crypto mathematicians, not software engineers, so I'm not sort of venturing into that territory.
OK, so on the proofwork, firstly, a few statistics for you. It's about 1700 logical lines of code comprising statements and declarations and slightly over 100 sub-programs, about 74 functions, sorry, 74 procedures and 37 functions.
In terms of contracts, this is a bit interesting. There's quite a lot of types and subtypes, as you know from using Ada, types and subtypes are sort of super important in the correct use of the language. But other contracts, there's 27 preconditions, 20 explicit post-conditions, dynamic
subtypes or dynamic subtype predicates turn out to be really important. I think they are something of a killer feature of Ada 2012, which the Spark tools can exploit. There is a few of those. And there are 51 explicit assertions in the code, assert pragmas. That's quite a lot. And that reflects the difficulty of some of the proof that we have to do, is that the asserts are used
to kind of help the prover, if you like, to kind of give it a hint of things it should prove first. Before it can prove C, you should prove B and before B, you should prove A. Well, you can kind of do that with by giving it assertions. So there's quite a few intermediate assertions in the code that we'll come back to.
Proving things about loops is often tricky. I had a quick look. There were 56 loop statements in the library. Interestingly, of those 56, 25 don't require any loop invariants at all to be stated by the user. The tool just synthesizes something and that's enough to prove what we want to prove. 18 of them require a user written loop invariant, but it's really simple. It just
took me about 30 seconds to write the correct loop invariant and it was done. And only 13 of them are sufficiently hard that I had to write a loop invariant and it actually took some real work and some real thinking to get it out. But I will admit those 13 were actually quite tricky. If you look in the code, you'll see some of them are really quite, quite non-trivial.
And I'll tell you where those are in a minute. Yeah, the difficult bits, unsurprisingly, are things like there's a multiply operator for that GF type. So this is multiplying 256 bit integers modulo a large prime number.
The large prime number, by the way, is two to the power of 255 minus 19, which is where the curve gets its name. So, yeah, multiplication modulo two to the 255 minus 19. That's a really difficult bit of code. And proving that it doesn't overflow, for example, proving that all the intermediate numbers that you get don't overflow is actually quite tough.
That was a difficult one. There is a subsidiary function to that called carry and reduce, which is abbreviated to CAR. That function does the modulo reduction thing. That's quite tricky as well. And proving that that terminates is actually quite tricky and proving that it converges turned out to be quite difficult.
And we've actually done that. And then there's this modulo L operation, the fragment of code I showed you earlier. That is spectacularly difficult to understand and was really quite difficult to prove to be type safe. Interestingly, the first two of those that the multiply and the carry and reduce operations are also the most performance critical in the loops of the whole library.
The carry and reduce algorithm for a single signature generation function in the library. I think the carry and reduce function is called about 20000 times. So that's very intensely performance critical. So trying to, for example, get the inner loop of that code to be really, really fast was something of a big deal.
So what have we proven? So this notion of type and memory safety in Ada terminology kind of reduces to the complete absence of undefined or erroneous behavior. There's no dependence on unspecified behavior, which is an intrinsic property of Spark.
There's nothing that would raise an exception. So we prove that all range check, overflow check, index check, division by zero check and so on all can never fail using the theorem proving tools. Obviously, because it's Spark, there are there's no explicit use of pointers and no use of the heap. So that makes a whole load of problems go away.
Also, obviously, what we're doing is proving that all object values always satisfy their type declarations and all users supplied contracts are always true. So preconditions, post conditions, assertions and loop invariants are always true as well. And then if you get the whole lot out, you know, you're in a good place. So that's basically what we've proven.
Some observations about that. Types basically are awesome. If you're used to using Ada, which I'm sure most of you are, since you're in the Ada room at FOSDEM, you'll know that already. But this comes as a surprise to other people, people that are used to programming in C. They kind of don't get the idea of things like user defined scalar types,
user defined integer types to C programmers is something of a crazy concept, you know. And it's just so important in Ada to use these things appropriately. It really is is the key to achieving automation and completeness, meaning zero false alarms and getting one hundred percent of the VCs out.
It really is all about using the type system properly. And another observation is that types, this idea of type safety is not really a fixed price concept because in Spark, the more effort you put in to declaring and using types and subtypes properly, the more you get back in terms of what you can prove. So it's really up to you once you've established these basic properties of no undefined behavior and everything else.
How far you go then depends upon you to actually write. And the more and the stronger declarations you write, the more value you can get back from the proof system. Types are like a weak, if you like, a weak specification of correctness.
And so some of the properties that emerged from the proof are actually what you would think of as being correctness properties of the code. Although they're expressed as types in the language. So this notion of, well, is it type safe or is it partially correct? Actually, it kind of is a bit fuzzy in Spark and in this library because
very use, very strong use of types actually starts to establish partial correctness properties as well. And a real interesting observation for me was the discovery, really, that the so-called dynamic subtype predicate thing from Ada 2012 really is amazing.
It offers a leap in the expressive power of the type system in Ada, which is an enormously important thing in 2012. I've recently discovered they also exist in languages like Haskell and OCaml, but they're called liquid types, where liquid stands for logically qualified, which is a rather strained abbreviation of the term, I suppose.
But yeah, they do exist in other languages. But this idea that you can define a subtype with a fully general predicate is really, really amazing and very, very powerful. So an example from the code is this bunch of declarations. So what we're looking at here is this type, the type GF we saw in C, which is this array of 64 bit integers.
You can see there in the Spark code is called type GF 64. And it's an array of 16 values of type GF 64 any limb where GF 64 any limb is an I64 up. Because we're in Ada, we're going to define the bounds on that thing, which are defined in terms of some constants we've declared above.
But then, interestingly, I've got a subtype called normal GF 64, which is a GF 64 with a dynamic predicate that says, ah, it's a GF 64. So it's an array of 16 64 bit signed integers. But each of the elements of that array is constrained to be in the range 0 to 65535.
So that's like an array of 64 bit integers where each digit is constrained to be normalized down into a much smaller range. So the use of this dynamic predicate thing is really, really important. And the proof tools are aware of the dynamic predicate and what it means and will prove that you always respect it.
And it's really, really cool. In terms of volume of proof, the code generates about 907 or exactly 907 verification conditions for a particular release. I'm pleased to say that with a combination of the provers Z3, CVC4 and ALTOGO, the distribution, basically the community
edition, the 2020 and 2021 community editions of Spark managed to prove all 907 of those verification conditions completely automatically. About 770 something are so trivial that all three provers blow them away every time very, very trivially.
And 130 are hard in that they require that 130 are where at least one of those provers fails to prove one of the VCs. So you have to have more than one prover switched on. Of the 130 that are hard, the scores are that Z3 gets 122 of them.
CVC4 then comes along and proves another two. And then ALTOGO comes along and proves the last six. And that's how we managed to get them all out. But it's interesting to note that you do need all three of the provers. No, no two of them are OK. You won't get 100 percent with only two of them. You do still need all three to get to 100 percent.
In terms of elapsed time, it's about four and a half minutes on the laptop that sat in front of me. This has eight processor cores, so that's not too bad. That's kind of nearly a coffee break, which is reasonably acceptable. The tools are pretty good at being parallelized. That's good news by generating lots of verification conditions.
And, you know, you get verifications for each. You get a VC file almost for each check or for each line of code that has a check on it. For each subprogram, you get a verification condition file and you can prove them in parallel using using as many provers as you like. That works very, very well and scales very well. And you can just buy more processor cores to speed the proof up.
That's really good news. OK, so I want a performance. So what does it actually go like when you run it on a real machine? The test set up, I needed an embedded target, so I bought one of these things. This is a sci-fi high five one Rev B board, which is a small 32 bit RISC-V machine because it's a small embedded machine.
Also, the GANAT community edition release from 2020 has support for this out of the box. This is a 32 bit RISC-V. So it was supported straight out of the box by GANAT, which was very nice. And thanks to AdaCore and the team that work on the BSPs and so on for supporting it.
So I started with the GANAT community 2020 edition, which is essentially under the hood of release of GCC 8.4.1. And what I decided to test was a sign operation because that's computationally intensive. It's important. It uses all of the mathematical interesting stuff of using the elliptic curve operations.
And I'm signing a 256 byte message using a particular fixed key. And what I did was to sign the same message using Spark and ACL and using the C implementation from Tweet and ACL, sign the same message and measure their runtime on the board to see what I got.
There's some reasons why I thought Spark might be slower than C. So firstly, I adopted a more functional programming style in Spark than you would in C. So particularly in Spark, for example, I was able to overload the mathematical operators for that GF type so that I can supply a proper binary plus operator, a multiply operator and so on for that type,
which essentially, you know, in Spark is a function that returns a GF type. And normally in Ada, the canonical semantics say that that function return would be by copy. But that GF type, remember, is 1664 bit integers, which is 1024 bits.
And obviously returning 1024 bits by copy on the stack might actually be really, really slow. So that might be a reason why we would be slow in Spark. So that's a possibility. Tweet and ACL also has a rather interesting trick in it where it only does two of those normalization steps after the multiply operation.
In Spark and ACL, it was actually a lot easier to prove to do that you needed three normalization calls after a multiply. And also in Spark and ACL, it normalizes after addition and subtraction of those GF values. So the bottom line is in Spark and ACL, we do more of those normalization operations, so that might slow us down.
Finally, Spark and ACL sanitizes its local variables, which is good practice in cryptographic programming. You sort of erase your local state before you return from a function. Tweet and ACL doesn't do that. So that's another reason why Spark might be slower. There are also some reasons why Spark might be faster than C.
Obviously, once we've proven that the Spark code is type safe and free of runtime errors, we are compiling it with the switches on the compiler to say, well, suppress all runtime checking. It's the GNAT-P switch because we can. We've got justification and evidence to do that because we know from the proof tools telling us that the runtime errors can't fail.
Therefore, you can never raise an exception. Therefore, suppress all the runtime checking. So that's good news. Spark also has many properties that make it really usefully intrinsically good for an optimizer. So things like having no aliasing of names in the language. There's no undefined behavior, no side effects in expressions or function calls in Spark.
These are correctness properties, but also they're properties that an optimizer might be able to exploit to improve the code. So there's a potential way we might be good. So what did we get? Well, we started optimization level zero, one and two. And obviously I'm using GNAT Community Edition here to compile the C as well as the Spark.
So it's the same back end of the same compiler at optimizations level one to the C code, basically one by a bit, as you can see in green there. But a really big surprise. This was really surprising, was that the Spark code was substantially faster at optimization level zero, much faster than the C.
And that was actually quite a surprise. So what was going on? Firstly, I found at optimization level zero, the C code is very slow at doing bitwise rotate because in C there is no bitwise rotate operator. So you have to write it out as a full blown function. And without the optimizer, that function doesn't get inlined in C.
It's a full blown function call in Spark. Rotation is an intrinsic function call in package interfaces. So you typically get one instruction or a very small number of instructions for a rotate operation, therefore very, very fast. And that saved us a lot of time.
There's a small improvement owing to the an improvement in the carry and reduce code, which comes from the wire guard sources that I adopted. That's not that significant. A big deal was this thing called return slot optimization. So that concern I had about the cost of a function return of a thousand bits or more being very, very slow.
Turns out that's almost entirely removed by the compiler because, again, of the properties of Spark, because the function calls a side effect free and alias free. The compiler is able to actually optimize those function calls and essentially it optimizes them into doing a return by reference directly into the caller stack frame. And that works even at optimization level zero in Gannett.
And that's really, really cool because that concern just disappears. We'll come back to returns on optimization a bit later on. So the golden rule I had with improving performance was whatever I do to improve the performance of this code, I'm not going to break the proof. So having established a 100 percent automatic proof, I'm going to keep it 100 percent and fully automated whatever I do to improve performance.
So that was my kind of guiding rule, if you like. It also was really nice because the proof stops me making stupid mistakes. If I start with type safe code, obviously I should finish with type safe code because I shouldn't break the correctness properties of the code.
So the proof tool is really, really useful because it stops me doing things I shouldn't do. I also found, and we'll come to this, that the proof system and the types in particular language can kind of suggest or guide optimization. And we'll see that in just a moment. So the first round of things I did was some fairly basic optimizations that I knew would help.
So things like going for an optimal initialization of large objects rather than initializing everything in a large object once and then initializing it all again in a loop or something. I was able to optimize that. I did a bit of manual common sub expression and partial redundancy elimination, a little bit of judicious unrolling, particularly of the inner loop of the GF multiply operation.
There's a big, big improvement available by unrolling the innermost of two nested loops. I removed some slices from the Spark code because that was a bit of a source of slowness. And where possible, I actually applied those changes back into the C source code of TweetInACL to kind of level the playing field.
So with that lot in place, the numbers look out like this. And this is a bit less surprising. So now we see that TweetInACL is winning by a bit at all optimization levels. But with a big improvement across the board in both languages, which was very pleasing just by doing some very basic stuff.
In my second round, I came across this this observation really that the GF type. Most of the time, the numbers in that GF type are actually limited to be unsigned and 16 bits in range, even though they're being stored in 64 bit digits.
Because when you're doing a multiply, those digits overflow a normal 16 bit range. The digits actually get very, very big as you multiply up all the different digits and coefficients and add them all together. Some of those digits get up to be about two to the forty two to the forty three or something. So you need 64 bits to store one for sure.
But an issue problem is the code is compact, but it's very wasteful of storage. You're storing a lot of zeros there. You're storing 1024 bits where really most of the time you could store one of those GF things in 256. So a question is, could we compress that that type down to be 256 bits?
And also, can we use the proof system to prove that some of the 64 bit mathematical operations could actually be done in 32 bits? Because if you can prove that something will always fit in a 32 bit plus or a 32 bit multiply, you can save a lot of time. So, for example, on RISC-V 32 bit, a 32 bit multiply is one instruction, but it's six instructions to do a 64 by 64 bit multiplication.
So that's a potentially enormous saving. So, for example, in the carry and reduce code, the first application of carry and reduce starts with 64 bit values.
But it turns out the result of that operation, all the digits fit in 32 bits. The second carry and reduce code goes from 32 bit to 32 bit. The third application goes from 32 bit back to 16. What you can do is in that second and third call, you can actually do all the arithmetic in 32 bit maths, not 64 bit maths.
And that makes a huge difference. The other potential benefit from compressing the representation is that just assigning these things is fast. If you do need to copy one of them, you only need to copy 256 bits instead of 1024. Also, you might get more of them in the data cache, which would help as well.
And as I said, the proof system is backing me up here by proving that these transformations are OK and proving that the operations still don't overflow, even if I do them in 32 bits and so on. So that's nice to know I'm not going to make a mistake. With that in place, there is an absolutely amazing improvement.
The Spark code goes from 90 something million CPU cycles to assign a question down to 30 something or 20 something. So we're getting about a factor of three better from the Spark code with this revision, this so-called operator narrowing kind of transformation and compressing the storage. We absolutely get an amazing improvement pretty much in all the optimization levels, particularly at one and two.
We get down to 32 or 27 million CPU cycles from 90 something, which is really, really, really impressive. In the third round of optimization, I thought, well, let's try optimization level three. But use the no-unroll pragma to tell optimization level three where not to unroll loops just to make sure we were in control of what it was doing.
I also had a look at optimization S, which means optimize for size to see what it would do. I also tried the so-called compressed instruction set from RISC-V, which basically some of the 32-bit instructions on RISC-V can actually fit in 16 bits.
They can be compressed. So you get smaller code size and possibly better hit rate in the instruction cache. On this particular target machine, there was actually a penalty, though, because if you branch to an instruction that's not on a 4-byte boundary, there's actually a performance penalty for that. So we're not sure that's going to help or not, but we'll try it.
Then I discovered the compiler could actually change that. You can say, give me compressed instructions, but force alignment. So all jump targets, all basic blocks always start on a 4-byte boundary. The compiler has switches to do that. So that's what I actually settled on. So with that in place, we get the following results.
We get a bit more improvement across the board, except for some reason optimization level two, there's some sort of degradation that I never really understood. But we do get a bit of improvement at that level. So at this point, the best we've got is about 25 million cycles for a sign operation at optimization level three.
The other interesting numbers, the stack usage is down to about two and a half K for a sign operation. That's computed using the GANAT stack tool. The code size for the whole library is about 18K with optimization S switched on. That's OK. I mean, it's only 18K here, folks. I don't know if many of you remember what it was like programming
when your memory was measured in kilobytes, not megabytes, but that's quite reasonable. A bit more optimization in round four was things like doing some loop fusion and putting the first carry and reduce operation, merging it with the plus and minus functions helped a bit. I also realized I could narrow another multiplication operation in the actual multiply
function itself to do a 16 by 16 bit multiplication instead of a 64. Again, the proof system said that was OK. And a bit more judicious loop unrolling in the multiply function to avoid double initialization of the intermediate. That gave us another few hundred thousand CPU cycles.
So we were down into the realms of 23 and a bit million cycles at optimization level three, which was quite nice. Then a year had passed. So GANAT community 2021 came along based on GCC 10.3. And as a result of the work I'd done earlier, many thanks to Eric Botkazoo at Adicor, who actually implemented some changes
in GCC so that the return slot optimization stuff was enabled for more cases of function calls because of the properties of Spark. So particularly if you put the aspect pure or pure function onto a function in Spark now that the compiler picks that up and says, aha, and is able to turn on returns optimization in more cases.
So that got us about another two or three hundred thousand CPU cycle saving. So down to about twenty three point two two million at optimization level three, which was useful. Observations, then the optimizations seem to come in several flavors. There are language specific stuff like array slices and this return slot stuff.
There's stuff to do with the micro architecture, like compressing instructions or not, as the case may be. There's algorithmic improvement like transforming the mathematics, but using the proof engine to support and prove that we're not making a mistake. And then there's what I would call back porting. So this is taking optimizations that happen at a higher
optimization level and transforming the source code so that you get the benefit of that optimization, but without the optimizer. So this is things like partial redundancy, elimination and loop unrolling, where you change the source code to make the optimization happen. But but you still get the benefit and lower optimization level. It is hard to predict what you're going to get.
You really do have to have hardware and measurement to get this stuff to understand what's going on. Right. The combination of using the proof system with the optimization is really, really great. The proof tool just stops me making mistakes, but it also suggests things.
It says, you know, the types are there. The types are saying you could do this operation in 16 bits or 32 bits instead of 64. And I can prove that that's safe. That's really, really great and works really, really well. I do come across a lot of projects in industry where people don't like using the optimizer, which is quite a surprise.
People still compile with the optimizer off. And there is a general perception, particularly among C programmers, that they don't trust the optimizer or at least they say, well, we tried to switch it on and our code broke. So we switched it back off again. Ah, that's disappointing. Spark is amazingly robust in the face of optimization. And this is not a not a surprise. Spark was designed to have this property
that the semantics is independent of optimization and the complete elimination of undefined and unspecified behavior. The language means that optimization is typically very, very safe and robust on Spark. And that's a good news. Loop invariants are still really hard to find for the non trivial stuff.
The difficult ones were really difficult. If you want to have a look at them in the code, be my guest. They're not very nice in about three or four of the functions. Reproducibility of results with the prover is a tricky subject at the moment. I mean, things like if you take a new version of a tool like Z3 or CVC4, it's not guaranteed that you'll get the same results.
And it's not guaranteed that you'll get better results. You know, I've tried this and sometimes a newer proof tool actually degrades in that it fails to prove something it proved before. And that's really disappointing. Also, timeouts in proof tools are a terrible, terrible thing.
The proof tool timing out as a way of, you know, sort of generating a false alarm is not a great thing if the timeout is based on the passing of a wall clock of real time, because then you don't get the same results on all your desktop machines.
You don't get the same result on your server or your continuous integration or out there in the cloud. It's really not very helpful. So getting the same results on all the platforms for all the developers all the time is a really important thing. And you have to go to things like using step limits in the proof engines rather than timeouts.
In terms of further work, then, what we might do next with it. Firstly, well, what if we ran on a 64-bit target? Would all this still apply? What sort of results do we get? The answer is don't know, but we might try it. It would be very interesting to do this all again. But instead of using GCC, try it with GANET LLVM to see what the performance is like with an LLVM-based compiler.
That would be interesting. It would be interesting to take those for those the verification conditions. So, for example, for the eight verification conditions that Z3 can't prove, well, we would take those eight VCs because we know they're true because CVC4 and Alta Ergo can prove them.
We take those eight that Z3 won't prove and we submit them to the Z3 developers to say, hey, come on, why don't you prove these? And maybe we can get one of the provers up to a standard where it proves everything. That would be good. It would be nice to try and prove that we only need to do two applications of that carry and reduce operation rather than three.
That proof has actually been done by someone called Benoit Vigier over in Belgium. He did it in Coq, though, using a really high powered theorem prover and doing it interactively in higher order maths. He's been able to prove that that's the case. So can we get that at the code level to be an automation as well? Because that would save us another ton of time.
And finally, formal verification of the constant time property we don't have at the moment. I believe we can do that by using the information flow analysis engine. You know, in that proof, there's this thing called a program dependence graph in information flow stuff. It does. I think we can use that PDG to prove that subprograms exhibit this constant time property.
And that's that's something we'd like to look at in the future. So that pretty much gets us to the end. I hope I'm on time in terms of resources. The code is on GitHub at that location. If you want to look at the code, be my guest. Please submit comments or issues if you have any. If you want to see how the optimization worked.
Each of the optimizations is tagged in the GitHub repo. So from a particular tag to the next tag, you'll be able to see the changes that I made to achieve a particular transformation optimization. And so you can see exactly what's happened. And there are a sequence of blog entries on the article blog which document this performance analysis and optimization stuff in gruesome detail.
If you want to know exactly what I did and how it worked, then please go check out those blog entries. Thank you all for listening very, very much. I'll leave it at this level at this point and hope we get some really interesting questions and comments from you all in just a moment. So I would say thank you very much for listening. Thank you to the organizers of FOSDEM for inviting me and hope to speak to you soon.
Amazing presentation, especially for making this topic, which is quite complex, digestible, in my opinion.
So the first question I would like to ask you is Spark worth it? Was it worth using it? That's a good question. Well, I would like to think so, but obviously I'm biased.
It was an interesting exercise. It is interesting to see how far the proof tools have come. I think, you know, 10 years ago we wouldn't have dreamt of doing that depth of proof with the tools, especially with that level of automation. The tools just couldn't have managed with the previous version of Spark before Spark 2014.
Now, you know, with Spark 2014 and the really modern proof engines, it's amazing that the power of the proof tools and also the power of the language. As I said, things like the type system in Spark is really powerful. The addition of the dynamic subtype predicate in remainder 2012 coming into Spark for 2014 is amazing in terms of what we can express now.
And the tools ability to automate the process come on in leaps and bounds for just a decade ago or so. So it was really, really worth it. And it's interesting. And it was kind of I suppose I would say it was good fun to do. I did most of it during lockdown. I needed something to do as a side project.
So I enjoyed it and thought it was worth doing. And I hope the rest of you found it interesting to listen to. OK, so you mentioned that maybe you could find some bugs in the original implementation.
Interestingly, from the release of TweetNACL that I started with, the answer is no. We didn't find any clear bugs in TweetNACL, which is a pretty remarkable result, actually, for such a significant body of complicated C code.
There are a couple of things in TweetNACL where I came to translate it to Spark and thought, oh, that's terrible. Things like TweetNACL does shift operations on signed integers, which is a terrible thing, because in C it's either implementation defined one way and undefined the other. And it's not a great thing to do. So there's a little bit of sort of slightly dodgy practice in there.
But we didn't find any clear bugs. There is one interesting result with that, though, that there in the first release of NACL, if I look online, the first release of the NACL sources was made in December 2013. And then in April 2014, there was a second release saying, oh, we fixed a bug.
And there was a bug in the first release of TweetNACL, which means that for about five months, no one noticed that there was actually a rather subtle, nasty bug in it. This was in this modular algorithm. And what I did do, actually, just for fun, was to take my copy of Spark NACL and I kind of re-injected that original bug and put it back in.
It's literally, there's a 14 where there should be a 15 or a 15 where there should be a 14. It's normalizing the wrong element of an array of digits. I put that bug back in to Spark NACL, ran the proof tools, and because of the predicated types, because of things like predicated subtypes, the proof tool spots the bug immediately.
It comes back and it says, no, you've got one undischarged BC there and spots it straight away. And that's a really cool result because it means you couldn't have made that mistake. If the code had been written in Spark in the first place, you never would have made that bug because the proof tool spots it.
So that was really pleasing to see that Spark would have got that one. Other than that, no, we didn't find any bugs in the original, which is good news, I guess. It speaks very highly of the ability of the original designers. Thank you. You mentioned using Spark 2014.
Spark has evolved since then and some new additions have gotten new tools. Have you tried those new additions, improvements or not? So I think the one really recent feature of the language that I did exploit was this attribute, not an attribute.
There's an aspect in the language called relaxed initialization, which I believe appeared in the 2020 release or 2021, I'm sure Yannick will correct me, release of the Spark community edition. So that allows us to prove initialization using the proof tools rather than by the more pessimistic flow analysis.
That's very important for performance, because what you'd normally do in Spark, if you've got an aggregate, you know, a composite object like an array, what you do is initialize it all in one go using an aggregate initialization. And then maybe you initialize it again or initialize a few elements again, field by field with the relaxed initialization aspect.
And there's this tick initialize attribute. You can use the proof system to prove initialization. What that means is in terms of performance is you end up initializing what you need to initialize exactly once, which is faster. Basically, you get faster code by being able to exploit this more expressive way of proving initialization of arrays.
So that was very, very useful. I guess the other thing in technologically, a big advance that came was with the community, as I said, with the community 2021 release of GANAT. Eric Bockers, who had actually implemented some changes in the middle end of the compiler to actually turn on this thing called return solid optimization for Spark functions.
So if the Spark functions marked pure and if it's got no aliasing of the actual parameters with the returned object, then the compiler switches on this return slot optimization. And again, we've got a, you know, a noticeable performance improvement by the compiler being smarter to exploit the properties of Spark to generate code.
So that was very useful, actually, as well. And as I said, the killer feature from 2012 is the dynamic subtype predicate. And I couldn't have done it without dynamic subtypes and predicates on them. They're really, really powerful. They're definitely the thing that I found to be the most useful for doing this work.
OK, thank you. We have another question. Do you have, did you have too many false alarms? Well, initially, when I did the first port, obviously, if you go, if you look at the GitHub repo and you go right back to the beginning,
you'll find my initial port of Spark and I hadn't really developed the contracts very, you know, well at that point. And if you run the tools on that version, yeah, there were false alarms because I hadn't written a sufficiently strong set of contracts. I hadn't written the preconditions and loop invariants yet. So initially, there were quite a few false alarms. I can't remember how many.
But then I developed, as my understanding of the code developed, you know, how this multiplication and reduction works and all these other algorithms. As you begin to understand how they work, you are, OK, aha, I've understood something, OK, I'll put that in. I can sort of encode my understanding of the algorithm as a precondition or as an assertion as you develop the contracts.
This is the sort of autoactive bit of it. You put contracts in, the false alarm rate comes down, comes down, comes down, and eventually we got there. We got to a point where there were zero false alarms because all of 900 and something verification conditions are automatically discharged. So effectively, we got to zero, which was the point. And that's what I was trying to get to, to see if it was possible.
So we got there in the end. And I won't pretend it was easy. There was, you know, three or four of those functions took a lot of thinking about to understand. It was really a matter of just understanding how they work in the first place. Once you understand how they work, you go, ah, I'd express your understanding as a contract and bang, the proof tool goes, oh, yeah, that one's easy.
And they managed to prove the code. So it's it's it's an interesting process. But but yeah, it's it's was worth doing. And, you know, we got to zero in the end, which was was a good thing. So Frederick Braga asks, how much time did it take you to build this project, knowing that you are a longtime Spark user?
That's a good question. I regret to say I wish I had done this. I did not actually keep a log of the time that I spent. I don't have a record of how many hours this all took.
It's interesting that there's a big split. There's some of the routines like the Scharf, the Scharf 2, the Scharf 512 implementation, you know, that translated from C into Spark very easily. And almost all of it proved at first attempt. It was almost like a day or something, a few hours work. Easy peasy. At the other end of the spectrum, the mathematically complicated stuff in the elliptic curve algorithm, you know, translating the code.
Yeah. You know, 10 lines of C translates into Spark. Not too bad. A few hours. The proof of some of those hard algorithms, it wasn't sitting at the keyboard typing work. It was thinking. It was just getting my head around how it worked.
That was measured in in hours and hours and hours of just going, why can't I understand this algorithm? What's it doing? How does it work? There was just a lot of sitting and thinking and sometimes waking up in the middle of the night and going, aha, I've got it. Quick, I know how it works. You've got to get up in the middle of the night and write an assertion before I forgot what I was thinking.
Some of those algorithms took quite a lot of effort. I mean, measured in many hours of thinking about them and then trying the proof. The proof tool said, oh, not quite. No, can't prove it. And then you go, ah, I understand something else. And then the proof tool would say yes. I mean, end to end, sort of working on this just part time as a side project.
The whole thing has been spread over more than a year or two of work. Again, if you look at the GitHub repo, you'll see the commits on it go back about it must be almost two years now. So it's quite a lot of effort, but very spread out over the spare time, over more than a year.
You'll see there's sort of groups of commits where there are lots of commits in a short space of time. And then nothing happens for months because I'm doing some real work and getting paid to do something elsewhere. And then there's obviously a period where I had time and I did a lot more work and you'll see the commits are grouped into little bursts.
So, yeah, I don't know the actual numbers for productivity and effort, but that's the way it goes. So we have a question from Jeffrey who asks whether is it even possible to prove the correctness of an encryption algorithm?
Well, if you mean partial correctness, if you mean does it actually, you know, does the ciphertext you get back, you know, are the bits actually correct with regard to the specification or the cryptographic algorithm? Well, yes, it is possible. We've not done that in Spark and ACL. It is possible because lots of other groups have done it. Look at fiat cryptography.
Look at the work with EasyCrypt and Jasmine. Look at the work that Galois have done with their crypto specification language for which they can refine code. Microsoft's work on the Everest project, Amazon's work. Lots of other groups are looking at really serious correctness of cryptographic algorithms.
So, yeah, it's been done. It comes down to having a specification. Thankfully, most of these cryptographic crypto algorithms do have very strong mathematical specifications which you can work from. And that's great. We haven't done it in this case because I was I was going for automation and type safety is my main goal rather than pure correctness.
I mean, on Twitter and ACL, there is a correctness proof of the scalar multiplication in the elliptic curve, as I said, was done by a guy called Benoit Visier in Belgium. But that was kind of a Ph.D. sized project and he had to use a full blown interactive theorem prover to do it.
I believe he used Coq to do it. And it's a big, complicated bit of real, hardcore mathematics that he's actually proved that the code really does generate the right answer. So that's an entirely different scale of efforts, if you like, and different complexity from what we did here. But no, it has been done. And there are lots of projects out there doing really serious correctness proofs on cryptographic algorithms all the way down,
even some even going down to the level of the machine code and proving correspondence with the binary. It really is very impressive what else has been going on out there in the crypto community. OK, so another question is how many contracts, assertions, et cetera, do you have to add
that should not have been necessary, meaning I shouldn't be guiding the prover towards this goal? That's a good question to which I have to say I don't know the answer that adding these assertions sometimes is a bit difficult. You have to think, well, if I had this assertion, will it help? And you add an assertion and sometimes it helps.
And the proof tool says, yeah, that works. And sometimes it says no or anything. Oh, that didn't help. So how many shouldn't be needed? Well, if I had an infinitely powerful theorem prover, you know, hopefully you wouldn't need that many. But we don't have infinitely powerful theorem provers. The theorem provers are capable of what they are today and they'll get better as time goes on, no doubt.
It would be an interesting exercise. I suppose you could take the Spark and ACL sources. You could take the proof as it is with zero false alarms and full automation. You could try to selectively remove some of the assertions and see if the proof still works. And then if that works, remove another assertion and see if the proof still works and see how many you can remove before you break the proof.
That would be an interesting exercise. I haven't had the time to do that. I just don't know how many you could. What would be the minimum set that you'd need to retain automation of the proof? Don't know. It would be a fun thing to do, but probably pretty time consuming because you'd spend a lot of time waiting for the proof tools to see if they work.
OK, sounds like a bit of a Herculean effort, frankly. Yeah. So we've got 10 seconds left. So there are still some questions left unanswered. So please, viewers, join this room. I think Rod is going to stay here. Thank you.