We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Good news, everybody!

00:00

Formale Metadaten

Titel
Good news, everybody!
Untertitel
Understanding Guile 2.2 Performance
Alternativer Titel
Good news everybody Guile 2 2 Performance notes
Serientitel
Teil
95
Anzahl der Teile
110
Autor
Lizenz
CC-Namensnennung 2.0 Belgien:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
19
20
Vorschaubild
44:46
23
30
Vorschaubild
25:53
69
Vorschaubild
25:58
76
78
79
96
97
KugelkappeDatenmodellATMExpertensystemLeistungsbewertungMakrobefehlLambda-KalkülAlgebraisch abgeschlossener KörperBildschirmsymbolDiskrete-Elemente-MethodeRegulärer Ausdruck <Textverarbeitung>Funktion <Mathematik>CodeZeiger <Informatik>ZählenDisassemblerGlobale OptimierungAnalytische FortsetzungLaufzeitfehlerFunktionalBrennen <Datenverarbeitung>Algebraisch abgeschlossener KörperApproximationEndliche ModelltheorieProgrammiergerätGlobale OptimierungProgrammschleifeTeilbarkeitProgrammierungArithmetisches MittelCodeLoopArithmetischer AusdruckBoolesche AlgebraExpertensystemInterpretiererKontrollstrukturOrdnungsreduktionBewertungstheorieRechter WinkelGeradeMultiplikationsoperatorRekursive FunktionCompilerVerzweigendes ProgrammZahlenbereichZählenFaltung <Mathematik>MakrobefehlWärmeausdehnungGüte der AnpassungPhysikalisches SystemGraphfärbungLaufzeitsystemNummernsystemMessage-PassingNichtlinearer OperatorHash-AlgorithmusLambda-KalkülSystemaufrufOrtsoperatorProgrammierparadigmaPunktSampler <Musikinstrument>MultifunktionVererbungshierarchieBitResultanteVariableLeistungsbewertungNeuronales NetzSelbstrepräsentationSoundverarbeitungWechselsprungZellularer AutomatPartielle DifferentiationOffene MengeMailing-ListeAlgorithmische ProgrammierspracheFokalpunktHeuristikElement <Gruppentheorie>DatentransferCASE <Informatik>System FGesetz <Physik>Transformation <Mathematik>Zeiger <Informatik>DifferenteVersionsverwaltungCodecMereologieVerschränkter ZustandAutomatische HandlungsplanungFormation <Mathematik>GruppenoperationParametersystemIterationEinsWort <Informatik>MultiplikationMathematikQuick-SortSchreib-Lese-KopfWeb SiteByte-CodeVirtuelle MaschineSchnelltasteMaßerweiterungZweiTranslation <Mathematik>Prozess <Informatik>InformationCoxeter-GruppeXMLVorlesung/Konferenz
Hill-DifferentialgleichungInnerer PunktDiskrete-Elemente-MethodeDisassemblerSystemaufrufVerschlingungLambda-KalkülCodeZeiger <Informatik>FreewareAlgebraisch abgeschlossener KörperVariableKeller <Informatik>Web-SeiteDatenmodellKontextsensitive SpracheWasserdampftafelGraphische BenutzeroberflächeProgrammschleifeAnalytische FortsetzungStochastische MatrixCompilerProzess <Informatik>Overhead <Kommunikationstechnik>ROM <Informatik>PunktRechter WinkelObjekt <Kategorie>Freier ParameterAlgorithmische ProgrammierspracheGruppenoperationNetzbetriebssystemGlobale OptimierungSpeicherbereinigungProgrammierumgebungFunktionalVererbungshierarchieZeiger <Informatik>CodeMathematikRechenwerkSchnittmengeAlgebraisch abgeschlossener KörperSampler <Musikinstrument>SystemaufrufMessage-PassingShape <Informatik>MultiplikationsoperatorWeb-SeiteVariableFlächeninhaltHochdruckGraphfärbungVirtuelle MaschineParametersystemTransformation <Mathematik>SprachsyntheseLoopSelbstrepräsentationCodecCASE <Informatik>DifferenteSelbst organisierendes SystemLateinisches QuadratMapping <Computergraphik>QuaderDatenflussBitGanze ZahlOrdnung <Mathematik>Brennen <Datenverarbeitung>Automatische IndexierungPunktAnalytische FortsetzungEntscheidbarkeitElement <Gruppentheorie>VektorraumCompilerKartesische KoordinatenPrädikatenlogik erster StufeNummernsystemGraphNegative ZahlKette <Mathematik>Natürliche ZahlAutorisierungPhysikalischer EffektProzess <Informatik>Güte der AnpassungSoftwaretestEinsZahlenbereichHalbleiterspeicherMinkowski-MetrikProgrammierungRechenschieberOverhead <Kommunikationstechnik>Endliche ModelltheorieKategorie <Mathematik>DämpfungWasserdampftafelStellenringMultiplikationQuick-SortMAPSpeicherverwaltungAusnahmebehandlungInformationÜbersetzerbauWeb logBinärcodeArithmetisches MittelTypentheorieFahne <Mathematik>DisassemblerVorlesung/Konferenz
SpeicherabzugGoogolComputeranimationVorlesung/Konferenz
Transkript: Englisch(automatisch erzeugt)
Feel free to start, Andy. Oh, yeah? Because I'm going to use all the minutes, right? Exactly. OK. All right. Well, thanks for coming, folks. My name's Andy Wingo, and I come in Tengal,
along with Ludovic. And I've been doing most of the work on the upcoming Gal 2.2, which is a new compiler, a new virtual machine. And let's see. Am I controlling it? So yeah, the good news is that Gal is faster across the board, and your programs are going to run faster. You don't even have to think about it, right? So that's the great news.
But the bad news is that if you are interested in getting the best performance you can possibly get, your mental model about how you understand how your code works, like what code is that translated into and what are the costs in different parts of your program, well, that model is now out of date, right? Peter Norvig wrote a book a long time ago called
Paradigms of Artificial Intelligence Programming, which is not good as an artificial intelligence book, but it has really interesting LISP optimization notes. And one of these points which has stuck with me is that the expert LISP programmer eventually develops a good efficiency model, a model of what their code is doing.
And we need to update ours. So if we look back at 1.8 in the bad old days, the approximate efficiency model that you would have for your program would be the cost of your program is O n in the number of reductions in your program. So you just have to basically count the number of open parentheses, more or less, to see what is the cost of my program,
plus the 2 and the 3, all of those have costs as well. So we can say that plus 1, 5, even though these are the same program, is cheaper than plus 1, plus 2, 3, because the one on the left has more reductions and the one on the right has fewer. Well, that was because GAL 1.8 had a pretty simple interpreter. And in the switch to GAL 2.0, we got a compiler,
which not only made syntax free, as it should be, the idea that macro expansion is a runtime cost was a pretty terrible thing, and the way we made our programs in GAL 1.8 and GAL 2, that was gone away, but also we had a bit more of an optimizer. And the biggest part of optimization in GAL 2.0
is partial evaluation. And I'll go into that in a little bit. We call it P eval. So that means some reductions that can be done at compile time or done at compile time. So the cost of these two expressions are the same. And there are a couple of other changes in the runtime environment that materially affected the way you program. But in essence, your cost model in 2.0 is
how expensive is my program? Well, it's like how many instructions does it take to run? Maybe that's something that you don't have a very good idea about, and it's true. Instruction count and your program, there's not such a direct correspondence there. You can inspect the effect of partial evaluation
on your code. I don't know, who here has used the comma optimize little command at the REPL? If you all are scheme, because I know many of you all are scheme programmers, and you've visited GAL before, definitely try that out on an expression to see what the partial evaluator will do on an expression. And the other thing that you use to inspect
what is the result of your compilation, besides just timing your program, is the disassemble command, which shows you the bytecode instructions. But peval, so I'm gonna be talking about cost of programs, but peval partial evaluation really does a lot of stuff, right? So if we take this simple loop that's adding numbers and counting down,
partial evaluation removes it entirely, right? And it does a lot of things. Function inlining is the biggest thing that it does, but it'll effectively unroll loops. Tail calls are loops, and so by inlining tail calls up to a certain size, it can completely fold both iterative loops or recursive loops.
It can do all sorts of things. But basically, it is able to do a job when it's able to see definitions and see uses. And if it can see all the definitions and all the uses, it can make some transformations on your code. So you definitely need to check out Optimize. It's part of the basic mental model of the cost of GAL. And GAL 2.2,
it's a similar message in the end in that the cost of your program is O-N and instructions, but the translation from scheme to instructions gets more complicated now. So you need to refresh your model in this regard. There are many improvements of degree, but the ones I'm gonna focus on in this talk
are the improvements of kind and the sense of the improvements that are such that you might consider writing your program in a different way or understanding your program in a different way once you know these bits. So here they are, the ones I'm choosing to focus on. First of all, Lambda is not always a closure. Names don't keep data alive.
We can do unlimited recursion. The GAL does dramatically better loop compilation. It has a lower footprint, and it does unbox arithmetic. And I'm gonna go into all these points. First of all, Lambda, you know, that's our tribe, right? You know, that's the thing that some of us have tattooed on our arms, and that's, you know, how we identify ourselves.
But sometimes we think of, like, Lambda expressions, which define functions, if you've never worked with scheme before. It's just a function expression, right? And that's what it does. It defines a function, but that function doesn't necessarily exist in runtime. And we saw an example before with that loop that P of L completely took out. And so we need to take a look at, like,
how can a Lambda be represented at runtime? Like, is it always a closure? Is it always a cost? Does it have a cost at all? And if it does, like, what kind of cost is that? So the Lambda can be completely gone, right? Like, for example, if it's not used. It can be completely inlined. It can be quantified, which I'll explain in a bit. It can be represented as a code pointer,
or it can be a closure. As you can see, a closure is just one of the five things that a Lambda can be. So gone is a very common thing, actually. So in our first case, we have, we bind our function to F in a let expression, and then the body of the let expression doesn't reference F at all. So partial evaluation reduces this directly to the body.
So that reduces to 42, and the Lambda's gone, right? So Lambda doesn't exist at runtime. In the second case, we have two variables. One is, like, whether or not to call the function. So launch is this Boolean false value, and we bind the function to F. And then in the body of our let expression, if the value of launch is true, then we call F,
and otherwise, just kidding. Well, P of L also sees through that launch and folds that branch, and so it ends up producing the whole thing to just kidding. So Lambdas that are never used as values, or used in any way, will be removed from your program, pretty much. The other way that P of L can do a transformation
on Lambdas in your program is by inlining them, which means, and the heuristics are a bit tricky as to whether this happens or not. So if inlining is important to you and you know that you need this to be inlined, you need to be checking the results of optimize to see what the transformation is. So in this case, for example, it's the same as the previous one, except I've put hash T, which is true in scheme,
as a launch value, and so it folds to the value of applying this F procedure, but it inlines F, right? So the optimization, this code at runtime will call launch the missiles, right? But F, the wrapping lambda isn't there, right? It's been inlined.
So this isn't really a P of L talk, and that's kind of another talk entirely, but that's the second way that Gal can treat your Lambdas. The third way is a new way in Gal 2.2, and this is quantified. So at this point, it gets more complicated because we can't use this nice source-to-source
optimization representation that comma opt or comma optimize can show us. Contravocation is when you take a function whose callers are all known and which always returns to the same place, and you wire the call and the return directly into the calling function.
A little bit complicated, so I'm going to have a sort of extended example. Here's our countdown. Instead of writing it as a named let, I've expanded out the find name, and you actually see the Lambda. So there's our Lambda that is going to get quantified in this example. And we see it has two callers. It has the caller at the end in the body. It's a tail recursive call.
And then it has a caller which starts off the loop iteration. So if I do a disassembly of this and to see the effect of many of these opposations, I'm going to be showing some disassembly, and I don't feel like you have to understand it all. I'm just going to point out the relevant bits of it,
and this presentation will be online. So if this is something you're interested in, you can have a look at it later. So the point being is this loop has been inlined to jumps. Basically, BR is, you know, break, right? So I'm basically doing these four instructions in the loop, in the countdown loop, and there's no Lambda, right?
So what has happened here? The loop was quantified into its caller. It had one caller from outside it. Let's see. I think I have a bit more information here. We have one caller from outside it, and the only call from inside it was a jump back to the beginning. So it's different from inlining.
Inlining copies the body of a function and then specializes it to the arguments where it's being called. And inlining can happen in multiple call sites. Contravication rewires the function entry and the function exit. So inlining is very tricky, right? I don't know if any of you all have worked on inliners before, but they're like rabid animals that you're, you know,
barely restraining on a chain or something because they want to cause code growth in all of your compiler. Contravication is not like that. It's always an optimization, and it never causes code growth because you're just, instead of calling out to this function, you're actually incorporating its body into its caller.
And because you do that, you enable more optimizations because you turn this high-order flow graph into a first-order flow graph. It's kind of complicated, but just know that if you call a function and it always returns to the same place, always returns to the same continuation, which is one of the reasons it's called confecation, then it will be quantified.
It will be wired directly into its caller. And it's a very reliable optimization. Once you get a feel of when this is going to happen, you can expect it. It's going to happen. So next representation is to be as a code pointer. The thing is I had to defeat peval here in this case, and the issue is if the function I'm working on is too small, then it will be inlined
into each of its callers. So I had to make it bigger by telling a story about my chickens and the names they could have. So I have this function, and it's going to be represented as a code pointer. And we can see it's called from three places outside its body. So it's not called with the same continuation, like this one continues to here, and this one continues to here. So it's not a candidate for quantification, right?
So what happens with this function? Well, I do a disassembly of it, and I see that I see a disassembly printing for the outside function, and I see a disassembly printing for the inside function. And that means there are two functions, right? It has not been quantified. That's what you see from this. I'm actually testing what I need to be testing.
And then if I look at the call, it's replaced this indirect call with this call label, and it goes to an offset. And this is a faster call because we know the procedure offset. And it can only happen where all of the callers of the function that we're dealing with are known. But this isn't the real optimization.
With a code pointer, you never need the value of that log procedure. So I'll go back and show that again. Like, it's never necessary that I build a procedure object for log, right? And for that reason, I can call it by the code pointer, and then we can use a different representation of the procedure as a value.
GAO currently has a uniform calling convention where the thing being called, the callee, is passed as a zeroth argument, and then the positional arguments are passed on a stack in order like that. And so the zeroth one is the callee. But we don't need to build the closure to pass as the callee. We can use a different representation
for the free variables of this function. And so if it doesn't have any free variables, we don't need to pass anything in particular. We actually pass in a false value because it happens to be a cheap value to make. If it has one free variable, we don't need to allocate a closure. We can just pass in that free variable as the closure, and then inside the log function, it will reference that argument zero
to reference the value of the free variable. It's a little bit complicated, but the essence is closures can be cheaper than big objects with a code pointer and all of the free variables. With two variables, we use a pair because it happens to be smaller. With three or more, we stuff the free variables in a vector. Basically, we are representing the free variables of the function as an object which is not a closure
and then passing them in as a specialized argument zero. And this can happen for a mutually recursive set of procedures as well. You take the union of all of the free variables of all of the functions, and you represent them. Is it zero, is it one, is it two, is it three, in a specific way? And that way, it's improved a bit.
If not all of the colors are lexically visible, then we do create a closure. A closure is an object on the heap. It has a tag indicating I'm closure, and it has a code pointer, and it has the free variables. And so you do have to create this object. Although, if there are zero free variables, you don't need to create it every time. You can just statically allocate it in the binary.
And you can, if you have one entry to a set of mutually recursive functions, you can use just the one closure for all of them if the other one's colleagues are all known. It's complicated, and if you're interested, I'll point you to a paper that describes all this. OK, yeah, lambda. It's complicated. I have very little time, so I would just point out,
let's see. Oh my goodness. No, papers. All right, I'll point out one thing that you will notice as a difference with 2-2.
In 2-0, if a value was ever allocated to a local variable slot, within its scope, it would be kept alive. And particularly, this was the case for procedure arguments. So you would call a function, and the arguments to that procedure would be always kept alive
in the function. That's no longer the case, and this is to preserve the property called being safe for space, meaning Guile will never retain memory anymore if it's no longer necessary in the course of your program. It's free to reuse that slot. It's free to collect its value if it hasn't reused it, but garbage collection happens at some point.
So you should never feel like you need to null out any value or something just to release its memory anymore, because Guile no longer holds on to values just because they have names. This is a property of the new compiler construction. Really? Five minutes? No. It's not 54? I have six minutes, don't I?
Oh, come on. I got lots of time. OK. All right. Yeah, but the upshot is that when you do a backtrace in Guile 2.2, you won't see as many arguments. And this is a kind of negative thing, and I don't know whether to introduce a new compile flag or something to make sure it preserves the values of arguments so that you get a nicer backtrace.
It will replace them with an underscore. That's basically the deal. Because it could be the garbage collection happened and the argument's gone. Or the slot was reused for something else and the argument's gone. Oh, and this one's awesome. So in Guile 2.0, your stack had a limited size. It was 64,000 elements.
You could set an environment variable if you wanted to be bigger or not. But once you reach the end, you're done. Well, in Guile 2.2, the stack has unlimited size. It's super rad. So whenever you need to grow the stack, it will allocate a new stack area and copy. And whenever the stack shrinks at some point, it will return those pages to the operating system. And this, combined with the closure optimizations,
made it free to implement map as a nice recursive function. And that fixed actually some other problems relating to if you captured the continuation in the map procedure. That's also another mess, which I'm happy to discuss later. So yeah, this is describing the manual. We do great loop compilation now.
It's pretty good. It could get better, but it's a startling change compared to Guile 2.0. And that includes hoisting and different sorts of loop transformations. The footprint is smaller. On this machine, Guile 2.0 starts up at 13 and 1.5 milliseconds, and now it's 7 and 1.5.
3.4 megabytes overhead in Guile 2.0, and only 2 in Guile 2.2. And this is per process overhead, if you're having multiple processes out there. So this can make it much more impossible for you to be like, oh, yeah, deploying Guile in my organization, because it's fast and starts up quickly
and has low overhead and everything. And plus, we now just landed some unboxed arithmetic, whereby we can work on floating point values without having to allocate them on the heap. And it's kind of shameful that we were ever doing this in Guile, but understandable as well. Whenever you work with floating point numbers, you would be like these bolded things
where I'm pulling a 32-bit float out of a byte vector and I'm multiplying it times 2. These bolded things are allocating new objects. So you're going through this loop. In every loop, you're allocating two objects, and you're doing this indirect type dispatch on all of the multiplications and the references and all the things. Now that's not only the case, if I take a look at the difference in times, it's 10x.
10x, right? So it can be a big deal. And this is a little loop we happen to get. Lots of nice things coming out, and the index also got unboxed as a 64-bit integer. So that's all the things. Details are gnarly. I wrote a blog post about it recently.
But yeah, so the same model, ON and instructions. The instructions themselves are cheaper. And the mapping of scheme to the instructions is a bit more complex. If you need peak performance, you've got to make your piece with common disassemble at some point. But if you don't, just enjoy the general improvements.
So I think I'm done now. Thank you for dealing with the whirlwind. Any questions? Yes, sir. So are there other kind of dramatic changes to debugging? Like if you end up getting an exception or something like that, do you lose a lot more information
about the variables that are available? You still have the maps of the variables that are available. The variables which are live, but fewer variables might be live. And you might have a different set of variables and a different set of temporaries as well. So it could feel different, but I don't know.
I feel it's OK, but you'll have to say. OK, thank you very much.