Zero-Overhead Metaprogramming
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Untertitel |
| |
Alternativer Titel |
| |
Serientitel | ||
Anzahl der Teile | 150 | |
Autor | ||
Lizenz | CC-Namensnennung 2.0 Belgien: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/34477 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache | ||
Produktionsjahr | 2015 |
Inhaltliche Metadaten
Fachgebiet | |
Genre |
FOSDEM 201513 / 150
3
5
9
11
13
14
15
16
17
18
19
20
24
26
27
28
29
30
37
41
42
44
48
50
54
62
64
65
67
68
69
71
75
78
79
82
83
84
85
86
87
88
89
90
94
96
97
103
104
105
107
108
113
114
115
117
118
121
122
123
125
126
127
129
130
131
133
136
137
138
141
142
147
148
149
150
00:00
Orakel <Informatik>Overhead <Kommunikationstechnik>Proxy ServerDiskrete-Elemente-MethodeZeitbereichGebäude <Mathematik>Projektive EbeneObjekt <Kategorie>DatenfeldDatenparallelitätCASE <Informatik>Klasse <Mathematik>SoftwaretestDomain <Netzwerk>Formale SemantikInstantiierungProxy ServerImplementierungPhysikalisches SystemTexteditorVektorraumMetropolitan area networkProtokoll <Datenverarbeitungssystem>Inverser LimesDatensatzApp <Programm>Arithmetisches MittelFormale SpracheBeobachtungsstudieEndliche ModelltheorieAbstimmung <Frequenz>Virtuelle MaschineMathematische LogikSymboltabelleOrakel <Informatik>GruppenoperationKartesische KoordinatenProgrammierungFächer <Mathematik>UmwandlungsenthalpieTeilbarkeitRechenschieberEigentliche AbbildungEinfache GenauigkeitGraphRechter WinkelMeta-TagPoisson-KlammerSystemaufrufRastertunnelmikroskopGesetz <Physik>BitVersionsverwaltungLesen <Datenverarbeitung>MetaprogrammierungXMLComputeranimation
02:57
Overhead <Kommunikationstechnik>Overhead <Kommunikationstechnik>Objekt <Kategorie>Meta-TagCompilerInstantiierungCodeProtokoll <Datenverarbeitungssystem>AppletMultiplikationsoperatorGenerizitätProxy ServerGrößenordnungMetaprogrammierungDynamisches SystemPhysikalisches SystemMAPJust-in-Time-CompilerInformationMomentenproblemGeradeMeterMessage-PassingProgrammierungProzess <Informatik>Kartesische KoordinatenKontextbezogenes SystemURLProgramm/QuellcodeComputeranimation
05:17
ImplementierungGlobale OptimierungInterpretiererObjekt <Kategorie>Meta-TagProtokoll <Datenverarbeitungssystem>RechenschieberComputeranimation
05:40
InterpretiererCodeOverhead <Kommunikationstechnik>AppletAbstraktionsebeneSyntaktische AnalyseRahmenproblemATMGlobale OptimierungKartesische KoordinatenProzess <Informatik>ImplementierungMAPBildschirmsymbolInterpretiererCodeAppletWellenpaketInstantiierungTopologieParserBefehl <Informatik>Ganze ZahlObjekt <Kategorie>MultiplikationsoperatorCASE <Informatik>Lesen <Datenverarbeitung>Automatische IndexierungMomentenproblemVerzweigendes ProgrammKonditionszahlSampler <Musikinstrument>Formale SpracheArithmetischer AusdruckTypentheorieVariableHeuristikZählenSoftwarewartungElektronische PublikationPaarvergleichCodecRahmenproblemWurzel <Mathematik>DifferenteRechter WinkelVirtuelle MaschineFlächeninhaltSupremum <Mathematik>BitMultiplikationssatzFormale SemantikPhysikalismusVersionsverwaltungAbstrakter SyntaxbaumComputeranimation
11:09
Overhead <Kommunikationstechnik>Primitive <Informatik>HochdruckFehlermeldungPhysikalisches SystemKette <Mathematik>Gerichtete MengeModifikation <Mathematik>CachingCompilerNichtlinearer OperatorATMKette <Mathematik>Message-PassingInterpretiererInformationFormale SpracheInstantiierungMereologieGanze ZahlSpiegelung <Mathematik>Globale OptimierungLogischer SchlussGeradeCachingStandardabweichungAbstimmung <Frequenz>ZahlensystemSchreib-Lese-KopfFokalpunktSymboltabelleNeuroinformatikKlasse <Mathematik>DatenfeldURLParametersystemDatenflussSystemaufrufMeterResultanteKlassische PhysikProzess <Informatik>SoundverarbeitungObjekt <Kategorie>Leistung <Physik>Shape <Informatik>FlächentheorieFehlermeldungFlächeninhaltMathematikRahmenproblemDatenverwaltungCASE <Informatik>MomentenproblemJust-in-Time-CompilerQuaderRechenschieberPunktVarianzVirtuelle MaschineModifikation <Mathematik>CoprozessorGebäude <Mathematik>ZählenLaufzeitfehlerOverhead <Kommunikationstechnik>Computeranimation
16:24
Overhead <Kommunikationstechnik>Proxy ServerSchaltnetzMessage-PassingObjekt <Kategorie>Protokoll <Datenverarbeitungssystem>Meta-TagKartesische KoordinatenKategorie <Mathematik>Wurzel <Mathematik>
16:47
ZeitbereichGebäude <Mathematik>Overhead <Kommunikationstechnik>VariableKette <Mathematik>DatenfeldStandardabweichungSchnelltasteInterpretiererAppletSkriptspracheSoftwareCompilerPartielle DifferentiationJust-in-Time-CompilerMeta-TagLeistungsbewertungTeilauswertungExakte SequenzFormale SemantikProgrammierungSampler <Musikinstrument>Bildgebendes VerfahrenObjekt <Kategorie>Byte-CodeBildverarbeitungMaschinenspracheBitNichtlinearer OperatorMereologieProzess <Informatik>BenchmarkKartesische KoordinatenDemo <Programm>PunktMultiplikationsoperatorArithmetisches MittelSprachsyntheseMeterMAPDifferenteEndliche ModelltheorieProtokoll <Datenverarbeitungssystem>Message-PassingVektorraumKontextbezogenes SystemEinhüllendeStrahlensätzeImplementierungAppletSkalenniveauPhysikalisches SystemGewicht <Ausgleichsrechnung>BildschirmfensterCASE <Informatik>Rechter WinkelProgrammbibliothekStandardabweichungDatenfeldEinfache GenauigkeitKette <Mathematik>PerspektiveFormale SpracheDatenstrukturGeradeOverhead <Kommunikationstechnik>SoundverarbeitungVirtuelle MaschineSpiegelung <Mathematik>GrundraumPixelInteraktives FernsehenMeta-TagE-MailReelle ZahlDynamisches SystemBildschirmmaskeZahlenbereichAbstraktionsebeneMetaprogrammierungComputeranimation
23:54
Overhead <Kommunikationstechnik>GoogolComputeranimation
Transkript: Englisch(automatisch erzeugt)
00:05
That's what I want to achieve and that's a little research project we did together with Chris Seaton from Oracle Labs. He contributed a little bit the experiments in JRuby and with Stefan Lucas in LIL and at Armort.
00:24
So what do I mean by metaprogramming? Well as a small talk group here you probably know that kind of stuff. I'm sorry I still used my old slide that's adapted for the more general audience. So everything you see here you can imagine just with small talk syntax. So that's just a little small talk
00:42
hiding between curly brackets. So yeah you know you have your perform in small talk with a symbol. You have your instance reading and instance field writing methods so that kind of metaprogramming. Here for people from the Ruby kind of a world we have the same thing where it does not understand. You implement
01:03
a little proxy and then you use does not understand and perhaps forward the call by using perform. Or that's not so much in small talk having a proper meta object protocol. So that's more like the C laws kind of lisp style of
01:21
things and that's what I'm actually really interested in. So it's really cool to make that stuff fast but actually I want to get here. That's research I did during my studies here in Brussels and the goal is to have something where I can modify the whole behavior of the language just by defining a specific
01:40
somewhat called meta class. So in my case I was focusing on concurrency. What I wanted is I wanted to have some meta object like a domain that I can redefine so that I for instance can provide the semantics of an actor model within my language. So I don't have to adapt a virtual machine to provide guarantees but I can really just define my concurrency model on
02:03
top. So either actors or STM or whatever you can imagine. So what I wanted to be able to do is redefine the semantics of writing to fields by just providing such a little snippet. And what you see here is actually almost sufficient to implement a little actor system. So because most important
02:22
for actors is that you have proper isolation between the actors so that not one actor can see what the other actor is doing and not access an object graph within another actor. So essentially in the naive version you just do a test whether you're in the current actor or in another actor. If you're in the current one you can just access the object fields otherwise you throw an
02:44
error. Now you can imagine that stuff is executed for every single field right in the system. So it's really god damn slow if you have no way of optimizing that. So yeah that's a problem. And in general of course everybody knows meta programming is slow. Why do people know that?
03:03
Because for instance of Java. So but it's not all that much better if you're working with Coq because Coq at the moment also is not optimizing it. The idea is you have a method invocation and when we measure that we see roughly 7x overhead. It's about the same order of magnitude also
03:24
in and follow or speak. Same thing with dynamic proxies at least on the JVM you also see about 7x overhead. So it's much slower as it should be. The assumption is if you want to encourage people to use meta
03:42
programming and not to avoid it in performance critical code yeah then you want zero overhead. So otherwise you have the situation now that people find workarounds which perhaps makes the code less readable because in small talk we know standard thing is you use does not understand and if you
04:02
hit it the first time then you generate a method for instance to delegate to another object. So you avoid the cost of doing the does not understand and the reflective invocation. But then of course you have that strange does not understand handler and all those methods popping up and who has to maintain the system then doesn't really know what's going on. So yeah
04:22
zero overhead. Well the idea is to make the VM smarter so that all the meta object, meta programming facilities are fast. So yeah the idea is to work
04:41
at the LPM level. Okay so generic meta programming is pretty slow on Java compared to that for instance on Piper they have a meta tracing just-in-time compiler that's actually pretty fast but if you start to implement such a simple meta object protocol as I showed then you still get like a 50%
05:02
overhead because the compiler even so it can optimize all those little things still doesn't have enough information to optimize the example I showed you with the actor meta object protocol. So we need something better and what we found here is a nice way of implementing interpreters and there we also
05:24
experimented with those meta object protocols so I will just briefly show you the way how we implemented interpreters because I think that's that's a very nice idea it's not actually mine I forgot to put the reference here but I will put it in the slides I will put online. Okay before I
05:42
start explaining exactly if you see Java like code that's application level code if you see Pythonic code that's the interpreter level implementation yeah just to make things clear up for it. So if you imagine you want to implement a little interpreter you will probably first start with a parser for instance
06:02
for such a code snippet you have a conditional statement with a condition expression then the two branches then and else and that will be parsed to a tree like that perhaps so again the condition statement and the other two sub expressions. Our interpreter in the essence looks like that we parse a file
06:26
we get back that kind of a root node here and then we call an execute method and pass it perhaps a frame which captures all the local variables in the method application so another question is how do we actually need to
06:40
implement those execute methods for each of those nodes and that's pretty straightforward the simplest possible way is really just directly implement the semantics of your language you want so for instance you have literals so the zero and the one here and the parser will just while parsing create such an
07:02
object put a constant value in there and during execution you will just return that. For reading of variables are very similar there we use our frame object I mentioned before and we can assume that typically languages like smalltalk can be compiled in a way that you can just enumerate your
07:23
local variables and then you can just map it to an index and access a specific index for each variable in that frame which has an array to store the local variables. If you write on the other hand variable that's very
07:41
similar but the difference is that you of course first have a sub expression to evaluate so here we have our assignment statement and first you want of course to get the value here so what we do is go into the child node execute that node and get back a value which we then can store
08:01
into that frame object at the given index of that variable okay so that's pretty much generic way of implementing an AST and abstract syntax tree based interpreter not very fancy but the main idea is here to actually start
08:22
optimizing that AST interpreter during execution so one of the things that's important to really get performances to realize in smalltalk we actually don't know exactly during compilation what that is so we have here that the count statement to increment that count if you just look at the code we will
08:43
intuitively know okay there will be just integers but the important thing in smalltalk is of course also the integers are objects so one trick they use in the virtual machines is to use things like small integers where you essentially steal a bit and then represent the integer value directly and
09:02
don't have to allocate objects but yeah so depending on what kind of integers is maybe you need to allocate an object and we want to avoid it so for the self-modifying self-optimizing interpreters we try to actually optimize each of those nodes for instance that they can handle integers directly so
09:27
what it means in that interpreter we just compile that little code snippet actually to an uninitialized version of that node and that node when it executes will still execute a sub expression first gets a value and then
09:42
call a specialized method with that value and then that specialized method we as a language implementer can decide on the heuristics we want to use to optimize our little interpreter and in that case I decided okay I want to use a type and then actually provide a specific type for the instance
10:02
variable right for the type integer and yeah at the moment you see it's red here and if we execute that then we will really exchange that node in the tree and the next time we execute it we will actually use the optimized node for the integers so and that implementation looks like that so
10:23
instead of just calling a generic execute method we actually ask the whole subtree to provide us with an integer and then we get that returned well sometimes in some code passes that might not be true that it's an integer then we might need to handle that exceptional case and we change our
10:42
tree but for surprisingly many code snippets it will work directly and then we can just optimize it to the integers and in that case yeah if it's a Java like language you want to implement your interpreter in you can properly type all those things to use directly pure integers and
11:05
completely optimize that kind of AST tree to really use the integers without having to wrap them without having to tag them and also the plus essentially can use a direct plus operator of the processor instead of having to do all
11:21
kind of management instructions around so for our write node here specifically that means we can also store the pure integer value directly in our frame so we have actually a specific second array instead of just having an object array where we can directly store integers to avoid boxing and
11:41
tagging and other tricks virtual machines usually need to apply so that's the basic simple idea of building self modifying self optimizing interpreters and now I want to show you how we actually use that to optimize also reflection so yeah that's that part here just to give you a
12:04
little overview what the language looks like well it's still a small talk here just with a brace notation to not confuse the non small talkers but as you can see that's our perform that's our instance var add instance var add put doesn't understand yeah and here's a global get global and and so on let's
12:28
focus just on on the perform so standard reflective method invocation you give it a symbol and you have an array for all the arguments so now the question is how can we possibly optimize that well let's first look at
12:45
what people came up with this standard message sense and small talk so that's a classic polymorphic inline cache paper the idea is you have some variable this whatever kind of value you can imagine in here and then the plus
13:01
message sent with a one argument so in the more Java notation we see directly that message send operator here and the problem is that one is highly polymorphic so if you would use a static compiler we would have to guess what's going to happen there because usually we don't know and the idea they
13:21
came up with polymorphic inline caches is that we actually try to observe at runtime what happens and then cache the lookup for the operator or for the plus message sent for instance so and in those little interpreters it's called a dispatch chain which essentially is part of that
13:44
plus invocation node and will cache the lookup result so let's walk through how that works we have our little plus here we have our count which has an integer object in it so the zero and that dispatch chain starts out in
14:02
an initialized case and the idea now is that we that we execute that first message sent here and then we note okay we actually got an integer as a receiver we do the lookup we get that integer plus operation we cache it and
14:20
we also have like that check whether the receiver is still an integer because maybe for some reason the count overflows to a large integer or something and the receiver changes so we need also to handle that case so that's a general idea and when you now apply a compiler to it that compiler can speculate okay I actually see that's like what we call a monomorphic
14:44
send so there's actually only one kind of receiver always been observed and then it can completely inline that plus operator and do directly the plus operation in your method so the benefit here is even the interpreter it avoids the extra lookup and it enables inlining and further optimizations by
15:04
just-in-time compilers so that's a basic idea now back to our little reflective example so remember perform with a symbol and here we actually just generalized that idea so we have our little invoke node here and also at the dispatch chain and here we resolve that plus first so we
15:28
we note that we get here a plus symbol provided as a parameter and then we cache that we have that again that check in case some other symbol is coming along dynamically at some point and then we will just nest that kind
15:44
of dispatch chain again where we do the thing I just showed on the slide before so here we will have again an integer node and then we can cache and inline the plus operator so and that provides actually enough
16:01
information to the just-in-time compilers and optimizers and allows us to remove completely the overhead of reflective method invocation or reflective field accesses so all the 7x are gone if the just-in-time compiler sees it's monomorphic all the way through okay so that solves
16:25
those two categories so the simple reflective axis and also solves like in combination that does not understand and the message missing is a perform here so now the question is what do we do this what I actually was
16:43
interested in the kind of fancy meta object protocols so remember we have that kind of meta object protocol that's really cross-cutting applied to the whole application so we want to change every single field right so if you have a little example that's using that kind of actor meta object
17:04
protocol and we have that field right node well there are at least two parts where the definitely will apply to method invocations also part I just leave left out but definitely all the field right accesses for them it's not
17:22
statically known what the semantics is so we cannot just emit the bytecode for writing the field here or something like that was a native code no we first have to resolve what happens on the meter object level so and then the solution here is applying those caching and dispatch chain tricks again
17:42
so the first thing is we observe what kind of meter object actually comes along and in our case we see it's actually always an actor and then we can actually just inline that right to feel and by that we provide enough knowledge to the compilers to optimize it interestingly it's a very generic
18:02
abstraction so we can in case we only have a standard semantics of our language we did not actually use all that fancy stuff to do something we can really just directly do the right so in the end there is supposed to be no overhead in the system if you don't use the fancy meet object protocol
18:21
stuff okay so now briefly to check whether it's actually fast what I used for that it's not the small talk you probably know it's not a squeak it's not a follow it's something very simpler because it's make it's much easier for me to experiment with stuff it's called a simple object machine it's a little
18:41
small talk not exactly in the small talk 80 tradition much simpler has been used in a couple of universities in the past it's designed for teaching so it's the other concepts are more clearly expressed performance was usually not a goal until I picked it up and got it like almost on Java level fast there
19:02
are a couple of implementations and C C++ Java JavaScript our Python and also of course a small talk so for the setup here I use two implementations that are based one on our Python our Python is something you might know from pi pi pi
19:21
pi is based on that pi pi is a very fast Python implementation or roughly seven times faster than CPython I think the compilation there uses meter tracing I'm not going to detail that the interesting part is just the second one uses a completely different technique that's based on Java I just wanted to
19:42
show that actually the technique is generally enough to be applicable to all kind of VMs and all kind of compilation technologies if you're interested you find those two little small talks on github and yeah so to show you the numbers that's a little benchmark to show or a couple of
20:03
benchmarks to show that actually we can remove all the reflective overhead while having the kind of fancy meta object protocol so imagine all kind of field rights all kind of field reads or message sends or have an extra interaction level an extra intercept where you can redefine all those
20:21
semantics and the question was okay if I put it into the system do I actually see overhead in the end and if how much so I tried it on both little small talks and in the end we see about an overhead of four or nine percent depending on the compilation technology and the the most important
20:43
insight here is we don't see the 7x overhead so there are still a little bit of overhead because we have that extra semantics so at any single point in the program we could change again what the field right means by associating another meter object to a base level object so there is some semantics that we still need to support but other than that there is
21:04
essentially no overhead to speak of so at least not the 7x we saw before so now you might wonder why are those meta object protocols important maybe it's not important for you maybe it's really just an academic thing I'm interested in so let's look at something more practice oriented so
21:26
that's where Chris Eaton came in this is a JRuby implementation so he applied the same techniques also in that context kind of independently and he checked what kind of speedup does he actually get on real applications and
21:42
with real applications he looked into a Ruby image processing library which is a good example where the Ruby people sing also very much in the dynamic small talk way okay let's use perform let's use does not understand to implement our little Photoshop processing library and it turns out that it's
22:03
really everywhere and it's also for every single pixel operation if you like merge layers and the multi-layer document they use it so we have all those benchmarks here and we see actually if you apply that kind of technique you get 10 to 20 X speedup on processing images in JRuby so
22:24
that's just to show that's really nice and I think Elliot Mejana was also thinking about introducing something like that into a Coq so I hope at some point we also have that and in follow and squeak okay well let's skip the
22:45
demo it's not really important so just to sum up the idea I wanted to show here is that meta programming doesn't have to be slow so there is a very simple technique at least from my perspective with those dispatch chains
23:01
which is are essentially just a more generalized form of things that virtual machines already do we can apply that to reflected method invocation reflective field access and so on we can also apply that to fancy meta object programming and perhaps enable new kind of applications with that and yeah that's that's I think the most important bit for me so if you have
23:25
any questions if you would like to read up more detail I have a paper draft for that it's not yet accepted but shoot me an email you can get it and yeah questions okay thank you very much