Unfolding the paper windmills
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 112 | |
Author | ||
Contributors | ||
License | CC Attribution - NonCommercial - ShareAlike 4.0 International: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/60862 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 202297 / 112
5
10
14
17
19
23
25
29
31
32
34
44
47
51
53
54
57
61
69
70
82
83
88
93
94
97
101
105
106
00:00
Multiplication signComputer animation
00:40
Transformation (genetics)Execution unitMereologyMultiplication signSolid geometryCASE <Informatik>Computer scienceDiscrete element methodBitProper mapProtein foldingComputer animation
01:28
TrailIdeal (ethics)Repository (publishing)AbstractionSheaf (mathematics)HypothesisComputer configurationMusical ensembleRobotFunction (mathematics)Data modelMultiplicationArchitectureTask (computing)Translation (relic)Hidden Markov modelBit rateMessage passingReal numberHypothesisReading (process)AuthorizationTwitterVirtual machinePointer (computer programming)Limit (category theory)BitFigurate numberCovering spaceMoment (mathematics)Time zoneResultantTransformation (genetics)Different (Kate Ryan album)Open setComputer fileRepository (publishing)Machine learningTrailMultiplicationSign (mathematics)Semaphore linePhase transitionOnline helpGoodness of fitMultiplication signSheaf (mathematics)Self-organizationYouTubeBlogSystem callDoubling the cubeView (database)Proper mapDigital photographyOffice suiteSummierbarkeitNumbering schemeMorphingDesign by contractSingle-precision floating-point formatBoss CorporationRow (database)Bus (computing)Chaos (cosmogony)Direction (geometry)State of matterPoint (geometry)FamilyElectronic mailing listIdeal (ethics)Hand fanKnotComputer animation
08:43
Computer networkArchitectureMotion captureInformationCodierung <Programmierung>Operations researchPositional notationToken ringWordBlock (periodic table)MultiplicationLibrary (computing)Transformation (genetics)Time domainCodeStapeldateiFunction (mathematics)Numerical analysisGradientComputerMeta elementAreaOperator (mathematics)CalculusGUI widgetVariable (mathematics)Graphics processing unitBefehlsprozessorShape (magazine)EmulationFamilyHydraulic jumpCrash (computing)SoftwareGradientQuery languageShape (magazine)Goodness of fitLibrary (computing)Line (geometry)Position operatorTransformation (genetics)Codierung <Programmierung>Address spaceWordTranslation (relic)Function (mathematics)Symbol tableAuthorizationSequenceFunctional (mathematics)Key (cryptography)Order (biology)Just-in-Time-CompilerMathematicsRootGraph coloringProduct (business)Block (periodic table)Branch (computer science)Einbettung <Mathematik>StapeldateiSimilarity (geometry)DiagramVector spaceMultiplication signLevel (video gaming)Computer programmingIntermediate languageProjective planeArtificial neural networkoutputNormal (geometry)DistanceMechanism designRecursionParallel portMotion captureCoefficientBitSystem callSquare numberConfidence intervalAssociative propertyNeuroinformatikArithmetic meanLie groupSpacetimePlastikkarteMoment of inertiaRule of inferenceMassPresentation of a groupQuicksortLogic gateSinc functionBit rateVideo gamePoint (geometry)Recurrence relationSoftware bugWave packetRight angleComputer animation
14:30
StapeldateiHausdorff dimensionoutputOperations researchGamma functionDifferenz <Mathematik>GradientModul <Datentyp>Computer programmingData modelLibrary (computing)Boilerplate (text)Endliche ModelltheorieParameter (computer programming)Revision controlFunction (mathematics)Module (mathematics)Block (periodic table)Codierung <Programmierung>Computer clusterEinbettung <Mathematik>Instance (computer science)Group actionBlock (periodic table)SpacetimeEndliche ModelltheorieFunctional (mathematics)Medical imagingoutputParameter (computer programming)Position operatorAreaRight angleLibrary (computing)SoftwareVideo gameMultiplication signMoment (mathematics)NeuroinformatikNear-ringDecision theoryHeat transferCuboidGradientVector spaceRevision controlJust-in-Time-CompilerModule (mathematics)Artificial neural networkTransformation (genetics)LinearizationParallel portStapeldateiInheritance (object-oriented programming)WordVariable (mathematics)Software bugBoilerplate (text)Codierung <Programmierung>Auditory masking
17:38
Maxima and minimaHill differential equationMIDIBit rateFunction (mathematics)CodeIntegerStapeldateiData modelParameter (computer programming)Computer fileConnected spaceModal logicInheritance (object-oriented programming)Structural loadLogarithmBlock (periodic table)Einbettung <Mathematik>TensorMathematical optimizationInformationBlock (periodic table)Queue (abstract data type)Bit rateInsertion lossLibrary (computing)Sign (mathematics)Parameter (computer programming)Endliche ModelltheorieTransformation (genetics)Mobile appSystem callCodeEinbettung <Mathematik>Functional (mathematics)Key (cryptography)Multiplication signBitWave packetOrder (biology)Residual (numerical analysis)Standard deviationPopulation densityBoilerplate (text)Token ringMedical imagingBlogWorkstation <Musikinstrument>Set (mathematics)Physical systemFiber (mathematics)Physical lawWebsiteGroup actionDrop (liquid)State of matterComputer animation
21:04
Physical systemInformationAbsolute valueCodierung <Programmierung>Token ringImplementationLibrary (computing)Endliche ModelltheorieKey (cryptography)Interpreter (computing)MassMultiplication signTransformation (genetics)WritingVirtual machineNormal (geometry)SequenceBlack boxPosition operatorBounded variationProduct (business)Context awarenessDistanceCodeLibrary (computing)Motion captureToken ringInformationRange (statistics)Physical systemEndliche ModelltheorieQuicksortSpecial unitary groupLogical constantLogic gateWave packetRight angleHand fanXML
24:07
PiProduct (business)Set (mathematics)Lecture/Conference
24:38
GradientVirtual machineBefehlsprozessorSoftware developerLevel (video gaming)Transformation (genetics)DataflowBit rateWordRight anglePie chartBoss CorporationComputer animation
25:49
Endliche ModelltheorieOpen sourceSubject indexingDataflowGoodness of fitMultiplication signPlastikkarteLecture/ConferenceComputer animation
26:32
XML
Transcript: English(auto-generated)
00:06
So I've been working in academia for a very long time and this talk is inspired for someone who I was menteeing this year. It's basically like science is built on the top of giants, right? And those giants look scary, look like, oh
00:24
how are we going to do it? But sometimes giants are just windmills. Granted, windmills who speak weird English, who publish way too many papers, and it's basically impossible to keep up. So I hope this story sounds familiar, because what I'm trying to do is for the next half an hour, even less,
00:45
a little bit less, I'm going to be your squire and I'm going to try to give you the tools to read papers properly. So there's a two folds to this talk. On one hand, I will try to give you tools for reading papers, and every time
01:00
I talk about this, it's like my co-worker says, but I know how to read papers. I'm like, let me show you how to. And the second part is like a more computer science part, where I will give you some tools to implement a paper. In this case, we're going to look at the very, very famous attention is all
01:21
you need. And as a coworker told me, we all use transformers, so yeah, let's do it. So yeah, how to read academic papers. So there's some tools that I think everyone should know about it for implementing, for reading academic papers.
01:41
And as I said, it was not like, yeah, I read papers, I sit and I read, I print them and I read them. No, don't do that. First of all, and I think like the most important tool are repositories, because we all been there. Like we have hundreds of tabs open with different papers, blog posts,
02:02
YouTube videos, podcasts. Like these days, science is distributed across multiple mediums, and we have everything open and never have time to read them. So we need something to properly collect and categorize everything. The main thing that it needs to have these tools is to be distributed,
02:24
like multi-platform. I need to be able to be on my phone and see a paper and save it and then maybe label it, but it needs to be distributed. So mentally, so there are like the old school tools, but also I really, really like paper, like paper pile. I don't know, it feels like a pile of papers
02:44
I'm never going to read, but I'm trying. And then note taking. Again, note taking could be pen and paper, but if you like digitalized tools, and then being able to do nice summaries and share them across the internet,
03:00
I find that good notes and notability are very good tools. And finally, organizing. Sometimes you read a paper because you think it's interesting, but it might not be interesting at that moment, and you need to be able to come back and remember that paper. Again, Notion, GoodNotes, PaperPile,
03:21
and Obsidian are very good tools for doing that. And that's basically it. That's the tools that you need for reading papers. Almost. I have a couple of bonus tracks. First, the first bonus track is a tool for discovering new papers. Granted, Twitter is my main way for me to discover papers.
03:44
Like academic Twitter is very like, yeah, it's there. You get papers, new papers every day. But that might be skewing you to big labs, big corporations. So research rabbit and lead marks will find your papers related. Then they're linked to each other.
04:01
And then for my neurodivergent family, there's bionic reading. Then that's something very, very cool and will help us read better. OK, so we have the tools, but how we need to read it. Hopefully the repository has helped you massively. You don't have these tools open, so you should be able to read.
04:21
Cool. So now what? Now we sit on the desk and have like 200 cups of coffee and we read through them cover to cover. Well, no, because that's infeasible. Like, please be kind with yourself. Like no night is able to read everything.
04:40
So I do this thing. I do the three pass approach. The first approach is me trying to figure it out, is this paper relevant? Like I'm trying to be brutal, like I'm don't going to spend more than 50 minutes doing this. I read the title. I read the abstract, skim a little bit through the introduction,
05:01
maybe read the discussion, and that's it. Nothing else. Then is like the moment where I maybe know that the paper is interesting. I might start brewing a cup of coffee because I'm going to need to read this. Again, no cover to cover, just the introduction,
05:20
the contributions and the limitations. My favorite authors always have like this last paragraph with these are the contributions and they itemize their contributions and the limitations. That is fantastic. Please, authors in the room, do that. I would be very grateful. And then I will read these figures and the result sections.
05:42
Depending on how expert you are on the topic, this might be more or less useful. And yeah, skim through the rest of the paper, grab more or less the idea of what the paper is about and write a summary. Granted, it's not going to be the best summary. It's just like, well, these topics are discussed in this paper.
06:02
Cool. We're good. And then the next phase is when we properly need multiple cups of coffee and sit and read it cover to cover. There's no shortcut here. We need to read it properly. So I know to like you don't need to read it alone, like find help, find colleagues.
06:23
Asking for help is a sign of strength, not weakness. And something that I also do is when I'm reading the paper cover to cover, I also add new papers to the repository so I know where to follow the lead. And extend the summary. At this point, you have a way better idea of what you're going to talk about.
06:45
Brief, brief note on how to highlight papers. Since I do these three phases a stage, which is, I think, pretty common, maybe not. I do this thing like because sometimes I read the paper, but I'm not going to implement it or maybe I'm going to take almost a year since I read it again.
07:03
So I have this semaphore thing where I highlighted in red the hypothesis, the problem that the authors are trying to solve. In yellow, the hypothesis or the methodology that the authors are proposing. And finally, in green, I highlight the evidence, then back up the hypothesis.
07:24
And how does this look in practical? So this is my first pass. Very few things. This is basically like the only thing that I highlighted are the things that I've read. Basically, this takes 50 minutes for real. This is the second pass when I really highlighted more things, go through the figures.
07:44
Like here, I'm trying to pay more attention. Here I have like an idea of what Transformer looks like. And this is the third pass. No, this is not the attention zone you need, but I wanted to show you this paper that was very recently published, in fact, that is values encoded in machine learning,
08:02
because we think the machine learning is neutral. And hopefully from last day, yesterday, keynotes, you know that it's not. So I'm here. You can see that I'm doing oh, it's a pointer. OK, so here you see that I'm doing annotations like things happening where I'm reading.
08:22
And when I'm finally I go to a summary, so I go back to paper file and I write this summary. That's it, now I have a very good idea of what the Transformer looks like, how is it going to, how they do, what methodology they follow, what proofs they have.
08:40
OK, so if I need to implement it, I know how to do it. Great, and now let's implement an academic paper. But before I jump into that, let's have a quick think on what Transformer is. I feel like everybody knows what Transformer is. So indulge me and let's go through it together. So Transformer is actually a family of neural networks.
09:03
It looks more or less like this diagram. This is the original diagram. So it has a color branch and the color branch. And it's very, very popular because it allows parallelization of some tools. Like recurrent neural networks were very slow because you need to go through the whole recursion.
09:20
And here we have some parallelization, which allows us to train faster. And then we have the new magical block attention, which allows us to capture relationships, long distance relationships in a sequence. And finally, we have positional encodings. Positional encodings are very important because it allows us to know
09:41
what's the position of the token in the sequence. Because if you think about it, it's not the same if I say no at the beginning of a sentence or at the end. It might change the meaning. OK, so positional encodings, as I was just saying, sequencing problems need to understand the order of the sequence.
10:00
The authors use a sinusoid to encode the position. So every token at every time step, it has a deterministic vector. And they basically sum it. It's some word embeddings with positional embeddings. That's it. That's positional encodings. You're more than welcome to try other positional encodings.
10:23
There's no rule, but I don't know why. Well, you know, inertia, we all use sinusoid. And what's the attention block? So the attention block is where you're trying to capture the similarities between two words in a sequence. This is very easy to understand when you're talking about translation.
10:43
So, for example, the word windmills are translated in Spanish as molinos. And you want to be able, when you're doing translation, to know that the words might not be aligned, might not be in the same place, but the word molinos is very tight with windmills.
11:00
So you need to have like this relationship. And that's what attention is trying to capture. So we have the, I lost one thing. So we have the embeddings plus the positional encoding. We project that into a smaller vector space. We do the dot product between the query and the keys.
11:20
If the query and the keys belong to the same sequence, that's self-attention. If the query and the keys belongs to an input and an output sequence, like, for example, the translation case, that's cross-attention. OK, then we project it and it's a non-linear projection. And then we compute another dot product and we get the attention coefficients.
11:42
And that's the, all the magic, all the sugar, spice, and everything nice that makes transformers. So basically, in summary, we have an encoder branch that have this multi-headed attention because that mechanism that I just showed you is repeated multiple times. Then we have the decoder, the decoder branch
12:01
that have multi-headed attention and cross-attention, feed-forwards, add normalization, add layers and normalization layers. The add layers is because we are actually also writing in the address symbol. That's the dot lines that connect in between. And that's it. And now, let's do the quickest introduction to JAX.
12:22
This is not a prescription tool. It's like, there's myriad tools out there that you can use. And by all means, pick up the best for your needs. Having said that, we actually love JAX. So, why we love JAX? JAX is a NumPy-like library that runs on accelerators.
12:43
That means that if you knew NumPy, you kind of know JAX as well. It's kind of, yeah. I've been, like, I was thinking for that for a very, very long time. Then it's not completely true. And the good thing of JAX is they have this transformation. So, I'm going to explain in a minute.
13:01
And this is the promised land. Like, you have the predict function, then text and inputs, compute the dot product, add the bias, add a non-linear function, and then compute the mean square root. And basically, I switch NumPy by JNumpy. And, yeah, that's it.
13:20
That's brilliant. So, if it's exactly the same, why do this change? So, we do the change because we have transformations. We have grad. Grad and JIT are going to be the most common transformations. Grad basically takes a function and returns the gradient of the original function.
13:42
If you want to get the values and the gradient, because you might want to get the gradient and you want to compute the loss, you have the value in grad. And then you have JIT, which is a just-in-time compilation. What it does, basically, it does the trace of your program and then traces the program and writes an intermediate representation in JAX-PR.
14:05
Normally, the trade-off between flexibility and fastness is shape array. There's the level of tracer, which is like, we keep the shape of the array, but we don't keep the values. So, you can operate on different batches,
14:21
but all the batches need to have the same shape. Cool. Oh, I forgot to show you. It's here. Look how, oh, what did I do? It's here. Look how easy grad and JIT. And you have the gradient function, trace, brilliant.
14:41
And then you have vectorization and parallelization, beam-up and beam-up are quite similar. Beam-up works on batches and beam-up works across devices, which allow us to do gradient for examples and parallel gradients. We could not train the big, big neural networks and we are training at the moment without them.
15:00
Okay, so let's implement a transformer. It's been a long time. We have been working a long time by the road and it's like, well, what was it? Oh, it was not. We don't really work on JAX because JAX is function-oriented and has like tons of boilerplate and we don't like to write the same thing over and over again.
15:22
So what we do is we have this very nice library, Haiku, then allow us to write object-oriented models and you have the Haiku module then builds this model and has some parameters and the function to apply to the inputs. And these models need to be initialized
15:41
because we need somehow to go from regular functions to pure functions. So we need Haiku transfer and they gave us the init and the apply pure versions of the function. Okay, now we're here for real. Brilliant. So now we have the embedding block
16:01
and the embedding block has the positional encoding and the embedding block and embedding, the word embeddings. It's not really word embeddings because we all use sentence space. But yeah, and you can see then the most common modules like I keep thinking I use this wrong button. So you have H key embed and since we have the parameters,
16:25
we can tell, hey, get the positional embedding and then we sum it and that's it. We have both of them and we're good to go. The attention block then we were just talking about is this thing. And again, we do some housekeeping to know
16:42
if it's self-attention or cross-attention and then we call the parent because multi-headed attention is such a common module that it's already implemented. Yeah, and also very important, please remember then you need to add castle masks if you're doing cross-attention because you don't want to learn from something
17:01
that you should not have seen in the current time step. It's obvious, but it has led to a lot of bugs. The feed-forward network, the feed-forward layer basically initialize the variable. Compute the linear, adds like yellow and returns the linear.
17:23
Very, very common things. So it's really simple to implement. We don't need to do a lot of things. And here's the whole transformer. Maybe this is too small. Oh, you're seeing it? When does this happen? I thought that you were seeing the slideshow. Brilliant.
17:42
So, yeah, this might be too small for you to read, but what I want you to see is that it's very, very similar to all the other tools. Basically, you have the attention block. You have the, ah, I know what I'm doing. Yeah, okay, cool. You have the attention block, a dropout, and you can see here the normalization.
18:03
It takes the attention and the residuals. So something then has not passed through the attention. So if something is meaningful, we keep it. And then the feed-forward block, a dense layer, dropout, and layer norm. And we repeat this multiple times. Cool.
18:25
And then we have to build the forward function. This is slightly different to other toolkits, but it's not something completely insane. It's basically get the tokens, get the embedding block,
18:40
get the transformer block, and apply the transformer. And that's it. The last function is similar, very, very, very similar to the first thing then we saw at the beginning. So you get the whole, the one whole embedding of the target. See, this is like, I'm completely sure that even though this might be the first time that you see jokes,
19:01
you're perfectly capable of reading this code because you might already know NumPy. Yeah, provided that you need not, you already know NumPy. Not, yeah, it might be a little bit tricky. Cool. And we have a lovely call app that I'm going to be able to present,
19:24
which is basically, I'm going to share it online, but it's basically everything that you do. We need to install a couple of libraries. This might not be big enough. Okay, you might need to start a couple of libraries.
19:43
You need to build the sentence piece, but everything is like these days are so easily as accessible. Like TensorFlow Hub has a sentence piece tokenizer that you can basically import. And all these are the model parameters that you are more than welcome to tune.
20:01
Even though some models are, the dropout rate is pretty much standard. And then you load the dataset. This dataset is both in TensorFlow Hub and in Hugging Face. So you can decide how you want to mix these things. The embedded block that we just see, but now with a little bit more of boilerplate,
20:22
the attention block, again, with a little bit more of boilerplate forward. I'm going to skip everything we just saw. And here you define the update function, which is basically get the key, apply, get the key in order to make it reproducible,
20:42
apply the optimizers and return the new state. That's absolutely it. And when you train a model, I trained for a very little time, but you can see that the loss is getting lower. So let's take it for a good sign. Okay, let's go back to this.
21:01
And so the main takeaways that I hope you get from this talk is first of all, find the right system that allow you to keeping up with the literature. There's no good tools. I hope that some of the tools I presented to you are useful, but by all means,
21:20
find the one that is useful for you. Be smart about how you read papers. Like you don't need to read absolutely everything. And then if a paper is relevant to you, summarize it and store it somewhere safe, then you can go back and remember. I have a colleague who the other day told me and they keep all the papers on their brain.
21:42
And I was like, no one don't do that. Okay, so on transformers, I remember then the key things for transformers and it allows parallelization, therefore faster training times. A lot of new flavors of transformer has improved the long range distance, but it has hardly training times.
22:02
So that's a caveat. Attention allows us to capture information into long range distances. The longer the context, the better for the prediction. That's why a lot of new variations like S4 try to improve the context. But then if you want to put it in production
22:21
and do run experiments with them, might be too slow. And finally, positional encodings, then capture the absolute position of the tokens in a sequence. And finally, for implementing papers, we will need to implement papers for either our academic career, for our business career.
22:42
Find the right tool to implement the paper. There's always a trade-off between flexibility and being able to modify things. So there's no right answer, find the best for you. We really like JAX because it's very easy to jump in
23:01
and it allows us to do a lot, a lot of things. And on top of that, when we don't like JAX, we have Haiku, which is a JAX library that allows us to write normal Python code. And that'll be me. Let's build amazing new roads.
23:20
And please, please, please, if you implement new systems, new machine learning models, be conscious about your users, be conscious about the repercussions. This is not a black box. Like we understand what's happening and there's a massive new research on interpretability and trying to understand the depths of the transformer.
23:41
So I hope just the talks gives you like an idea of what's happening in the field, but also be happy. Like I'm very cheerful about the future. I think it's bright because we have all these new tools and all this new blooming research. And yeah, that'll be me.
24:02
If you want to have some time for questions, I'd be more than welcome. Sorry, I rushed through, so hey.
24:25
JAX looks pretty cool. I haven't seen it before. What would be reasons that you might move over from something like PyTorch, either in a research setting or more particularly in a production setting? It allows way more flexibility than PyTorch.
24:42
And then it's like business reason, like JAX is implementing within the company. So we have like the original developers that we can ask and it's very like set up for settings. Like it works very well with our TPUs, with our CPUs. And people just use it.
25:01
But again, it's like try to find the right tool. I think from my experience, I'm all machine learning practitioner and I use TensorFlow on the past. I never used PyTorch. I feel like PyTorch, it allow us more high level development.
25:20
I'm not sure if I want to touch something, like if I want to get a gradient respect to different variables, how is it going to be done? So it probably is like, it's find a tool that is right for you. Maybe you just want to import the transformer and you don't care how many layers you have.
25:40
You just want to say, this is the main things that I want to modify, but you don't need to do something like fine grain. And is there much of an ecosystem, sorry. Yeah, go right there. Is there much of an ecosystem, you know, on papers with code, is there a lot of like JAX models up there or is it still kind of developed? Yeah, yeah. So Google, most of their research
26:02
is either in JAX or TensorFlow or research that we open source is JAX these days. So there's a lot of things. Obviously not as much as other people we don't open source that much, sadly. But for a good reason too.
26:20
But yeah, there's a good ecosystem out there. Good, thank you. Thank you. Thank you so much for your time and your talk. Thank you.
Recommendations
Series of 2 media