Unfolding the paper windmills - TIB AV-Portal

Unfolding the paper windmills

00:00

0

Formal Metadata

Title

Unfolding the paper windmills

Title of Series

EuroPython 2022

Number of Parts

112

Author

Contributors

N. N. (Moderation)

License

CC Attribution - NonCommercial - ShareAlike 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this

Identifiers

10.5446/60862 (DOI)

Publisher

Release Date

Language

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Research is done on the shoulders of giants. Luckily and unluckily, those giants spoke paper-English and documented their achievements kind of publicly so we could advance the science. In this talk, we will dissect the structure of a paper, looking for the essential points that will help us understand it and implement it. Following we will get our hands dirty and implement the paper using Python. In particular, we will dive into the seminal paper ""Attention is all you need"" and implement a transformer using JAX. The key takeaways from this talk are: - Demystify academic reading. - Understand the Transformer architecture. - An introduction to the JAX ecosystem.

EuroPython 202297 / 112

1

44:39

`typing.Protocol`: type hints as Guido intended

2

23:57

A Tale of two Kitchens, hyper modernizing your codebase

3

43:07

AI for Content Moderation at PayPal

4

44:54

An Introduction to Apache TVM

5

30:46

6

19:48

Automate cleaning code in few easy steps!

7

27:58

Automated Refactoring Large Python Codebases

8

25:10

Best practices to open source a product and creating a community around it

9

30:26

Build your own Playlist Recommender System with Python using your GDPR Data

10

29:19

Build-A-Database with Python

11

30:35

Building a Just-in-Time Python FaaS Platform with Unikraft

12

29:53

Circuit board design to finished product: hobbyist’s guide to hardware manufacturing

13

27:16

Classifying LEGO Bricks with Machine Learning

14

47:47

Clean Architectures in Python

15

35:19

Code coverage through unit tests running in sub-processes/threads

16

29:44

Common Python Mistakes with Kubernetes

17

1:01:43

CPython Developer Panel

18

27:48

Correlating messy data with "correlate"

19

36:34

CPython bugs & risky features

20

17:45

Creating great user interfaces on Jupyter Notebooks with ipywidgets

21

19:36

Creating the Next Generation of Billionaires - Part 4

22

26:20

Czech Drought Monitoring system – a journey from manual work to global drought monitoring and machine learning, powered by Python

23

34:14

Data Warehouses Meet Data Lakes

24

26:11

Debugging asynchronous programs in Python

25

29:25

Demystifying Python’s Internals

26

43:30

Developers Documentation: your secret weapon

27

1:00:17

Diversity & Inclusion in the Python Community Panel

28

27:36

Dr. Jekyll & Mr. Hyde - transition from developer to manager

29

1:03:42

Education Panel

30

36:01

Elephants, ibises and a more Pythonic way to work with databases

31

29:44

EuroPython 2022 Closing Session

32

31:00

EuroPython 2022 Opening Session

33

26:13

EuroPython 2022 Sponsor Highlight & Recruitment

34

54:18

Thursday's Lightning Talks

35

27:54

Event-driven microservices with Python and Apache Kafka

36

30:16

Forget Mono vs. Multi-Repo - Building Centralized Git Workflows with Python

37

43:44

EuroPython 2022 - Friday Lightning Talks

38

26:59

From pip to poetry - Python ways of packaging and publishing

39

39:10

Game Development with CircuitPython

40

27:46

Handling Errors the Graceful Way in Python

41

20:13

How a popular MMORPG made me a better developer

42

27:26

How I wrote a Python client for HTTP/3 proxies

43

28:51

How we are making Python 3.11 faster

44

29:37

HPy: a better C API for Python

45

28:16

I have to Confess, I still Love Pandas

46

26:45

Is the news media polarized? Or are we conditioned to think it is?

47

39:55

Jupyter - Under the Hood -

48

1:00:17

Keynote: Dodging AI Dystopia: you can't save the world alone

49

44:27

Keynote: Killer Robots Considered Harmful

50

53:25

Keynote: Multithreaded Python without the GIL

51

44:05

Leading & growing software teams

52

27:45

Lessons learnt from building my own library

53

25:42

Let's talk about JWT

54

28:57

Lint All the Things!

55

46:33

LocalStack: Turbocharging dev loops and team collaboration for cloud applications

56

32:18

Machine Translation engines evaluation framework

57

29:48

Making AI Happen at Your Company

58

45:34

Making Python better one error message at a time

59

28:29

Managing complex data science experiment configurations with Hydra

60

30:29

Managing the code quality of your project. Leave the past behind: Focus on new code

61

26:14

Maps with Django

62

30:29

Memory Problems, Did Collector Forgot to Clean the Garbage?

63

26:57

Mercury - Build & Share Data Apps from Jupyter Notebook

64

26:24

My journey using Docker as a development tool

65

31:56

Native Packaging of GUI Apps on Windows and macOS

66

25:45

Network Embeddings based Recommendation Model with multi-factor consideration

67

45:24

Online voting system used for primary elections for the French Presidential, must be secure right?

68

31:07

Open Science: Building Models LIke We Build Open-Source Software

69

28:24

Packaging Python in 2022

70

29:31

Packaging security with Nix

71

39:42

Predicting urban heat islands in Calgary

72

20:58

Protocols - Static duck typing for decoupled code

73

28:39

Protocols in Python: Why You Need Them

74

44:36

PyArrow and the future of data analytics

75

29:31

PySnooper: Never use print for debugging again

76

24:52

Python & Visual Studio Code - Revolutionizing the way you do data science

77

25:57

Python for Arts, Humanities and Social Sciences

78

26:47

Python Packaging Automation: Auto-Publish to PyPI via Pull Requests

79

23:47

Raise better errors with Exception Groups

80

30:26

Rapid prototyping in BBC News with Python and AWS

81

29:44

Real-time browser-ready computer vision apps with Streamlit

82

42:04

Writing Faster Python 3

83

38:46

Writing secure code in Python

84

34:52

Revolutionizing Education: How Python is Essential Beyond CS

85

32:28

Scaling scikit-learn: introducing new sets of computational routines

86

25:09

Scalpel: The Python Static Analysis Framework

87

30:43

Secure Python ML: The Major Security Flaws in the ML Lifecycle

88

44:56

Self-explaining APIs

89

28:01

Simple data validation and setting management with Pydantic

90

18:16

Super Search with OpenSearch and Python

91

26:33

Synergize AI and Domain Expertise - Explainability Check with Python

92

44:25

Taking charge of your race conditions

93

31:09

Tales of Python Security

94

27:56

The Design of Everyday APIs

95

40:35

The intricate art of making your (internal) clients happy

96

29:00

Try Something Different: Explore MicroPython (... tales from an adventurer)

97

26:36

Unfolding the paper windmills

98

33:14

Use animated charts to present & share your findings with ipyvizzu

99

21:47

Using Python to manage Software Bill of Materials

100

33:16

Using python to predict Asset price reversals.

101

41:35

Walk-through of Django internals

102

37:55

EuroPython 2022 - Wednesday's Lightning Talks

103

29:06

What happens when you import a module?

104

46:09

What transitioning from male to female taught me about leadership

105

23:16

When gRPC met Python

106

32:28

When Models Query Models

107

27:35

When to refactor your code into generators and how

108

28:22

Why is it slow? Strategies for solving performance problems

109

31:38

WIP: Implementing PEP 458 to Secure PyPI downloads

110

29:36

Working with Audio in Python (feat. Pedalboard)

111

27:09

Write Docs Devs Love: Ten Tricks To Level Up Your Tech Writing

112

27:35

Automate the Boring Stuff with Slackbot(ver.2)

Automatic playback

Speech

Text

Image

00:00

Multiplication signComputer animation

00:40

Transformation (genetics)Execution unitMereologyMultiplication signSolid geometryCASE <Informatik>Computer scienceDiscrete element methodBitProper mapProtein foldingComputer animation

01:28

TrailIdeal (ethics)Repository (publishing)AbstractionSheaf (mathematics)HypothesisComputer configurationMusical ensembleRobotFunction (mathematics)Data modelMultiplicationArchitectureTask (computing)Translation (relic)Hidden Markov modelBit rateMessage passingReal numberHypothesisReading (process)AuthorizationTwitterVirtual machinePointer (computer programming)Limit (category theory)BitFigurate numberCovering spaceMoment (mathematics)Time zoneResultantTransformation (genetics)Different (Kate Ryan album)Open setComputer fileRepository (publishing)Machine learningTrailMultiplicationSign (mathematics)Semaphore linePhase transitionOnline helpGoodness of fitMultiplication signSheaf (mathematics)Self-organizationYouTubeBlogSystem callDoubling the cubeView (database)Proper mapDigital photographyOffice suiteSummierbarkeitNumbering schemeMorphingDesign by contractSingle-precision floating-point formatBoss CorporationRow (database)Bus (computing)Chaos (cosmogony)Direction (geometry)State of matterPoint (geometry)FamilyElectronic mailing listIdeal (ethics)Hand fanKnotComputer animation

08:43

Computer networkArchitectureMotion captureInformationCodierung <Programmierung>Operations researchPositional notationToken ringWordBlock (periodic table)MultiplicationLibrary (computing)Transformation (genetics)Time domainCodeStapeldateiFunction (mathematics)Numerical analysisGradientComputerMeta elementAreaOperator (mathematics)CalculusGUI widgetVariable (mathematics)Graphics processing unitBefehlsprozessorShape (magazine)EmulationFamilyHydraulic jumpCrash (computing)SoftwareGradientQuery languageShape (magazine)Goodness of fitLibrary (computing)Line (geometry)Position operatorTransformation (genetics)Codierung <Programmierung>Address spaceWordTranslation (relic)Function (mathematics)Symbol tableAuthorizationSequenceFunctional (mathematics)Key (cryptography)Order (biology)Just-in-Time-CompilerMathematicsRootGraph coloringProduct (business)Block (periodic table)Branch (computer science)Einbettung <Mathematik>StapeldateiSimilarity (geometry)DiagramVector spaceMultiplication signLevel (video gaming)Computer programmingIntermediate languageProjective planeArtificial neural networkoutputNormal (geometry)DistanceMechanism designRecursionParallel portMotion captureCoefficientBitSystem callSquare numberConfidence intervalAssociative propertyNeuroinformatikArithmetic meanLie groupSpacetimePlastikkarteMoment of inertiaRule of inferenceMassPresentation of a groupQuicksortLogic gateSinc functionBit rateVideo gamePoint (geometry)Recurrence relationSoftware bugWave packetRight angleComputer animation

14:30

StapeldateiHausdorff dimensionoutputOperations researchGamma functionDifferenz <Mathematik>GradientModul <Datentyp>Computer programmingData modelLibrary (computing)Boilerplate (text)Endliche ModelltheorieParameter (computer programming)Revision controlFunction (mathematics)Module (mathematics)Block (periodic table)Codierung <Programmierung>Computer clusterEinbettung <Mathematik>Instance (computer science)Group actionBlock (periodic table)SpacetimeEndliche ModelltheorieFunctional (mathematics)Medical imagingoutputParameter (computer programming)Position operatorAreaRight angleLibrary (computing)SoftwareVideo gameMultiplication signMoment (mathematics)NeuroinformatikNear-ringDecision theoryHeat transferCuboidGradientVector spaceRevision controlJust-in-Time-CompilerModule (mathematics)Artificial neural networkTransformation (genetics)LinearizationParallel portStapeldateiInheritance (object-oriented programming)WordVariable (mathematics)Software bugBoilerplate (text)Codierung <Programmierung>Auditory masking

17:38

Maxima and minimaHill differential equationMIDIBit rateFunction (mathematics)CodeIntegerStapeldateiData modelParameter (computer programming)Computer fileConnected spaceModal logicInheritance (object-oriented programming)Structural loadLogarithmBlock (periodic table)Einbettung <Mathematik>TensorMathematical optimizationInformationBlock (periodic table)Queue (abstract data type)Bit rateInsertion lossLibrary (computing)Sign (mathematics)Parameter (computer programming)Endliche ModelltheorieTransformation (genetics)Mobile appSystem callCodeEinbettung <Mathematik>Functional (mathematics)Key (cryptography)Multiplication signBitWave packetOrder (biology)Residual (numerical analysis)Standard deviationPopulation densityBoilerplate (text)Token ringMedical imagingBlogWorkstation <Musikinstrument>Set (mathematics)Physical systemFiber (mathematics)Physical lawWebsiteGroup actionDrop (liquid)State of matterComputer animation

21:04

Physical systemInformationAbsolute valueCodierung <Programmierung>Token ringImplementationLibrary (computing)Endliche ModelltheorieKey (cryptography)Interpreter (computing)MassMultiplication signTransformation (genetics)WritingVirtual machineNormal (geometry)SequenceBlack boxPosition operatorBounded variationProduct (business)Context awarenessDistanceCodeLibrary (computing)Motion captureToken ringInformationRange (statistics)Physical systemEndliche ModelltheorieQuicksortSpecial unitary groupLogical constantLogic gateWave packetRight angleHand fanXML

24:07

PiProduct (business)Set (mathematics)Lecture/Conference

24:38

GradientVirtual machineBefehlsprozessorSoftware developerLevel (video gaming)Transformation (genetics)DataflowBit rateWordRight anglePie chartBoss CorporationComputer animation

25:49

Endliche ModelltheorieOpen sourceSubject indexingDataflowGoodness of fitMultiplication signPlastikkarteLecture/ConferenceComputer animation

26:32

XML

Transcript: English(auto-generated)

00:06

So I've been working in academia for a very long time and this talk is inspired for someone who I was menteeing this year. It's basically like science is built on the top of giants, right? And those giants look scary, look like, oh

00:24

how are we going to do it? But sometimes giants are just windmills. Granted, windmills who speak weird English, who publish way too many papers, and it's basically impossible to keep up. So I hope this story sounds familiar, because what I'm trying to do is for the next half an hour, even less,

00:45

a little bit less, I'm going to be your squire and I'm going to try to give you the tools to read papers properly. So there's a two folds to this talk. On one hand, I will try to give you tools for reading papers, and every time

01:00

I talk about this, it's like my co-worker says, but I know how to read papers. I'm like, let me show you how to. And the second part is like a more computer science part, where I will give you some tools to implement a paper. In this case, we're going to look at the very, very famous attention is all

01:21

you need. And as a coworker told me, we all use transformers, so yeah, let's do it. So yeah, how to read academic papers. So there's some tools that I think everyone should know about it for implementing, for reading academic papers.

01:41

And as I said, it was not like, yeah, I read papers, I sit and I read, I print them and I read them. No, don't do that. First of all, and I think like the most important tool are repositories, because we all been there. Like we have hundreds of tabs open with different papers, blog posts,

02:02

YouTube videos, podcasts. Like these days, science is distributed across multiple mediums, and we have everything open and never have time to read them. So we need something to properly collect and categorize everything. The main thing that it needs to have these tools is to be distributed,

02:24

like multi-platform. I need to be able to be on my phone and see a paper and save it and then maybe label it, but it needs to be distributed. So mentally, so there are like the old school tools, but also I really, really like paper, like paper pile. I don't know, it feels like a pile of papers

02:44

I'm never going to read, but I'm trying. And then note taking. Again, note taking could be pen and paper, but if you like digitalized tools, and then being able to do nice summaries and share them across the internet,

03:00

I find that good notes and notability are very good tools. And finally, organizing. Sometimes you read a paper because you think it's interesting, but it might not be interesting at that moment, and you need to be able to come back and remember that paper. Again, Notion, GoodNotes, PaperPile,

03:21

and Obsidian are very good tools for doing that. And that's basically it. That's the tools that you need for reading papers. Almost. I have a couple of bonus tracks. First, the first bonus track is a tool for discovering new papers. Granted, Twitter is my main way for me to discover papers.

03:44

Like academic Twitter is very like, yeah, it's there. You get papers, new papers every day. But that might be skewing you to big labs, big corporations. So research rabbit and lead marks will find your papers related. Then they're linked to each other.

04:01

And then for my neurodivergent family, there's bionic reading. Then that's something very, very cool and will help us read better. OK, so we have the tools, but how we need to read it. Hopefully the repository has helped you massively. You don't have these tools open, so you should be able to read.

04:21

Cool. So now what? Now we sit on the desk and have like 200 cups of coffee and we read through them cover to cover. Well, no, because that's infeasible. Like, please be kind with yourself. Like no night is able to read everything.

04:40

So I do this thing. I do the three pass approach. The first approach is me trying to figure it out, is this paper relevant? Like I'm trying to be brutal, like I'm don't going to spend more than 50 minutes doing this. I read the title. I read the abstract, skim a little bit through the introduction,

05:01

maybe read the discussion, and that's it. Nothing else. Then is like the moment where I maybe know that the paper is interesting. I might start brewing a cup of coffee because I'm going to need to read this. Again, no cover to cover, just the introduction,

05:20

the contributions and the limitations. My favorite authors always have like this last paragraph with these are the contributions and they itemize their contributions and the limitations. That is fantastic. Please, authors in the room, do that. I would be very grateful. And then I will read these figures and the result sections.

05:42

Depending on how expert you are on the topic, this might be more or less useful. And yeah, skim through the rest of the paper, grab more or less the idea of what the paper is about and write a summary. Granted, it's not going to be the best summary. It's just like, well, these topics are discussed in this paper.

06:02

Cool. We're good. And then the next phase is when we properly need multiple cups of coffee and sit and read it cover to cover. There's no shortcut here. We need to read it properly. So I know to like you don't need to read it alone, like find help, find colleagues.

06:23

Asking for help is a sign of strength, not weakness. And something that I also do is when I'm reading the paper cover to cover, I also add new papers to the repository so I know where to follow the lead. And extend the summary. At this point, you have a way better idea of what you're going to talk about.

06:45

Brief, brief note on how to highlight papers. Since I do these three phases a stage, which is, I think, pretty common, maybe not. I do this thing like because sometimes I read the paper, but I'm not going to implement it or maybe I'm going to take almost a year since I read it again.

07:03

So I have this semaphore thing where I highlighted in red the hypothesis, the problem that the authors are trying to solve. In yellow, the hypothesis or the methodology that the authors are proposing. And finally, in green, I highlight the evidence, then back up the hypothesis.

07:24

And how does this look in practical? So this is my first pass. Very few things. This is basically like the only thing that I highlighted are the things that I've read. Basically, this takes 50 minutes for real. This is the second pass when I really highlighted more things, go through the figures.

07:44

Like here, I'm trying to pay more attention. Here I have like an idea of what Transformer looks like. And this is the third pass. No, this is not the attention zone you need, but I wanted to show you this paper that was very recently published, in fact, that is values encoded in machine learning,

08:02

because we think the machine learning is neutral. And hopefully from last day, yesterday, keynotes, you know that it's not. So I'm here. You can see that I'm doing oh, it's a pointer. OK, so here you see that I'm doing annotations like things happening where I'm reading.

08:22

And when I'm finally I go to a summary, so I go back to paper file and I write this summary. That's it, now I have a very good idea of what the Transformer looks like, how is it going to, how they do, what methodology they follow, what proofs they have.

08:40

OK, so if I need to implement it, I know how to do it. Great, and now let's implement an academic paper. But before I jump into that, let's have a quick think on what Transformer is. I feel like everybody knows what Transformer is. So indulge me and let's go through it together. So Transformer is actually a family of neural networks.

09:03

It looks more or less like this diagram. This is the original diagram. So it has a color branch and the color branch. And it's very, very popular because it allows parallelization of some tools. Like recurrent neural networks were very slow because you need to go through the whole recursion.

09:20

And here we have some parallelization, which allows us to train faster. And then we have the new magical block attention, which allows us to capture relationships, long distance relationships in a sequence. And finally, we have positional encodings. Positional encodings are very important because it allows us to know

09:41

what's the position of the token in the sequence. Because if you think about it, it's not the same if I say no at the beginning of a sentence or at the end. It might change the meaning. OK, so positional encodings, as I was just saying, sequencing problems need to understand the order of the sequence.

10:00

The authors use a sinusoid to encode the position. So every token at every time step, it has a deterministic vector. And they basically sum it. It's some word embeddings with positional embeddings. That's it. That's positional encodings. You're more than welcome to try other positional encodings.

10:23

There's no rule, but I don't know why. Well, you know, inertia, we all use sinusoid. And what's the attention block? So the attention block is where you're trying to capture the similarities between two words in a sequence. This is very easy to understand when you're talking about translation.

10:43

So, for example, the word windmills are translated in Spanish as molinos. And you want to be able, when you're doing translation, to know that the words might not be aligned, might not be in the same place, but the word molinos is very tight with windmills.

11:00

So you need to have like this relationship. And that's what attention is trying to capture. So we have the, I lost one thing. So we have the embeddings plus the positional encoding. We project that into a smaller vector space. We do the dot product between the query and the keys.

11:20

If the query and the keys belong to the same sequence, that's self-attention. If the query and the keys belongs to an input and an output sequence, like, for example, the translation case, that's cross-attention. OK, then we project it and it's a non-linear projection. And then we compute another dot product and we get the attention coefficients.

11:42

And that's the, all the magic, all the sugar, spice, and everything nice that makes transformers. So basically, in summary, we have an encoder branch that have this multi-headed attention because that mechanism that I just showed you is repeated multiple times. Then we have the decoder, the decoder branch

12:01

that have multi-headed attention and cross-attention, feed-forwards, add normalization, add layers and normalization layers. The add layers is because we are actually also writing in the address symbol. That's the dot lines that connect in between. And that's it. And now, let's do the quickest introduction to JAX.

12:22

This is not a prescription tool. It's like, there's myriad tools out there that you can use. And by all means, pick up the best for your needs. Having said that, we actually love JAX. So, why we love JAX? JAX is a NumPy-like library that runs on accelerators.

12:43

That means that if you knew NumPy, you kind of know JAX as well. It's kind of, yeah. I've been, like, I was thinking for that for a very, very long time. Then it's not completely true. And the good thing of JAX is they have this transformation. So, I'm going to explain in a minute.

13:01

And this is the promised land. Like, you have the predict function, then text and inputs, compute the dot product, add the bias, add a non-linear function, and then compute the mean square root. And basically, I switch NumPy by JNumpy. And, yeah, that's it.

13:20

That's brilliant. So, if it's exactly the same, why do this change? So, we do the change because we have transformations. We have grad. Grad and JIT are going to be the most common transformations. Grad basically takes a function and returns the gradient of the original function.

13:42

If you want to get the values and the gradient, because you might want to get the gradient and you want to compute the loss, you have the value in grad. And then you have JIT, which is a just-in-time compilation. What it does, basically, it does the trace of your program and then traces the program and writes an intermediate representation in JAX-PR.

14:05

Normally, the trade-off between flexibility and fastness is shape array. There's the level of tracer, which is like, we keep the shape of the array, but we don't keep the values. So, you can operate on different batches,

14:21

but all the batches need to have the same shape. Cool. Oh, I forgot to show you. It's here. Look how, oh, what did I do? It's here. Look how easy grad and JIT. And you have the gradient function, trace, brilliant.

14:41

And then you have vectorization and parallelization, beam-up and beam-up are quite similar. Beam-up works on batches and beam-up works across devices, which allow us to do gradient for examples and parallel gradients. We could not train the big, big neural networks and we are training at the moment without them.

15:00

Okay, so let's implement a transformer. It's been a long time. We have been working a long time by the road and it's like, well, what was it? Oh, it was not. We don't really work on JAX because JAX is function-oriented and has like tons of boilerplate and we don't like to write the same thing over and over again.

15:22

So what we do is we have this very nice library, Haiku, then allow us to write object-oriented models and you have the Haiku module then builds this model and has some parameters and the function to apply to the inputs. And these models need to be initialized

15:41

because we need somehow to go from regular functions to pure functions. So we need Haiku transfer and they gave us the init and the apply pure versions of the function. Okay, now we're here for real. Brilliant. So now we have the embedding block

16:01

and the embedding block has the positional encoding and the embedding block and embedding, the word embeddings. It's not really word embeddings because we all use sentence space. But yeah, and you can see then the most common modules like I keep thinking I use this wrong button. So you have H key embed and since we have the parameters,

16:25

we can tell, hey, get the positional embedding and then we sum it and that's it. We have both of them and we're good to go. The attention block then we were just talking about is this thing. And again, we do some housekeeping to know

16:42

if it's self-attention or cross-attention and then we call the parent because multi-headed attention is such a common module that it's already implemented. Yeah, and also very important, please remember then you need to add castle masks if you're doing cross-attention because you don't want to learn from something

17:01

that you should not have seen in the current time step. It's obvious, but it has led to a lot of bugs. The feed-forward network, the feed-forward layer basically initialize the variable. Compute the linear, adds like yellow and returns the linear.

17:23

Very, very common things. So it's really simple to implement. We don't need to do a lot of things. And here's the whole transformer. Maybe this is too small. Oh, you're seeing it? When does this happen? I thought that you were seeing the slideshow. Brilliant.

17:42

So, yeah, this might be too small for you to read, but what I want you to see is that it's very, very similar to all the other tools. Basically, you have the attention block. You have the, ah, I know what I'm doing. Yeah, okay, cool. You have the attention block, a dropout, and you can see here the normalization.

18:03

It takes the attention and the residuals. So something then has not passed through the attention. So if something is meaningful, we keep it. And then the feed-forward block, a dense layer, dropout, and layer norm. And we repeat this multiple times. Cool.

18:25

And then we have to build the forward function. This is slightly different to other toolkits, but it's not something completely insane. It's basically get the tokens, get the embedding block,

18:40

get the transformer block, and apply the transformer. And that's it. The last function is similar, very, very, very similar to the first thing then we saw at the beginning. So you get the whole, the one whole embedding of the target. See, this is like, I'm completely sure that even though this might be the first time that you see jokes,

19:01

you're perfectly capable of reading this code because you might already know NumPy. Yeah, provided that you need not, you already know NumPy. Not, yeah, it might be a little bit tricky. Cool. And we have a lovely call app that I'm going to be able to present,

19:24

which is basically, I'm going to share it online, but it's basically everything that you do. We need to install a couple of libraries. This might not be big enough. Okay, you might need to start a couple of libraries.

19:43

You need to build the sentence piece, but everything is like these days are so easily as accessible. Like TensorFlow Hub has a sentence piece tokenizer that you can basically import. And all these are the model parameters that you are more than welcome to tune.

20:01

Even though some models are, the dropout rate is pretty much standard. And then you load the dataset. This dataset is both in TensorFlow Hub and in Hugging Face. So you can decide how you want to mix these things. The embedded block that we just see, but now with a little bit more of boilerplate,

20:22

the attention block, again, with a little bit more of boilerplate forward. I'm going to skip everything we just saw. And here you define the update function, which is basically get the key, apply, get the key in order to make it reproducible,

20:42

apply the optimizers and return the new state. That's absolutely it. And when you train a model, I trained for a very little time, but you can see that the loss is getting lower. So let's take it for a good sign. Okay, let's go back to this.

21:01

And so the main takeaways that I hope you get from this talk is first of all, find the right system that allow you to keeping up with the literature. There's no good tools. I hope that some of the tools I presented to you are useful, but by all means,

21:20

find the one that is useful for you. Be smart about how you read papers. Like you don't need to read absolutely everything. And then if a paper is relevant to you, summarize it and store it somewhere safe, then you can go back and remember. I have a colleague who the other day told me and they keep all the papers on their brain.

21:42

And I was like, no one don't do that. Okay, so on transformers, I remember then the key things for transformers and it allows parallelization, therefore faster training times. A lot of new flavors of transformer has improved the long range distance, but it has hardly training times.

22:02

So that's a caveat. Attention allows us to capture information into long range distances. The longer the context, the better for the prediction. That's why a lot of new variations like S4 try to improve the context. But then if you want to put it in production

22:21

and do run experiments with them, might be too slow. And finally, positional encodings, then capture the absolute position of the tokens in a sequence. And finally, for implementing papers, we will need to implement papers for either our academic career, for our business career.

22:42

Find the right tool to implement the paper. There's always a trade-off between flexibility and being able to modify things. So there's no right answer, find the best for you. We really like JAX because it's very easy to jump in

23:01

and it allows us to do a lot, a lot of things. And on top of that, when we don't like JAX, we have Haiku, which is a JAX library that allows us to write normal Python code. And that'll be me. Let's build amazing new roads.

23:20

And please, please, please, if you implement new systems, new machine learning models, be conscious about your users, be conscious about the repercussions. This is not a black box. Like we understand what's happening and there's a massive new research on interpretability and trying to understand the depths of the transformer.

23:41

So I hope just the talks gives you like an idea of what's happening in the field, but also be happy. Like I'm very cheerful about the future. I think it's bright because we have all these new tools and all this new blooming research. And yeah, that'll be me.

24:02

If you want to have some time for questions, I'd be more than welcome. Sorry, I rushed through, so hey.

24:25

JAX looks pretty cool. I haven't seen it before. What would be reasons that you might move over from something like PyTorch, either in a research setting or more particularly in a production setting? It allows way more flexibility than PyTorch.

24:42

And then it's like business reason, like JAX is implementing within the company. So we have like the original developers that we can ask and it's very like set up for settings. Like it works very well with our TPUs, with our CPUs. And people just use it.

25:01

But again, it's like try to find the right tool. I think from my experience, I'm all machine learning practitioner and I use TensorFlow on the past. I never used PyTorch. I feel like PyTorch, it allow us more high level development.

25:20

I'm not sure if I want to touch something, like if I want to get a gradient respect to different variables, how is it going to be done? So it probably is like, it's find a tool that is right for you. Maybe you just want to import the transformer and you don't care how many layers you have.

25:40

You just want to say, this is the main things that I want to modify, but you don't need to do something like fine grain. And is there much of an ecosystem, sorry. Yeah, go right there. Is there much of an ecosystem, you know, on papers with code, is there a lot of like JAX models up there or is it still kind of developed? Yeah, yeah. So Google, most of their research

26:02

is either in JAX or TensorFlow or research that we open source is JAX these days. So there's a lot of things. Obviously not as much as other people we don't open source that much, sadly. But for a good reason too.

26:20

But yeah, there's a good ecosystem out there. Good, thank you. Thank you. Thank you so much for your time and your talk. Thank you.

Recommendations