Keynote - PyTorch: Framework for fast, dynamic deep learning and scientific computing
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 43 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/38191 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Place | Erlangen, Germany |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
4
5
6
7
9
10
11
14
28
29
32
33
34
35
38
39
41
43
00:00
Software frameworkAerodynamicsPlanningVector potentialCore dumpSoftware frameworkSoftware developerFacebook2 (number)Right angleEvent horizonGoodness of fitSmith chartComputer animationLecture/Conference
00:52
Library (computing)Reinforcement learningGraphics processing unitMathematical optimizationPoint (geometry)Endliche ModelltheoriePiUtility softwareWave packetDirection (geometry)Position operatorSineProjective planeWorkloadAutomatic differentiationAbsolute valueNeuroinformatikCoefficient of determinationComputational scienceCore dumpCommutatorInternetworkingMathematical optimizationLibrary (computing)Exterior algebraForm (programming)Revision controlComputer animation
04:58
TensorSimilarity (geometry)Operations researchLibrary (computing)TensorLinear algebraGraphics processing unitDefault (computer science)Operator (mathematics)AliasingAreaLibrary (computing)Different (Kate Ryan album)Subject indexingSuite (music)Unit testingStandard deviationBitMathematicsDampingDoubling the cubeType theoryPlastikkarteObject (grammar)Numerical analysisComputer animation
06:16
Bit rateSummierbarkeitGradientDot productSpecial unitary groupData typeSoftwareVideo gameDifferent (Kate Ryan album)Numerical analysisPhysical lawTransformation (genetics)Electronic signatureFunctional (mathematics)Operator (mathematics)Artificial neural networkCodeParallel portComputer animation
07:41
Similarity (geometry)AdditionTensorGraphics processing unitMatrix (mathematics)Letterpress printingLibrary (computing)Radio-frequency identificationStandard deviationSubject indexingBridging (networking)Read-only memoryBefehlsprozessoroutputFunction (mathematics)Power (physics)Type theoryWordEndliche ModelltheorieMathematicsDifferent (Kate Ryan album)Numerical analysisSemiconductor memorySheaf (mathematics)WebsiteSystem callInterface (computing)Subject indexingDecision tree learningPointer (computer programming)AreaWorkstation <Musikinstrument>Standard deviationLibrary (computing)Address spaceDirection (geometry)Cantor setChainOperator (mathematics)NumberHeat transferResultantAlgebraic functionFigurate numberMetropolitan area networkBefehlsprozessorTensorLinear algebraArray data structureException handlingFunctional (mathematics)Computer animation
11:52
Reinforcement learningGradientVariable (mathematics)Hyperbolic functionGraph (mathematics)Binary multiplierResultantGradientVariable (mathematics)NeuroinformatikTensorLibrary (computing)Attribute grammarTerm (mathematics)Operator (mathematics)Error messageRight angleStandard deviationGradient descentSequenceSoftwareSocial classObject (grammar)Matrix (mathematics)Automatic differentiationPointer (computer programming)Recurrence relationMachine learningHyperbolic functionFunctional (mathematics)Artificial neural networkEntire functionLevel (video gaming)Touch typingCoefficient of determinationDisk read-and-write headElectric currentSystem callCodeLatent heatDifferential calculusMedical imagingGame theoryDependent and independent variablesSurjective functionDevice driverSoftware testingComputer animation
16:16
WeightComputer networkArtificial neural networkMathematical optimizationCondition numberLine (geometry)Functional (mathematics)Interface (computing)Transformation (genetics)Operator (mathematics)SoftwareSheaf (mathematics)outputConstructor (object-oriented programming)Artificial neural networkResultantFunction (mathematics)Extension (kinesiology)GradientStandard deviationCodeAttribute grammarGradient descentParameter (computer programming)Computer animation
17:58
outputMathematical optimizationFunction (mathematics)Data modelTerm (mathematics)Function (mathematics)outputArtificial neural networkMultiplication signLoop (music)Endliche ModelltheorieFunctional (mathematics)GradientInsertion lossMathematical optimizationOperator (mathematics)Gradient descentWave packetStochasticWeightError messageParameter (computer programming)SoftwareNeuroinformatikCASE <Informatik>Execution unitMetreComputer animation
20:52
Core dumpEndliche ModelltheorieImplementationPreprocessorIntegrated development environmentTheoryIdeal (ethics)BuildingThread (computing)Domain nameStandard deviationMultiplicationBootingTime domainGoogolGeneric programmingData modelCodeRegular graphSinguläres IntegralKey (cryptography)Price indexComputer-generated imageryRootWritingTensorCore dumpVirtual machineVotingBitWater vaporOrder (biology)SpacetimePoint (geometry)Slide ruleStandard deviationMathematicsProcess (computing)Design of experimentsSampling (statistics)NetzwerkdatenbanksystemSoftwareSocial classBlogSet (mathematics)Interface (computing)Functional (mathematics)Computer configurationLibrary (computing)Wave packetNamespaceTheoryLine (geometry)Subject indexingWeightEndliche ModelltheorieMachine visionIntegrated development environmentMedical imagingScripting languageContent (media)Projective plane1 (number)Data managementBootingValidity (statistics)Software testingSemiconductor memoryWordObject (grammar)MultiplicationMereologyBootingNumerical analysisPhysical systemFacebookFile formatIterationNoise (electronics)CodeBefehlsprozessorStructural loadMiniDiscPreprocessorTensorCondition numberGraphics processing unitArtificial neural networkProgrammschleifeThread (computing)MehrprozessorsystemLengthOperator (mathematics)Block (periodic table)ImplementationForm (programming)Computer animation
30:28
Universe (mathematics)VideoconferencingGame theoryIntegrated development environmentProfil (magazine)Integrated development environmentSlide ruleFlow separationReal-time operating systemBitVideo gameContext awarenessCodeComputer animation
31:28
Extension (kinesiology)Black boxSoftware frameworkBookmark (World Wide Web)Letterpress printingExtension (kinesiology)DebuggerInheritance (object-oriented programming)WebsiteMultiplication signEndliche ModelltheorieComputer animation
32:19
Extension (kinesiology)DebuggerComputer fileExt functorFunction (mathematics)InformationSystem callDigital filterTime line <Programm>Line (geometry)User profileControl flowLetterpress printingCodeMereologyBookmark (World Wide Web)SmoothingLevel (video gaming)Endliche ModelltheorieProfil (magazine)Quantum stateControl flowComputer animation
33:22
Scripting languageKernel (computing)Core dumpMultiplication signSlide ruleMathematical optimizationKernel (computing)Run time (program lifecycle phase)CompilerFormal languageSoftwareArtificial neural networkBinary codeCodeComputer animation
35:08
Mathematical optimizationDuality (mathematics)Innere-Punkte-MethodeExtension (kinesiology)Mountain passWeightArtificial neural networkMereologySystem callImplementationMassExtension (kinesiology)PiMereologyOperator (mathematics)Software frameworkLibrary (computing)Artificial neural networkEndliche ModelltheorieComputer animation
35:53
Endliche ModelltheorieLinear mapModul <Datentyp>Computer fontArtificial neural networkMultiplication signEndliche ModelltheorieLine (geometry)Library (computing)Wave packetSoftwareCodeBitPiPosition operatorComputer animationXML
37:08
Imperative programmingComputational scienceComputer programmingLetterpress printingVariable (mathematics)Linear mapGradientOverhead (computing)Asynchronous Transfer ModeComputer configurationMessage passingComputer configurationVotingOverhead (computing)Software bugComputer programmingWave packetFundamental theorem of algebraSoftwareWorkloadMaxima and minimaSystem callDifferential calculusEndliche ModelltheorieSheaf (mathematics)Multiplication signArtificial neural networkAutomatic differentiationLinearizationNonlinear systemNeuroinformatikLinear codeAsynchronous Transfer ModeComputer animation
39:37
AbstractionSheaf (mathematics)Multiplication signCodeProcedural programmingCompilerPhysical systemDataflowComputer animation
40:39
GradientTelecommunicationBroadcasting (networking)TensorReduction of orderImplementationIdeal (ethics)Subject indexingCovering spaceJust-in-Time-CompilerCache (computing)CompilerSummierbarkeitKernel (computing)StapeldateiNormed vector spaceSummierbarkeitLine (geometry)Operator (mathematics)TensorPoint (geometry)Function (mathematics)Just-in-Time-CompilerVariable (mathematics)Process (computing)Reduction of orderMultiplicationVirtual machineProgrammschleifeInterior (topology)Mathematical optimizationCompilerOrder (biology)BefehlsprozessorCompilerFunctional (mathematics)NeuroinformatikGradientKernel (computing)Broadcasting (networking)Row (database)MathematicsComputer programmingLevel (video gaming)CodeMereologyOpen setCondition numberResultantMultiplication signGraph (mathematics)BuildingGame controllerArtificial neural networkEndliche ModelltheorieCombinational logicBand matrixoutputTerm (mathematics)Wave packetBuffer solutionExpert systemSubgraphComputer hardwareElement (mathematics)Semiconductor memoryDampingGrass (card game)ScatteringNetwork topologyNumerical analysisClosed setTracing (software)Water vaporDifferential calculusView (database)Radical (chemistry)FreewarePeer-to-peerData storage deviceSound effectMoment (mathematics)Ext functorTelecommunicationPhase transitionData conversionComputer animation
49:48
CompilerJust-in-Time-CompilerCache (computing)CodeOperations researchEndliche ModelltheorieFacebookDigital signalLTI system theoryMathematical optimizationSoftware frameworkEndliche ModelltheorieSlide ruleJust-in-Time-CompilerRun time (program lifecycle phase)Operator (mathematics)Moment (mathematics)PiMultiplication signComputer animation
50:57
FacebookLTI system theoryHill differential equationDigital signalProjective planeLevel (video gaming)Product (business)PiForceAxiom of choiceLibrary (computing)NumberData managementMereologyNP-hardMultiplication signPoint (geometry)CASE <Informatik>Physical systemDuality (mathematics)DataflowInterface (computing)Endliche ModelltheorieAbstractionBlogPointer (computer programming)WritingComputer animation
53:32
Just-in-Time-CompilerCompilerCache (computing)CodeOperations researchEndliche ModelltheorieDigital signalFacebookIntegrated development environmentEndliche ModelltheorieRun time (program lifecycle phase)Product (business)Operator (mathematics)Stack (abstract data type)Automatic differentiationMoment (mathematics)PiWebsiteMixed realityTouch typingFunctional (mathematics)Differential calculusComputer animation
54:50
Just-in-Time-CompilerCache (computing)CompilerOperations researchCodeEndliche ModelltheorieComputer hardwareDebuggerGraphics processing unitComputing platformDifferential calculusRight anglePlanningRun time (program lifecycle phase)Software testingTerm (mathematics)Functional (mathematics)Artificial neural networkWritingBit rateDataflowLevel (video gaming)Denial-of-service attackMultiplication signWechselseitige InformationPhysical systemMetropolitan area networkValue-added networkView (database)Computer animation
58:15
Computer animation
Transcript: English(auto-generated)
00:06
Okay, good morning, everyone. I hope we didn't lose too many people after the social event last night in the dark and with the drinking beers after, but good to see most of you made it here this morning.
00:21
So we're gonna start off the second day of talks with a keynote by Sumit Chantala, who's gonna be talking about PyTorch. I'm actually a user of PyTorch. I think it's a fantastic deep learning framework, especially for experimentation. Sumit is a research engineer, right, at Facebook AI Research in New York,
00:42
and he's one of the core developers of PyTorch. And yeah, I'll leave the talk to Sumit. Thanks for coming. Thanks, Federico. Hello, thank you for coming. I will be talking about PyTorch today. And, oh, actually, give me one second, yes.
01:36
So I'll be talking about PyTorch today. It's a deep learning framework,
01:42
but also a scientific computing package. This is the PyTorch team of people from all over the world, and we've been developing it for about a year now. It actually started off as an intern project.
02:03
So for those of you who don't know, there was a package called Torch that was written in Lua that we used to use a lot in the deep learning community. I was one of the maintainers for that.
02:23
And we ran Torch for a while, but one of the big downsides was that there was no ecosystem for Lua, no standard library there, and many other problems. And we were thinking of writing a new version of Torch,
02:40
and there is no better ecosystem to compete with than Python. And we decided to go in this direction, but we also rewrote the design of PyTorch as well to upgrade it from Lua,
03:01
from Torch, which was about seven, eight years old. So what is PyTorch? It is many things, like fundamentally, it has a NDRA library with GPU support.
03:25
So basically you can think of it as a NumPy alternative that also has support for doing computations on GPU. And that forms the absolute base and core
03:41
of PyTorch, and then there is an automatic differentiation package. So you can do computations, and then you can take the derivatives of one node in your computation with respect to another. So this is useful, especially,
04:01
well, the way we designed it was so that we can do particular deep learning workloads and do a lot of AI research. And to support the automatic differentiation engine, we have a optimization package that does numerical optimization.
04:22
It implements a lot of standard, great and decent optimization methods. And lastly, we have many utility packages that some of them I will be going through. They're mostly around like data loading,
04:40
automatically downloading pre-trained models from the internet, et cetera. So at any point of the talk, if you wanna ask questions, please do raise your hand. So after, okay. So an NDRA library, so just like NumPy
05:05
provides an NDRA object, PyTorch. In PyTorch, we have a torch.tensor, and you have different tensor types, like you have torch.floatTensor, doubleTensor, longTensor, and so on.
05:22
By default, torch.tensor is alias to the default tensor type, which is floatTensor. We have many mathematical operations, linear algebra operations, indexing. I mean, the standard suit of operations you would expect on a multidimensional area library.
05:42
And we have very fast acceleration on NVIDIA GPUs. We also have some work that is getting this towards AMD GPUs, particularly AMD Vega cards, but that is ongoing work along with AMD.
06:04
All the unit tests are passing on the AMD thing, but there's still some perf issues. Okay, so to help you relate a little bit more to PyTorch, don't worry if you can't read the code.
06:21
I guess the letters are too small, but basically, there's a side-by-side example of a small, manually written, I think, one-layer neural network that is doing some kind of L2 loss, and the left side
06:46
is written in NumPy, and the right side is written in PyTorch, and you can see that there's basically a lot of parallels between both the APIs, so the PyTorch Tensor API is not an exact API compatible with NumPy,
07:02
but all the operations that you would expect in NumPy are present in PyTorch, probably with slightly different function signatures. And there is no particularly good reason to not have adopted the NumPy API. It's just that we were coming from our own legacy
07:23
of coming from LuaTorch to writing PyTorch, so we carried that API over, but we are slowly making transformations to the API to make it look more and more like NumPy.
07:40
So with a slightly bigger font, the NDR library, you can create tensors, you can print them, you can, you know, like we have a random package, you can create randomly initialized tensors, you know, you can print their sizes,
08:02
and you can do standard indexing, like basic and advanced indexing, just like you would do to NumPy arrays. You can do standard mathematical operations. Here I'm just like adding up two tensors, X and Y.
08:23
One of the things we can do is that you can convert from a torch tensor to a NumPy array and back without any memcpy. So it's basically like a free operation, especially if the torch tensor is in the CPU.
08:41
So under the hood, NumPy provides a nice API for this, and we leverage that. So you can create a torch tensor, convert it to a NumPy array, and then make changes to the NumPy array, and then the changes will automatically be reflected in the torch tensor,
09:00
because the underlying memory pointers are the same. And similarly, you can create a NumPy array and convert it to a torch tensor, and you would again see that if you change the NumPy array or change the torch tensor, the chains are reflected in the other direction as well.
09:23
Here is a small example of me showing that. I show that I add the number one to A, which is a torch tensor, and then the NumPy array that it was converted from also changed.
09:43
So all the torch tensors on the CPU, except for the char tensor, support converting to NumPy and back. Then coming to GPU tensors, it's kind of very explicit
10:02
on how tensors are used on the GPU. So if you have a particular tensor, to put it on the GPU, you call .cuda on the tensor, and then if you have a torch.float tensor, and you call .cuda on it, then you would get a torch.cuda.float tensor
10:21
as your return type. And that tensor is now pointing to a memory region on the GPU. If you do any computations on it, then the results will also be on the same GPU. And only if you want to transfer it back to the CPU, you call .CPU on that particular torch.cuda.float tensor.
10:43
And again, you will get back a torch.float tensor. The reason we have an explicit model like this, rather than trying to hide it where all of your intermediates are NumPy arrays, for example, is because doing CPU to GPU transfers is a fairly expensive operation,
11:02
and you wouldn't want to have this automatic way of trying to figure out where to keep memory and so on. Doing it explicit also gives the user a lot of power. So that is an overview of our tensor library.
11:26
We do have all of the standard linear algebra functions with LEPAC interfaces, and if there's something missing on the CPU side, for example, you could always just convert the tensor
11:43
to a NumPy array and call, you know, sci-pi NumPy operations. Okay, coming to my next section, which is the automatic differentiation engine.
12:00
So, so far I showed like how you could use the standard tensor library. So the automatic differentiation engine I will show it by example. What it is is you would do a bunch of tensor operations, and then you can then take the gradients
12:21
of one of the results with respect to another. And this is often useful if you're trying to do gradient descent based machine learning. As an example, so we provide in the AutoGrad package, we named it AutoGrad because there's a lot of history
12:43
to that. There's also a Python AutoGrad package, which is basically the original automatic differentiation package by some of the Harvard folks. We named ours torch.autograd due to a comedy of errors.
13:03
The right name for it probably would be torch.autodiff. Anyways, so in our AutoGrad package, we have an object called variable, a class called variable. So a variable is something that loosely wraps a tensor.
13:21
So you put a tensor inside a variable, and then on this variable you can do a lot of operations. And then what was the sequence of operations will be recorded by this variable object. And then in this example, I create four random tensors,
13:44
x, w, h, w, x, and prev, h. This is a small like recurrent networkish, recurrent network style graph. And I create these four initial tensors, which I will call leaf variables
14:02
because these are user-created tensors. That is, these are not intermediate variables that were the results of some operation, the user explicitly created them. And then I will do two matrix multiplies on them to get this resulting graph.
14:21
And then I will add the results of those two matrix multiplies, and I will finally get the node nextH. So this is my graph. And then finally, I will add a tanH operation
14:42
on the nextH. So in terms of my computation that I've done so far, this is what the computation graph looks like. And the variable nextH has a pointer to the tanH function,
15:01
which in turn has a pointer to the add function. And you can basically recreate the entire computation graph from the last variable nextH. So if you want to compute the derivative of nextH with respect to any of the leaf variables up there,
15:23
like wH or h or wx or x, then what you would do is you would call nextH dot backward with some gradients with respect to nextH. And then what will happen is the gradients
15:41
with respect to, gradients of nextH with respect to all of the leaf nodes will then be accumulated into the dot grad attribute of each of those leaf variables, specifically x, prev, h, w, h, and wx.
16:02
And so this is basically how our autograd engine works by example. To give a more concrete and fuller neural networks example, so this is a short example of creating a neural network.
16:24
In the first constructor section, we are just creating particular layers that we would like to use in our neural network. This kind of neural network, it's called a convolutional neural network. So we create some convolutional layers,
16:41
some affine transform layers. And then we define a forward function and the forward function basically it takes the inputs and then you define how you transform the inputs to get the output, the output final result.
17:04
And so the forward function is just like a standard Python piece of code. You just, you can use the layers that you created in your constructor and you can do various other operations, torch operations, and then you get your result that you return.
17:20
And this is basically how you define a neural network. And then you can create an instantiation of the neural network in the last line. And then you can pass inputs to the network and then call backward on it. And then the gradients with respect to each of the
17:46
parameters defined in the network will be accumulated into their grad attributes. And then you can do some kind of gradient descent method. And to do the gradient descent steps, we provide an optimization package
18:01
which is kind of has a very simple interface. We implement many methods that are standard in the neural network literature, Stochastic Gradient Descent, Adagrad, RMSprop, LBFGS, Adam, and so on. And basically, this is typically how a training loop
18:23
for your neural network would look like. You would iterate over your inputs and some target values in your data set. You would zero the previous gradients and you would pass your input through your neural network. And in this case, it's called model.
18:42
And then you would get a particular output. And then you would take your output and some target value that you want your neural network to compute. And then you would compute some loss function that compares your output to the target value and produces some how far away is your output
19:01
from your target value. And then you will compute the gradients with respect to that loss function for all of your parameters by calling loss.backward. And then all of the gradients are accumulated in those particular parameters of your neural network. And then when you call optimizer.step,
19:23
then the neural network will take a small gradient descent step where its weights will be slightly transformed to have less of an error the next time it computes some input target question.
19:55
So the question is the optimization is itself implemented in terms of torch operations.
20:01
So can you do meta optimization? Yes, the answer is yes. You can create an optimizer that itself is some kind of parameterized model that will provide like the gradient descent step. And it is learning how to optimize as well.
20:22
You could do like all of these inception style step. Okay, so now just I gave you like a rough overview of like the basic packages, but I presume a lot of you
20:44
are not following a lot or it's just early in the morning. So what's left is I will go through what typical machine learning workflows look like,
21:01
especially in the deep learning space. And what are the usual pain points of dealing with you know, building these workflows and how PyTorch can help in dealing with them. And then like one slide summary of like what are the core philosophies of PyTorch
21:21
when we develop this package and to guide our future. And then lastly, I will talk a little bit about the upcoming features in PyTorch. So talking about ML workflows, typically in my short experience,
21:40
this is kind of what I think machine learning workflows are like. You or your advisor or your friend will have an idea or some theory that you came up with and then you want to design experiments to validate or invalidate that theory.
22:01
And then you would select some data sets or like some appropriate environments to validate this and you would pre-process your data or set up your environments. You would implement your particular models like all kinds of machine learning models.
22:20
And then you would train and you would have some part of your data set be a validation set and some part of it be a testing set. And so you would do training and validation and then finally you would test your final model, see if it works and then if it works, you would publish it if not, you would write a blog post.
22:47
So Python combined with PyTorch is one environment to do all of this. It's not the only one, definitely not, but it is one. And let me go into how PyTorch can help you
23:03
do some of these blocks here. So writing data loaders, I think it's one of the most horrible parts of doing machine learning workflows. It's annoying, it's not rewarding, but it has to be done. So the problem is every data set
23:23
is slightly differently formatted, slightly differently processed and when you get it, there's some noise in the data set of various form or it's written in some XML format that you have to parse. So what do you need typically is some kind of pre-processing
23:49
on this data set that is consistent and typically when you're using GPUs to train your neural network models on these data sets,
24:02
GPUs are generally like fairly fast machines. So you would need your data pre-processing to not become a bottleneck while you're training your neural network. So you would need a multi-threaded data loader or like some kind of data loader that will alleviate the CPU bottlenecks that you might have.
24:22
So our solution to these things is, the first is we have packages that share all of the data loaders across the PyTorch community so that you don't have to write the same data loader that some other person somewhere wrote
24:42
and it's been fairly successful. We've been getting pull requests to add more and more data sets, especially academic data sets. We have a vision package, a text package and then we just started an audio package. But this is not really limited to like,
25:03
oh, do you need to use PyTorch data loaders to use PyTorch and no, you can just use regular Python to write like a torch data sets and leverage, you can leverage existing Python code. As an example, there's a project called Parlay
25:21
that we had Facebook release. Do you have like 20 question and answering data sets and interfaces to them? And you can just like, it's a standard Python API and you can use this to write like PyTorch data loaders. In practice, this is how the code looks like.
25:43
And so if you're writing a script to train a network on a bunch of data sets, you would just have like some conditionals based on like whatever your command line is parsing. It's like, oh, if my command line says
26:01
my data set is Elson, then I would just have this one snippet that would initialize the Elson data set and specify how to pre-process the data set. And the last line here is, so we have two abstractions. One is called a data set
26:20
and the other is called a data loader. The data loader takes a data set and just makes it into like a multi-processing beast so that you just have parallel processing while you're like loading data from disk or doing pre-processing. So the data loader takes in a data set and takes in the number of workers you want to use
26:44
to process the data set and you have some other options like whether you wanna shuffle your data set or how big of a mini-batch you want to load per iteration and so on.
27:01
And actually writing a new data set for a new, sorry, writing a new Torch data set for a data set that you wanna use is also not that hard. The interface is only, you only have to implement two functions.
27:23
You have to subclass from the torch.dataset class and then you have to implement the getItem class which takes an index. So if your data set has 10,000 samples and the samples can be let's say images or audio clippings or some sentences.
27:43
So the getItem function will just take the particular index of the sample that you wanna load in your data set. If you have 10,000, then like your index can range from zero to 9999. And then that function will just define
28:01
how to load that particular index and pre-process it. And then the other function that you need to implement is the length operator that just defines like how big your data set is. And it's fairly mechanical to implement a new data set.
28:21
You know, like you just define how to load it from disk and pre-process it and so on. And lastly, for data loaders, which are the multiprocessing helpers for data sets, what we ended up doing in PyTorch
28:41
is we did a small fork of the Python multiprocessing package and just put it under the torch.multiprocessing namespace. The only thing we did was instead of using the standard Python pickler, we use a custom pickler which whenever it encounters tensors,
29:01
being passed through from one process to another, instead of like serializing the tensor and then deserializing it on the other side, it just like puts the tensor into a system shared memory and then it just like shares tensors across the processes. So if you change the tensor in one process,
29:22
like the change is gonna be reflected on the other process, for example. And this is kind of both needed for efficiency. Because Python pickling is fairly slow. And also it's useful when you're trying to implement
29:40
certain variants of training loops where let's say you want to have multiple copies of your neural network all in different processes but doing different things and optimizing different things. So if you want to know more subtleties
30:01
of general Python multiprocessing, there's a talk at 2.15, I think, by one of your upcoming speakers. And the content of the talk goes into like a lot of very, very important subtleties in multiprocessing.
30:24
You should definitely go for that. And lastly, apart from like loading datasets, disk-based datasets or text-based datasets, you would also want to interface with environments. For example, like video game environments
30:42
or like the real world, you have like a real-time camera somewhere. And pretty much every environment provides a Python API. So there's like not much else you really need to do to natively interact with your environment separately.
31:03
So that covers the data loading part. And the next couple of slides are gonna be about how PyTorch really helps you do better debugging and do easier profiling of your code,
31:21
finding hotspots. I wanted to give a little bit of context on why these slides are there. In the deep learning framework space, it is not obvious that if you use some framework that just say has a Python API,
31:41
that you could use the Python debugger, for example. Because like a lot of these frameworks are provided as black boxes where you create your model, but then when you run it, it's run in some kind of C++ runtime, so you can't actually set Python breakpoints and see what's going on and printing and so on.
32:02
So that is particularly why I do mention that with PyTorch, PyTorch is just a C Python extension. It doesn't work with PyPy or other variants. But you can use your favorite Python debugger. You can use PyCharm, you can use PDB,
32:23
you can use print, for example. And it's kind of as smooth as debugging other parts of your Python code. And similarly, if you want to profile your code, identify bottlenecks, you can use your favorite
32:45
profiler from Python. Now, for these profiling, it's not gonna go into the C code and further break down which parts of the C code
33:00
are being a bottleneck, but usually when you're doing PyTorch stuff, the granularity of hotspots you would find is at the Python level. So this particularly not gonna disable you when you're doing identifying bottlenecks, for example.
33:21
And lastly, PyTorch is written so that you don't have any runtime compilation time. So if you use a package like Theano, for example, to do certain optimization, what they would do
33:43
is they would invoke GCC at runtime, like stitch C code and then compile it. But these things generally take time. And from the beginning, PyTorch has been built so that as a user, you don't actually wait on anything.
34:01
You use PyTorch and you get a feel of its scripting language, pretty much. And we basically precompile all of the kernels we really need, but that comes with a slight downside, which is that the binaries that we ship are fairly large.
34:22
A PyTorch binary is about 400, 450 MB. Most of it is like precompiled GPU kernels, which we can't, you know, we don't want, like it's that we don't want to specialize them at runtime for performance because that means
34:41
that as a user, you're waiting on something to be happening. So yeah, that's the only downside. Our binaries are fairly large. So this slide is kind of useless for this audience, but yes, if you use PyTorch,
35:01
you can use any of your favorite packages, even while writing neural network layers, you can use SciPy, Scikit-learn. As a small, anecdotal example, this user called Brandon Amos, he wrote some really complicated thing that he couldn't do in other frameworks,
35:23
but more importantly, he interspersed like PyTorch stuff, went into SciPy for certain operations, came back and like all of these are pretty natural when you're writing PyTorch layers. And we even have tutorials of like,
35:43
okay, if you want to write like a PyTorch extension, can you implement parts of it in NumPy and leverage SciPy, for example. So another thing that we have in PyTorch that also tries to create a community
36:02
and share things is users can share their neural network models across the community. And just with like, say, one line of code, you can get a pre-trained network, trained on some data set
36:21
that was trained by someone else in the community. And that just saves you a lot of academic research time. For example, someone took all of the models in the PyTorch model zoo, and they just wanted to see
36:40
if that particular neural network, which was trained and computed in 32-bit floating position, if they quantized it to various bits, how the accuracy of the neural network would degrade. And doing something like this, if you already have pre-trained models, it's kind of much easier to do.
37:07
So one of the things you will notice in PyTorch is we built all the APIs so that you, as a user,
37:20
will build code with some, you have a linear style of programming that is, you will write code top-down, and you can interactively write it in IPAT on notebooks, for example.
37:40
And there's other models of building neural networks and doing training as well. Usually people build a particular neural network beforehand, and then they run it in some other engine. So if there's something that they wanna debug, they would have to relate their debug message
38:04
to where they constructed the neural network, which is what I call nonlinear dependencies. But in PyTorch, you can basically have a linear code flow, and debugging is also fairly easier. This is not something in the neural network world
38:21
is not unique to PyTorch, but to build this automatic differentiation engine that is interactive and imperative, the fundamental bottleneck before PyTorch came out was that these packages had a huge node creation
38:40
and bookkeeping overhead. We spent a lot of time micro-optimizing our automatic differentiation engine so that the overhead of node creation, of bookkeeping is at the max not more than 20 to 30 microseconds, which is quite large
39:00
if you have very, very small computation workloads. But in general, the workloads that we've been seeing in the community and the workloads that we care about, that overhead is fairly okay. And we're of course working on reducing that even further.
39:22
The other options that you have in the community generally have a few milliseconds or more of node creation overhead, especially if you have an interactive mode. Okay, coming to the second last section,
39:41
which is the philosophy of PyTorch. These are basically what we kind of keep in mind. We wanna stay out of the way of the user. We don't want to overburden them with abstractions or complicated API procedures.
40:05
We wanna cater to the impatient. We want things to always be interactive, quick, no compilation time. We wanna promote a linear and interactive code flow. And we wanna be interoperating with the Python ecosystem
40:24
as naturally as possible. And we wanna be as fast as any other package that provides the same features that we do. And the last section is the recent and upcoming features.
40:43
Which might be useful in context of like existing users but also might be useful otherwise. So distributed PyTorch is something we released about a month ago as part of our point two release.
41:02
It's an MPI style distributed communication. So you can basically exchange tensors between multiple nodes and multiple machines or multiple processes. And you can also do like reductions. For example, you can say,
41:21
there I have these tensors on each of my machines. I wanna compute the overall sum of all of these tensors into the same buffers. And you can do scatters, gathers and so on. This style of distributed programming is called an MPI style.
41:42
We introduced that package. We have examples of how to use this in terms of neural network training. If you have multiple nodes with multiple CPUs or GPUs, like how can you leverage all of them at the same time to make your training accelerate faster?
42:01
And then we introduced a feature to compute not just the first differential package through our automatic differential package, but also like higher order derivatives. And this is kind of useful to implement more crazy ideas in research and practice.
42:27
A lot of recent research seems to be using higher order gradients for implementing some of their ideas. So that's something we implemented.
42:41
And until the point two release, PyTorch didn't have NumPy style broadcasting or advanced indexing. Across the library, we've implemented this now. And that's been pretty useful for a lot of our users.
43:00
And this last feature I want to talk about is just-in-time compilation. This is an upcoming feature. We have pull requests open for this and we're in various levels of code review slash changes. But basically, I described to you how PyTorch computation graphs are built
43:23
and you can do a forward and a backward and you have neural network models. Now, all of this computation is interactive, which means that the amount of optimization we can do to make the computation go faster
43:40
on your particular hardware is very limited. So what we ended up building is a tracing just-in-time compiler that can cache and compile your graphs and subgraphs. I'll explain to you what a tracing JIT is in very, very simple terms. If there's compiler experts in the audience, don't kill me.
44:03
But basically, let's say you have a function foo that takes an input tensor x and it first computes the total sum of all elements of the tensor. And if the sum is less than five, it returns x plus one.
44:21
Otherwise, it returns x plus two. So let's take two example tensors, x and x two. In retrospect, I should have called it x one and x two. Well, okay, so x doesn't sum to five and x two sums to more than five.
44:42
So if you do foo of x, you would get basically x plus one and if you do foo of x two, you would get x plus two. And now let me explain what a tracing JIT would look like.
45:02
So now let's say that we call torch.jit.traced of foo and then you would get a new function called JIT foo. And you pass x through JIT foo and this is what happens.
45:21
There's the tracing engine that is running somewhere is going to trace the operations of all the operations that are happening on tensors in that particular Python function. So as you do that line sum equals x.sum,
45:42
it would record that x was passed into the sum function and you got some output variable t one. And then here sum is just like a Python float so it doesn't record anything there. And then when you return x plus one, it records that you added x to one
46:02
and then that's your return value. So that's your trace that gets recorded and then you would get the output zero, one, two, three, just like above. Now what a tracing JIT is, is it records a trace of what happened
46:21
and then next time when you wanna run that function, instead of running the actual Python function, it will just run the trace again and again. And it can do various optimizations on the trace to make it more efficient. So when you run JIT foo of x two,
46:40
what actually happens is that it will run the trace instead of the actual Python function. It will just sum x and give an output variable t one, which is basically unused in computing the return value. And then it will add x to one and then return x plus one
47:05
and as you see, this is the wrong result. To compute and that's, but that's kind of how the tracing JIT works. So basically you would want to apply torch.jit.trace
47:24
to functions that usually don't have conditionals that usually have just like a straight line computation without a control flow. So depends on like the syntactic sugar you wanna use,
47:41
but instead of using torch.jit.trace, you could just annotate your function with a decorator that will do the tracing for you. Now, I hope you guys got a good understanding of how the tracing JIT works. Now, looking at the benefits, if you go back to this trace here,
48:04
while computing this trace, the first operation sum of x to t one is absolutely useless because you don't use it to compute the return value. So once you get the trace, you can send it to like a compiler optimizer, which will remove the dead code that is not needed.
48:21
It will also do various things like operator fusion. So most of the operators that we use, especially like in PyTorch or NumPy, they're all bandwidth bound, which means that the operations are running as fast as how fast you can transfer them into CPU registers
48:43
from main memory and then back again. So you can do what we call kernel fusion, which is like you can try to combine multiple operations into inner loops and generate a new C function that does like multiple operations at once
49:03
and much more efficiently. Another benefit of doing compilation is you can do stuff like out of order execution, like if you have code that does instruction one, two and three, but instructions one,
49:20
two and three don't depend on each other, you can change the order of instructions to maybe run them more optimally. And lastly, you can do automatic work placement. For example, if some of your CPUs are free and not being used, you can say offload some of the computation onto these free CPUs,
49:41
even though the user has never specified explicitly to do so. In the tracing JIT that we built, we have, at the moment we have optimization passes to do operator fusion and generate
50:02
more optimized code, well code. And another feature we have is you can use a tracer to export your models from PyTorch to another framework. And for example, run it in purely C++ runtimes.
50:25
And with that, I conclude my talk. The last slide I have is just an overview of who is involved with PyTorch and thank you for having me here.
50:52
If you have any questions, I will take them, though not sure how much time there is. Oh no, yeah. So questions?
51:04
Start raising your hand if you have questions, so I can go to you. So thank you very much for your talk concerning the NumPy compatibility. Would you expect this project somehow to supersede PyQ to PyOpenCL
51:21
in that sense, what it could do also with NumPy? So PyCuda, if I remember correctly, is like a low-level interface to CUDA. So this project is slightly orthogonal to PyCuda in that sense.
51:41
For example, you can write new PyTorch CUDA kernels in PyCuda instead of writing in C++. PyCuda itself, from what I know, doesn't have like a NDRA abstraction and stuff. So I guess it's like a lower-level package
52:00
that you can use in conjunction with PyTorch. I wouldn't expect it to replace PyCuda. It's just orthogonal. I would expect it to be, there's some packages like, I think, CUDA, MAT, or there's some like Python, CUDA, NDRA libraries
52:24
that are lying around. It probably will be a replacement for those. Would you use PyTorch for production kind of things,
52:41
or would you say, and if yes, do you have any pointers for blog posts or so to learn what people do to do that? Sure. Sure. Generally, production means various things. If you're okay with shipping Python into production, yes, we would absolutely use PyTorch into production,
53:02
but it depends on where you work or what your production systems look like. Sometimes you wouldn't wanna ship Python into production, and in that case, we wouldn't wanna use PyTorch into production. We made an explicit choice that PyTorch will have a hard dependency on Python.
53:22
That comes with huge benefits, but also the downside that you have to ship Python wherever you wanna ship PyTorch. But as of my last slide, as I said, the tracer can be used to export models to purely C++ runtimes. So it can use PyTorch for R&D,
53:41
and then you can export the traces and run it in pure C++ runtimes for those kinds of production environments. I was curious. So you said that you can mix, one of the strengths of PyTorch is that you can mix PyTorch with other Python,
54:00
like the rest of the SciPy stack and so forth. How does the automatic differentiation play with if you call into a lot of SciPy functions which are written in C? So if you want to write a particular AutoGrad node in that leverages SciPy, for example,
54:21
you'd write the forward and backward yourself, and then you explicitly are telling PyTorch how to do that differentiation. Any more questions? Any more questions?
54:44
The question is, you talked about fusing operations. Will they be supported on the GPU too? Actually, the current status is that we've only implemented this for the GPU at the moment, because that's where they matter much more.
55:02
So we are, almost all of the PyTorch priorities are on GPUs. We first implement things for the GPUs, and then we do it for the CPU. How does this compare to TensorFlow?
55:21
When would you advise to use TensorFlow versus PyTorch? Well, if you ask me, I would always advise to use PyTorch. It's something that's providing the same high-level functionality as TensorFlow.
55:41
It's providing differentiation and providing neural networks and providing a Tensor package and stuff. But in TensorFlow, you can't really use a lot of the Python ecosystem. You can't write TensorFlow nodes that call into Python, for example,
56:02
like use SciPy, for example. And you can't use the standard Python debugging tools. You have to use TensorFlow's own debugger and stuff. So, but the upside of TensorFlow is that when you have the question of like, oh, can we ship to purely C++ runtimes
56:20
or can we ship to mobile, it becomes easier in TensorFlow. So depending on what your trade-offs are, what you care about, use PyTorch or TensorFlow, I guess. Okay, oh, Olivier? Olivier. Can you just ask? Olivier.
56:48
The question is, do you plan to support other hardware platforms besides GPUs? Can you give me examples?
57:06
Right. So, I mean, if they have usable APIs, and I mean, if they have public APIs and stuff, like we have plans to extend PyTorch to other hardware platforms.
57:23
It's just like a matter of when those things mature. Like for example, it's no different for us to support new hardware platform versus supporting AMD GPUs, for example. Like we have to write another engine
57:41
that will add all of the functionality for that particular AMD GPU. And we've already started testing that and supporting that. So like, it would be equally, in terms of priorities, we care about it. If people care about it enough,
58:01
then we don't have any problems supporting it. Okay, well, thank you very much. Thank you very much.
Recommendations
Series of 10 media
Series of 14 media
Series of 11 media