We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

5th HLF – Lecture: Deep Learning and the Grand Engineering Challenges

00:00

Formal Metadata

Title
5th HLF – Lecture: Deep Learning and the Grand Engineering Challenges
Title of Series
Number of Parts
49
Author
License
No Open Access License:
German copyright law applies. This film may be used for your own use but it may not be distributed via the internet or passed on to external parties.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Over the past several years, Deep Learning has caused a significant revolution in the scope of what is possible with computing systems. These advances are having significant impact across many fields of computer science, as well as other fields of science, engineering, and human endeavor. For the past five years, the Google Brain team (g.co/brain) has conducted research on deep learning, on building large-scale computer systems for machine learning research, and, in collaboration with many teams at Google, on applying our research and systems to dozens of Google products. In this talk, I'll describe some of the recent advances in machine learning and how they are applicable towards many of the U.S. National Academy of Engineering's Global Challenges for the 21st Century (http://engineeringchallenges.org/). I will also touch on some exciting areas of research that we are currently pursuing within our group. This talk describes joint work with many people at Google. The opinions expressed in this video do not necessarily reflect the views of the Heidelberg Laureate Forum Foundation or any other person or associated institution involved in the making and distribution of the video.
Internet forumSearch engine (computing)Point (geometry)NumberAlgorithmParameter (computer programming)Key (cryptography)KnotConnected spaceData structureEndliche ModelltheorieSoftwareComputer architectureComputer-assisted translationScaling (geometry)Basis <Mathematik>Core dumpWave packetLevel (video gaming)PredictabilitySequenceMedical imagingProcess (computing)Backpropagation-AlgorithmusMultiplication signMathematicsVirtual machineExecution unitImage registrationPhysical systemTask (computing)Complex (psychology)State observerDifferent (Kate Ryan album)1 (number)Surface of revolutionLine (geometry)Graph (mathematics)Internet forumDiagramCoefficient of determinationPixelArtificial neural networkFigurate numberHierarchyQuicksortRaw image formatGraph (mathematics)WaveformDomain nameFile archiverService (economics)Term (mathematics)BitGoogolPattern recognitionMoore's lawGroup actionMachine learningFloating pointLecture/Conference
Bit rateInternet forumPixelHypothesisNeuroinformatikPredictabilityParallel portAlgorithmWeightCoefficient of determinationCoprocessorEndliche ModelltheorieVirtual machineCategory of beingWave packetComputer-assisted translationBitTrailSign (mathematics)Speech synthesisPattern recognitionRaw image formatWaveformMedical imagingFormal languageObject (grammar)Identical particlesMoore's lawArithmetic progressionField (computer science)MIDIPhysical systemMultiplication signAbstractionComputer animationLecture/Conference
Internet forumDifferent (Kate Ryan album)MathematicsQuicksortNeuroinformatikError messageRepresentation (politics)Machine visionWeightMultiplication signFehlerschrankeStatisticsMedical imagingLecture/Conference
Internet forumElectronic mailing listCollaborationismMachine learningTelecommunicationBuilding1 (number)Autonomic computingOnline helpDegree (graph theory)Operating systemVirtual machineGoodness of fitLecture/Conference
Bit rateInternet forumDrum memoryGame controlleroutputLevel (video gaming)QuicksortEndliche ModelltheorieArithmetic progressionAreaTask (computing)Decision theoryLecture/Conference
Drum memoryBit rateInternet forumIntegrated development environmentAreaVirtual machineMachine learningEndliche ModelltheorieNumberMedical imagingBitNeuroinformatikWave packetGroup actionMachine visionDifferent (Kate Ryan album)Sign (mathematics)GradientMultiplication signCausalityRight angleElectronic mailing listComputer animationLecture/Conference
Bit rateInternet forumDrum memoryMedical imagingField (computer science)ResultantModal logicoutputSequenceLine (geometry)CASE <Informatik>Function (mathematics)Physical systemState observerHydraulic jumpCodeGreen's functionGroup actionTranslation (relic)QuicksortEndliche ModelltheorieDifferent (Kate Ryan album)Formal languageStack (abstract data type)Greatest elementSet (mathematics)Task (computing)PredictabilityMaschinelle ÜbersetzungTable (information)Level (video gaming)Inheritance (object-oriented programming)Machine visionComplex systemAreaWordNeuroinformatikAbstractionWave packetStatisticsGraph coloringPixelLocal ringFormal grammarData structureLecture/Conference
Internet forumRight angleBlogException handlingEntire functionFormal languageMaschinelle ÜbersetzungUsabilityQuicksortPhysical systemWordComputer animationLecture/Conference
Internet forumBit rateoutputPattern languageDensity functional theoryGroup actionReal numberNeuroinformatikWeightSimulationResultantMultiplication signCodePhysical systemFunction (mathematics)Quantum2 (number)CollaborationismKeyboard shortcutConfiguration spaceComputer simulationProcess (computing)Context awarenessTwitterDifferent (Kate Ryan album)Domain nameComputer architectureCategory of beingBlogSoftware testingMereologyElectronic mailing listCondition numberRow (database)Identical particlesBefehlsprozessorAttribute grammarTask (computing)SequencePredictabilityLoop (music)Error messageGraph (mathematics)Level (video gaming)TouchscreenProteinCASE <Informatik>Lecture/Conference
Internet forumBit rateReverse engineeringData structureGraph (mathematics)Volume (thermodynamics)Program slicingThermal conductivityImage resolutionSlide ruleAdjacency matrixLine (geometry)ArmPort scannerConnected spaceLecture/Conference
Internet forumBit rateCollaborationismError messageKey (cryptography)Endliche ModelltheorieEntire functionLevel (video gaming)Lecture/Conference
Internet forumBit rateCellular automatonBoundary value problemAreaDenial-of-service attackSoftwareBitConfidence intervaloutputPredictabilityLecture/Conference
Internet forumBit rateMereologyConfidence intervalPlastikkarteBranch (computer science)Entire functionQuicksortComputer animationLecture/Conference
Internet forumBit rateAreaQuicksortVirtual machineSoftwareElectric generatorTensorBuildingMachine learningProduct (business)Physical systemGroup action2 (number)DataflowEndliche ModelltheorieOpen sourceLecture/Conference
Internet forumBit rateQuicksortPhysical systemCodeProof theoryOpen sourceVirtual machineSingle-precision floating-point formatMaschinelle ÜbersetzungAndroid (robot)Different (Kate Ryan album)Descriptive statisticsRepository (publishing)Variety (linguistics)File archiverComputing platformSocial classMetric systemReal numberProduct (business)NeuroinformatikResultantUniverse (mathematics)Data centerPiWebsiteLecture/Conference
Internet forumComputerBEEPBit rateCategory of beingComputer hardwareNeuroinformatikMultiplication signAlgorithmOperator (mathematics)Physical systemVirtual machineBinary multiplierAreaComputerVector spaceEquivalence relationWeightBitLinear algebraMatrix (mathematics)CalculationDisk read-and-write headCore dumpTensorLecture/Conference
Internet forumBit rateComputer architecturePredictabilityCalculationMatching (graph theory)Data centerProcess (computing)Endliche ModelltheorieTranslation (relic)NeuroinformatikComputer-assisted translationCoefficient of determinationWhiteboardInferenceAuthorizationPhysical systemComputer animationLecture/Conference
Internet forumBit rateProcess (computing)TensorExecution unitPhysical systemElectric generator2 (number)Wave packetInferenceScaling (geometry)Field (computer science)Arithmetic progressionMultiplication signSemiconductor memoryConfiguration spaceFLOPSWhiteboardLecture/ConferenceComputer animation
Bit rateInternet forumFLOPSSupercomputerNumberNeuroinformatikPolygon meshWhiteboardData centerSpeech synthesisLecture/Conference
Internet forumBit rateEndliche ModelltheorieMultiplication signSupercomputerComputer programmingVirtual machineMathematicsMachine learningPhysical systemBefehlsprozessorParallel portNeuroinformatikGraphics processing unitComputer animationLecture/Conference
Bit rateInternet forumData modelMultiplication signVirtual machineLink (knot theory)TwitterChannel capacityEndliche ModelltheorieBitTask (computing)Different (Kate Ryan album)Power (physics)Computer animationLecture/Conference
Bit rateInternet forumExpert systemDifferent (Kate Ryan album)RoutingFormal languageData structureMathematicsRepresentation (politics)outputWeightMixture modelEndliche ModelltheorieBitFunction (mathematics)Matrix (mathematics)Parameter (computer programming)TwitterNeuroinformatikChannel capacityMaschinelle ÜbersetzungGraphics processing unitReduction of orderWordPoint (geometry)MeasurementMetric systemProduct (business)Row (database)Translation (relic)Pairwise comparisonGene clusterCategory of beingInferenceMixed realityLecture/Conference
Internet forumBit rateArchitectureScaling (geometry)Virtual machineProcess (computing)Machine learningSelf-organizationDomain nameMachine visionDot productMomentumCartesian coordinate systemExpressionTask (computing)Modeling languageExpert systemFormal languageTranslation (relic)QuicksortEndliche ModelltheorieMultiplication signWave packetDirection (geometry)Computer architectureSign (mathematics)Physical systemTerm (mathematics)Conditional-access moduleDivisorDifferent (Kate Ryan album)PredictabilityGradientSymbol tableMetric systemNeuroinformatikBit ratePlastikkarte1 (number)CASE <Informatik>ResultantError messageSequelQuery languageState of matterParameter (computer programming)Level (video gaming)Row (database)SpacetimeMathematical optimizationArithmetic progressionRule of inferenceState diagramPattern recognitionMedical imagingWeightCategory of beingPlanningInsertion lossSet (mathematics)Validity (statistics)Software testingProbability distributionArtificial neural networkBoundary value problemFilter <Stochastik>Connected spaceDescriptive statisticsElectric generatorRight angleAdditionMultiplicationSineCellular automatonSequenceComputer animationLecture/Conference
Internet forumBit rateDifferent (Kate Ryan album)Endliche ModelltheorieWave packetComputer hardwareTask (computing)Virtual machineDiagramRight angleLecture/Conference
Internet forumBit rateEndliche ModelltheorieBitComputer architectureTask (computing)Different (Kate Ryan album)Representation (politics)Connectivity (graph theory)State of matterGraph coloringRight anglePhysical systemGoodness of fitLatent heatNumberBuildingLecture/Conference
Internet forumWave packetProcess (computing)Error messageMultiplication signBitAreaPredictabilityMachine learningEndliche ModelltheorieQuicksortDeep WebState observerTensorDataflowDean numberInferenceLecture/Conference
Transcript: English(auto-generated)
Okay, so our next speaker is a winner of the ACM Prize, I think in 2012, and so it's Jeff Dean. He leads the Google Brain Group, which if any of you know is an amazing group doing
really interesting things. And I think he's going to tell us a little bit about what they do. Deep learning and humanities grand challenges. Jeff Dean. All right, thank you very much.
First off, let me say that I'm thrilled to be here at the forum. I've been really enjoying interacting with the other laureates and with the young researchers, so thank you for having me. I'm going to be presenting work that some of which I've done and some of which my colleagues at Google have done. But first, one of the things I want you to leave this talk with is that deep learning
is really causing a revolution in machine learning. Here's a graph of interest in the term deep learning to the Google search engine over the last sort of four or five years, six years. Here's a graph of the number of machine learning papers posted on archive, which is a preprint posting service.
And you can see it's growing faster than the Moore's Law doubling every couple years, to the point where it's now 50 papers a week a day being posted on archive in the machine learning domain. Here's a graph of NIPS conference registration. So NIPS is one of the two main machine learning conferences. And this is different years, and you can see them color-coded, but basically as the
years go by, the graphs go up. The lines are higher, and that's this year right there. I hope you registered early. Which has caused some tongue-in-cheek cartoons by some machine learning researchers.
This was drawn actually before this year's registration, so they might need to revise the cartoon a little. Okay. But what is deep learning all about? We've had two very nice talks by Alexei and John earlier today that gave you some of the introduction to what kinds of approaches this involves.
But let me just give my own take on it. So really deep learning is just a rebranding of artificial neural networks, which were popular kind of in the 80s and 90s. And they're a collection of simple trainable mathematical units that can learn from observations
to accomplish different kinds of tasks. And they're organized in layers, and they work in this hierarchical manner, where we build complex pattern recognition systems on top of simpler pattern recognition systems in earlier layers. And so they can do things like you put in an image, and you get out a prediction of what is in that image, you know, that's a cat.
Really the techniques that we're using are not all that different from the techniques that were popularized in the 80s and 90s. So what's really new is that we now have scale. We have some new network architectures and some new training map, but really the fundamental basis is really at its core the same algorithms that were being used in the 80s and 90s.
One of the key benefits for deep learning is that they can learn from very low-level features, very raw kinds of data. So you can put in raw pixels of images, raw audio waveforms, individual characters in a text document, and these systems can then learn hierarchical features on top of
those raw features. So you don't have to sort of figure out what are interesting higher-level features from pixels. It learns that through the course of learning. Here's a kind of cartoon diagram, but I think it's useful to illustrate what happens in the learning process. So in this learning process, we have a very simple task. We want to identify if that's a cat or a dog in an image.
And so the model takes in the raw pixels of an image, and it's going to activate certain pattern recognition neurons throughout the model. Some of them will fire. Some of them will not. And then higher-level ones will then fire based on the behavior of the lower-level ones. And near the top of the model, we'll finally try to make a prediction based on the high-level
features we've learned. Is that a cat or a dog? And in the training process, if we get this right, we're done. If we get it wrong, then through the magic of backpropagation, we can essentially make little adjustments to all the neurons throughout the model, all the parameters in the model, to make it more likely that the next time we see that image, or more importantly, an
image like that, we'll make the correct prediction rather than the wrong one. So that's the basic notion of how you train a neural net, is these sequences of observing what the model does, making little adjustments to the model's behavior by changing its parameters. And the parameters you can think of as just floating-point numbers on every edge in this
connectivity structure in the model. Okay, so what can neural nets do? You've already heard some of this, but you can take in the pixels of a raw image and make a prediction of a category of what that is. It doesn't have to be just two categories, like cat or dog. We can train models with 20,000 categories of objects and distinguish Doberman pinchers
from schnauzers from fire trucks. You can take in the raw audio features of an audio waveform and then learn to emit a transcript of what is being said in there. So a completely end-to-end-learn speech recognition system, you see that, and you emit how cold
is it outside. You can take in text in one language and produce text in another language automatically. Hello, how are you? Bonjour, comment allez-vous? As Alexei already showed, you can take in pixels of an image and not just emit a category, but actually emit a sentence that describes that image, a blue and yellow train traveling
down the tracks. You know, that's pretty magical. And that, you know, if you'd asked me a few years ago, would we be able to do that with computers anytime soon, I would have said no. But that's really a sign of progress, I think, in this field. So what is happening, given that we're using the same underlying algorithms that we were using in the 80s and 90s, at that time, neural nets actually seemed like a really
interesting abstraction. In fact, I did an undergraduate thesis on parallel training of neural nets, because I was so enamored with them. And I felt like if we could just get a little bit more compute, you know, we could make them solve even more impressive problems than they could solve then. And at that time, they could solve kind of very interesting but toy problems.
So I felt like, you know, a 60x speedup on the 64 processor machine I was working on would be just what we need. It turns out what we actually needed was a millionx speedup. And fortunately Moore's Law has come to the rescue, and so we now have a lot more compute. And we can solve problems that we either can't solve in any other way, or that we can
improve on the solutions that we already had for some of these problems. So you've already heard some of these stats. I have a more visual representation of how much computer vision is improved. So there it is in 2011, and there it is now. Right? So we've gone from 26% error for winning the ImageNet Challenge in 2011 to 3% error
in 2016. And that's fundamentally transformative for computing, right? Computers can now see, and they couldn't before. And if you think about when that happened in real biology, that was probably a time of great change. And so that's sort of where we are now, is we can now see, and that opens up our
eyes to all kinds of different things. So in 2008, the National Academy of Engineering, in collaboration with the Chinese Academy and the Royal Society, put out a list of grand challenges for the 21st century. And they had 14 of them. I think it's a pretty good list. You know, a lot of them are, you know, geared around making people healthier, happier,
more productive, advance our knowledge of the world. Good list. I would add a couple others. I think enabling universal communication is important, and building flexible general purpose AI systems also seems like a challenge for this century.
So in this, I actually think machine learning is going to help with all of these to different degrees, but really, it will be at the heart of many of these. And I'm going to talk only about a few of these. I don't want you to get the idea that only the red ones will be aided by machine learning. You know, it's going to help a lot of things. So let's go through a few examples of where deep learning is going to improve our ability
to solve some of these challenges. So one is restore and improve urban infrastructure. At the heart of that is building autonomous vehicles. If we had autonomous vehicles today that worked, and were quite close, you know, all of a sudden our cities would be transformed.
You wouldn't need parking lots, you wouldn't need as many cars to be owned, the pollution levels would be way down, and this is really, getting these on the road is at its heart a perception problem and a, you know, reliable, safe control problem. And you really need to actually understand the world around you from the perceptual inputs you get. So you get a sort of sensors of various kinds, you get LIDAR, you get video data,
and you want to then build a model of what's happening in the road and the world around you so that you can make safe decisions and really accomplish the task of driving autonomously. And we're making pretty good progress there. I think in the next few years, you will see autonomous vehicles actually operating in the wild.
And that will be pretty exciting. Another area where I think machine learning really is going to be impactful is in the area of healthcare. I'll give you a few examples. So one of the areas we've been doing work in in the longest in our group is in the area of diagnosing a disease called diabetic retinopathy.
So you get a retinal image like this, this is something your ophthalmologist does when you go in for an eye exam, they take a picture of your retina, and then they want to see if you have any signs of this disease. Because if you catch it early enough, it's very treatable. But if you don't, it can actually cause blindness. And there's 400 million people at risk for this around the world.
And when an ophthalmologist grades this, they give it a score of one, two, three, four, or five. Now, it's slightly terrifying, as we've learned a bit more about this disease. If you show the same image to two ophthalmologists, they agree on this number 60% of the time. Slightly more terrifying, though, is if you show the same image to the same ophthalmologist
a few hours later, they agree 65% of the time. And it's a pretty big deal, right? The difference between two and three is like, yeah, yeah, go back and come back in a year, and we should get you in next week, right? So this is pretty important. And so to actually collect a training set that we could use for this computer vision problem,
we actually had to get every image labeled by seven ophthalmologists so that we could actually really understand, is it more like a two or more like a three? But once you do that, you can actually train a pretty good model. We collected about 150,000 of these images and got about a million judgments on them.
And we now have a model that is actually better than the median board-certified ophthalmologist. And this work was published in JAMA the end of last year. And that's pretty exciting. We're actually doing clinical trials in India with this. And we've, with our verily subsidiary, licensed it to a third-party device manufacturer
to commercialize it. And you're going to see results like this in many other fields of medical imaging modalities. So we've been doing some work in our group on pathology. And we actually now have a tumor localization model that is actually better than the pathologist.
And the interesting thing about that work is that was trained on only 270 images. Admittedly, pathology images are very large, so like 100,000 by 100,000 pixels. But that's only 270 labeled images. And we now have a model that is better at this task. The Karolinska Institute in Sweden did some very nice work on radiology.
And you can see they're getting good results there as well. So medical imaging, because computer vision works, is going to be fundamentally changed because of these problems. Another area we're pretty excited about is sort of higher-level abstract tasks. In particular, can we predict the future for patients?
And that would be pretty useful. And deep learning methods are actually becoming quite good at sequential prediction tasks. So let me detour through a translation task. And then we will come back to medicine. So translation, it turns out, one of the successful models for machine translation
was developed by Ilyas Letzgever, Oriel Vanyals, and Kwok Lee in our group. The idea is you have aligned sentence pairs, where you have two sentences in different languages that mean the same thing. And you just have lots and lots of examples of those. And you train what's called the sequence-to-sequence model, where you take an input sequence
and you try to predict the output sequence. And in this case, the input sequence is going to be English, or French, I guess, and the output sequence is going to be English. And by observing lots and lots of sentence pairs, you can actually train a model to do this translation task without engineering any higher-level features.
So it knows nothing about grammar or nouns or verbs or sentence structure other than observing from these sentence pairs, which is quite remarkable. And so here are the results. Sorry, the colors don't show up that well. But the blue bar at the bottom of each of these language pairs is we're showing
translation quality of the different translation systems. And so the blue line is the old phrase-based translation system, which should not use deep learning. It had a very complicated set of statistical models of how words align in French and English and then a phrase table of likely phrases in French.
And that gets a quality score that you see at the bottom of each of these stacks. And it was also 500,000 lines of code and very complicated and engineered set of four cooperating sort of complex systems. What you see in the sort of horizontal green lines is the new neural-based translation system.
And you can see we got a big jump in quality from doing this. And the other really nice thing is that system is just learning from the observations of the sentence pairs and is expressed in 500 lines of TensorFlow code rather than 500,000
lines of sort of complicated code. The line at the top there is bilingual human translations as judged by other humans. So not professional translators, but someone who's fluent in both languages producing translations. You can see for some of the language pairs, we've nearly closed the gap there. And more importantly, for some of these language pairs, we have so much data that
we're only able to get through one-sixth of the training data that we have for some of these language pairs. And so we know that if we can get through all of it, the quality of these models will improve greatly. When we released this in Japan, we were trying to actually quietly launch it. But a bunch of people in Japan noticed that the quality had improved a lot, including
a Japanese sort of literature professor who then decided to write up a blog post about how much it had improved. And so he ran through the first paragraph of Hemingway's The Snows of Kilimanjaro. And there's the old phrase-based translation system first paragraph when you translate
it into Japanese and then back from Japanese into English. Let's just focus on the last sentence. Whether the leopard had what the demand at that altitude, there is no that nobody explained. Crystal clear. And the new system says no one can explain what leopard was seeking at that altitude.
So it left out the word the. But other than that, it got the entire first paragraph right except for a and the. And so that's really quite dramatic improvements in usability for English, Japanese, and Japanese to English. And you've seen that across other language pairs as well.
OK. Now back to healthcare. So if you imagine a medical record as a sequence and what you're trying to do is take, say, a prefix of that sequence and predict the rest of it or predict high-level attributes about what's going to happen in the rest of it, you can imagine this would be pretty useful for medical kinds of tasks.
In particular, I might want to say, will this patient be readmitted in the next N days? Or what medications should a doctor think about prescribing for a patient in this condition? Or what diagnoses are most likely for this patient? We think we could give doctors a list of five probable diagnoses with calibrated probabilities
for each of them. And that might be quite useful because a doctor's experience, they maybe see 10,000, 20,000 patients in their career, and many rare conditions they might have never encountered themselves. And so if you can say, hey, the likely diagnosis, 93%, is this, but there's a 1% chance
it could be this other thing, that will really, I think, improve healthcare a lot because it might catch medical errors. It might make doctors think, oh, maybe I should do a test for that other rare thing. Engineering better medicines is another one of the challenges. And obviously part of that is understanding chemistry.
And so this is actually work that was done in our group in the context of chemistry but is a trend we've seen across lots of different scientific domains where the pattern is you have some highly complex computationally expensive simulator that can simulate some scientific process for you so that you can then sort of get results
from the simulator and understand the world better. And in this case, for chemistry, you put in a molecule configuration and you run your density functional theory simulator for thousands of seconds or an hour, and then you get out interesting results like what are the quantum properties of this chemical configuration,
does it bind with a given protein, that kind of thing. So it turns out you can actually use a neural net to learn the simulator. And so that's what my colleague George Dahl and his collaborators did. And they developed a new kind of neural net architecture that was particularly good for these kind of graph-based chemical problems.
And they used the output of the simulator, input and output from the real simulator, to train the neural net. And what they now have is a system that uses a neural net to do the simulation rather than the real computational HPC code and is 300,000 times faster. And the accuracy is indistinguishable from the other simulator.
So you can imagine if you're a chemist, all of a sudden if you have a tool that gives you the same results you were getting but is 300,000 times faster, that's going to dramatically change how you do chemistry. You might screen 100 million things and then say, oh, these thousand are more interesting. I'm going to explore those in more detail.
You can read about that on that blog post. And that's a general pattern we've seen. We have a visiting researcher in our group from Harvard who does earthquake science. He's seen the same thing by replacing the inner loop of an earthquake simulator. They took something that was a million CPU hours and made it 100,000 times faster with a teeny, teeny neural net and indistinguishable inaccuracy again.
And so we've seen this pattern repeat itself across a few different scientific domains, and I think it's worth paying attention to. OK, reverse engineer the brain. So one of the things that is important if you're trying to understand the brain is just to understand the static structure of it.
And we don't even really have a very good handle on that today. So neuroscientists are hard at work to be able to reconstruct the structure of the neural wiring diagrams between neurons. And the way you do this today is you start with some brain tissue and then you slice it up.
Patient is hopefully not there anymore. And you then do high-resolution electron microscope scans of the 2D slices. And now you have a whole stack of these slices. And the goal is to actually reconstruct the wiring diagram in that volume of neural tissue. And it's actually pretty hard because you don't know, you know,
does this neuron here line up on the next slide or a few slides up with this thing that looks like another arm of a neuron? And you want to really understand, is it this neuron making its way all the way through the slide? You can view the connectivity in that way. Or if you're more mathematically inclined, you might ultimately like a connectivity matrix
and then be able to do kind of graph analyses of this kind of thing. So my colleague Viren Jain and his collaborators at the Max Planck Institute and at Google have been working on this for a little while. And they've, in the last 18 months or so, made about 1,000x improvement in the accuracy,
the reconstruction error of these models. Essentially the metric they use is how many micrometers of neural tissue can you trace out before you make a mistake. And it turns out that we're now almost to the level of being able to trace out an entire songbird brain without making an error.
And one of the key techniques that they developed was a technique called flood-filling networks, where essentially you use the raw input and then also the predictions you've made so far to make your next prediction. So that you essentially focus in on areas where you're highly confident, and then as you kind of sketch that out and then you lose a little bit of confidence there,
you then jump across to another area where you're now more confident and start to trace things out there. And this really improved the accuracy quite a lot. You can see it doing this in 2D to trace out a cell boundary, but it looks cooler in 3D. And this is extremely computationally intensive.
So this is on a high-end GPU card, and we speeded it up 60x. So you can see it trace along one of the branches as it's more confident, and then eventually it loses confidence and then it switches over to another part of the neural tissue and starts to trace out the branch there. And you get really nice kind of reconstructions like this.
And so this is, you know, very computationally intensive, but they think the next sort of practical step is to actually reconstruct the wiring diagram of this entire Sogenberg brain. Okay. Perhaps the sort of broadest area and one of the grand challenges
is to engineer the tools for scientific discovery. And we have some take on some of those tools. So one of the things that we think is important is because machine learning is going to be at the core of many of these challenges, we think building tools that allow you to express machine learning ideas
and to try out machine learning research ideas quickly is pretty important. That's something we've been developing in our group. And a system called TensorFlow is actually our second generation machine learning software system that we've built to sort of underlie our research and also some of our production uses of machine learning models.
And we decided we would open source the second generation system, which we did in November 2015, with the hope that we could express a common platform for exchanging machine learning ideas. You know, archive papers are great. There's tons and tons of archive papers being posted every day. But ultimately when you write an English description of what you've done,
sort of like Leslie Lamport's description, the textual description of a proof is not nearly as good as an executable thing that shows you exactly what was done for a paper and allows other people to build on it or to reproduce those results. So one thing we're hoping and seems to be happening is that people are now open sourcing code associated with their papers
in TensorFlow that allows other people to then experiment and try out ideas. We want this system to be able to run machine learning sort of wherever it wants to run. So that means it runs on things like Raspberry Pis, these tiny little computers that runs on Androids and iPhones
and sort of single desktop machines and large data centers. And we want to make it good for both flexible trying out of new research ideas but also robust enough that we can take production ideas and like put the real translation system out and have that be used by millions of people.
So this is one metric of how much interest there is in a variety of different open source machine learning packages, which is GitHub stars. So when you're interested in a repository on GitHub, which is a popular open source hosting site, you can star it. It's not the finest metric, but it is a metric.
And you can see when we released TensorFlow there was a lot of interest, and that interest has continued to grow relative to a bunch of other open source machine learning packages. And we've now got a pretty vibrant external community working to improve TensorFlow along with our own efforts.
And so we have about 800 non-Google contributors contributing code to TensorFlow. We've done about 1,000 commits a month over the last 21 months, really working to improve the system, add new capabilities, and millions of downloads of the system. And it's starting to be used in various machine learning classes at different universities, Toronto, Berkeley, Stanford,
as sort of the way of expressing machine learning ideas in those courses. Okay, so I've mentioned a few times that computation is a real bottleneck in what we want to do. And deep learning actually has a couple of nice properties that are transforming how we think about building computer hardware.
So two properties I want to address, draw your attention to. So one is that neural nets turn out to be incredibly tolerant of very low precision arithmetic. Like, it's perfectly fine to do kind of the rough calculations you might do in your head, like, oh, 10 times 9, that's about 100.
You know, that's perfectly fine for a neural net. And so that really changes the kinds of circuitry you might build for multipliers. You don't need as much precision. You can pack in way more multipliers on the same chip area. And the other property that they have is that all of the algorithms I've shown you essentially are built out of a handful of very specific operations.
Think of it as linear algebra. You know, basic matrix multiplies, vector operations, 3D tensor equivalents of those. And so if you can essentially build a specialized chip or system that speeds up very low precision linear algebra and does nothing else, that's going to allow you
to really speed up the kinds of computations that are at the core of all these machine learning algorithms. So we've been doing this for a little while. Our first design started about four years ago, maybe a little more, and has been in deployment for about 30 months in our data centers. And the first problem we tackled was inference.
So inference is the process of getting predictions from a model but not training it. So you already have a trained model. You now want to say, is that a cat or a dog or a fire truck or what sentence should I produce for translation? And so that turns out to be an easier problem
for computer architecture design because usually you can get it so that a model fits on a single chip and you just want to run lots and lots of computations through that chip with as low a latency and high a throughput as you possibly can. And so there was an ISCA paper published this year with those four authors and 40 others about that paper, about that chip.
That's the 2-rack system that was used for the AlphaGo match in Korea. And one of the key ingredients of that match was being able to use these custom ASICs to speed up some of the computations done for the Go board calculations.
But ultimately what we care about is not just inference but training as well. And so really speeding up training is in some sense even more important. You want to make researchers more productive so that you can get an answer to an experimental question you have in an hour instead of a week. If you have that time scale,
that just changes fundamentally the kind of way you do science and the way you kind of make progress in the field. And it also can be used to increase the scale of problems that we can tackle. So the second generation system that we built was really designed to do both training and inference.
And this is a picture of the board which has four chips on it. And this is a designed device that has 180 teraflops of compute, which is quite a lot, 64 gigabytes of memory. And if that amount of compute is not enough, we've also designed them to be connected together into a much larger configuration that we call a pod.
And so these pods have 64 of these boards in them that are all connected together in a two-dimensional mesh with 11.5 petaflops of compute. Now, if you're not used to speaking of petaflops, 11.5 petaflops is quite a lot. The number 10 supercomputer in the world
has about 11 petaflops in our peak, although it's not exactly comparable, because this is a lower precision computation than that, which is probably full double precision. But quite a lot of compute. And that will allow us to tackle larger problems. And we're going to be building many of these in our data centers
and actually have many of them already deployed. And normally, programming a supercomputer is kind of annoying. It's like lots of fiddly details. These systems are designed to be programmed using TensorFlow. So you express a machine learning computation, and then the same program will run relatively unchanged
on CPUs, GPUs, or on these new TPUs. And you would express the same program to run it on one of these devices, or by scaling up with data parallelism, you can actually then, without any real changes to the program, run it on an entire pod if your problem is big enough. And we're also going to make these available to customers who have their own machine learning problems,
because machine learning is something that's going to be impacting lots and lots of industries. And there's tremendous demand for getting faster turnaround time on experimental runs of machine learning models. We're also making 1,000 of those available for free to researchers who are committed to doing
open machine learning research. So the only requirement is that you will submit a proposal of what you want to do, and that you're committed to publishing the work openly. And you can find out more at that link. Okay, so in the time remaining, let me go through a few trends in the kinds of models we want to train.
So one thing we want, I think, is here's where a bit of biological inspiration is useful, I think. Our brain has many different specialized pieces, and our entire brain is not active for everything that we do, right? When we need to turn on a piece, we do. And so we want some of that same kind of notion
in the machine learning models that we build. We really want huge model capacities so we can remember lots and lots of stuff so that we can remember how to do lots of different tasks. But we only want to activate a small portion of the model so that we're computationally efficient and power efficient.
And so we've been doing a bit of work in this, and we've come up with a piece of a model that you can stick into other models that we call a mixture of experts layer. And so this has a few properties. One is each expert has lots and lots of parameters. So you can think of it as a big matrix
that's going to take in some input and transform it in some way to produce some output. So that's four million parameters or something. And you might have lots and lots of these experts, a couple thousand of them. So that's eight billion parameters, which is quite large as neural nets go. But what we're going to do is we're going to sparsely activate it. So any given example is only going to go through
one or maybe two of these experts, and the rest will be idle. And the input representation is given to a piece of the model that's actually going to learn to route them to the most appropriate expert. And so you can see that as you train these models,
they develop different kinds of expertise, and these are the examples that were routed to expert 381. You know, it's talking about research and innovation and science, and expert 2004 over there seems to like to talk about rapidly changing things and drastic changes and volatile things.
So it's really getting at different kinds of structure in language and different kinds of clusters. And so the interesting thing is that this actually works quite well. So this is a comparison on the bottom row with the production machine translation system. And the main metrics to look at are the blue score, which is a measure of the translation quality.
Higher is better. So you see the mixture of experts system in the top row there actually achieves a pretty significant boost in blue score. One point boost is actually quite large. And it does so with a lot less computation per word,
so roughly half the compute per word. It does have a lot more parameters because it has all these mixture of experts with 8 billion parameters in them. But the other interesting thing, because of that less compute per word, you're actually able to train it in one day on 64 GPUs instead of six days on 96 GPUs. That's about a 10x reduction
in the amount of compute needed to train one of these models, and you get a more accurate model that is then ultimately cheaper for inference. So I think that's a pretty interesting trend is building these sparsely activated models that are going to be quite successful at building bigger capacity and more accuracy with less compute.
So another thing we're really excited about is automation of the machine learning process. So if you think about how you solve a machine learning problem today, you have some data, you have some compute devices, and then you have a human machine learning expert to stir it all together. They stir it all together,
out pops a solution to your problem, maybe after a lot of perspiration by the human machine learning expert to run lots of experiments. The problem with that is there's probably 10 million organizations in the world that should be using machine learning, but there's probably 1,000 or so that have hired machine learning experts
and are actually really productively deploying machine learning systems to their real problems. So how do we get from 1,000 to 10 million? We could wait to train a lot more machine learning experts, or perhaps in addition to that, we think an interesting approach is to automatically learn to solve machine learning problems.
And so can we turn this into something where we have data and maybe a lot of compute into a solution? So here's one piece of work in this direction, done by my colleagues Barrett Zoff and Kwok Lee, and the idea is we're going to have
a model-generating model, right? Usually the human machine learning expert sits down and says, okay, we're going to train a nine-layer model with this connectivity and five-by-five filters, and instead we're going to have a model-generating model. The model-generating model will generate a description of a neural network architecture and say we're going to generate 10 models,
we're going to train each of them for a few hours, and we're going to see how accurate they become on a test set or a validation set. And then we're going to use the loss of those 10 generated models, how effective, how accurate they were, to train the model-generating model. So now we can steer it away from models that seem to really kind of suck
and towards ones that were really, really good. And then we're going to repeat that process. Now we have a probability distribution over models that is improved, right? We've now steered it closer to models that seem to work well for this problem. Or perhaps we're going to train this thing to solve not just one problem but thousands of problems.
And this works, which is nice. It generates model architectures that look unlike what a human would probably come up with. And so this is one of the best models for this particular problem, which is a small problem in image recognition, CIFAR-10. So it's about 60,000 small images with 10 categories,
like plane and horse and truck or something, car. But the nice benefit is it's been pretty well studied in the machine learning community. And so everything with the last four rows here are human-generated machine learning models, and you can see the progress in improving the state of the art over time
as you go down, as the error rate has been dropping, so lower error rates are better. And this architecture here that was generated on a weekend on hundreds of GPU cards but with no human involvement in designing that architecture gets very close to that state of the art result. And it works in other domains as well. So here's a language modeling problem.
I've now switched metrics, so this is perplexity, so lower perplexity is better. And this is, again, a problem that's been pretty well studied. That's the Penn Freebank language modeling task. And normally people use something called an LSTM cell, which was designed by Jurgen Schmituber in a while back as a way of modeling
language and sequential processes. And this system came up with an architecture that is somewhat reminiscent of that but different in various ways. And it actually gets better perplexity. We've then taken that architecture and used it on a completely different medical prediction task, and it did better than an LSTM cell.
So we're now working on scaling this up. It's very computationally intensive because you're training models for each of these tasks, and you're doing that thousands of times. But, importantly, it actually works when you scale it up. So each of these black dots here
is a human-generated machine learning research paper for ImageNet, where you can see the x-axis is computational cost of that model. So generally, as you do more compute, you get more accurate results. And accuracy is on the y-axis here. And so the black dots are all human-generated models.
And accuracy has generally been improving over time as well. But each one of those black dots is a human machine learning expert or team sitting down for multiple weeks or months to come up with the best architecture for this problem. And what you see is the red dots are automatically generated architectures with this architectural search.
And you see they're outside the boundary of the black dots. So that should be pretty startling to you, I hope. And you see that they're both, when you want high accuracy, they can give you that with less compute cost. And at low accuracy, they can give you more accuracy at the same compute cost. And you can sort of trade this off
wherever you want on this sort of frontier. So that's pretty exciting, I think. You can also learn the optimization update rule. So here are four ways of updating the parameters in a neural net when you get a gradient. So normally, you take your parameters and you apply a learning rate times some expression.
And here are four human-generated expressions that pretty much all machine learning research that's been published in the deep learning space over the past few years generally use. You use STD or STD momentum or Adam or RMSprop. And so those are four quite well-known techniques.
Which one of them works better for your problem? Sometimes you have to try all four and just see which works better. So we're gonna learn an optimization update rule. We're gonna generate symbolic expressions for updating the parameters. And we're gonna do the same kind of thing. We're gonna generate a whole bunch of symbolic expressions for a particular problem.
And we're then going to train the models with different symbolic expressions and see which ones work better. And so there's the four human-generated ones for CIFAR. In this case, the momentum turned out to be the best of the four, and you can see pretty significant differences between them.
And here's a bunch of symbolic expressions that this model generated. And pretty much all of them are better than any of the human-generated ones. And if you look at them, they kind of make intuitive sense. So many of them have this shared sub-term. Let's see if I can do the laser pointer. Ooh, look at that.
E to the sine of the gradient times the sine of the recent accumulated gradients. Essentially that means if you want to go in the same direction you've been going, you should speed up by a factor of E. And if you want to go in a different direction than you've been going recently, you better slow way down by one over E,
which sort of makes intuitive sense. That's sort of the intuition behind momentum, but this seems to actually work better. And in fact, you can take one of those symbolic expressions that work quite well on one problem and take it to a completely different domain. So it was trained on a vision model. This is it transferred to a language translation task,
and you see it gets better blue score and lower perplexity for that task. So I think there's encouraging signs that we can actually use machine learning to improve machine learning itself and to more quickly generate solutions and help to spread the ideas of getting machine learning into the hands of people
so that perhaps you could get it to someone who maybe doesn't know anything about machine learning but has the ability to write sort of a SQL query level of expertise. If we could get them to be able to use machine learning models as well, that would be amazing. Okay, so what might a plausible future look like? Let's combine some of these ideas.
What we want is a large model, sparsely activated. I think one of the problems we have in machine learning today is that we generally train a model to do one thing. We train it to do something perhaps pretty complicated, like translate English to Japanese, but generally we don't train models to do lots and lots of different things.
And really, humans are incredibly good because we know how to do lots of things and then when we encounter a new thing, we can build on our experience in solving the other things we know how to do. And I think we want to dynamically learn and grow pathways through a large model for new tasks. And then obviously we want to run it on fancy hardware
and have machine learning actually efficiently map it onto the hardware. So here's a cartoon diagram of how this might look. So different colors here are different tasks. And imagine we've learned a bunch of pathways for different tasks through this large model and notice that some of the components for these different tasks are shared, right,
because we want to build representations that are good for doing lots of different things and now a new task comes along. So now we can do, for example, instead of doing architecture search, we can do pathway search, which is not that big a leap from where we are today to say let's find good pathways for this new task.
So maybe that pathway works pretty well, gets us into a pretty good state One benefit of having this heavily multitask approach is I think it will make us be much more data efficient, right? If you know how to do a million things and the million at first comes along, then with a relatively small number of examples you can usually figure out how to do that new thing. And maybe we'll decide that this task is really important
and we'll want to add a task specific component and change this pathway a bit. So that's where I think we should be heading is really to build flexible general purpose systems that can do lots of different things. So I hope I've convinced you deep nets and machine learning are producing really big breakthroughs and they're gonna help us solve some of these challenges
and I think there's lots and lots of areas of research. So thank you. Okay, so we can take one question if there's a, yes.
Dean, I'm an error analyst and I hope that the precisions that you use for training are more than twice the precisions that you use in your TensorFlow for prediction.
If you don't do that, then from time to time very strange and inexplicable things will happen. Ah, yes, that is a good observation that generally for training you do need significantly more precision for the training process and then once you have a model and you've sort of frozen it
and are ready to use it for inference, you can actually quantize it down to many fewer bits. So yes, that is definitely a good observation and what we do. And I highly recommend it if you're doing it as well. Okay, so let us thank Jeff. Thank you.