Virtual HLF 2020 – Talk: David A. Patterson - TIB AV-Portal

Virtual HLF 2020 – Talk: David A. Patterson

00:00

5

Heidelberg Laureate Forum Foundation

Patterson, David A.

Formal Metadata

Title

Virtual HLF 2020 – Talk: David A. Patterson

Subtitle

Architecture Innovation Accelerates Artificial Intelligence

Title of Series

Virtual Heidelberg Laureate Forum, 2020

Number of Parts

19

Author

Patterson, David A.

License

No Open Access License:
German copyright law applies. This film may be used for your own use but it may not be distributed via the internet or passed on to external parties.

Identifiers

10.5446/49579 (DOI)

Publisher

Heidelberg Laureate Forum Foundation

Release Date

Language

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Machine learning is now driving AI, and the first word in Machine Learning is machine. Thus, we need even faster computers to enhance AI, but the slowing of Moore’s Law means conventional machines are barely improving. In order to expand the impact of AI, we are forced to design machines specifically for AI. Google is a leader on this new architectural path with the announcement of three generations of Tensor Processing Units (TPUs), with the first deployed for the cloud in 2015. The co-design of machine learning specific software and hardware let production machine learning applications scale to 1000 chips with almost 1000 times the performance and deliver 10 times the performance/Watt of the most efficient general-purpose supercomputers.

Virtual Heidelberg Laureate Forum, 202016 / 19

1

57:37

Scientific Dialogue: Robert Endre Tarjan and Donald Ervin Knuth

2

31:32

Laureate Talks: Barbara Liskov

3

51:05

Virtual HLF 2020 – Talk: Alan Kay

4

36:00

Virtual HLF 2020 – Talk: Raj Reddy

5

1:28:55

Virtual HLF 2020 – Panel Discussion: Public Science Communication in Mathematics/Computer Science

6

22:40

Virtual HLF 2020 – Talk: Andreas Matt

7

32:37

Virtual HLF 2020 – Talk: David Silver

8

1:01:20

Virtual HLF 2020 – Panel Discussion: Where can computer science and mathematics interact fruitfully?

9

32:50

Virtual HLF 2020 – Talk: Shwetak Patel

10

1:15:35

Virtual HLF 2020 – Interviews with recent laureates

11

33:36

Virtual HLF 2020 – Interview with Karen Uhlenbeck

12

56:34

Virtual HLF 2020 – Talk: Jon Kleinberg - Analyzing Bias in Machine-Learning Algorithms + Dialogue: Jon Kleinberg and Yoshua Bengio

13

47:53

Virtual HLF 2020 – Dialogue: Sir C. Antony R. Hoare and Leslie Lampor

14

1:17:55

Virtual HLF 2020 – Panel Discussion: Theorems and Initiatives Inspired by Maryam Mirzakhani

15

41:00

Virtual HLF 2020 – Talk: Shigefumi Mori

16

32:42

Virtual HLF 2020 – Talk: David A. Patterson

17

1:58:55

Virtual HLF 2020 – Hot Topic: Health, technology and data: Which is the best way to go?

18

45:31

Virtual HLF 2020 – Panel Discussion: Scientific Exchange and Collaboration in the Post-COVID Era

19

31:29

Virtual HLF 2020 – Talk: Joseph Sifakis

Automatic playback

Speech

Text

Image

00:00

Row (database)MicroprocessorComputer architectureMikroarchitekturFunctional (mathematics)Performance appraisalTerm (mathematics)CoprocessorPresentation of a groupPoint (geometry)Multiplication signTuring testMobile appMeeting/Interview

01:21

Computer architectureBit rateArtificial intelligenceComputer animation

01:52

AlgorithmGraph (mathematics)Computer hardwareMicroprocessorData centerComputer architectureSoftwareSemiconductor memoryProduct (business)Wave packetCategory of beingIntegerTask (computing)InferenceBefehlsprozessorBitExpected valueGroup actionProjective planeExecution unitTerm (mathematics)Virtual machineNumberElectronic visual displayAreaDivisorServer (computing)Branch (computer science)InternetworkingArtificial neural networkCoprocessorPredictabilityFood energyControl flowData storage deviceOnline helpLatent heatGoogolPoint cloudChannel capacityGraphics processing unitEndliche ModelltheorieDomain nameNeuroinformatikMultiplication signRule of inferenceTuring testCache (computing)Right angleFigurate numberMoore's lawInternet der DingeComputer hardwareArtificial intelligenceComputer architectureSoftwareTelecommunicationComputer programmingSemiconductor memoryProduct (business)Matrix (mathematics)Wave packetPhase transitionIntegerTotal S.A.InferenceBefehlsprozessorCompilerGroup actionIntelLemma (mathematics)Projective planeExecution unitTensorTerm (mathematics)Virtual machineCellular automatonMachine learningAreaBinary multiplierInternetworkingProcess (computing)Remote procedure callFood energyLaptopPoint (geometry)Data storage deviceCartesian coordinate systemLatent heatGoogolBit ratePoint cloudGraphics processing unitFingerprintDomain nameNeuroinformatikTesselationCache (computing)Moore's lawSoftware developerMobile appInternet der DingeComputer animation

08:39

DampingImplementationMathematicsData centerComputer architectureSoftwareSemiconductor memoryProduct (business)Matrix (mathematics)Wave packetSupercomputerBand matrixDecision theoryState of matterBefehlsprozessorForm (programming)BitExponentiationLine (geometry)Group actionPotenz <Mathematik>Slide ruleExecution unitResultantCore dumpVirtual machineNumberScaling (geometry)Link (knot theory)AreaBinary multiplierMatching (graph theory)Artificial neural networkParameter (computer programming)Food energyPoint (geometry)LiquidInterior (topology)Cartesian coordinate systemInclusion mapDirection (geometry)Latent heatInsertion lossScripting languageFile formatPlastikkarteGraphics processing unitDomain nameMachine learningClassical physicsCycle (graph theory)Single-precision floating-point formatMultiplication signTuring test2 (number)Interface (computing)1 (number)Cray, SeymourComputer hardwareImplementationComputer architectureSoftwareTelecommunicationComputer programmingSemiconductor memoryProduct (business)Matrix (mathematics)Wave packetSupercomputerBand matrixIntegerInferenceFunctional (mathematics)IntelExecution unitCore dumpVirtual machineScaling (geometry)Link (knot theory)AreaCoprocessorFood energyInheritance (object-oriented programming)Convex hullCartesian coordinate systemArithmetic logic unitLatent heatBit rateSpeicherkapazitätMixed realityBayesian networkIRIS-TDomain nameNeuroinformatikArmTesselationMoore's lawComputer animation

15:25

Computer chessDampingNatural languageComputer architectureSelf-organizationSoftwareNetwork topologyComputer programmingSemiconductor memoryType theoryProduct (business)Wave packetSupercomputerCategory of beingInferenceBenchmarkPerformance appraisalMoment (mathematics)ResultantTerm (mathematics)Virtual machineNumberScaling (geometry)AreaComputer engineeringDivisorSimilarity (geometry)Parameter (computer programming)CoprocessorPerfect groupGeometric meanPoint (geometry)Set (mathematics)Cartesian coordinate systemFile formatGreatest elementScalabilityPoint cloudGraphics processing unitSpeciesCalculationDomain nameNeuroinformatikMultiplication signMoore's lawSoftware developer1 (number)FLOPSFormal languageComputer architectureSoftwareNetwork topologyTransformation (genetics)Semiconductor memoryProduct (business)Wave packetSupercomputerElectric generatorInferenceAxiom of choiceBenchmarkGreen's functionExecution unitResultantVirtual machineScaling (geometry)Link (knot theory)LinearizationParameter (computer programming)Translation (relic)Cartesian coordinate systemLatent heatGoogolCNNAlpha (investment)Bit ratePoint cloudGraphics processing unitSpeciesEndliche ModelltheorieDomain nameNeuroinformatikSingle-precision floating-point formatTuring testWritingMoore's lawXML

22:11

Computer hardwareWave packetCausalityDifferent (Kate Ryan album)Bounded variationDuality (mathematics)SkewnessMeeting/Interview

22:41

Computer hardwareSemiconductor memoryBand matrixVirtual machineCartesian coordinate systemLatent heatDomain nameComputing platformWriting2 (number)Meeting/Interview

23:10

Dynamical systemSemiconductor memoryType theoryBand matrixTerm (mathematics)Virtual machineOcean currentStatic random-access memoryEndliche ModelltheorieNeuroinformatikMeeting/Interview

24:23

Slide ruleVirtual machineFile formatPosition operatorDuality (mathematics)Meeting/Interview

24:49

Data typeStochastic processEnergy conversion efficiencySemiconductor memorySupercomputerBitExponentiationFloating pointVirtual machineNumberAreaRevision controlRange (statistics)Point (geometry)Arithmetic progressionComputational scienceFile formatChemical equationCalculationMachine learningStandard deviationPosition operatorMeeting/Interview

26:13

DampingHand fanData conversionPosition operatorDuality (mathematics)SkewnessMeeting/Interview

26:40

Computer architectureSemiconductor memoryMereologyProjective planeVirtual machineAreaCoprocessorCartesian coordinate systemLatent heatClosed setDomain nameNeuroinformatikMachine learningMultiplication signMeeting/Interview

28:07

AlgorithmComputer hardwareWave packetBefehlsprozessorNumberReal numberUtility softwareDivisorParameter (computer programming)Ocean currentGraphics processing unitEndliche ModelltheorieNeuroinformatikMachine learningSpacetimeDuality (mathematics)Meeting/Interview

29:45

Assembly languageBuildingExecution unitScaling (geometry)CoprocessorSound effectDisk read-and-write headNeuroinformatikDuality (mathematics)Meeting/Interview

30:50

Computer programmingMereologyAreaArithmetic progressionMultiplication signSpacetimeVideo gameDuality (mathematics)Meeting/Interview

31:17

Process (computing)Internet forumImage resolutionDuality (mathematics)Meeting/Interview

31:57

Streaming mediaComputer animation

Transcript: English(auto-generated)

00:01

It's my pleasure to introduce the next speaker of today's sessions. It is David Patterson, recipient of the ACM AM Turing Award in 19, excuse me, 2017. It's fairly recent. For, and I quote, pioneering a systematic quantitative approach to the design

00:21

and evaluation of computer architectures with enduring impact on the microprocessor industry. He will talk about the impact of AI on processor architecture or put it more precisely on the special requirements included by AI technology on what processors should provide in terms of functionality.

00:46

There will be time for questions and the audience is encouraged to send questions via the chat channel of the HLF app. And we will then relay these questions to David after the presentation.

01:01

The presentation itself is prerecorded. If there are more questions than can be handled at the remaining time frame, then, of course, we have to stop at some point. But nevertheless, you are encouraged to ask questions via that function. With that said, we can start the recording. David.

01:22

To talk to you about how the importance of computer architecture affects the rate we're going to improve artificial intelligence.

01:54

Artificial intelligence started in the 1950s and the dominant way to think about it was a top down approach.

02:02

If we just wrote down all the rules, kind of if this happens, do that and got the rule set right. With the proper logic, intelligence would emerge, but within artificial intelligence, there was another group. That said, that's impossible. The only way we're going to learn that is from the data we you have to go bottom up from the data to create those rules.

02:26

Human beings can't figure that out themselves. And within machine learning, there's another group that said the only way we can do this is supervised learning, where we try and imitate the neurons of the brain. And this is called deep neural networks.

02:42

Deep neural networks have two pieces training where you're deriving from the data a model. And then once you have that model, you infer it or serve it, you put it to work. So training could take days, but inference could take milliseconds.

03:01

And this deep neural network piece is what led to the Turing Award to Hinton, Benjio and Lacun, who got it the year after Hennessy and I got it, and recognizing the contributions. So what we see is advances in AI is really advances in deep neural networking starting to work.

03:23

So why is it starting to work? These aren't new algorithms that they invented. The algorithms are around 20 years. What's happened is that we needed a lot more data in a lot faster machines than when those algorithms were invented. It's easy to get more data today from the cloud and Internet of Things devices.

03:41

But what about faster machines? Sadly, just as we need much faster machines, Moore's law is letting us down because Moore's law is going to slow down. And as you can see in the graph below, this is Moore's law prediction of doubling every two years and versus microprocessors that came from

04:00

Intel in terms of the transitions per chip were off at least a factor of 15, where we would be if Moore's law still held. And that lack of transistors per chip turns into performances. You can see the graph on the right. In the 1980s and 90s, when Moore's law was alive and well, we were turning those transistors into faster computers, doubling performance about every 18 months.

04:25

At the time, people would throw away perfectly good working computers because they were two years old and much slower than your friend's computer. These days, nobody gets rid of a good computer. My laptop, I throw it away only if the battery breaks or the display breaks or something like that, because the performance is about the same.

04:47

And indeed, in benchmark measurements, it's only proving a few percent per year. So instead of doubling performance every 18 months, we double performance every 20 years. So we need faster machines.

05:01

How are we going to do that without the help of Moore's law? Computer architects think the only way we can do this is domain specific architecture, DSAs. They don't try to do everything, but just do a few things exceptionally well. What this means for computer architects is five decades of experience in designing general purpose processors may not apply.

05:23

So if you're a researcher, this is an exciting new time. The innovation, brand new ways to design computers. Oh, boy. But however, if you're at a company trying to sell products, it's a very scary time because which things should you build? And we need these domain specific architectures for both the cloud and the edge.

05:42

So the cloud is simply. Warehouse scaled computers with 50,000 servers in them at remote storage, remote places. They also have storage in them and distributed data centers around the world that are connected via the Internet. That is the cloud warehouse computers all over the world.

06:02

And the edge is the tiny computers that often battery powered Internet of things or cell phones or laptops, cars, tennis shoes everywhere. So that's the cloud. Let's do them in sequence, starting with the cloud. Well, Google was one of the first people to get excited about both deep neural networks and then domain specific architectures.

06:25

In 2013, they calculated that if 100 million users started doing deep neural networks three minutes a day on CPUs, they would have to double the size of the data center. Not only would that be very expensive, it would take forever to build twice as many data centers in the cloud.

06:45

So they set an emergency project whose goal was to make a factor of 10 improvement over existing CPUs and DPs. And they gave them very little time because this could happen any day that people want to start using deep neural networks.

07:00

So it was done in just 15 months from ideas to working hard work and software. And they exceeded the expectations remarkably enough in that short time. It was about a factor of 80 times better than the contemporary general purpose CPUs and about 30 times faster than the Nvidia GPUs.

07:23

And putting this in perspective, these are amazing numbers because factors of 10 and commercial products are rare. You can sell a lot of products if you're only a factor of two better. This was factors of 10 to 80. Why? Why was it so successful? First of all, an amazing number of arithmetic units.

07:43

It has 256 by 256 arithmetic units, 64,000 multiply accumulators. Secondly, they were doing work on eight bit integer data rather than 32 bit floating data. So it can be more energy efficient and take less memory capacity and be faster.

08:01

And because it was domain specific, it dropped a lot of the general purpose features that dominate CPUs and DPs, like caches and branch predictors. This saves area and energy and lets the transistors get reused for domains specific hardware. Okay, TPUv1 was success.

08:20

TPUv1 was for inference, which is the simpler task. So for TPUv2, they wanted to take on training. So it's a bigger task because it's more computation. It needs more memory and the data has to be very unique in integers. And so and then trying to figure out what were the good ideas that we should carry over and what had to be new.

08:42

Well, first thing was I said it takes longer to train. It could take on Google's production applications that we were trying to do with one chip. It could take more than a year. So that's obviously no one's going to wait more than a year to get the results back. So when you think about the bigger machines and more data lead to bigger breakthroughs in machine learning.

09:04

That was true in 2015, just like it is today. The goal was to build a supercomputer for deep neural networks. And in retrospect, that was a great decision. Here's a result from our colleagues at OpenAA just showing the thirst for machine learning training.

09:23

So if you wanted to state the state of the art, they calculated going back to 2012 up through 2019. The appetite is 10x per year. Whereas Moore's Law, which, you know, when it was at full speed was 10x every five years.

09:41

So dramatically faster appetite for training at the very state of the art. A critical feature for a supercomputer is how the chips talk to each other. Google decided to build inside every chip what we call ICI or inner core interconnect.

10:01

Each link in each direction has 500 gigabits per second. And there's four of them for chips. That's pretty phenomenal. But it's not very expensive. It uses only an eighth of the die to do the distributed switch and the interconnect. And the TPUv2 supercomputer scales up to 256 TPUv2 chips.

10:22

So compared to the classic data center network, the links are faster. It's cheaper because there's no network interface card or there's no switches in it. So it's cheaper, faster. And there's not those don't form bottlenecks. And so that maybe it's five times faster at one tenth o'clock.

10:42

So this was a big feature for our domain specific architecture. Then the question is the resigning chip, how many cores per chip for TPUv1? We had one core per chip. But, you know, GPUs can have 100 cores per chip. So where should we go? Well, the challenge is, as the feature size gets smaller with more advanced semiconductor technology,

11:06

the global wires that go across the script don't scale anywhere. So the delay increases. So that's an argument for not making too big a core chip. Now, we know that training can lose lots of processors, so it's acceptable to use more cores.

11:20

So where we went to advise is Seymour Cray, the greatest supercomputer architect of all time. And when he was asked the way he said, how many cores? He said, if you're plowing a field, what would you rather use, two strong oxygen or 1024 chickens? So we went with two strong oxygen. So the TPUv2 has two cores per chip to prevent the longer so it wouldn't have a slower clock cycle.

11:44

And we thought it wouldn't be much easier to do two feature cores per chip rather than, you know, 1024 wimpy cores. What about the supercomputer arithmetic? The TPUv1 success was this 256 by 256 8-bit integer multiply and accumulate.

12:02

If we did that for 32-bit floating point, that would just be too big, too much area and too much energy to do that on a single chip. And 16-bit floating point is much faster. Typically, it would be eight times faster because the mantissa that you multiply is much smaller.

12:22

So we experimented with doing floating point using 16-bit floating point. Now, if you can see in the format below, the IEEE standard, which was developed by Berkeley and led to my colleague Delval Kahn winning the Turing Award. 16-bit IEEE floating point has only five bits of exponents.

12:43

So that is a very narrow range. You need to represent really small numbers when you're dealing with training on CPUs. And five bits of exponents doesn't support that small a number. So what they found was if they try to do an IEEE floating point, they ran into problems.

13:01

So but if they kept the exponent the same as the full precision, which is eight bits, they didn't. But they didn't need very many bits of transition. So this brain float format has the same exponent eight bits as single precision, but only seven bits instead of 23 bits single precision. And that worked just fine. So this is a great result because it's faster.

13:24

It's less die area because the multiplier dominates. The multiplier at seven bits is half of the die area of a 10-bit multiplier. And it's also half the energy. So it works better for software, less hardware, less energy.

13:42

So brain float is a much better match to machine learning training. And in fact, as a result, everyone else has embraced it. ARM, Intel, and many startups have decided to include brain float for machine learning. This is the floor plan with chips.

14:01

This looks like the dividing line in the middle shows the two cores, upper core and the lower core. The matrix multiply units, only 10 percent of the chip. Interestingly, and as I said, the ICI links there in purple are just about 12 percent of the chip.

14:21

Google did a second implementation of the same technology. After doing it, they thought that with a little bit more work they could do another one that would be even better. So the TPUv3 has about one third faster clock. The ICI interconnect link bandwidth is about a third faster. And the memory bandwidth is about a third faster.

14:40

The big change is they decided they can do two of these multiply units per core rather than one per core as in TPUv1. So one third faster clock and twice as many cores. It's about 2.7 times peak performance. Now to go that much faster, they burned a lot more power. And so it went from an air cooled design that was on the prior slide.

15:02

And so in the middle of the slide, you see these gray pipes connecting the four TPUs together. And those pipes contain liquid that cools the chips. It has more memory, twice as much memory as TPUv2. And we can scale up to a much bigger supercomputer. It goes up to 1024 chips.

15:22

The die size grew only a little bit. It was in the same technology despite these enhancements. Because basically the Google designers had a better idea how to do the layout the second time around. Before we talk about scaling, let's talk about individual chip performance.

15:41

Well, we need to do benchmarks. Now for TPUv1, we use production applications to evaluate that. But that works for Google, but no one else could use those Google's production applications because we keep them secret. And so we helped create benchmarks that the whole industry used called MLperf.

16:01

We with some other organizations got it on the ground floor. So we've got two sets of evaluations of how the speed up of the TPUv3 is over the contemporary GPU, which was the Volta. So using the MLperf benchmarks where Google works hard to get them running and NVIDIA works hard to get them running, it was about a tie.

16:24

It's about the same speed. The geometric mean is one. Now if we look at Google production applications, it's completely different. The TPU is five times faster. So how could that be? Well, what the problem was when we talked about that brain floating point format.

16:41

The B float, which is in GPUs, was easy for Google application developers to use. But they couldn't and didn't want to get IEEE FP16 to work because that takes extra work to get the same results. You have to change the software to do that. So we didn't do that. So we thought, so that was our experience, but we're not the only one who does that.

17:05

The Vector Institute also is, in Toronto, only one of the 200 people use the 16-bit floating point when they use GPUs. Everybody else uses 30 points. So we'd have probably get that same factors of five results, different there.

17:22

Okay, let's talk about scale up. Remarkably, AlphaZero, this is the program that beat all human beings in the world at both chess and Go and one other game. It is almost perfectly scaled up. So with 1,024 TPU v3 chips, it goes 980 times faster, which is 96% of perfect speed up.

17:43

And three other applications run at 99%. So it's just almost perfect scale up. It's just what you want from a supercomputer. How can we compare this domain-specific supercomputer to conventional supercomputers? Well, it's not going to be apples and oranges, but there's some similarities.

18:02

So we're going to use the AlphaZero program, that one that beats everybody at chess. This is the production program in Go. And using it in real data. And it's using brain float 16 and IEEE 32-bit, that's at its calculations. For supercomputers, the most common benchmark is LinkPack.

18:22

This is a benchmark that isn't fantastic. You can scale up to any size at LinkPack to keep all the processors busy. This is called weak scaling. And it's synthetic data. It's not real data that is created. So the more chips you have, the more data you get. The other difference, they're not running the same benchmark.

18:41

And also, this is doing 32-bit and 64-bit floating point instead of 16 and 32. And so the results are at the bottom. So remarkably, running a production application, AlphaZero gets 70% of peaks. That's 70% of 1,024 chips times the peak performance per chip, which is amazing.

19:02

But you can see for LinkPack, the two supercomputers we have here get about 60%. In terms of the actual petaflops per second, it's almost 90 for TPUv3 versus 60 and 1. The reason this slower computer there in yellow is there is because there's another way you can look at the LinkPack results.

19:22

It's called the green 500, which plots, resorts the top 500. You can look at flops per watt. And the green 500 winner is the Saturn V there at 15 watts. But TPUv3 is 10 times better performance per watt than the number one green supercomputer, which is pretty remarkable.

19:48

Let me conclude. Moore's law is slowing down just at the wrong time. We're going to have to tailor our machines to AI for both the cloud and the

20:01

edge for machine learning training and inference because the transistors are not getting that much better. Fortunately, from a computer designer's perspective, the machine learning appetite is ravenous. For example, GPT3, which was recently in the news, makes pretty impressive imitation in terms of writing a human language.

20:23

And what's the big idea? The big idea of GPT3 is simply bigger. It has 100 times as many parameters as GPT2. They had calculated on their own that a factor of 100 should dramatically improve the quality. And it did. From a computer design perspective, by being domain specific, that really simplifies supercomputer design.

20:45

How much performance we need to know, what type of arithmetic you need to do, how much memory you need, the speed of the network links, the topology. Are simpler when you're trying to do a specific domain rather than doing everything. In this case, machine learning training. Google's TPU supercomputer demonstrates factors of 50 improvement in performance

21:05

per watt over general purpose supercomputers, which is an amazing benefit. I think this decade is going to go down as a Cambrian area. Computer architecture will see many exotic species appear to try and do this tailoring. And we'll learn later which ones will flourish. As you can see, I've magically transported myself here to Heidelberg and I'm ready to answer questions.

21:50

Thank you very much for this great presentation, for this overview. We are still checking the chat channel to see which questions came in from the audience.

22:05

Just checking the technical parameters at the moment. All the right microphones are on and the other ones are off and so on. And I'm receiving the first questions and I'm reading them just as they appear.

22:28

First one, irrespective of the amount of training time, can hardware type, for example, CPU, GPU, TPU cause variation in the learning quality of the deep learning model?

22:41

Can different hardware cause variation in the learning? You're trying not to do that. You're hoping to get it be portable across all the applications. If you're writing this in PyTorch and TensorFlow, you don't really want to have to retrain things if it's running on different hardware. So our goal, just like porting any other program, is to be compatible and get the same answers across all the platforms.

23:07

So hopefully the answer is no. Okay, second question is, is the high bandwidth memory considered the go-to domain specific memory for machine learning? How does the machine learning thirst influence the current DRAM technologies?

23:25

Great question. It appears that a lot of this computation is memory intensive. And so the speed of the memory affects how fast you can do things, especially, as I mentioned towards the end, this interest in giant models.

23:42

The two types of memory are static RAM that fits on the chip and dynamic RAM where you can find them in lots of chips. The high bandwidth memory is for the dynamic RAM. As we're getting these really big models, it's just not possible to keep them all in the static RAM on the chip. So the speed of access for that DRAM is very important.

24:01

Another issue in terms of memory technology is actually the cost. And if we're going to start doing models that have hundreds of billions of parameters, we may have to start looking at more exotic memory technologies that are cheaper to hold all the devices.

24:22

Thank you. The next question is a fairly long one. I'll start reading it. Thanks for the nice talk. On slide 14 and some after that, you mentioned the use of half precision floating point format. So is the sacrifice of precision for performance? The question is, is the precision important in AI world?

24:43

Besides, how about the use of POSIT instead of the IEEE floating point number? Now, in general, I think what's if you talk to people machine learning, they say precision isn't as important as it is in scientific computing and high performance computing.

25:04

So because precision is so important, high performance computing, they almost always use double precision floating point for their calculations. 64 bit floating point. What we found and with many people have said, we don't need that for machine learning. It's a stochastic process. So what we found is that 16 bit floating point seems to work well enough.

25:25

Now, the IEEE standard for which, as I mentioned, Professor Kahan on the standard, the 16 bit version of them has the wrong balance of exponent in mantissa. For machine learning, you want to have a bigger range with these small numbers.

25:41

By using small numbers, they're much more energy efficient and more memory efficient. But you have a bigger exponent than IEEE standard. POSIT is an interesting research area that's still to be determined how important that will be. We can certainly do really well. I mentioned this other format called brain float, and it's become a standard data type.

26:02

All companies have implemented it. So I don't see floating point formats as an obstacle to making progress. Short question from my side. Did you talk to William Kahan about the brain flow? What does he think about it?

26:21

Right now in COVID, I don't get to talk to anybody. I did ask him about POSITs and he is not a fan of POSITs. But I haven't had a chance to talk to him to see what he thinks of brain float. I'm sure it'd be a very long conversation. Okay, next question.

26:41

What do you think about using memristors for realizing neural networks? In recent years, there have been an emergence of research on memristors that bring memory computing close together. What do you think about deviating from von Neumann's architecture this way? I think in general, there's this area called processor in memory.

27:00

I worked on this earlier in my career. We had a project called intelligent RAM. I think it'll be interesting to see if there's areas of or there's pieces of the machine learning problem where this is a good application. Memristor is a more novel memory technology. And I think maybe there'll be some embedded applications where the non volatility.

27:27

So it's like flash, if you turn it off, it doesn't forget it of the memristor is going to be an interesting application. Basically, what's happened in computer architecture is a lot of ideas that didn't work for general purpose computing have been resuscitated in being evaluated for machine learning for this domain specific architecture.

27:45

Just every practically every idea we ever tried is being retried here. So I think processing new memory, using more novel memory technologies will be, you know, it's part of what I call this camera and you can see all these investigations. It's hard for me right now to see that that's going to be a winner, but it's a good time for doing these investigations.

28:08

Next question, if you increase the number of parameters for a learning model for better performance, the need for computer increases. However, is it really hardware that needs to catch up or is it that current algorithms are yet to fully utilize current hardware capabilities?

28:24

Well, I would say it's, I mean, I would say it's slightly differently. It'd be great if we could come up with algorithms that could need less computation. That would be there's a real need for that. This growing a factor of 10 every year for training or factors of 100 every two years is just pretty unaffordable.

28:44

I think the GPT-3 model, my understanding that I mentioned, took months to train, even on 8000 GPUs. So it's a huge computation. It's one of the biggest computations that I know about. So the excitement is by making it tremendously bigger.

29:02

It gets so accurate that it's it's becoming even more useful. So that's the temptation. But it would be fantastic if there were less expensive ways to train. Right now, what we're doing, everybody's working on machine learning. Right now, what we're doing in the hardware space is trying to find more cost efficient ways to scale up to really big designs.

29:24

But boy, if we had better algorithms, that would be fantastic. I'd be surprised if, you know, there's some algorithm that could just get all the work done on your desktop CPU. That would be a moving breakthrough. I think you're going to need more hardware, but it'd be great if it wasn't as expensive as it is today.

29:46

So the last question, we are probably able to handle this session. What do you think about wafer scale processors or in other words, does it make sense to just combine more and more units into one piece of silicon? I think you addressed that briefly.

30:01

Yeah, I'll try and be quick. Wafer scale is where instead of taking the wafer and dividing up the pieces, which we call chips or dies, it's the normal way and then putting it in the package and putting reassembling in the wafer is just make the whole wafer be the computer. And Cerebras is doing that. That's the startup company. It's a very creative, exciting way.

30:23

It's not clear yet whether it's cost effective. It's certainly revolutionary. If it proves to be cost effective, then they'll have a head start because it's an idea that has been around for decades that many people have gone out of business trying to do. If Cerebras makes it work, it'd be very exciting, innovative way of building computers.

30:44

I don't think it's a necessary way, but it could be a big advantage if they can make it work. Thank you very much, David, for your presentation and for the Q&A session. We really appreciate it. And we are looking, of course, forward to seeing you in person again in Heidelberg next year, talking about the latest progress in this area.

31:04

If you have time and interest, you can join our VR space and meet some of our participants and discuss with them or some other part of the program. For the time being, thank you very much for joining us. And it was a great talk. To the audience.

31:22

I just want to thank you for all your work on the Heidelberg Forum. It's been fantastic. So we'll miss you when you're gone. There will be a successor and the show will go on, no problem. The next session will start at 20 hours CEST.

31:44

And stay tuned. It will be Alan Kay talking about the process of innovation and the problem solving. Again, thank you, everybody.