5th HLF – Lecture: Self-Supervised Visual Learning and Synthesis - TIB AV-Portal

5th HLF – Lecture: Self-Supervised Visual Learning and Synthesis

00:00

9

Related Material

Heidelberg Laureate Forum Foundation

Formal Metadata

Title

5th HLF – Lecture: Self-Supervised Visual Learning and Synthesis

Title of Series

5th Heidelberg Laureate Forum (HLF), 2017

Number of Parts

49

Author

License

No Open Access License:
German copyright law applies. This film may be used for your own use but it may not be distributed via the internet or passed on to external parties.

Identifiers

10.5446/40130 (DOI)

Publisher

Heidelberg Laureate Forum Foundation

Release Date

Language

Content Metadata

Subject Area

Computer Science Mathematics

Genre

Conference/Talk

Abstract

Computer vision has made impressive gains through the use of deep learning models, trained with large-scale labeled data. However, labels require expertise and curation and are expensive to collect. Can one discover useful visual representations without the use of explicitly curated labels? In this talk, I will present several case studies exploring the paradigm of self-supervised learning – using raw data as its own supervision. Several ways of defining objective functions in high-dimensional spaces will be discussed, including the use of General Adversarial Networks (GANs) to learn the objective function directly from the data. Applications in image synthesis will be shown, including automatic colorization, paired and unpaired image-to-image translation (aka pix2pix and cycleGAN), and, terrifyingly, #edges2cats. The opinions expressed in this video do not necessarily reflect the views of the Heidelberg Laureate Forum Foundation or any other person or associated institution involved in the making and distribution of the video.

5th Heidelberg Laureate Forum (HLF), 201720 / 49

1

56:43

5th HLF – Scientific Interaction “Research in Industry“: Who (the hell) is SAP?

2

1:09:01

5th HLF – Poster Flash

3

50:10

5th HLF – Lindau Lecture: The Personalized Medicine Revolution: Are We Going to Cure all Diseases and at What Price?

4

30:30

5th HLF – Lecture: An Interplanetary Internet

5

34:09

5th HLF – Lecture: Perspectives on Turing

6

31:19

5th HLF – Lecture: The Discrete and the Continuous from James Clerk Maxwell to Alan Turing

7

38:38

5th HLF – Lecture: Curious Facts About Nested Canalyzing Functions

8

19:54

5th HLF – Lecture: Art, Mathematics and Computer Science

9

46:27

5th HLF – Lecture: The Evolution of Public Key Cryptography

10

44:35

5th HLF – Lecture: The mathematics of randomness

11

44:40

5th HLF – Lecture: Can a Machine be Conscious? Towards a Computational Model of Consciousness

12

47:42

5th HLF – Lecture: Mathematical Theories of Communication: Old and New

13

23:57

5th HLF – Lecture: How to Write a 21st Century Proof

14

31:04

5th HLF – Lecture: Where Computer Science Meets Neuroscience

15

44:10

5th HLF – Lecture: How Much Hard is System Design?

16

34:36

5th HLF – Lecture: Deep Learning Research

17

46:58

5th HLF – Lecture: Deep Learning and the Grand Engineering Challenges

18

39:48

5th HLF – Lecture: Asymptotic Group Theory

19

30:38

5th HLF – Lecture: Approximate Elimination

20

32:38

5th HLF – Lecture: Self-Supervised Visual Learning and Synthesis

21

3:21:55

5th HLF – Hot Topic 2017: Quantum Computing

22

07:18

5th HLF – Interviews with young researchers: Olaitan Awe (Nigeria)

23

04:05

5th HLF – Interview with Larwan Berke

24

04:00

5th HLF – Interviews with young researchers: Fatimah Abdul Razak

25

08:05

5th HLF – Interviews with young researchers: Abdulfatai Atte Momoh (Nigeria)

26

32:46

5th HLF – Interviews with mathematics and computer science laureates: Whitfield Diffie

27

22:31

5th HLF – Interviews with mathematics and computer science laureates: Vinton Gray Cerf

28

18:41

5th HLF – Interviews with mathematics and computer science laureates: Stephen Smale

29

41:26

5th HLF – Interviews with mathematics and computer science laureates: Stephen A. Cook and William Morton Kahan

30

35:16

5th HLF – Interviews with mathematics and computer science laureates: Sir Michael Francis Atiyah

31

19:08

5th HLF – Interviews with mathematics and computer science laureates: Sir C. Antony R. Hoare

32

13:17

5th HLF – Interviews with mathematics and computer science laureates: Shigefumi Mori

33

09:40

5th HLF – Interviews with mathematics and computer science laureates: Robert Endre Tarjan

34

14:40

5th HLF – Interviews with mathematics and computer science laureates: Richard Edwin Stearns

35

19:59

5th HLF – Interviews with mathematics and computer science laureates: Martin Hellman

36

21:09

5th HLF – Interviews with mathematics and computer science laureates: Manuel Blum

37

19:45

5th HLF – Interviews with mathematics and computer science laureates: Madhu Sudan

38

23:22

5th HLF – Interviews with mathematics and computer science laureates: Leslie Lamport

39

26:54

5th HLF – Interviews with mathematics and computer science laureates: Joseph Sifakis

40

21:44

5th HLF – Interviews with mathematics and computer science laureates: John E. Hopcroft

41

21:36

5th HLF – Interviews with mathematics and computer science laureates: Jeff Dean

42

12:54

5th HLF – Interviews with mathematics and computer science laureates: Frederick Brooks

43

14:47

5th HLF – Interviews with mathematics and computer science laureates: Efim Zelmanov

44

11:38

5th HLF – Interviews with mathematics and computer science laureates: Daniel Spielman

45

19:12

5th HLF – Interviews with mathematics and computer science laureates: Alexei Efros

46

02:52

5th HLF – Interview with journalists: Shubashree Desikan

47

03:59

5th HLF – Interview with journalists: Guillermo P Curbera

48

05:24

5th HLF – Interview with journalists: Elna Schütz

49

33:12

5th HLF – Press conference with experts on the topic of Quantum Computing: Panel with Jay Gambetta and Chris Monroe

Automatic playback

Speech

Text

Image

00:00

Internet forumExecution unitBit rateConvex hullMIDIAnnulus (mathematics)Computer wormMaxima and minimaNeuroinformatikElectronic mailing listLevel (video gaming)Machine visionLattice (order)Right anglePresentation of a groupLecture/Conference

00:29

Visual systemConvex hullInternet forumHill differential equationOvalBitVisual systemLevel (video gaming)Presentation of a groupComputer animationLecture/Conference

01:01

Bit rateInternet forumDigital photographyView (database)WordVideoconferencingYouTubeVisual systemComputer-assisted translationMedical imagingInternetworkingRandomizationMachine visionCloningMobile appLecture/Conference

02:12

Visual systemDean numberBit rateInternet forumDistanceHill differential equationMemory managementPixelThree-dimensional spaceLevel (video gaming)WordCoefficient of determinationDistanceProfil (magazine)Single-precision floating-point formatConnected spaceNeuroinformatikDemosceneSemiconductor memoryMemory managementOperator (mathematics)Pairwise comparisonPoint (geometry)Scalar fieldSimilarity (geometry)Lecture/Conference

03:54

Internet forumFlickrMedical imagingDistanceMetric systemLecture/Conference

04:20

Internet forumBit rateComputerMedical imagingComputer graphics (computer science)View (database)WindowComputer animationLecture/Conference

04:48

Internet forumBit rateMedical imagingSet (mathematics)Right angleDifferent (Kate Ryan album)Multiplication signLimit (category theory)PixelLecture/Conference

05:19

Internet forumSimilarity (geometry)Visual systemSemantics (computer science)Euklidischer RaumDistanceSpacetimeMedical imagingSingle-precision floating-point formatVirtual machinePixelBitLecture/ConferenceMeeting/Interview

05:59

Internet forumMachine learningAssociative propertyMedical imagingBlack boxAssociative propertyDistanceArtificial neural networkSoftwareVector spaceField (computer science)Machine learningVisual systemMultiplication signEinbettung <Mathematik>String (computer science)SpacetimePower (physics)Channel capacityFeature spaceSource codeRepresentation (politics)Software testingLecture/Conference

07:46

Internet forumComputer-generated imageryBit rateDigital photographyObject (grammar)WeightMedical imagingGroup actionMultiplication signSupervised learningVideoconferencingRight angleScaling (geometry)Computer animationLecture/Conference

08:33

Internet forumMedical imagingSoftwareSet (mathematics)InternetworkingDescriptive statisticsLecture/ConferenceComputer animation

09:12

Internet forumTexture mappingPattern languageSet (mathematics)Lecture/Conference

09:53

Internet forumPredictionComputer-generated imageryContext awarenessSemantics (computer science)Task (computing)Patch (Unix)Connected spaceBus (computing)WordNeuroinformatikSemantics (computer science)Formal languageTask (computing)GeometryInjektivitätSocial engineering (security)Video gameSupervised learningBuildingMedical imagingPixelProgramming paradigmSoftware testingWindowCircleInformationLecture/ConferenceMeeting/Interview

13:26

Task (computing)Position operatorBit rateInternet forumType theoryMedical imagingPatch (Unix)SoftwareMultiplication signRepresentation (politics)Category of beingSocial engineering (security)Right angleEinbettung <Mathematik>Function (mathematics)Computer animationLecture/Conference

14:05

Internet forumBit rateSoftwareWave packetRepresentation (politics)Medical imagingPatch (Unix)NeuroinformatikEinbettung <Mathematik>Multiplication signComputer-assisted translationCorrespondence (mathematics)Process (computing)Instance (computer science)Category of beingContext awarenessoutputDistanceResultantPredictabilityTask (computing)InformationConnectivity (graph theory)PixelBitLecture/Conference

16:23

Internet forumBit rateAlgorithmBitGraph coloringWave packetPoint (geometry)Semantics (computer science)Right angleComputer animationLecture/ConferenceMeeting/Interview

17:19

Internet forumBitoutputWave packetLinear regressionDomain nameFunction (mathematics)Artificial neural networkPredictabilityMappingDistanceNichtlineares GleichungssystemMathematical optimizationCASE <Informatik>Medical imagingTask (computing)SoftwareInsertion lossComputer animationLecture/Conference

18:28

Internet forumRight angleInsertion lossAsynchronous Transfer ModeStudent's t-testNetwork topologyFunctional (mathematics)Multiplication signLecture/Conference

19:08

Bit rateInternet forumMedical imagingWordQuicksortDifferent (Kate Ryan album)Translation (relic)Universe (mathematics)Graph coloringReal numberHoaxGoodness of fitPerspective (visual)SoftwareMathematical optimizationInsertion lossFunction (mathematics)Identical particlesFunctional (mathematics)DistanceDigital photographyTask (computing)Lecture/ConferenceComputer animation

21:18

Internet forumBit rateBeat (acoustics)Graph coloringMedical imagingTranslation (relic)MappingSatelliteView (database)BitLecture/ConferenceComputer animation

21:55

Internet forumBit rateComputer-generated imageryCodeParameter (computer programming)Different (Kate Ryan album)Wave packetMedical imagingRight angleLecture/Conference

22:43

Internet forumBit rateStorage area networkCycle (graph theory)CodeComputer-assisted translationTranslation (relic)outputMedical imagingInternetworkingCycle (graph theory)Instance (computer science)Domain nameFormal languageWave packetCorrespondence (mathematics)Interactive televisionRight angleDigital photographyMultiplication signMobile appDifferent (Kate Ryan album)Constraint (mathematics)ResultantConsistencyLecture/Conference

25:27

Bit rateInternet forumMedical imagingPoint cloudDomain nameTranslation (relic)Function (mathematics)Lecture/ConferenceMeeting/Interview

26:01

Internet forumMedical imagingHorizonRevision controlComputer animationLecture/ConferenceMeeting/Interview

26:39

Heat transferInternet forumBit rateDigital photographyDifferent (Kate Ryan album)Computer animationLecture/Conference

27:05

Internet forumFrame problemMultiplication signConsistencyTemporal logicLecture/Conference

27:29

Internet forumCorrespondence (mathematics)Bipartite graphFormal languageInformationMereologyTranslation (relic)SoftwareLecture/ConferenceMeeting/Interview

28:12

Internet forumVideo gameVideoconferencingResultantMedical imagingSpacetimeVector spaceMultiplication signDimensional analysisType theoryEinbettung <Mathematik>Surface of revolutionTheoryView (database)Representation (politics)Machine visionSet (mathematics)Direction (geometry)Wave packetInternetworkingPolygon meshSoftware testingEuklidischer RaumLecture/Conference

Transcript: English(auto-generated)

00:02

It is my great pleasure to introduce Alexei Ephros. He's coming here on stage already. And indeed, it's the first thing ever here at the 5th HLF. It's the first presentation from a winner of an ACM prize in computing, if I see this right, which is because we added this prize to the list of prizes here just last year, in the meeting last year.

00:25

So, this is a talk about computer vision, self-supervised visual learning. Thank you, Alexei. Thank you very much. I'm very humbled to be here in front of this august audience. And I was originally going to give a more technical presentation,

00:41

but John did so well in setting up the stage that I thought I would give maybe a little bit more of a kind of an overview talk about visual data. So, everybody's talking about the big data, data deluge, all this data being rained down on us. But I think a lot of people don't appreciate the fact that most of the data is actually visual,

01:03

that all these videos and images flowing at us, you know, YouTube claims to have 500 hours updated every single minute. The Earth has something like 3.5 trillion images, and half of that has been captured in the last year or so.

01:23

So, 74% of traffic is visual, you know, and much of it is cats flowing around the internet. And a lot of it is basically just too big for humans. There is a wonderful little YouTube clone, it's called PittiTube.

01:41

So, you go to this PittiTube and it plays you a random YouTube video that has exactly zero views. It's never been seen by a human being before, and of course, once you watch it, it will again be never seen by a human being after because it already goes away from the PittiTube. So, in the words of Pietro Pirona, visual data is the digital dark matter of the internet.

02:07

There's just a lot of it and we don't really have a way to access it. And the reason is that basically visual data is difficult to handle. Text is clean, it's compact, segmented, it's one-dimensional, it's indexable.

02:23

Visual data is noisy, it's very high-dimensional, it's two- or three-dimensional. And it's even the simplest operation, something like, you know, comparison, distances between points, turns out to be very hard.

02:41

So, you know, with scalers, it's easy to compute a distance. With, you know, word strings, you can do humming distance. With, you know, with single pixel brightnesses, you can still say, okay, this is, you know, 50 gray levels brighter than the other one. But what's the distance between these two things? Nobody really knows.

03:00

Now, Borges, of course, had foreseen this problem. In his work, a short story on fumes, the memoirs, he says, it irritated him that a dog at 3.14 in the afternoon seen in profile should be indicated by the same noun as dog at 3.15 seen frontally. My memory, sir, is a garbage heap.

03:22

This is really what the computer is feeling, I believe. This is very much from the Borges' short story, that all this data is sitting there, but the connections are often missing. And what can we do about it? Well, one way to do it is to just, you know, get more data. For example, here is your dog at 3.14 in the afternoon,

03:41

and you don't know what it looks like, but if you just add more data, there will be a very, very similar dog, okay? That's a very simple idea, but it turns out to be extremely powerful. And here is one example that we did a few years back. So here is an image, and let's say you want to get rid of this foreground castle.

04:01

So you go to Photoshop, you erase it, no problem. Now you have a hole. How do you fill in the hole? So what we did is something very, very dumb. We just went to internet, went on Flickr, and just downloaded two million images, and then just searched for the closest image in some very, very simple distance metric. And here is one of the closest images,

04:22

and then some computer graphics trickery, and we fill in the hole, okay? Nothing very fancy. It's very, very simple. It's just the data allows you to do this. So here is another example. You don't like the view from your window. You can get a better one. You have some buildup on your favorite beach.

04:41

You can get rid of that. Here we get rid of the crane. So why is it working? Well, it's working because look at these. So this is the set of kind of nearest, closest images in our data set. And look at the closest one, right? It's a different city, different river, and yet it looks so similar

05:01

because a lot of the time we all, we're boring. We take the same pictures over and over and over again, and we build the same kind of cities over and over again. So a lot of this is held by the fact that our visual world is very, very structured, okay? So, but there is a limit to this. For example, these two images,

05:22

we would feel that they should be close, but actually there is not a single pixel in common. They're actually very, very far in the space or in Euclidean distance or anything like that. They're very, very different. So what can we do?

05:41

Well, here is where we can add a little bit of semantics. We can say, well, let's say that both of them are called penguin. I don't know what penguin is, but it's the same thing. Now, can we use this to somehow find a better distance between our images? And this is one way that machine learning comes into the picture.

06:03

A lot of the time with dealing with visual data, machine learning is really there as data association. Okay, so here we have an image X, and then we have this label Y, the string penguin. And now we basically have some black box

06:20

that tries to associate X with the Y. And so then you give it another penguin and another, and lots and lots of data, lots and lots of data. And this is why convolutional neural networks have turned out to be so powerful because they are able to just eat up a lot of the data.

06:43

They're a very high capacity classifier as opposed to the previous one. So a lot of the reasons why the ConvNets are so powerful is because they can just gobble up millions and millions of data in doing this data association. Okay, so now you get a new image.

07:00

You have never seen this before. And at this time, this network is gonna, first it's gonna tell you that it's a penguin. Okay, that's good. But in a way, more importantly, it's going to tell you, it's going to have a representation, a vector in some high dimensional feature space,

07:20

some embedding space. That is going to be close to the vector for the other penguins that it has seen. So in a way, it's learning a better distance, a better space where distances make more sense. And this turns out to be a really powerful thing for visual data.

07:40

And this is where Deep Network has been used a lot in our field. And this is, John has already mentioned the ImageNet Challenge. You can do things like given lots and lots of photographs. You can basically detect objects in it. You can even automatically generate captions. Look at this.

08:01

A group of people posing for a picture on a skillet. This is amazing. This is like, wow, we're done, we can go home. Well, there are still some problems with this human supervised learning. One obvious one is that getting these human labels is very, very expensive, right? So you have to spend a lot of time

08:21

getting humans to click on these labels for founts or millions of videos. It's very, very expensive that doesn't scale up, okay? So this is one big problem with supervised learning. Another is that it's often very easy to fool yourself into thinking that the network is doing more than it's actually doing.

08:41

So for example, here is an image, and I ran it through a kind of a standard image captioning software, and it says, a car parked by the side of the road. And I say, wow, this is so cool, right? But if you go and you look for cars on the internet, here's just a set of cars from Google Image Search, you can basically say that that description applies

09:02

to pretty much all of those images, right? All of them you could say, yeah, it's a car parked by the side of the road, more or less, right? And then I tried it on something like this. And it's a car parked by the side of the road, which is kind of true, it is, it was. What about this?

09:22

Well, there is a car, there is a road, and probably that's really all it's getting. It's probably not getting that much more. And a lot of it is just our kind of wishful thinking that think that it's doing something more than basically just finding a few set of texture patterns.

09:42

And so I think this is something that one has to be very careful to make sure that we understand that there is no magic, you're getting whatever you put in. And so, in a way, this idea of kind of learning connections between images using words,

10:01

well, there is a certain issue with the fact that the world, the visual world, is just so much richer than the world of words. And so it's not a one-to-one mapping. You're losing a lot of information, a lot of what we, we don't have words for many things that we see.

10:21

So here is just one example. Here is a picture of Pittsburgh, Pennsylvania, and Paris, France. And they happen to be labeled the same English word, city. That's what language tells us. But actually, visually, they're very, very different, extremely different. And you can say, well,

10:41

they all have buildings. Well, but the buildings look different. Well, all the buildings have windows. But the windows, every single pixel is different. So in a way, you're really asking the computer a very hard, maybe even impossible task, somehow finding connection between those two things. And so, in a way, this kind of going through

11:00

these word labels is, in a way, kind of this language bottleneck. And so what we have been doing in my lab is to try to see if we can somehow get rid of this language bottleneck, and see if we can kind of get visual data to be taken on its own merits, to be its first-class citizen.

11:20

So I will show you a few examples of some of the stuff that we have been doing. One kind of overarching idea is this idea of self-supervised learning. So it's a supervised learning paradigm. But the idea is that the labels, they don't come from some human labeler. They come from the data itself.

11:41

So the data is supervising itself, okay? So let me explain to you what I mean. So here is one idea. So the idea is to have a pretext task, which is maybe not very useful, but which we have infinite labels for, and try to get the computer to be good at this task.

12:01

Here is one example task. So the task is I have two patches taken from an image, A and B. And I want to ask the computer, what is their spatial arrangement? So given that the patch A is here, where should patch B go? So now let's pretend that you are all computers, where should patch B go?

12:23

Lower right, very good. Now let's try to introspect. Now in psychology, you're not supposed to, but this is, we're okay, we haven't seen any psychologists here. How do we, when it is prospective, why did we do this? Well, the top thing kind of looks like the top of a bus. The side looks like the side of the bus.

12:42

Once I can remember how buses look like, I have this kind of geometric connection, and then I can just kind of import it there. Now imagine that you have never ever seen a bus in your entire life. This would be an impossible task. You would just not be able to do this. So now if we force the computer

13:01

to try to train to solve this task, the hope is that it will have to learn about buses. It will have to learn about some semantics just to be able to do this task. And so even though the task itself is not very useful, nobody really needs to know this solution to this task, the hope is that we'll actually force the computer

13:22

to learn something about the visual world. And so here is the setup we have. So we basically took millions of images. For every image, we pick a random patch, and then we put eight other patches around it, and then we basically train a Siamese-type network

13:40

that given two patches, it basically predicts one of eight possible outputs. And then we train this for six weeks on a GPU. Very long time. It's not an easy problem. Even humans are not very good at this problem. But the idea is that once you train it on this useless pretext task, then you can look at the learned representation that you get.

14:03

So you can get rid of one of those things, and you look at what is the feature embedding that you get, and the hope is that in that embedding, distances in that embedding would be somehow better. So here is one very simple example. We take an input image, a little patch, cat patch,

14:23

and then we look for its nearest neighbors, the closest patches in this new learned embedding. And the nice thing is that the nearest patches from a whole bunch, from millions of other images, they're all also cats, which is kind of remarkable

14:40

because training was done one image at a time. Nobody has ever told the computer, these two things are both called cats, right? Before we had the images, we said penguin, penguin, penguin. Here, nothing like this. The training was done in a single image at a time, and yet, through context, it learned to find these correspondences across categories.

15:04

So it's kind of strung together from instances in the categories all by itself. And so this, we thought, was a very hopeful result because it might get us to where we want to go. But of course, this is very slow, and you basically get one bit of information

15:21

per training pair. And so you can do better, you can do faster. For example, you can say we can just do just a prediction of all pixels all at once. So what is the way to set up this kind of prediction problem where we don't have human labels? Well, one example is we can predict caller from an image.

15:40

So we can take an image, we can split it into grayscale and the caller component, and then we basically try to train a network to predict the caller from the grayscale, okay? So we have half the data predicts the other half of the data, okay? And of course, the nice thing is that

16:02

you could put it back together and you have a nice colorized image, but kind of more exciting is that hopefully the representation is being learned while it's doing this task is going to somehow be meaningful, okay? And you can do this task forever because you don't need any label data. Basically, this process labels its own data for you, okay?

16:22

So of course, you know, show some pictures. So this is Ansel Adams and we made it a little bit colorized. Here is another, you know, this is not supposed to be art, it's just to show that the algorithm is actually doing something reasonable. Even the mistakes are kind of fun. So can you see what the mistakes here are?

16:43

It's not very bright, unfortunately, but here I will toggle. So it puts some pink underneath the chin, right? And the reason it did this is because the training data has the tongs out. And so, but this is a very good point. This basically means that whatever it's learning,

17:02

it's not some low-level signal. It's actually recognizing that it's a poodle, I guess, and then say, well, all the poodles I have seen before have their tongues out, so probably this also has their tongue out. Okay, so that actually suggests that it's actually learning something, something higher level and semantic about this problem. Okay, but whenever you're doing this prediction task,

17:24

things are a little bit tricky because, so what you're doing is you're doing two things, right? So first is you're training a network, which is basically a mapping from some, you know, input domain to the output domain. And then you're actually trying to make this mapping

17:41

minimize this equation. Basically you want to say that you want whatever the F produces to be as close as possible to Y. So this F is the neural network and L is the objective function or the loss, which says, what do I want it to do? And so in this case, we want F of X to be as close to Y as possible.

18:03

But now we're back to this problem that close in what sense? You know, in L2, L1, what's the distance that we're trying to get? And of course, the standard thing is to do something like L2, but in high dimensional data like images, that doesn't work very well. So here is an input image,

18:21

and this is a regression with L2 distance, and it doesn't look very good compared to the ground truth. And the reason is, well, one of the reasons is basically, imagine that you have multiple modes. Imagine that this bird could be blue or it could be red. What is L2 going to do? It's going to try to make both of those happy. It's going to do something that splits down the middle.

18:42

So it's going to do something in the middle, and then what's in the middle in the middle, it's green. So that's why it's going to be, neither one is going to be happy, okay? So it's not very good when you have multiple modes in the data. And so, you know, we did some fancy things with, you know, doing cross-entropy loss with boosting the colors,

19:02

and my student Richard Zhang has spent a lot of time really fitting the right loss function. And we got something like this. But the problem is that when you ever do something by hand, sometimes, you know, unintended consequences. So this image, for example, it just over-colorizes it. It just, the back wall, you know,

19:21

should really be white, and it's put some yellows and stuff in there. And so what we really would love is we would love to have some sort of a universal objective function that basically just is for any kind of this, what we call image-to-image translation problems. Just find something that will tell us if this is a good image or not.

19:41

And this is, you know, this is a good thing because we do know what the right answer should look like. We have a whole bunch of real images that look real. So all we need to do is somehow try to have our outputs be indistinguishable from real data.

20:03

And just like John mentioned already, there is this wonderful paper by Ian Goodfellow and colleagues called Generative Adversarial Networks that does exactly this. And it's just perfect for this task. Basically what it does is it puts this generator, puts a discriminator on top of this and say,

20:20

can I tell the difference between whatever I have produced and a real photo? And it's a minimax optimization. In the end, hopefully the discriminator will say, okay, I give up, I cannot tell the difference. And then we win. And so we have our generator where it does the colorization, for example, and then we have this discriminator. And so G is trying to synthesize fake images

20:42

that fool D. D tries to identify those fakes. And one way to think about what it's doing, which is kind of nice, I think, is that from G's perspective, D is kind of like a loss function. It's kind of like your objective. It's basically kind of like L1 or L2 distance, but it's learned.

21:00

It basically, D tells G how to get better, how to push towards reducing the error, but not in some predefined way, but learn for this particular problem. And this turns out to be very, very powerful technique. And so we have recently had a paper

21:22

basically trying to use this for a whole bunch of what we call image-to-image translation problems. So for example, we tried it for colorization and it seems to work. And then we take exactly the same approach, retrain on new data. For example, we downloaded images from Google Maps and we made it produce images

21:41

of the satellite view of that same place. So going from Google Maps to satellite, and here is the ground truth so you can see that it's a little bit blurrier, but it's actually doing pretty well. And of course, we can do it the other way around, going from satellites to Google Maps. Unfortunately, you can't really see very well here. And it's exactly the same code,

22:01

exactly the same parameters because the GANs, they basically learn the right metric automatically for us. We can go from day to night. Again, same code, same parameters, just train on different things. We can go from edges to images. So train on edges and produce images.

22:22

And then using that train data, we tried it on kids' sketches of things and it's still able to do something reasonable, right? We posted the data, the code online about a year ago. We called it pix2pix. And then lots and lots of people

22:42

were able to just create stuff with it. And not even scientists, like there's a bunch of artists that would just use the code. It turns out to be very, very easy now to just get the code and just start working with it. Maybe you have seen there's this cat sensation. People trained it on cats

23:00

and started to have a little interactive cat thing. So you can try it at home, this little app. You draw something and then it calls our code and then you can basically catify anything because of course cats are what internet is all about.

23:21

Now the only problem with the current setup is that there's still some supervision. And the supervision is that we have to give pairs of training data. So we say okay, edges and the corresponding image, right? But what if we want to get rid of it?

23:42

Sometimes this paired supervision data is just not available. For example, I have a whole bunch of photos that I took while on different trips and I want to translate them into the style of say Cezanne. But I don't have any correspondences. I have not taken a picture exactly where Cezanne has painted his painting, okay?

24:03

How can I do this? Well, one can try to do this again with this discriminator from the GAN. We can say given some instance in domain X, let's train a generator to produce something in domain Y that looks like Cezanne.

24:20

Now that's reasonable but most of the time it doesn't work because it's too open-ended. It will just produce something of Cezanne but it might not have anything to do with their initial X. And so this is where we can give more constraints by adding what we call cycle consistency to basically try to go back to the original image.

24:43

So the idea here is you go from the input image to the other domain. We don't know what it should be in the other domain but then we can translate it back and hopefully it should be close to where it was started with. And this is something that as John mentioned, this is something that people use in language all the time.

25:01

In fact, Mark Twain did this exercise translating, finding a French translation of his short story and then translating it back into English and being horrified at the results. But this back translation idea is actually being used in language all the time and so we basically adopted it to visual data.

25:21

And so here now we have a couple of photos that I took in Paris and here is what happens if we translate them to be more like Cezanne. And people have, as John has shown, there are some other work that has been doing this kind of things. Now I'm particularly proud of our stuff,

25:41

the Cezanne clouds I think. I think we have the best Cezanne clouds because we are basically able to use not just one Cezanne image but his whole output, a thousand images of Cezanne. So it's like a really domain to domain translation. But people have done this before. Now what people haven't really done before is to do the other way.

26:01

So starting with say this Monet, can you make it look like something like a real image? And we're not quite there yet, it's not quite there but we feel like it's getting close. That really, fooling humans is on the horizon, maybe in a couple of years.

26:22

So here is another example, here's a Monet and this is kind of our version and some of it is really looking quite good and some of it doesn't. And of course it would be nicer if I showed some Cezanne examples instead of Monet but Cezanne doesn't work as well. So still future work. So here are some other examples

26:40

of translating a photograph into different styles. I'm just like to show off my travel photos here. We can also translate other things. For example we can translate seasons. We can go from summer to winter and of course back again. We can go from oranges to apples

27:02

or from apples to oranges. We can go from horses to zebras and back again. And it's amazingly even just running it one frame at a time without any temporal consistency. It's doing okay, although look at the tail.

27:21

The tail is definitely weird here. Yeah, but even the failures are kind of fun. So this, I showed this in my talk in Moscow and I thought they're not gonna get me out, let me out. But it actually makes sense.

27:42

There is actually no supervision. Nobody told the network what a horse looks like, what a zebra looks like. So it's really trying to find some correspondence. It's kind of really doing this bipartite matching to see what corresponds to what in this translation. It's like two visual languages without a dictionary. So maybe it's not that weird that it thought

28:02

that Putin was part of the horse, right? Because it doesn't really have any extra data. And of course it makes sense to try to add more information to this. So in conclusion, visual data is really the biggest big data we have right now. And it's time that it starts being treated

28:22

as a first-class citizen, but it's hard. So hopefully using deep learning and in particular this idea of self-supervised learning might be a trick to get us there. Thank you very much.

28:43

Thanks a lot for this beautiful and inspiring talk. I think we have time for one or two quick questions. Somewhere from the back over there. Yes? Can we have a microphone over there?

29:06

Is the mic on the way? Yes. Does this technique offer a view toward solving the learning from one image problem that John described?

29:25

So again, the learning from a single image? Well, it is kind of in that direction. You can think of it that there is no learning from a single example. Basically the idea would be that you're learning

29:40

all your life without any supervision. And then finally at the very end, you give an image and say, okay, this is a fire truck. And then boom, it just connects it with the representation that you have built up. So in a way, that's the idea here, that you build up this representation and then just one or just a few labels

30:00

should get you there. And we're not quite there yet. We have some experiments on pre-training using the self-supervised data and then testing on some smallish data sets like Pascal. And we're getting reasonable results, but they're not something to write home about yet. So I think this is definitely still a big problem

30:22

and hopefully some of you guys can help us solve it. Thank you. So you mentioned before that most of the deep learning

30:42

results right now, most of the data in the internet available is our images. So then like most of the results deal with the image or vision problems. So what are your thoughts for the future of deep learning with data that cannot be embedded in an Euclidean space or that are not

31:01

like naturally represented as a vector? Because I mean, then there might be some difficulties applying these methods to such data. So I think methods for data that is not in Euclidean space are developing, for example, various types of embedding meshes in 3D graphics, for example.

31:22

I think the places where deep learning will help are when your data is high dimensional and noisy. I think those are the two requirements. It needs to be high dimensional and because it's high dimensional, there is a lot of weird things going on. I think for low dimensional data,

31:40

probably deep learning might not be that useful because other methods are already reasonably good. But the reason why deep learning made such a big revolution in something like audio and video was because it's a very, very high dimensional space with lots of junk in there and other methods just couldn't deal with all that junk.

32:02

And so I think you want to look for, I presume something like financial data, if you look at the whole stock market, that's probably high dimensional enough. So that makes sense. But I think you really want to look for problems where you have the dimensionality is high but the intrinsic dimensionality is maybe much lower.

32:21

And that's what deep learning seems to be good at. But again, as John mentioned, there is a lot of questions and I think there's more questions than answers right now. We know that it works amazingly well for some problems but the theory is still lacking, so there is plenty to do.