We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

5th HLF – Lecture: Self-Supervised Visual Learning and Synthesis

00:00

Formal Metadata

Title
5th HLF – Lecture: Self-Supervised Visual Learning and Synthesis
Title of Series
Number of Parts
49
Author
License
No Open Access License:
German copyright law applies. This film may be used for your own use but it may not be distributed via the internet or passed on to external parties.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Computer vision has made impressive gains through the use of deep learning models, trained with large-scale labeled data. However, labels require expertise and curation and are expensive to collect. Can one discover useful visual representations without the use of explicitly curated labels? In this talk, I will present several case studies exploring the paradigm of self-supervised learning – using raw data as its own supervision. Several ways of defining objective functions in high-dimensional spaces will be discussed, including the use of General Adversarial Networks (GANs) to learn the objective function directly from the data. Applications in image synthesis will be shown, including automatic colorization, paired and unpaired image-to-image translation (aka pix2pix and cycleGAN), and, terrifyingly, #edges2cats. The opinions expressed in this video do not necessarily reflect the views of the Heidelberg Laureate Forum Foundation or any other person or associated institution involved in the making and distribution of the video.
Internet forumExecution unitBit rateConvex hullMIDIAnnulus (mathematics)Computer wormMaxima and minimaNeuroinformatikElectronic mailing listLevel (video gaming)Machine visionLattice (order)Right anglePresentation of a groupLecture/Conference
Visual systemConvex hullInternet forumHill differential equationOvalBitVisual systemLevel (video gaming)Presentation of a groupComputer animationLecture/Conference
Bit rateInternet forumDigital photographyView (database)WordVideoconferencingYouTubeVisual systemComputer-assisted translationMedical imagingInternetworkingRandomizationMachine visionCloningMobile appLecture/Conference
Visual systemDean numberBit rateInternet forumDistanceHill differential equationMemory managementPixelThree-dimensional spaceLevel (video gaming)WordCoefficient of determinationDistanceProfil (magazine)Single-precision floating-point formatConnected spaceNeuroinformatikDemosceneSemiconductor memoryMemory managementOperator (mathematics)Pairwise comparisonPoint (geometry)Scalar fieldSimilarity (geometry)Lecture/Conference
Internet forumFlickrMedical imagingDistanceMetric systemLecture/Conference
Internet forumBit rateComputerMedical imagingComputer graphics (computer science)View (database)WindowComputer animationLecture/Conference
Internet forumBit rateMedical imagingSet (mathematics)Right angleDifferent (Kate Ryan album)Multiplication signLimit (category theory)PixelLecture/Conference
Internet forumSimilarity (geometry)Visual systemSemantics (computer science)Euklidischer RaumDistanceSpacetimeMedical imagingSingle-precision floating-point formatVirtual machinePixelBitLecture/ConferenceMeeting/Interview
Internet forumMachine learningAssociative propertyMedical imagingBlack boxAssociative propertyDistanceArtificial neural networkSoftwareVector spaceField (computer science)Machine learningVisual systemMultiplication signEinbettung <Mathematik>String (computer science)SpacetimePower (physics)Channel capacityFeature spaceSource codeRepresentation (politics)Software testingLecture/Conference
Internet forumComputer-generated imageryBit rateDigital photographyObject (grammar)WeightMedical imagingGroup actionMultiplication signSupervised learningVideoconferencingRight angleScaling (geometry)Computer animationLecture/Conference
Internet forumMedical imagingSoftwareSet (mathematics)InternetworkingDescriptive statisticsLecture/ConferenceComputer animation
Internet forumTexture mappingPattern languageSet (mathematics)Lecture/Conference
Internet forumPredictionComputer-generated imageryContext awarenessSemantics (computer science)Task (computing)Patch (Unix)Connected spaceBus (computing)WordNeuroinformatikSemantics (computer science)Formal languageTask (computing)GeometryInjektivitätSocial engineering (security)Video gameSupervised learningBuildingMedical imagingPixelProgramming paradigmSoftware testingWindowCircleInformationLecture/ConferenceMeeting/Interview
Task (computing)Position operatorBit rateInternet forumType theoryMedical imagingPatch (Unix)SoftwareMultiplication signRepresentation (politics)Category of beingSocial engineering (security)Right angleEinbettung <Mathematik>Function (mathematics)Computer animationLecture/Conference
Internet forumBit rateSoftwareWave packetRepresentation (politics)Medical imagingPatch (Unix)NeuroinformatikEinbettung <Mathematik>Multiplication signComputer-assisted translationCorrespondence (mathematics)Process (computing)Instance (computer science)Category of beingContext awarenessoutputDistanceResultantPredictabilityTask (computing)InformationConnectivity (graph theory)PixelBitLecture/Conference
Internet forumBit rateAlgorithmBitGraph coloringWave packetPoint (geometry)Semantics (computer science)Right angleComputer animationLecture/ConferenceMeeting/Interview
Internet forumBitoutputWave packetLinear regressionDomain nameFunction (mathematics)Artificial neural networkPredictabilityMappingDistanceNichtlineares GleichungssystemMathematical optimizationCASE <Informatik>Medical imagingTask (computing)SoftwareInsertion lossComputer animationLecture/Conference
Internet forumRight angleInsertion lossAsynchronous Transfer ModeStudent's t-testNetwork topologyFunctional (mathematics)Multiplication signLecture/Conference
Bit rateInternet forumMedical imagingWordQuicksortDifferent (Kate Ryan album)Translation (relic)Universe (mathematics)Graph coloringReal numberHoaxGoodness of fitPerspective (visual)SoftwareMathematical optimizationInsertion lossFunction (mathematics)Identical particlesFunctional (mathematics)DistanceDigital photographyTask (computing)Lecture/ConferenceComputer animation
Internet forumBit rateBeat (acoustics)Graph coloringMedical imagingTranslation (relic)MappingSatelliteView (database)BitLecture/ConferenceComputer animation
Internet forumBit rateComputer-generated imageryCodeParameter (computer programming)Different (Kate Ryan album)Wave packetMedical imagingRight angleLecture/Conference
Internet forumBit rateStorage area networkCycle (graph theory)CodeComputer-assisted translationTranslation (relic)outputMedical imagingInternetworkingCycle (graph theory)Instance (computer science)Domain nameFormal languageWave packetCorrespondence (mathematics)Interactive televisionRight angleDigital photographyMultiplication signMobile appDifferent (Kate Ryan album)Constraint (mathematics)ResultantConsistencyLecture/Conference
Bit rateInternet forumMedical imagingPoint cloudDomain nameTranslation (relic)Function (mathematics)Lecture/ConferenceMeeting/Interview
Internet forumMedical imagingHorizonRevision controlComputer animationLecture/ConferenceMeeting/Interview
Heat transferInternet forumBit rateDigital photographyDifferent (Kate Ryan album)Computer animationLecture/Conference
Internet forumFrame problemMultiplication signConsistencyTemporal logicLecture/Conference
Internet forumCorrespondence (mathematics)Bipartite graphFormal languageInformationMereologyTranslation (relic)SoftwareLecture/ConferenceMeeting/Interview
Internet forumVideo gameVideoconferencingResultantMedical imagingSpacetimeVector spaceMultiplication signDimensional analysisType theoryEinbettung <Mathematik>Surface of revolutionTheoryView (database)Representation (politics)Machine visionSet (mathematics)Direction (geometry)Wave packetInternetworkingPolygon meshSoftware testingEuklidischer RaumLecture/Conference
Transcript: English(auto-generated)
It is my great pleasure to introduce Alexei Ephros. He's coming here on stage already. And indeed, it's the first thing ever here at the 5th HLF. It's the first presentation from a winner of an ACM prize in computing, if I see this right, which is because we added this prize to the list of prizes here just last year, in the meeting last year.
So, this is a talk about computer vision, self-supervised visual learning. Thank you, Alexei. Thank you very much. I'm very humbled to be here in front of this august audience. And I was originally going to give a more technical presentation,
but John did so well in setting up the stage that I thought I would give maybe a little bit more of a kind of an overview talk about visual data. So, everybody's talking about the big data, data deluge, all this data being rained down on us. But I think a lot of people don't appreciate the fact that most of the data is actually visual,
that all these videos and images flowing at us, you know, YouTube claims to have 500 hours updated every single minute. The Earth has something like 3.5 trillion images, and half of that has been captured in the last year or so.
So, 74% of traffic is visual, you know, and much of it is cats flowing around the internet. And a lot of it is basically just too big for humans. There is a wonderful little YouTube clone, it's called PittiTube.
So, you go to this PittiTube and it plays you a random YouTube video that has exactly zero views. It's never been seen by a human being before, and of course, once you watch it, it will again be never seen by a human being after because it already goes away from the PittiTube. So, in the words of Pietro Pirona, visual data is the digital dark matter of the internet.
There's just a lot of it and we don't really have a way to access it. And the reason is that basically visual data is difficult to handle. Text is clean, it's compact, segmented, it's one-dimensional, it's indexable.
Visual data is noisy, it's very high-dimensional, it's two- or three-dimensional. And it's even the simplest operation, something like, you know, comparison, distances between points, turns out to be very hard.
So, you know, with scalers, it's easy to compute a distance. With, you know, word strings, you can do humming distance. With, you know, with single pixel brightnesses, you can still say, okay, this is, you know, 50 gray levels brighter than the other one. But what's the distance between these two things? Nobody really knows.
Now, Borges, of course, had foreseen this problem. In his work, a short story on fumes, the memoirs, he says, it irritated him that a dog at 3.14 in the afternoon seen in profile should be indicated by the same noun as dog at 3.15 seen frontally. My memory, sir, is a garbage heap.
This is really what the computer is feeling, I believe. This is very much from the Borges' short story, that all this data is sitting there, but the connections are often missing. And what can we do about it? Well, one way to do it is to just, you know, get more data. For example, here is your dog at 3.14 in the afternoon,
and you don't know what it looks like, but if you just add more data, there will be a very, very similar dog, okay? That's a very simple idea, but it turns out to be extremely powerful. And here is one example that we did a few years back. So here is an image, and let's say you want to get rid of this foreground castle.
So you go to Photoshop, you erase it, no problem. Now you have a hole. How do you fill in the hole? So what we did is something very, very dumb. We just went to internet, went on Flickr, and just downloaded two million images, and then just searched for the closest image in some very, very simple distance metric. And here is one of the closest images,
and then some computer graphics trickery, and we fill in the hole, okay? Nothing very fancy. It's very, very simple. It's just the data allows you to do this. So here is another example. You don't like the view from your window. You can get a better one. You have some buildup on your favorite beach.
You can get rid of that. Here we get rid of the crane. So why is it working? Well, it's working because look at these. So this is the set of kind of nearest, closest images in our data set. And look at the closest one, right? It's a different city, different river, and yet it looks so similar
because a lot of the time we all, we're boring. We take the same pictures over and over and over again, and we build the same kind of cities over and over again. So a lot of this is held by the fact that our visual world is very, very structured, okay? So, but there is a limit to this. For example, these two images,
we would feel that they should be close, but actually there is not a single pixel in common. They're actually very, very far in the space or in Euclidean distance or anything like that. They're very, very different. So what can we do?
Well, here is where we can add a little bit of semantics. We can say, well, let's say that both of them are called penguin. I don't know what penguin is, but it's the same thing. Now, can we use this to somehow find a better distance between our images? And this is one way that machine learning comes into the picture.
A lot of the time with dealing with visual data, machine learning is really there as data association. Okay, so here we have an image X, and then we have this label Y, the string penguin. And now we basically have some black box
that tries to associate X with the Y. And so then you give it another penguin and another, and lots and lots of data, lots and lots of data. And this is why convolutional neural networks have turned out to be so powerful because they are able to just eat up a lot of the data.
They're a very high capacity classifier as opposed to the previous one. So a lot of the reasons why the ConvNets are so powerful is because they can just gobble up millions and millions of data in doing this data association. Okay, so now you get a new image.
You have never seen this before. And at this time, this network is gonna, first it's gonna tell you that it's a penguin. Okay, that's good. But in a way, more importantly, it's going to tell you, it's going to have a representation, a vector in some high dimensional feature space,
some embedding space. That is going to be close to the vector for the other penguins that it has seen. So in a way, it's learning a better distance, a better space where distances make more sense. And this turns out to be a really powerful thing for visual data.
And this is where Deep Network has been used a lot in our field. And this is, John has already mentioned the ImageNet Challenge. You can do things like given lots and lots of photographs. You can basically detect objects in it. You can even automatically generate captions. Look at this.
A group of people posing for a picture on a skillet. This is amazing. This is like, wow, we're done, we can go home. Well, there are still some problems with this human supervised learning. One obvious one is that getting these human labels is very, very expensive, right? So you have to spend a lot of time
getting humans to click on these labels for founts or millions of videos. It's very, very expensive that doesn't scale up, okay? So this is one big problem with supervised learning. Another is that it's often very easy to fool yourself into thinking that the network is doing more than it's actually doing.
So for example, here is an image, and I ran it through a kind of a standard image captioning software, and it says, a car parked by the side of the road. And I say, wow, this is so cool, right? But if you go and you look for cars on the internet, here's just a set of cars from Google Image Search, you can basically say that that description applies
to pretty much all of those images, right? All of them you could say, yeah, it's a car parked by the side of the road, more or less, right? And then I tried it on something like this. And it's a car parked by the side of the road, which is kind of true, it is, it was. What about this?
Well, there is a car, there is a road, and probably that's really all it's getting. It's probably not getting that much more. And a lot of it is just our kind of wishful thinking that think that it's doing something more than basically just finding a few set of texture patterns.
And so I think this is something that one has to be very careful to make sure that we understand that there is no magic, you're getting whatever you put in. And so, in a way, this idea of kind of learning connections between images using words,
well, there is a certain issue with the fact that the world, the visual world, is just so much richer than the world of words. And so it's not a one-to-one mapping. You're losing a lot of information, a lot of what we, we don't have words for many things that we see.
So here is just one example. Here is a picture of Pittsburgh, Pennsylvania, and Paris, France. And they happen to be labeled the same English word, city. That's what language tells us. But actually, visually, they're very, very different, extremely different. And you can say, well,
they all have buildings. Well, but the buildings look different. Well, all the buildings have windows. But the windows, every single pixel is different. So in a way, you're really asking the computer a very hard, maybe even impossible task, somehow finding connection between those two things. And so, in a way, this kind of going through
these word labels is, in a way, kind of this language bottleneck. And so what we have been doing in my lab is to try to see if we can somehow get rid of this language bottleneck, and see if we can kind of get visual data to be taken on its own merits, to be its first-class citizen.
So I will show you a few examples of some of the stuff that we have been doing. One kind of overarching idea is this idea of self-supervised learning. So it's a supervised learning paradigm. But the idea is that the labels, they don't come from some human labeler. They come from the data itself.
So the data is supervising itself, okay? So let me explain to you what I mean. So here is one idea. So the idea is to have a pretext task, which is maybe not very useful, but which we have infinite labels for, and try to get the computer to be good at this task.
Here is one example task. So the task is I have two patches taken from an image, A and B. And I want to ask the computer, what is their spatial arrangement? So given that the patch A is here, where should patch B go? So now let's pretend that you are all computers, where should patch B go?
Lower right, very good. Now let's try to introspect. Now in psychology, you're not supposed to, but this is, we're okay, we haven't seen any psychologists here. How do we, when it is prospective, why did we do this? Well, the top thing kind of looks like the top of a bus. The side looks like the side of the bus.
Once I can remember how buses look like, I have this kind of geometric connection, and then I can just kind of import it there. Now imagine that you have never ever seen a bus in your entire life. This would be an impossible task. You would just not be able to do this. So now if we force the computer
to try to train to solve this task, the hope is that it will have to learn about buses. It will have to learn about some semantics just to be able to do this task. And so even though the task itself is not very useful, nobody really needs to know this solution to this task, the hope is that we'll actually force the computer
to learn something about the visual world. And so here is the setup we have. So we basically took millions of images. For every image, we pick a random patch, and then we put eight other patches around it, and then we basically train a Siamese-type network
that given two patches, it basically predicts one of eight possible outputs. And then we train this for six weeks on a GPU. Very long time. It's not an easy problem. Even humans are not very good at this problem. But the idea is that once you train it on this useless pretext task, then you can look at the learned representation that you get.
So you can get rid of one of those things, and you look at what is the feature embedding that you get, and the hope is that in that embedding, distances in that embedding would be somehow better. So here is one very simple example. We take an input image, a little patch, cat patch,
and then we look for its nearest neighbors, the closest patches in this new learned embedding. And the nice thing is that the nearest patches from a whole bunch, from millions of other images, they're all also cats, which is kind of remarkable
because training was done one image at a time. Nobody has ever told the computer, these two things are both called cats, right? Before we had the images, we said penguin, penguin, penguin. Here, nothing like this. The training was done in a single image at a time, and yet, through context, it learned to find these correspondences across categories.
So it's kind of strung together from instances in the categories all by itself. And so this, we thought, was a very hopeful result because it might get us to where we want to go. But of course, this is very slow, and you basically get one bit of information
per training pair. And so you can do better, you can do faster. For example, you can say we can just do just a prediction of all pixels all at once. So what is the way to set up this kind of prediction problem where we don't have human labels? Well, one example is we can predict caller from an image.
So we can take an image, we can split it into grayscale and the caller component, and then we basically try to train a network to predict the caller from the grayscale, okay? So we have half the data predicts the other half of the data, okay? And of course, the nice thing is that
you could put it back together and you have a nice colorized image, but kind of more exciting is that hopefully the representation is being learned while it's doing this task is going to somehow be meaningful, okay? And you can do this task forever because you don't need any label data. Basically, this process labels its own data for you, okay?
So of course, you know, show some pictures. So this is Ansel Adams and we made it a little bit colorized. Here is another, you know, this is not supposed to be art, it's just to show that the algorithm is actually doing something reasonable. Even the mistakes are kind of fun. So can you see what the mistakes here are?
It's not very bright, unfortunately, but here I will toggle. So it puts some pink underneath the chin, right? And the reason it did this is because the training data has the tongs out. And so, but this is a very good point. This basically means that whatever it's learning,
it's not some low-level signal. It's actually recognizing that it's a poodle, I guess, and then say, well, all the poodles I have seen before have their tongues out, so probably this also has their tongue out. Okay, so that actually suggests that it's actually learning something, something higher level and semantic about this problem. Okay, but whenever you're doing this prediction task,
things are a little bit tricky because, so what you're doing is you're doing two things, right? So first is you're training a network, which is basically a mapping from some, you know, input domain to the output domain. And then you're actually trying to make this mapping
minimize this equation. Basically you want to say that you want whatever the F produces to be as close as possible to Y. So this F is the neural network and L is the objective function or the loss, which says, what do I want it to do? And so in this case, we want F of X to be as close to Y as possible.
But now we're back to this problem that close in what sense? You know, in L2, L1, what's the distance that we're trying to get? And of course, the standard thing is to do something like L2, but in high dimensional data like images, that doesn't work very well. So here is an input image,
and this is a regression with L2 distance, and it doesn't look very good compared to the ground truth. And the reason is, well, one of the reasons is basically, imagine that you have multiple modes. Imagine that this bird could be blue or it could be red. What is L2 going to do? It's going to try to make both of those happy. It's going to do something that splits down the middle.
So it's going to do something in the middle, and then what's in the middle in the middle, it's green. So that's why it's going to be, neither one is going to be happy, okay? So it's not very good when you have multiple modes in the data. And so, you know, we did some fancy things with, you know, doing cross-entropy loss with boosting the colors,
and my student Richard Zhang has spent a lot of time really fitting the right loss function. And we got something like this. But the problem is that when you ever do something by hand, sometimes, you know, unintended consequences. So this image, for example, it just over-colorizes it. It just, the back wall, you know,
should really be white, and it's put some yellows and stuff in there. And so what we really would love is we would love to have some sort of a universal objective function that basically just is for any kind of this, what we call image-to-image translation problems. Just find something that will tell us if this is a good image or not.
And this is, you know, this is a good thing because we do know what the right answer should look like. We have a whole bunch of real images that look real. So all we need to do is somehow try to have our outputs be indistinguishable from real data.
And just like John mentioned already, there is this wonderful paper by Ian Goodfellow and colleagues called Generative Adversarial Networks that does exactly this. And it's just perfect for this task. Basically what it does is it puts this generator, puts a discriminator on top of this and say,
can I tell the difference between whatever I have produced and a real photo? And it's a minimax optimization. In the end, hopefully the discriminator will say, okay, I give up, I cannot tell the difference. And then we win. And so we have our generator where it does the colorization, for example, and then we have this discriminator. And so G is trying to synthesize fake images
that fool D. D tries to identify those fakes. And one way to think about what it's doing, which is kind of nice, I think, is that from G's perspective, D is kind of like a loss function. It's kind of like your objective. It's basically kind of like L1 or L2 distance, but it's learned.
It basically, D tells G how to get better, how to push towards reducing the error, but not in some predefined way, but learn for this particular problem. And this turns out to be very, very powerful technique. And so we have recently had a paper
basically trying to use this for a whole bunch of what we call image-to-image translation problems. So for example, we tried it for colorization and it seems to work. And then we take exactly the same approach, retrain on new data. For example, we downloaded images from Google Maps and we made it produce images
of the satellite view of that same place. So going from Google Maps to satellite, and here is the ground truth so you can see that it's a little bit blurrier, but it's actually doing pretty well. And of course, we can do it the other way around, going from satellites to Google Maps. Unfortunately, you can't really see very well here. And it's exactly the same code,
exactly the same parameters because the GANs, they basically learn the right metric automatically for us. We can go from day to night. Again, same code, same parameters, just train on different things. We can go from edges to images. So train on edges and produce images.
And then using that train data, we tried it on kids' sketches of things and it's still able to do something reasonable, right? We posted the data, the code online about a year ago. We called it pix2pix. And then lots and lots of people
were able to just create stuff with it. And not even scientists, like there's a bunch of artists that would just use the code. It turns out to be very, very easy now to just get the code and just start working with it. Maybe you have seen there's this cat sensation. People trained it on cats
and started to have a little interactive cat thing. So you can try it at home, this little app. You draw something and then it calls our code and then you can basically catify anything because of course cats are what internet is all about.
Now the only problem with the current setup is that there's still some supervision. And the supervision is that we have to give pairs of training data. So we say okay, edges and the corresponding image, right? But what if we want to get rid of it?
Sometimes this paired supervision data is just not available. For example, I have a whole bunch of photos that I took while on different trips and I want to translate them into the style of say Cezanne. But I don't have any correspondences. I have not taken a picture exactly where Cezanne has painted his painting, okay?
How can I do this? Well, one can try to do this again with this discriminator from the GAN. We can say given some instance in domain X, let's train a generator to produce something in domain Y that looks like Cezanne.
Now that's reasonable but most of the time it doesn't work because it's too open-ended. It will just produce something of Cezanne but it might not have anything to do with their initial X. And so this is where we can give more constraints by adding what we call cycle consistency to basically try to go back to the original image.
So the idea here is you go from the input image to the other domain. We don't know what it should be in the other domain but then we can translate it back and hopefully it should be close to where it was started with. And this is something that as John mentioned, this is something that people use in language all the time.
In fact, Mark Twain did this exercise translating, finding a French translation of his short story and then translating it back into English and being horrified at the results. But this back translation idea is actually being used in language all the time and so we basically adopted it to visual data.
And so here now we have a couple of photos that I took in Paris and here is what happens if we translate them to be more like Cezanne. And people have, as John has shown, there are some other work that has been doing this kind of things. Now I'm particularly proud of our stuff,
the Cezanne clouds I think. I think we have the best Cezanne clouds because we are basically able to use not just one Cezanne image but his whole output, a thousand images of Cezanne. So it's like a really domain to domain translation. But people have done this before. Now what people haven't really done before is to do the other way.
So starting with say this Monet, can you make it look like something like a real image? And we're not quite there yet, it's not quite there but we feel like it's getting close. That really, fooling humans is on the horizon, maybe in a couple of years.
So here is another example, here's a Monet and this is kind of our version and some of it is really looking quite good and some of it doesn't. And of course it would be nicer if I showed some Cezanne examples instead of Monet but Cezanne doesn't work as well. So still future work. So here are some other examples
of translating a photograph into different styles. I'm just like to show off my travel photos here. We can also translate other things. For example we can translate seasons. We can go from summer to winter and of course back again. We can go from oranges to apples
or from apples to oranges. We can go from horses to zebras and back again. And it's amazingly even just running it one frame at a time without any temporal consistency. It's doing okay, although look at the tail.
The tail is definitely weird here. Yeah, but even the failures are kind of fun. So this, I showed this in my talk in Moscow and I thought they're not gonna get me out, let me out. But it actually makes sense.
There is actually no supervision. Nobody told the network what a horse looks like, what a zebra looks like. So it's really trying to find some correspondence. It's kind of really doing this bipartite matching to see what corresponds to what in this translation. It's like two visual languages without a dictionary. So maybe it's not that weird that it thought
that Putin was part of the horse, right? Because it doesn't really have any extra data. And of course it makes sense to try to add more information to this. So in conclusion, visual data is really the biggest big data we have right now. And it's time that it starts being treated
as a first-class citizen, but it's hard. So hopefully using deep learning and in particular this idea of self-supervised learning might be a trick to get us there. Thank you very much.
Thanks a lot for this beautiful and inspiring talk. I think we have time for one or two quick questions. Somewhere from the back over there. Yes? Can we have a microphone over there?
Is the mic on the way? Yes. Does this technique offer a view toward solving the learning from one image problem that John described?
So again, the learning from a single image? Well, it is kind of in that direction. You can think of it that there is no learning from a single example. Basically the idea would be that you're learning
all your life without any supervision. And then finally at the very end, you give an image and say, okay, this is a fire truck. And then boom, it just connects it with the representation that you have built up. So in a way, that's the idea here, that you build up this representation and then just one or just a few labels
should get you there. And we're not quite there yet. We have some experiments on pre-training using the self-supervised data and then testing on some smallish data sets like Pascal. And we're getting reasonable results, but they're not something to write home about yet. So I think this is definitely still a big problem
and hopefully some of you guys can help us solve it. Thank you. So you mentioned before that most of the deep learning
results right now, most of the data in the internet available is our images. So then like most of the results deal with the image or vision problems. So what are your thoughts for the future of deep learning with data that cannot be embedded in an Euclidean space or that are not
like naturally represented as a vector? Because I mean, then there might be some difficulties applying these methods to such data. So I think methods for data that is not in Euclidean space are developing, for example, various types of embedding meshes in 3D graphics, for example.
I think the places where deep learning will help are when your data is high dimensional and noisy. I think those are the two requirements. It needs to be high dimensional and because it's high dimensional, there is a lot of weird things going on. I think for low dimensional data,
probably deep learning might not be that useful because other methods are already reasonably good. But the reason why deep learning made such a big revolution in something like audio and video was because it's a very, very high dimensional space with lots of junk in there and other methods just couldn't deal with all that junk.
And so I think you want to look for, I presume something like financial data, if you look at the whole stock market, that's probably high dimensional enough. So that makes sense. But I think you really want to look for problems where you have the dimensionality is high but the intrinsic dimensionality is maybe much lower.
And that's what deep learning seems to be good at. But again, as John mentioned, there is a lot of questions and I think there's more questions than answers right now. We know that it works amazingly well for some problems but the theory is still lacking, so there is plenty to do.