Style transfer and image synthesis
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| ||||
Alternativer Titel |
| ||||
Serientitel | |||||
Anzahl der Teile | 10 | ||||
Autor | |||||
Lizenz | CC-Namensnennung 3.0 Deutschland: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | ||||
Identifikatoren | 10.5446/68390 (DOI) | ||||
Herausgeber | |||||
Erscheinungsjahr | |||||
Sprache |
Inhaltliche Metadaten
Fachgebiet | |
Genre |
2
3
4
00:00
Kommunikationsdesignt-TestLogiksyntheseSpezialrechnerWärmeübergangReelle ZahlTopologieInhalt <Mathematik>ÄhnlichkeitsgeometrieCharakteristisches PolynomTVD-VerfahrenTotal <Mathematik>TermVektorraumFunktion <Mathematik>GradientenverfahrenGradientVisualisierungNormierter RaumZellulares neuronales NetzExogene VariableKorrelationsfunktionMatrizenrechnungProdukt <Mathematik>InterpolationAttributierte GrammatikRechnernetzSelbstrepräsentationVerband <Mathematik>AlgorithmusSchnelle Fourier-TransformationTransformation <Mathematik>FehlermeldungGewicht <Ausgleichsrechnung>TaskGenerator <Informatik>VariableDistributionenraumDichte <Physik>Komponente <Software>PixelCodeVisualisierungVektorraumSoftwareBildgebendes VerfahrenAuflösung <Mathematik>Figurierte ZahlNeuronales NetzSummierbarkeitTransformation <Mathematik>FunktionalPixelWärmeübergangAlgorithmusFehlermeldungStichprobenumfangBildschirmmaskeMatrizenrechnungEigenwertproblemTaskWellenpaketGradientenverfahrenCharakteristisches PolynomAbstandGraphfärbungGüte der AnpassungFunktion <Mathematik>MultifunktionArithmetische FolgeVarianzOrdnung <Mathematik>Kategorie <Mathematik>DifferenteObjekt <Kategorie>Exogene VariableKorrelationsfunktionMAPZusammenhängender GraphInhalt <Mathematik>CASE <Informatik>Generator <Informatik>Konfiguration <Informatik>ÄhnlichkeitsgeometrieFormale SemantikTermDämpfungReelle ZahlKlasse <Mathematik>Ein-AusgabeIterationSchnittmengeMapping <Computergraphik>GradientResultanteTypentheorieAttributierte GrammatikAutorisierungTextur-MappingGruppenoperationZweiDatenstrukturLinearisierungRechenbuchFaltungsoperatorGlobale OptimierungURLFlächeninhaltBildschirmfensterSichtenkonzeptElement <Gruppentheorie>PunktVersionsverwaltungEinsKette <Mathematik>Vorzeichen <Mathematik>RandomisierungStandardabweichungPaarvergleichNeuroinformatikFilter <Stochastik>DistributionenraumQuadratzahlArithmetisches MittelMetrisches SystemRelativitätstheorieArithmetischer AusdruckWurzel <Mathematik>Metropolitan area networkSchaltnetzGrenzschichtablösungRauschenKonfigurationsraumMereologieRegulärer GraphHilfesystemDeskriptive StatistikGewicht <Ausgleichsrechnung>RichtungDigitale PhotographieLineare AbbildungAdditionKomplex <Algebra>KonditionszahlInformationDomain <Netzwerk>HoaxDimension 3MathematikMinimalgradVariableEliminationsverfahrenSelbstrepräsentationTeilbarkeitAuszeichnungsspracheMomentenproblemReverse EngineeringComputerarchitekturBootenEinfügungsdämpfungInverseMittelwertVerschiebungsoperatorComputergraphikSnake <Bildverarbeitung>Prozess <Informatik>Mixed RealityMessage-PassingInterpolationPropagatorATMWeißes RauschenQuellcodeZahlenbereichXMLUMLComputeranimationFlussdiagramm
Transkript: English(automatisch erzeugt)
00:09
Hello and welcome. In this lesson we will look at style transfer and image synthesis. Image modification. We want to do some image transformation associated with changing some characteristics of the image,
00:24
but semantic ones, for example, changing the age of a person, figure 1. Similar problems are now solved using neural networks of the encoder-decoder type, that is, we modify the networks that we considered for semantic segmentation,
00:41
so that at the output they give not a segmentation map but an RGB image and change some features of the original image. Another option for modifying images is their stylization, figure 2. In this case it is not the semantic property of the image that changes, but the rendering style.
01:01
The third class of problems we consider is image synthesis. The task is to create from scratch a certain class of objects, for example, a human face. Image synthesis problems are now solved by networks such as DCGAN, figure 3, which look like half an encoder-decoder network.
01:24
In fact, this is a decoder that takes a certain compressed representation, a certain random vector, as input. Using this random vector through chains of transformations we obtain an RGB image. Methods with visualization of vector features.
01:41
Let's start considering the methods with the visualization of vector features. Let's try to synthesize images that produce the same feature vector as a given image, that is, let's see what other images from the point of view of the neural network are similar to this one. Let's find an image x that gives the same feature vector from the image.
02:04
We will solve the problem as follows. Initialize image x with white noise. We will optimize the following functionality with respect to x. Optimize functionality using gradient descent. We minimize the sum of two quantities.
02:21
The first quantity is the distance between the corresponding neural network features. That is, we run the image through the neural network and obtain a feature vector, compare it with the feature vector of the initially created image, for example by the root mean square distance. The second is a regularizer.
02:40
We can, as part of an experiment, reconstruct an image based on different layers of the image. From shallow layers the image is reconstructed very accurately and the deeper the convolutional layer the less clear the image becomes but the overall spatial configuration is preserved. When we move to fully connected layers the spatial distribution is disrupted
03:03
but an image that vaguely resembles the original object is still preserved. Thus, if we try to reconstruct an image based on convolutional features that store spatial information, then the resulting images are similar to each other. And if we start the reconstruction from fully connected layers,
03:24
then the spatial configuration is different but the images of the objects are still preserved. In this way the problem of stylization was solved. Basic color transfer methods work with the properties of the entire image as a whole. For example, equating the mean and variance for each lab channel.
03:44
The main idea of stylization is to modify the image in such a way that in terms of characteristics it is similar content to the target image and in style to the target image. In order to do this you need to decide how to convey the content of the image and how to convey the style.
04:06
The outputs of neural network, features of a neural network are a good writer of image content. That is, if we synthesize an image that is similar in its neural network characteristics to the target image, then the content has been preserved.
04:21
Several years ago it was experimentally discovered that the metric, which is given in the form of proximity of neural network features, corresponds very well to proximity from the point of view of human perception. This is much more accurate than with per-pixel metrics. How to describe the image style.
04:42
The correlation of responses of different filters throughout the image can be taken as a description of the style. You can calculate the style based on the characteristics of the first layers and try to reconstruct the image with the same style.
05:00
The style can be described by the correlation of filter responses by writing the Gram matrix. We generate images with the style of the original image, minimizing the root mean square difference between the Gram matrices of the original image G and the generated image A or the sums over L of the first layers.
05:21
In other words, let's take the output of some layer of the neural network. This is a three-dimensional matrix, where in the third dimension each channel is a response map of a certain filter. We can extend each channel into a vector. Accordingly, the correlation of responses with each other will be a matrix called the Gram matrix.
05:44
The Gram matrix value can be considered as a feature of the image style. Moreover, to compare the spirit of images by style, we can compare the Gram matrices of two images not by one layer, but by several layers. That is, we take an image and run it through a neural network.
06:05
We calculate the Gram matrix on each layer. We do the same for another image. We consider the proximity of Gram matrices, for example, as the sum of squared differences of the corresponding matrix elements.
06:20
Summarize over all layers. We have a function for comparing two images by style and a method for reconstructing an image with given characteristics. That is, we can take a noisy image and modify it using the gradient descent method until it becomes similar in its characteristics to another image.
06:41
We can apply the same thing to make the image similar in style. Let's take an image, calculate the Gram matrices for it, and begin to reconstruct the image, which based on the Gram matrices will be similar to the given image. In this way we can generate textures, figure 4. Let's take an image that is entirely a texture, consists of chaotically repeating elements, for example pebbles.
07:06
Next we take the outputs of the first convolutional layer, calculate the Gram matrix, and reconstruct the image, which based on the outputs of the first convolutional layer was similar to the target one. We got the texture.
07:21
Then we compare the sum of Gram matrices obtained after two layers. After the output of the first convolutional layer and after the output of max pooling. When we go through the image, up to four convolutional layers, and reconstruct the image, we reconstruct a texture image very similar to the original one, but differing in the spatial distribution of texture elements.
07:48
This way we can generate an arbitrary number of textures. If we provide an image with chaotic elements as input, nine years ago a work was published, the key point of which was the observation.
08:02
In the optimization process we can mix similarity in content and similarity in style. That is, content and style from the point of view of neural network features turn out to be separable. The upper layers describe more the content of the image, the correlation of the lower layers describes the style of the image.
08:23
We will generate an image starting from a random approximation, optimizing a weighted combination of content similarity to one image, that is, a direct comparison of feature vectors and style similarity to a second image, proximity of Gram matrices.
08:42
It turns out that after a 100 iterations of gradient descent, we end up with a good quality image that truly combines the content of one image with the style of the other. Since our target functionality is a weighted linear combination by adjusting the relative weights of style relative
09:01
to content, we can tell the algorithm whether the final image should have more content or style. It can be seen, figure 5, that by increasing the style weight, we stylize the image more and more until we finally begin to generate a texture. With the help of such optimization processes, we can make unusual image modification techniques,
09:26
which was published in the work on deep feature interpolation for image content modification. Let's assume that we have a large collection of annotated images of the same class, for example, people's faces, figure 6.
09:43
We annotated these faces with different attributes, the personality of the people, the presence of a mustache, beard, ethnic origin, etc. Let's say we have a photo of a person whose attribute we want to change, for example, from a
10:00
photo of a person without a mustache, we want to synthesize a photo of a person with a mustache. To do this, we will find several people similar to him. We will evaluate similarity by comparing neural network attributes, but among people who have all the same set of attributes, that is, if we take a photo of George Clooney, it means we need photos of white men, without glasses, without mustaches, etc.
10:28
In order to determine what mustache is, we will find several people who are also similar to George Clooney, but differ in one attribute, the presence of a mustache.
10:42
Then the difference between those without a mustache on average and those with a mustache on average will be the mustache vector. Accordingly, let's shift George Clooney feature vector in the direction of the mustachioed people and reconstruct the image that corresponds to this modified version.
11:02
So George Clooney will become mustachioed. Since we move a person in the direction of an attribute, depending on how much we move him, we can model the degree of change in the attribute. Of course, in order to be able to manipulate images in this way, other factors of variability had to be eliminated.
11:22
For example, before transformation, all images are aligned, that is, we combine them with each other so that the eyes are in the same places, etc. Because we want the difference in features to correspond not to the difference in spatial information, but only to the difference in appearance attributes.
11:42
To compare two images in this work, we used a network that was trained in the mode of classifying not people, but all kinds of images, and then used to compare people. If we replace the network trained on ImageNet in this algorithm with a network trained to recognize people, the results will become much worse.
12:04
Because the network that learns to recognize people learns to ignore various small changes in appearance, preserving only the personality. Algorithm for image stylization based on gradient descent The gradient descent image stylization algorithm attracted a lot of attention because it was
12:24
the first method that made it possible to stylize arbitrary images to resemble artists' paintings. But it had one serious drawback it is optimizing, and optimization with white noise is slow. To get an acceptable result, it was necessary to do 100 iterations of gradient descent, 200 passes through the neural network, which is quite long.
12:49
We need a neural network that does stylization in one pass over the image. So they took an encoder-decoder type network and trained it in such a way that
13:00
it transformed images, preserving their content, so that the neural network feature vector did not change. But in style, this image was similar to another image. To train such a neural network, let's record style images. Let's train it in the following way. Figure 7
13:21
We will use the VGG16 network pre-trained on the classification task on ImageNet. We will use it to extract features. It will be fixed when training the transformation network. Through it, we run images, style and content, an image transformed by a neural network, which
13:41
we feed to the input of another network, pre-trained on a classification task, for example, VGG16. We obtain neural network features. We remember one of the neural network features and compare it with the original neural network feature. We take the original image, run it through this neural network and get the features.
14:04
We want these features to be the same so that the content of the image doesn't change but the style of the image does. To do this, we take an image with style, run it through the same neural network, take features from several layers, calculate the gram matrices and, as a loss function, calculate the difference between the gram matrices of the image.
14:25
That is, our target functionality is a weighted linear combination of similarity in content and style. We get an error, we do back propagation of the error, we modify our transformer network so that it better stylizes the image.
14:41
As a result of training, we get a neural network that stylizes the image in one pass. The disadvantage of this approach to stylization is the fact that for each style image, we need to train our own neural network. I would like to make a universal algorithm so that, on the one hand, we can stylize an image to suit
15:03
any given image, and on the other hand, so that we can do this in one pass through a neural network. The idea is this. Let's train an autoencoder, a network of the encoder-decoder type which should compress the image into a latent vector representation f and decode it into the original one without loss.
15:26
We will manipulate the hidden vector representation of the image f. Let us denote the characteristics of two pictures, content and style. Accordingly, we train the autoencoder, receive two feature vectors in latent representations, the image that we want
15:44
to transform from which we take the content, and the image from which we take the style. Next, we want to find a transformation such that the new vector f is similar to the original vector, but at the same time, its gram matrix coincides with the gram matrix of the style image.
16:05
We can solve this problem by calculating a special linear transformation depending on. To calculate this, it is tedious to break the task into two stages. Remove the correlation of the original features using a linear transformation, image without style.
16:22
Let us equate the required matrices. As a result, we get a network that does the necessary styling in 1.5 network calculations. Let us remember that we generated structures starting from a random vector. Let us now throw out the hidden feature vector of the image and replace it
16:40
with a random noise vector and then stylize the random noise vector as a texture image. In this case, we will get a texture generator. We will reveal the random noise vector after stylization and decode it into a texture image. This method was further developed into a post-processing modification method.
17:04
After the style transformation, we apply to the image a filter that comes from image matting, which makes the gradients of the final image similar to the gradients of the original one, which is also implemented as a linear transformation.
17:21
It can be seen, figure 8, that as a result of stylization, the image is somewhat destroyed. Quite strong transitions appear, the fine texture has also changed. This happens because we take the feature vector at one level and do the stylization at the same level.
17:41
In order to reduce these distortions, the authors came up with the following type of post-processing method. We look for an image that is pixel by pixel similar to the received one, but the gradients of this image will be similar to the gradients of the original image. This problem can be solved using a linear transformation and the result will be as in figure 8.
18:06
Image synthesis. We want to learn how to synthesize entirely new images that are similar to the training set. We can also solve the problem of image synthesis using a neural network, that is, build a neural network that maps a feature vector into an image.
18:25
One of the main problems in computer graphics is assessing the realism of images. In computer graphics, two methods were used, look with your eyes or compare with a reference image. Standards are extremely difficult to obtain.
18:41
In addition, the difference between the image and the standard does not indicate anything about how the algorithm needs to be changed to make the picture more realistic. In the GAN neural network, we can train a classification method that will distinguish synthesized images from real ones. Since the classifier is also a neural network, we can consider the generator and classifier as one large
19:05
neural network, calculate the classifier error and propagate it through the generator, but with a change in sign. The generator's task is to deceive the discriminator. The discriminator's task is to learn to classify what the generator has done.
19:22
However, we cannot train the discriminator well from the very beginning, because we don't know what fakes are. Therefore, training should take place in stages. Initialize the generator and discriminator. Fix the generator, its weight. We take examples of real and generated images.
19:42
We teach the discriminator to distinguish them, figure 9. We fix the weights of the discriminator. We synthesize examples using a generator. We propagate the error from the discriminator. The alternating training of the discriminator is repeated until convergence, which may not occur.
20:00
In the end, we hope that the generator will learn to synthesize images such that the discriminator will not be able to distinguish between them, figure 10. When the pictures generated by neural networks became more realistic, we began to wonder how the elements of the generator are connected to the resulting images.
20:22
Is there a relationship between the specific convolutions and the feature classes that are generated? To do this, we select an area on the generated images, for example, a window, and look for a correlation between the presence of a window in a given location and convolution activations.
20:41
Having found the corresponding convolutions whose correlation is the strongest, we can manipulate their values, for example, increase or decrease them. If we sleep on the convolutions that have the strongest correlation, then the windows from the images will disappear or degrade.
21:01
This means that there really is some correlation, and certain filters generate a certain visual image. You can also try using vector arithmetic, figure 11. Let's remember the images that we like and the noise vectors that were used to generate them. We don't know in advance how this random vector picture is connected, so we select the pictures that we like.
21:26
Let's take several pictures that we classified as smiling woman, average their noise vector, check it, it really generates a smiling woman. We take a woman with a neutral facial expression and a man with a neutral facial expression.
21:41
We subtract the neutral from the smiling woman, we get the noise vector responsible for the smile. We add a man with a neutral face to this vector, and we get a smiling man. Wasserstein GAN, figure 12, is an example of the direction of training generators associated with improving the discriminator. One of the conceptual shortcomings of the input data was the following.
22:05
The GAN simply determines whether an image is a fake, and it usually detects fakes quite easily, especially in the initial stages of training a neural network. For training the generator, information that the generator is making fakes is not very useful.
22:22
It would be better if the discriminator were not just a discriminator, but a critic who would understand the grades of fakes, for example, by comparing a sample of generated images with a sample of real images, that is, calculating the probabilistic distance between samples.
22:41
One option to do this is using the Wasserstein distance. If we replace the discriminator with a metric based on Wasserstein distance, the visual quality of the generated images will improve. Conditional Generators Another major area of research is conditional generators.
23:00
The main problem with all generators is absolute randomness. We give some kind of vector, get an image as an output, and we would like to control the generator, at least determine the class of objects that we want to generate. This can be modeled in the following way. We make both the generator and the discriminator conditional.
23:22
That is, the generator will receive as input, in addition to the noise vector, some vector of parameters, for example. This will be the class label of the object that needs to be generated, and the discriminator will receive this label in addition to the generated image, figure 13.
23:42
But conditional GANs are a very complex architecture that cannot always be trained. The idea of a discriminator as a method that allows one to evaluate the quality of a generated image is very good not only for networks that synthesize images, but also for networks that perform image transformations.
24:05
One of the pioneering works on connecting GANs and transformation networks is the work of Pix2Pix. The idea is the following, to make an image transformation between two domains in which there is initially a strict correspondence.
24:21
For example, learn to generate an image of a shoe from a contour drawing of a shoe. We call such an image transformation network a generator, figure 14. The input is an image of the edges and the output is an image of a shoe. The discriminator then compares the boot image with the edge map and returns a decision, whether the correct boot was generated or not.
24:46
What will happen if we use pixel by pixel similarity discriminator or sum of components as the loss function in the px2pix method? There is a training pair, figure 15 input and ground truth, a markup map and an image from which this markup map was generated.
25:03
If we train the transformer network by comparing the target image with the one generated by the px2pix L1 metric, we will get images for which the spatial location is generally correct, but the pictures are very blurry. If we use GAN, the images are clear, but they may differ from the original images.
25:24
And if we use a combination of px2pix metrics and GAN as the goal, then we get the optimal result. This method has one interesting detail. We work with high resolution images. High resolution images are difficult to train and it is not very clear what kind of discriminator to use based on them.
25:45
The number of pairs of source and target images is limited. This means that instead of training the discriminator on the entire image, we can train the discriminator on fragments of the image. That is, it will take fragments of the target and source images and compare them.
26:04
Did we generate the object correctly? Then, if the picture is high resolution, we run the discriminator over the entire image like a sliding horse and average the result. This version of the discriminator is called PatchGAN, figure 16, because we apply the discriminator to patches of images.
26:25
The px2pix method is quite good, but has a significant conceptual drawback. It requires a training sample in the form of pairs of corresponding images. We often have a situation where there are two domains, for example, photographs
26:44
and paintings, and there is no pixel by pixel comparison between these paintings. How to teach? The answer was CycleGAN, figure 17. The idea behind CycleGAN was to train two generators simultaneously, transformation from
27:01
the first domain to the second, from the second domain to the first. Let's say we have two domains X and Y. We map X to Y and check using a discriminator that the generated image Y actually belongs to domain Y. Then we use an inverse generator from Y to X, get a back-transformed image, compare the back-transformed one with the original one.
27:23
It is necessary that the images do not change after the double transformation. We will train CycleGAN in stages. First we train one generator in one direction, then we train the second generator. We check that the second generator now generates correct pictures from domain
27:41
X and that after the reverse transformation the pictures do not get corrupted. Everything is trained jointly, the quality of the transformation is worse than that of px2pix. But we can work with arbitrary domains. Training generator networks is one of the most time-consuming and computationally intensive tasks at the moment.
28:01
No one was able to train a network that generates high-resolution images from scratch until NVIDIA implemented Progressive GAN, figure 18. They showed that if you have computing resources, you can train a neural network to generate high-resolution images. For example, megapixel by training sequentially.
28:25
Let's train a generator and discriminator for 4x4 pixel images. Let's add layers to them to now generate and classify pictures of 8x8 pixels. Let's do one more stage of training. And so on, we'll build the network to generate megapixel images.
28:44
It took 14 days of training to achieve the final high quality image. Thank you for your attention and see you again.