We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Painting with GANs: Challenges and Technicalities of Neural Style Transfer

00:00

Formal Metadata

Title
Painting with GANs: Challenges and Technicalities of Neural Style Transfer
Subtitle
Building Artistic Artefacts using Generative Networks
Title of Series
Number of Parts
130
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
A lot of advancements are happening in the field of Deep Learning and Generative Adversarial Networks are one of them. We have seen GANs being applied for photo editing and in-painting, generating new image datasets and realistic photographs, increasing resolution of images (Super Resolution), and many more things. Some people have also exploited GANs for generating fake content. All the above-mentioned examples are result of a technique where the focus is to generate uncommon yet original samples from scratch. However, these examples have very less commercial applications and GANs are capable of doing much more. The focus of this talk is a technique called "Neural Style Transfer (NST)" which has numerous commercial applications in the gaming world, fashion/design industry, mobile applications, and many more fields. Challenges and technicalities of NSTs will be covered in great detail. We will teach the machines on how to paint images and utilize Style Transfer networks to generate artistic artefacts. The flow of the talk will be as follows: ~ Self Introduction [1 minute] ~ A Succinct Prelude to GANs [10 minutes] ~ Understanding Style Transfer [5 minutes] ~ Learning about Neural Style Transfer Networks [5 minutes] ~ Loss Functions: Content, Style, Total Variantion [10 minutes] ~ Code Walkthrough and Result Analysis [5 minutes] ~ Challenges and Applications [5 minutes] ~ Questions and Answers Session [3-4 minutes]
61
Thumbnail
26:38
95
106
Heat transferPhysicalismReal numberSlide ruleMultiplication signComputer animationMeeting/Interview
Heat transferIntegral domainStack (abstract data type)Software developerLocal GroupProduct (business)ApproximationHeat transferComputer networkBounded variationContent (media)Insertion lossArtificial neural networkConditional probabilityForm (programming)Complex (psychology)Transformation (genetics)Storage area networkHeat transferImplementationPhase transitionField (computer science)Sampling (statistics)DataflowWave packetRepository (publishing)Artificial neural networkPoint (geometry)SoftwareDimensional analysisFront and back endsInsertion lossNoise (electronics)ConvolutionNormal distributionGroup actionContent (media)Link (knot theory)Bounded variationElectric generatorAlgorithmQuicksortNeuroinformatikTerm (mathematics)outputForm (programming)Distribution (mathematics)SpacetimeDifferent (Kate Ryan album)Basis <Mathematik>Set (mathematics)Propagation of uncertaintyMereologyOffice suiteCuboidIterationCycle (graph theory)Endliche ModelltheorieBackpropagation-AlgorithmusType theorySupervised learningCategory of beingUnsupervised learningCondition numberIdentical particlesError messageFundamental theorem of algebraParameter (computer programming)Functional (mathematics)CodePopulation density2 (number)Total S.A.EstimatorExtension (kinesiology)CAN busDiagramUniverse (mathematics)HoaxBitSlide ruleFunction (mathematics)GleichverteilungCASE <Informatik>Multiplication signReal numberSharewareXML
Random numberComputer-generated imageryHypercubePixelContent (media)Pattern languageCartesian coordinate systemSample (statistics)Pole (complex analysis)Computer networkSurjective functionResultantDialectMaxima and minimaData miningNumeral (linguistics)Content (media)Insertion lossCAN busHeat transferFunctional (mathematics)Core dumpSoftwareSampling (statistics)Cross-correlationSharewarePropagatorBitGradientMultiplication signCoefficient of determinationFluid staticsObject (grammar)Bounded variationCartesian coordinate systemWave packetFinite differenceUniqueness quantificationGreatest elementTotal S.A.Set (mathematics)Distribution (mathematics)Different (Kate Ryan album)Slide ruleParameter (computer programming)Degree (graph theory)IterationDimensional analysisCombinational logicCycle (graph theory)Order (biology)Endliche ModelltheorieGame theoryResultantState of matterFunction (mathematics)Basis <Mathematik>CodePixelGrass (card game)VarianceInterpolationQuicksortForm (programming)AreaLine (geometry)Mobile appReal numberGradient descentBackpropagation-AlgorithmusDigital photographySource codeXML
Computer-generated imageryTensorConvex hullEmpennageInclusion mapContent (media)CalculationTerm (mathematics)Similarity (geometry)Pairwise comparisonResultantInsertion lossData modelAsynchronous Transfer ModePixelContent (media)Pairwise comparisonLaptopInsertion lossZoom lensArtificial neural networkProcess (computing)ResultantMultiplication signArithmetic meanError messagePixelPlotterPoint (geometry)Coefficient of determinationDifferent (Kate Ryan album)Total S.A.Library (computing)Configuration spaceFunctional (mathematics)TensorInformationoutputGrass (card game)Endliche ModelltheorieFree variables and bound variablesCategory of beingSoftwareBitCNNComputer architectureWeightBlock (periodic table)SharewareMereologyBounded variationHeat transferStructural loadScaling (geometry)Electronic visual displayCache (computing)Musical ensembleBit rateComputer fileCodeBuildingRemote procedure callLevel (video gaming)Utility softwareSlide ruleCASE <Informatik>Repository (publishing)Degree (graph theory)Parameter (computer programming)Active contour modelDimensional analysisNumberSet (mathematics)Normal (geometry)Source code
Coma BerenicesComputer-generated imageryCross-correlationSimilarity (geometry)Mach's principleProduct (business)ResultantMatrix (mathematics)EmpennageSquare numberShift operatorPixelTotal S.A.Insertion lossBounded variationProduct (business)Block (periodic table)Form (programming)Cross-correlationDegree (graph theory)Error messageMappingConvolutionCellular automatonParameter (computer programming)Insertion lossContent (media)Level (video gaming)WeightCombinational logicPoint (geometry)Grass (card game)Artificial neural networkSimilarity (geometry)Normal (geometry)Endliche ModelltheorieMatrix (mathematics)Dot productFunctional (mathematics)Function (mathematics)SoftwareWave packetTotal S.A.Right angleBounded variationPixelType theoryTensorArithmetic mean2 (number)Order (biology)ResultantCalculationAreaExtension (kinesiology)CASE <Informatik>Graph coloringGraphical user interfaceMetric systemMusical ensembleActive contour modelComa BerenicesCodeSquare number
Shift operatorComputer-generated imageryPixelInsertion lossBounded variationBeta functionGradientIterationLocal ringFunction (mathematics)Negative numberVacuumComputer iconInternet forumoutputMenu (computing)Execution unitSimultaneous localization and mappingLocal GroupGradient descentPixelMultiplication signSoftwareEndliche ModelltheorieRepresentation (politics)Insertion lossProcess (computing)IterationGradientBuffer overflowPreprocessorMatrix (mathematics)Bounded variationTranslation (relic)Wave packetMessage sequence chartDiagramPhase transitionError messageFunctional (mathematics)Social classMathematical optimizationDirection (geometry)Free variables and bound variablesVideo game consolePerformance appraisalSlide rulePort scannerResultantWeightGroup actionSummierbarkeitAlgorithmTwitterLink (knot theory)Heat transferLevel (video gaming)Rule of inferenceBitCodeoutputSheaf (mathematics)Total S.A.Content (media)TensorGoogle Street ViewDimensional analysisBasis <Mathematik>Metric systemSharewareCartesian coordinate systemConstraint (mathematics)Coma BerenicesSemiconductor memoryCycle (graph theory)Limit (category theory)Function (mathematics)BuildingMaxima and minimaForm (programming)View (database)Image processingElectric generatorNeuroinformatikMereologySource code
Computing platformMeeting/Interview
Transcript: English(auto-generated)
So, welcome to the next session. We will have Anmol Krishang Saftewa as a speaker. He will talk about guns. I met Anmol, I think, first time at the GeoPison conference some years ago.
Anmol is very active at visiting conferences, real physical conferences back in the day. So, he was also a pike on Thailand, Malaysia, and many, many more. And I think last year at, I don't even remember, GeoPison or EuroPison.
And so, welcome, Anmol. He's also, I have to mention that this year, he is also one volunteer of EuroPison. So, thank you very much for volunteering. And now I give you a talk. Start your slides, please. Thanks, Martin, for the introduction. Okay, so hi, everyone.
I'm Anmol Saftewa. The title for today's talk is painting with cans. We'll be talking about the neural style transfer and the technicalities and challenges of using that. So, a brief introduction about myself. So, I'm an international tech speaker and a distinguished guest lecturer. And I work at OLX group.
I have done my master's in advanced computing from University of Bristol. And the specialization was in field of computational neuroscience and artificial intelligence. I represented India in various international hackathons and I'm a researcher also. About OLX group. So, it's a group which contains of 20 plus brands
and it actually has around 45 offices spanning across five continents. And we serve across five continents with 350 million people per month. So, the flow of the talk will be as follows.
First, we will be looking into an introduction to GANs. Then we'll be taking a look at what style transfer is. Thereafter, we'll be learning about different neural style transfer networks that are available and are popular at this time. And then we will dive into the actual NST implementation
by looking into loss functions, the content loss function, style loss function, total variation loss function. And then we'll be doing kind of a code walkthrough also. The talk will be supported by a few demos also. Those are actually adopted from the official
TensorFlow and Keras repositories. So, I'll be pushing the code to GitHub and we'll be sharing the link to it in the breakout channel. And then post-talk, we can have the Q&A, question and answer session in the talks, talk painting with GANs channel in Discord.
So yeah, prerequisites for this talk that you should be familiar with Python and Keras, especially using TensorFlow backends and an experience in artificial neural networks is good to have. It's also good if you have experience with convolutional neural networks and generative adversarial networks. And you should be inquisitive to learn about deep learning.
So first, let's start by revisiting the fundamentals of generative adversarial networks. In short, I'll be referring to them as GANs. So discriminative and generative models are the two types of models that we use in a GAN.
So first coming to the discriminative model, a discriminative model forms a discriminative network or the discriminator network. And it's essentially a supervised learning model which tries to classify the data which is fed into it. So it is just kind of a classification model that we are using here.
And it doesn't really bother about the underlying distribution of data, only the quality of data matters to it so that it can classify into categories properly. And then on the other hand, we have a generative model that forms the generative network. And instead of classifying the data,
it's used to generate the data. It actually learns the underlying distribution of the data that's provided. And then on basis of it, it tries to generate samples that are near real looking. So mostly it's unsupervised learning, but what if we want to have some conditional training done?
In that case, we may support the training set with label data also so that it becomes kind of supervised plus unsupervised learning. So conditional GANs actually use label data. And there we have sort of supervised learning also implemented a bit. And then the art of actually learning
the data distribution, underlying data distribution, which the GAN actually does is through like implicit density estimation. So we don't require to calculate any probabilities externally. Everything is done internally by the network itself, and it's called as implicit density estimation.
So I'll be referring to these terms going forward. But just to give you a gist of what GAN network, vanilla GAN actually looks like in form of a schematic diagram, it's this. So the goal is to generate near real looking samples of the underlying distribution that we are provided with. That's the training set.
And then we have this input layer where we feed the random noise. This random noise forms the part of latent space. This can be uniform distribution, can be normal distribution. We have to pass this distribution through GAN, which is like again formed by two neural networks.
One is discriminator network, one is generator network. So we'll be covering on the details of what's hidden inside this box in the coming slides. But then once we pass this input to GANs, the GAN produces an output, which is of some other dimension, say N dimension. And then that's maybe an image that we formed
out of random noise or something like that. So training of GAN algorithm has two essential parts. One is training the discriminator. The second is training the generator. So training the discriminator network actually involves the following flow. So we take a sample, a real sample,
that's the sample from the training set. And then we pass it through the discriminator and we have the discriminator classify it. On the other hand, we have a generator network to which random noise is fed. And then that generator network actually produces a sample, which we call as X star or X, that's a fake sample.
And that fake sample is also fed to the discriminator. Now the discriminator should be able to classify this as a fake sample. But our aim is to actually have a generator improvise to the extent that discriminator starts failing
in distinguishing between the real sample and the fake sample. So there will be a point where discriminator will start labeling the, will start classifying the fake sample as real sample. And there's the path of generator training. So generator actually uses this random noise in the second phase of this training, this generator actually uses this random noise,
generates a fake sample. The sample is fed to the discriminator and the discriminator classifies that as real or fake. But the essential thing is that during this training of generator phase, we back propagate the errors to generator instead of back propagating it to the discriminator.
So in discriminator phase, we actually have the back propagation done to discriminator network, whereas in generator phase, we have the back propagation done to the generator. And here we make sure that the discriminator networks parameters are not trainable. So in the second phase, we set the trainable parameters for discriminator to false,
because we just want the generator to improvise. So in schematic diagram, it looks like this. So this is the first phase train the discriminator. The second phase is training of the generator. So we have this as X, the real sample that we are providing to discriminator and then we have this generator
to which we feed some random noise from the latent space that we are calling as Z. It can be normal, it can be uniform, it can be any other distribution. It produces a sample X bar and then that's also fed to discriminator. So the discriminator classifies X and X bar as in some category, maybe real or fake
and then classification error are propagated to discriminator so that discriminator learns. And the second phase, the only thing that we do is we remove this training sample phase and we just pass the X bar and then we pass the back propagated error to generator. So that's the only difference. And then we do this, we repeat the cycle in iterations
so that the network learns. So the discriminator actually gets improvised on distinguishing between real and fake data and the generator on the other hand, improvises on generating data which is near real looking and which adheres to the underlying real training
data set distribution. So these are all fake samples generated by a NVIDIA style GAN. So no one can tell that these are fake, they look pretty much real. So this is how much we have advanced in the last five years since the inception of this concept of GANs.
So next comes the main concept for which we are here today. We have gained quite a lot of hold on generating near real looking images or kind of photo realistic imagery. But what if we want to now generate art
instead of just generating images of static objects, we want to now design new objects so build artistic artifacts. So how can we do using GANs? So here comes the concept of style transfer. So, as I told you earlier,
we now have to dive into just generating new kind of artistic artifacts. So what if we have a kind of image which we call as content image or base image that's there of a dog. And we have a style image, here I have taken image of cross. So what if I just apply the style of this cross
over to the content image and say it generates this output. So that means the style of this image has been imposed on the content image. And we get some unique form of art that actually contains both the content as well as the style. You see that the content also is dominant here
and the style also is getting reflected in this image. So this image holds kind of a combination of both the content image and the style image. So that's our aim. We will be generating art based on similar lines. We'll be having a content image,
we'll be having a style image and our goal will be to transform or embed the style from one style image to a content image in order to produce an image which is a combination of both and looks good and realistic and original. So yeah, that's the aim of neural style networks
that we'll be covering now. So you must take note of one thing that this model that we saw, this training that happened that didn't learn the underlying distribution. So here comes the first difference with respect to vanilla GANs. Vanilla GANs actually had the underlying distribution
being learned but here we are transferring styles. So that's the first difference. So we extract the style from the style image and then embed it to the content image and the result should look like a blend of both the images. So why not simply interpolate the pixels? That's because if we interpolate this with this,
what we will have is blurry image that's highly distorted and the style actually dominates the content image. So it will look muddy, it will not be clear and both the pixels will lose their entity. So this is the thing why we should not use
simple interpolation for doing such sort of things. And this style transfer networks have white applications in the area of gaming, in the area of developing applications. Few years back, we had an app called Prisma which had this sort of style transfer made public.
So people were able to apply styles from different images on their selfies and all. So that actually saw a real boom in like last two or three years and people have advanced in generating new networks
that are like state of the art networks and can do a style transfer. And many more applications have come which we will be discussing in a few minutes. And the last one is that the actual image, if you see, it's just that you have just applied the style on the content image.
So the dimensions and all you can play with, but ultimately it looks like someone has applied some art on the content image and we got the generated image. So popular style transfer networks, if we see these are three networks, one is Pix2Pix and one is CycleGAN.
And then there's neural style transfer. We'll be covering a bit on CycleGAN and Pix2Pix after we cover neural style transfer, but first we will start with neural style transfer. So neural style transfer, as I told you earlier, doesn't require any training set. We have seen that it doesn't require any training set. It doesn't have any kind of,
you can say back propagation being done because we just have two images to deal with. We have to transfer the style of one image to another image. So there's no involvement of any kind of training set. So that's another unique thing about neural style transfer that it actually picks up the features
from the style image and applies the features on the content image. And it creates some like hyper realistic imagery. So let's say we have this base image and then we apply a style image
that's shown here at the bottom. We get this kind of combined image. Just playing around with the loss functions and the hyper parameters, we can see highly varying resultant images. So this is one image that we see. Likewise, we can have different degrees
to which this image can be exploited by this neural style transfer network. And we can have different images which have the style features transferred or embedded onto the base image in different degrees. So we can generate multiple images from the same style and base image, like using combination of both.
So the core of neural style transfer is that essentially we have a loss function which constitutes of content loss. That's the loss which I'll be telling about more in a bit. So content loss, style loss and total variance loss. So first thing is that we have content loss
where we would like to have the combined or the resultant image to be as similar to be as similar to the content image, the base image. So that's a loss between the base image and the generated or the resultant image.
Then comes a style loss in where we have the generated image compared to the style image. And then on the basis of how much degree of correlation is there between the style of both the images, we calculate the loss. And then last is the total variation loss
that also we call as total variance loss, wherein we have the generated image checked whether it's smooth or whether the pixels are distorted and blurry. So how do we minimize this loss? Ultimately, the goal of neural side transfer is to minimize the combination of these losses.
So how to minimize this loss? We actually use a gradient descent technique wherein we update each pixel over the iterations and then we have something as you saw in the previous slide, the combined image. And there's a difference with the vanilla GAN which I highlighted earlier also that there's no training set required here
and there's no backpropagation concept being applied here. So coming to content loss. Now onwards, I'll be making references to bits of code also but before moving forward, I would like to show a quick demo of how we can actually utilize a pre-trained network
to generate these art pieces. So yeah, so this is the IPython notebook that I'm using. I'll just increase the size so that you are able to see. So here we are importing libraries. This is a TensorFlow library, Matplotlib.
We are setting the run config parameters for Matplotlib, then NumPy, Python image library and func tools. So this is the import part. Next comes a function. So this code has been adopted by TensorFlow, the official repository. So the function is to convert the tensor to image.
So this does nothing but uses some NumPy functions like watching on the channels that are there, the dimensions of the images that are there and then we choose the primary channel if the number of dimension, like it's greater than three dimension. Herein, we will be using four dimensional tensors.
So it converts the tensor to image. Then that's just a function which is being utilized below. We have the content path. Content path uses the utils function of Keras. So get file actually gets the file from remote place
and then fetches the file. So we are using image of the dog which I showed earlier. And then for demo purposes, I have taken three images which I will be running through in a bit. So these are the images of style that we want to transfer. So this is a content image, this is a style image.
Then we have load image function just to display the images. This again uses the normal NumPy and TensorFlow functions and libraries. We read the image, we decode the image, we convert the image and scale the image. So it's kind of processing of image and then showing the newly formed image. So we resize the image afterwards
and then we return the image. And IAM show function, this is a typical function. We just are attaching title to the image and then we have the plot function. So till this point, I'll run each of the cells so that you observe what is happening. So first we will see the style image
of bushes being applied to a dog. So last time in the slide, I showed grass being applied to dog. Now we'll see bushes being applied to dog. So yeah, let me print the images
that we are taking into account. So this is the image of bushes and we are applying it to dog. And then we'll have the pre-trained model VGG19. So I'm taking VGG19 pre-trained model,
which has all the weight set. And it is able to classify into thousand categories. The network has been trained on 1 million images taken from ImageNet dataset. So you'll see that the combined image is having the style of the grass and also adhering to the content image
has the content of the actual base image that we passed. So nothing much I have done here. I have just used VGG19 pre-trained model. And this is the destination to that model. We have passed to it the content image along with style image, and then we have passed to tensor to image thing,
which takes out and processes the image and then displays it. Let's quickly jump on and then I'll uncomment this and we'll have some other style transferred to this image. We can directly go.
So I'll show what's the style image we are referring to now. So, okay. Let me change the name here.
So it actually took from cache. That's a style image that we are trying to apply on the content image. And then I'll quickly apply this to the base image
and we should have a result like this. This looks like a novel art piece. So that's for the demo of the first part. Now moving back to content loss. So content loss is actually kind of L2 difference or the like mean squared error between the content image and the generated image.
So what if we compare apples to apples? The content is similar. So the information of the pixels should also be similar to some degree and we will have less content loss. But what if we compare ocean or sharks or say oranges with apple or banana with apple,
then there will be higher degree of content loss. But we will not do pixel by pixel comparison. What we will do is we will have the higher level features compared. So how to compare the higher level features? While training a neural network, say we pick up convolutional neural networks and there are blocks or there are stages
at which we train the neural network. And as and when there's any stage which progresses to the next stage, some of the features gets dropped, the lower level features gets dropped and we just are left with higher level features. So at each layer of neural networks,
say convolutional neural network, the lower layers represent some minute details, say very, very minute details and the higher layers or the layers at the top, they actually contain the higher level features or just say broad features like this is a car, this is a building. So just to capture the higher level features,
we'll be dealing with the top layers of the neural network from a pre-trained model and in our case, it's a VGG-19. So it has a 19-layered CNN architecture and as I told you, it's capable of classifying images into 1000 categories and has been trained over 1 million images from the ImageNet dataset.
So the VGG-Net architecture looks something like this. We have an input layer and then we have five chunks. So one, two, three, four, five, these are five blocks. Each block has a bunch of layers. So con one, con two, you can see likewise, we have con five, con one, two, con five, four
and then we have a dense layer which flattens the input which is being fed and then classification happens. So we'll be using this VGG-19-Net and then using the higher layers from block five, we'll be capturing the higher level features or the content or the features through which we'll be actually collecting
and computing the content loss. So this is the code. I'll just zoom in if that's possible. Yeah. Okay, so we have this Keras library being used.
VGG-19 is the model that we are using, the pre-trained model that we are using and we just specify the path to base image. We specify the path to style image and then we just specify some weight. So content weight is the total loss,
total variation loss weight is the style weight is there but that we'll be looking out in a bit later stage. So base image, we have the base image style image being processed by the path from the path which we have provided and then we have a placeholder, you can say placeholder image created
that we're in the combined image which will go. So we are having these three images. We pass these three images by concatenating them in form of a tensor and then we pass them in VGG-19 which has the pre-trained weights
and then we get some resultant image. From that image, we actually take, we actually take from the pre-trained model the third you can say layer of block five. So if you see third layer of block five, we take this as a reference for calculating the loss
because higher level features are captured here. So we may take two second layer, we may take third layer, we may take fourth layer but it totally depends on you. Like taking different layers may result in varying results. So I have chosen three as a layer.
You may choose two also. So we choose two and then we pass the base image and we collect the output from this third layer and we also have the combination features from the combined image taken from third layer in the network.
And then we just have the L2 norm applied, that's mean squared error applied to both the images. So generator image and content image and then we just have the content loss calculated by multiplying it with the weight, the content weight that we specified here at the top. Coming to style loss,
that's the second type of loss that we wanted to cover. So coming back to style loss, we have images which are like similar to each other maybe. So those images we call them as correlated images but there are images which are different and have different styles or different lower level details
that doesn't have like common lower level details. So those images are considered as having less degree of correlation. So how to calculate correlation between the layers? So degree of correlation between two images can be computed by calculating the degree of correlation
between the feature maps. So as we want to capture lower level features because lower level features represent the style, just note that higher level features represent content and lower level features or the output from the lower level layers of the neural network
represent the style. So we take into consideration the lower level layers of convolutional neural networks, we fetch the feature maps from there, we flatten the feature maps, we take the dot product of the feature maps between the two images and then depending on that dot product, if the value of dot product is greater
than some say value that we have specified, we consider it to have higher degree of correlation that means the style of both the images matches. So suppose this is image of grass, image A is image of grass and image B is also image of grass. So you see that the orange points, the dark orange points that you see on both the images,
if they overlap, so that's the area where the image has like similar kind of correlated style. So we can say that these images are correlated to some extent and in case if this considered this block B to be fully orange, then we will say that
that has a higher degree of correlation, that combination has a higher degree of correlation. So in our case, when we'll be calculating style loss, image A will be style image and image B will be the combined image, the resultant image that we get after training. So in order to actually have the style loss calculated
for different layers of our network, what we consider is a thing called gram matrix. So gram matrix is dot product of all the feature maps against the feature maps. So suppose you have layer, so you have this layer. So if you take a dot product of A with A,
then you take dot product of A with B. So in this image, it should be clear. We have image A and we have image B, image C. All these images are of grasses, but one contains only grass, one contains bushes, also one contains some brown grass. What we have is that we have the gram matrix,
which you see on the right calculator. So we have feature one map to feature one and then we calculate the correlation between them by taking dot product of the feature maps. Then we take dot product of feature one with feature two of image A. That means this one and this one, the dot product of both will be taken
and the common area or the result of the dot product will be shown by some color. Likewise, we do for all the features of a layer for an image. So suppose this is image one, we have first the dot product taken across all the feature maps
against the feature maps of that image and then we have this gram matrix. Likewise, we generate gram matrix for image B and image C and then if the gram matrix of two images, in our case, it will be the style image and it will be the resultant generated image. So when the gram matrix of both the images
are highly comparable, we say that the style actually hold it throughout the training. So the style was actually transferred to the generated image. So in that case, we will consider it as success. So we have to calculate the mean squared error again, that's the L2 norm error
and we have to minimize that error. So coming to the code, we again say that style loss is zero initially. We have the definition for gram matrix function here. This function actually does it flattens, as I showed you, it flattens the feature map
and then takes a dot product of it against itself and then returns a gram matrix. So feature map dot product is taken and we get the gram matrix cell. Likewise, we do for all the feature maps and we get the whole gram matrix. Then we have this function of style loss
where we calculate the mean squared error between gram matrix of combined image and style image. So this is nothing but the mean squared error of the combined image and the style image and these are just like some parameters to have that we are passing to the loss function.
This feature layer actually shows the active layers that we chose, so we chose block one. So it will be calculating loss against all the layers present in block one.
So block one of image is this block one. So suppose we have five layers in image, we will be taking that. Okay, so we took block one, block two, block three,
block four, block five. So we have taken each layer of all the five blocks. So con one, one, con two, one, con three, one, con four, one, con five, one. So each of the lower level layers from each block have been considered in order to have this style loss calculated. And then we just pass this and extract the feature maps from these layers
and then we pass it to the style loss function that we have. Then comes the total variation loss. A total variation loss is nothing but the loss or loss with respect to the quality of the resultant image that we are observing. So in case of the combined image is distorted
and is pixelated, we will consider it as noisy and the loss will be very high. So what we can do is we can take combined image and then we will shift pixels of that combined image, each pixel to the right once and also we will do another step. We will take each pixel of the generated image
and then shift it towards down by one pixel. We'll have both of these results stored in A and B respectively. And then we will just take a sum of these two and we will calculate the error. So we will calculate the error by shifting the pixels
to right and downwards. So that will show us whether that image is highly distorted or not. So that's for total variation loss. Once we have these three losses, we will just combine all these three and then we will get the resultant loss. Then is the time to start the training of model. So then comes the training phase.
So we have computed the loss till now, we have computed gram metrics based on which we have computed style loss. We have computed the content loss by taking into consideration the content image as well as the generated image. And we have calculated the total variation loss. So this loss can be trained.
So this network can be trained by taking into account loss and we need to minimize this loss. So we'll be using an optimization technique here. So essentially neural style transfer is an optimization technique which depends on another quasi-Newton
numerical optimization technique called BFGS and L is for limited memory use or we can actually constraint it on basis of resources. So L is limited memory BFGS algorithm which is a numerical optimization algorithm. And what it does is it finds the local minimum of any objective function based on the gradient of that objective function.
So essentially what we need to do is we need to minimize this computed loss over iterations by using the gradient descent method. And what we'll be doing is we'll be updating value of each pixel by an amount which is promotional to the negative of the gradient
that comes from this loss function. So let's dive into the code here. Okay. So it's pretty much the same as I showed you in the snippets. So we have the base image, we have the style reference image parts and we have the weights defined here.
So total variation weight we have defined, style weight we have defined and content weight we have defined. We process the images and we specify the dimensions of the generated image that we want to have and we specify the iterations. So for this demo, I have considered 50 iterations only but in real scenario,
you will be using somewhat like 5,000 iterations or 4,000 iterations to actually see the result that I showed you in the slides. So let me go back to the slide once and show you the end result that we'll see. So this is the end result that we should see and this is after 4,000 iterations of training.
We have the pre-processing of the images done. So it actually just opens, resizes, applies the image processing functions and then we get the tensor basically out of it. We have this deep process image also. So it converts a tensor to the image.
So nothing much is being done here. Again, a reshape function is being used. Then we are clipping the NumPy array for any additional, you can say, overflow that's happening. And then we just pass the base image and style image to the pre-process function that we have created. And we get the tensor representation of the two images.
Once we get the tensor representation of the two images, we have all the three images, combined image that we are treating as a placeholder image as of now. The style image, which is a pre-processed image and the base image, which is a pre-processed image. So these three are good to be fed
into a tensor concatenated network, which will be feeding the same to the VGG19 network for training. So we'll be combining these three and then concatenating and then feeding into the VGG19 pre-trained network, which we imported at the top. And then we'll have the model loaded with us. And also, we'll have the key layers
that we want to actually match against taken into form of a dictionary here. So output that represents the same. So now is the time to compute the neural style loss that we talked about earlier. Again, this is the same similar function,
which I showed you earlier. This is a bit modified from the actual one that I showed you in the slide. That's a gram matrix function, which I have already explained. So it calculates a gram matrix for the fed tensor. Then style loss is there. In style loss, we have the combined image and style loss gram matrix generated,
style image gram matrix generated. And then we calculate the L2 error. And then we have the content loss, which is just simply the MSE, that's mean squared error between the generated and base image. And the total variation, as I told you, that will be shifting by one pixel and then shifting towards the downward direction by one pixel.
So that's again, the sum of both. And then we just calculate the error. And then we select all of these three losses and we add them to form the main loss.
And the thing which you should be seeing is this. The next thing that you should be seeing is this, that we have the gradient computed for this particular loss. And we feed it to the evaluator, you can say, class that we have created.
So evaluator class, evaluator class actually returns the loss and the gradient value at each stage. And we have the iterations formed. So my network is training, as you can see. Over say 4,000 iterations, what we do is we take this loss, we pass it to the evaluator class,
we get the loss and gradients, and then we update the value of each pixel by the negative of this gradient thing. And ultimately, what we see is this combined image. So that's all for this talk. I'll be posting the links to all these code things.
So next we have Pix2Pix, just I'll be talking a bit about it. So it's used for image to image translation. So you can have say, handout or say a schematic diagram being transferred or translated into an image which looks real, kind of a chipboard representation being translated to an actual building
or silhouettes being transferred or translated into images, Google map, street view being transferred to map view, likewise. And then we have cycle GANs, which is again an advanced GAN for neural style transfer. So here we have essentially two GANs.
One is training for the first input and one is training for the other input, but both are actually dependent cyclically on each other. So that's another application, but discussing this is out of the scope of this talk, since we restricted it to neural style transfer. So this is something which I'm pointing you towards that you can actually explore cycle GAN
also if you are more inclined towards GANs. So these three are the popular, you can say networks out there to have the stylistic artifacts created. And linked to each one of them and wherever I use resources, I use the code references from David Foster's generative deep learning
and Jacob's GANs in action. So these are the two reference books that I consulted. So that concludes my talk. We are also hiring a 2LX group. So feel free to reach out to me or just drop in at the career section and then feel free to apply for the roles.
And yep, don't forget to follow me on Twitter, LinkedIn. We can get connected and we can have your questions answered in the discord too. And then later on, we can get connected on these platforms. So thanks a lot for listening.