We all are acquainted with the VGG network for image classification, that it uses multi-layered convolution network to learn the features required for classification. This article would focus on another property of these multi-layered convolution networks to migrate the semantic content of one image to different styles. This is the algorithm behind applications like Prisma, Lucid, Ostagram, NeuralStyler and The Deep Forger.
Based on these findings they devised an algorithm for style transfer by starting from random noise as the initial result and then changing the values of pixels iteratively through backpropagation until the stylized image simultaneously matches the content representation of the content image and the style representation of the style image.
Their loss function consists of two types of losses:
where l stands for a layer in the conv network, and F(i,j) is the representation of the ith filter at position j for the stylized image and P(i,j) is the representation of the ith filter at position j for the content image. See, this is just MSE over the corresponding representations in both the images.(Note: we are talking about representation from the hidden layers of CNNs, here VGG)
Here G is called the gram matrix and it contains the correlation between the filter responses. G(i,j) at layer l is the inner product between the vectorized feature map of the ith filter and the jth filter. This captures the texture information, for more detailed analysis on image style transfer using neural networks look into this paper by Gatys.
Now, consider we have gram matrix A for the stylized image and gram matrix G for the image we want to capture style representation from. N(l) be the number of distinct filters in layer l and M(l) be the size of each filter (height times width), then the contribution of a layer l in the loss is:
And now taking a weighted sum of all the layers, we have total style loss:
where w(l) is the weight assigned to the style loss in layer l.
The total loss is a weighted sum of the content loss and the style loss.
where p is the photograph from which we want to capture the content, a is the artwork from which we want to capture the style and x is the generated image. α and β are the weighting factors for content and style reconstruction respectively
In the paper, they matched the content representation on layer conv4_2 of the VGG net, which is the second last convolutional layer. And the style representation on conv1_1, conv2_1, conv3_1, conv4_1 and conv5_1, i.e. correlations from all the layers.
So, if you are planning for building your own neural artistic style transfer algorithm, for the content loss take the representation from the middle to last layers, and for the style loss do not ignore the starting layers.
As we already know that the ratio α/β determines the amount of content and the effect of style in the generated image. Let’s see what happens on decreasing it (i.e. considering the style image more)
Starts with an image containing noise and reach to a decent stylized image:
If we give more importance to style, i.e. decrease α/β. We would get:
Further decreasing α/β. We would get:
So you see, it is mandatory to manually tune the α/β ratio in the default Gatys el al algorithm for an aesthetic output.