top

Neural Style Transfer

We all are acquainted with the VGG network for image classification, that it uses multi-layered convolution network to learn the features required for classification. This article would focus on another property of these multi-layered convolution networks to migrate the semantic content of one image to different styles. This is the algorithm behind applications like Prisma, Lucid, Ostagram, NeuralStyler and The Deep Forger.The algorithm takes a content image:And a style image:And would provide you with a style-transferred image:This is called style transfer in the world of deep learning. This art generation with neural style transfer is all started with Gatys et al, 2015 who found that:The image content and style were separable from the image representation derived from CNNs.Higher layers in the network capture the high-level content in terms of objects and their arrangement in the input image. (Content Representation)By including the feature correlations of multiple layers, a stationary, multi-scale representation of the input image is obtained, which captures its texture information but not the global arrangement. (Style representation)Based on these findings they devised an algorithm for style transfer by starting from random noise as the initial result and then changing the values of pixels iteratively through backpropagation until the stylized image simultaneously matches the content representation of the content image and the style representation of the style image.Loss FunctionTheir loss function consists of two types of losses:1.Content loss:where l stands for a layer in the conv network, and F(i,j) is the representation of the ith filter at position j for the stylized image and P(i,j) is the representation of the ith filter at position j for the content image. See, this is just MSE over the corresponding representations in both the images.(Note: we are talking about representation from the hidden layers of CNNs, here VGG)2.Style loss:Here G is called the gram matrix and it contains the correlation between the filter responses. G(i,j) at layer l is the inner product between the vectorized feature map of the ith filter and the jth filter. This captures the texture information, for more detailed analysis on image style transfer using neural networks look into this paper by Gatys.Now, consider we have gram matrix A for the stylized image and gram matrix G for the image we want to capture style representation from. N(l) be the number of distinct filters in layer l and M(l) be the size of each filter (height times width), then the contribution of a layer l in the loss is:And now taking a weighted sum of all the layers, we have total style loss:where w(l) is the weight assigned to the style loss in layer l.3. Total LossThe total loss is a weighted sum of the content loss and the style loss.where p is the photograph from which we want to capture the content, a is the artwork from which we want to capture the style and x is the generated image. α and β are the weighting factors for content and style reconstruction respectivelyValues used in the paperIn the paper, they matched the content representation on layer conv4_2 of the VGG net, which is the second last convolutional layer. And the style representation on conv1_1, conv2_1, conv3_1, conv4_1 and conv5_1, i.e. correlations from all the layers.So, if you are planning for building your own neural artistic style transfer algorithm, for the content loss take the representation from the middle to last layers, and for the style loss do not ignore the starting layers.The content style trade-offAs we already know that the ratio α/β determines the amount of content and the effect of style in the generated image. Let’s see what happens on decreasing it (i.e. considering the style image more)Given:Starts with an image containing noise and reach to a decent stylized image:If we give more importance to style, i.e. decrease α/β. We would get:Further decreasing α/β. We would get:So you see, it is mandatory to manually tune the α/β ratio in the default Gatys el al algorithm for an aesthetic output.ImprovementsThough the original algorithm proposed uses VGG net for the representations, the same can be applied to other networks trained to perform object recognition task (e.g. ResNet)Gatys et al’s algorithm requires manually tuning the parameters, Risser et al (2017) automatically tune the parameters by the use of gradient information so as to prevent extreme values for gradients.Risser et al also introduce a new histogram loss for providing stability.Patch based style loss — For style loss Gatys et al. captures only the per-pixel feature correlations and does not constrain the spatial layout, but in case of images the local correlation between pixels is important for visual aesthetics. So they introduce patch based loss (Li and Wand).Fast Neural Style — Train an equivalent feedforward generator network for each specific style. Then at runtime, only one single pass is required (Johnson et al).Dumoulin et al train a conditional feedforward generator network, where a single network is able to generate multiple styles.
Rated 4.0/5 based on 43 customer reviews
Normal Mode Dark Mode

Neural Style Transfer

Nishant Nikhil
Blog
10th May, 2018
Neural Style Transfer

We all are acquainted with the VGG network for image classification, that it uses multi-layered convolution network to learn the features required for classification. This article would focus on another property of these multi-layered convolution networks to migrate the semantic content of one image to different styles. This is the algorithm behind applications like Prisma, Lucid, Ostagram, NeuralStyler and The Deep Forger.

The algorithm takes a content image:
 content image:
And a style image:

style image:


And would provide you with a style-transferred image:

style-transferred image:
This is called style transfer in the world of deep learning. This art generation with neural style transfer is all started with Gatys et al, 2015 who found that:

  1. The image content and style were separable from the image representation derived from CNNs.
  2. Higher layers in the network capture the high-level content in terms of objects and their arrangement in the input image. (Content Representation)
  3. By including the feature correlations of multiple layers, a stationary, multi-scale representation of the input image is obtained, which captures its texture information but not the global arrangement. (Style representation)

Based on these findings they devised an algorithm for style transfer by starting from random noise as the initial result and then changing the values of pixels iteratively through backpropagation until the stylized image simultaneously matches the content representation of the content image and the style representation of the style image.

Loss Function

Their loss function consists of two types of losses:

1.Content loss:

Content loss:where stands for a layer in the conv network, and F(i,j) is the representation of the ith filter at position for the stylized image and P(i,j) is the representation of the ith filter at position for the content image. See, this is just MSE over the corresponding representations in both the images.(Note: we are talking about representation from the hidden layers of CNNs, here VGG)

2.Style loss:

Style loss:Here is called the gram matrix and it contains the correlation between the filter responses. G(i,j) at layer is the inner product between the vectorized feature map of the ith filter and the jth filter. This captures the texture information, for more detailed analysis on image style transfer using neural networks look into this paper by Gatys.

Now, consider we have gram matrix for the stylized image and gram matrix for the image we want to capture style representation from. N(l) be the number of distinct filters in layer and M(l) be the size of each filter (height times width), then the contribution of a layer in the loss is:

And now taking a weighted sum of all the layers, we have total style loss:

where w(l) is the weight assigned to the style loss in layer l.

3. Total Loss

The total loss is a weighted sum of the content loss and the style loss.

where is the photograph from which we want to capture the content, is the artwork from which we want to capture the style and is the generated image. α and β are the weighting factors for content and style reconstruction respectively

Values used in the paper

In the paper, they matched the content representation on layer conv4_2 of the VGG net, which is the second last convolutional layer. And the style representation on conv1_1, conv2_1, conv3_1, conv4_1 and conv5_1, i.e. correlations from all the layers.

So, if you are planning for building your own neural artistic style transfer algorithm, for the content loss take the representation from the middle to last layers, and for the style loss do not ignore the starting layers.


The content style trade-off

As we already know that the ratio α/β determines the amount of content and the effect of style in the generated image. Let’s see what happens on decreasing it (i.e. considering the style image more)
Given:
comparisionStarts with an image containing noise and reach to a decent stylized image:

comparisionIf we give more importance to style, i.e. decrease α/β. We would get:

comparisionFurther decreasing α/β. We would get:

comparisionSo you see, it is mandatory to manually tune the α/β ratio in the default Gatys el al algorithm for an aesthetic output.

Improvements

  1. Though the original algorithm proposed uses VGG net for the representations, the same can be applied to other networks trained to perform object recognition task (e.g. ResNet)
  2. Gatys et al’s algorithm requires manually tuning the parameters, Risser et al (2017) automatically tune the parameters by the use of gradient information so as to prevent extreme values for gradients.
  3. Risser et al also introduce a new histogram loss for providing stability.
  4. Patch based style loss — For style loss Gatys et al. captures only the per-pixel feature correlations and does not constrain the spatial layout, but in case of images the local correlation between pixels is important for visual aesthetics. So they introduce patch based loss (Li and Wand).
  5. Fast Neural Style — Train an equivalent feedforward generator network for each specific style. Then at runtime, only one single pass is required (Johnson et al).
  6. Dumoulin et al train a conditional feedforward generator network, where a single network is able to generate multiple styles.
Nishant

Nishant Nikhil

Blog author

Undergrad @ IIT Kharagpur | Learning Intelligence | GSoC ’16 Student | GSoC’ 17 Mentor

Leave a Reply

Your email address will not be published. Required fields are marked *

SUBSCRIBE OUR BLOG

Follow Us On

Share on

other Blogs

20% Discount