Important note: this page, and this repository more generally, are being constantly updated as experiments end and are ready to be posted, and so will change considerably as new material is added. As of January 9th 2023, this project is ongoing.
In this repository and on this page I will address the following questions: how can one transform an image (e.g. a random photograph) like the one on the left below1 to look like it was painted in the style of the middle image (Van Gogh's "The Starry Night"), perhaps obtaining a gif2 of the transformation like the one in the image below on the right? How long does it take to produce such results on regular laptop hardware? What kind of results are even possible?
Here are a few more samples, from runs of experiments that succeeded as well as ones that failed!
Brief description. In what follows I will report, in long form and with various Python scripts and Jupyter Notebooks, on experiements with transfer learning and more precisely neural style transfer. I will restrict to the scenario of transforming a photograph to look like a famous painting (Van Gogh's The Starry Night unless otherwise noted). The interest is at once
- practical: running times of various algorithms, optimization schemes, etc. on normal/old hardware
- and abstract: how pleasant and usable the results are.
I will use and report these two aspects as soft metrics throughout.
Idea, use case scenario, and some questions. Suppose an artist, owning normal or even past-its-prime hardware (perhaps a 7 or 10 year-old Macbook Pro/Air, perhaps a 3-4 year-old PC), would like to add neural style transfer to their techniques. Is this possible given the contraints? Is this feasible? Can subjectively interesting results be achieved given enough time? Is the investment in learning neural networks and using a machine learning platform worth it despite the possible impracticality of the actual algorithms and uselessness of the results generated? That is, can transfer learning be achived at the human level and the techniques be useful somewhere else down the pipeline? (The last question obviously transcends whether the user is an artist.) Is is sustainable? (Is it environmentally sustainable? Is electricity pricing making this prohibitive to do at home? Is it resource intensive to the point one has to dedicate a computer solely to the task?)
Objectives. Here are two important objectives guiding the experiments:
- achieve interesting (for some reasonable definition of interesting) results on a human-scale budget, on middle-to-low-level hardware (could be years old), perhaps without using GPU training or even CPU parallelization;
- achieve human-scale displayable and printable images; note that there is a large difference between the two image scales here, and my experiments will for the most part use images of size 533x400 width times height, which is the display scale. The printable scale is two orders of magnitude (100) times bigger (10x in each direction): for a 30x40 cm quality print, one is looking at images of 20+ megapixels, or gigantic resolutions on the order of 6000x4500 pixels. The printable scale is perhaps beyond any reasonable machine learning algorithm for the moment (late 2022), at least without any serious upsampling or other heavy pre/post-processs image manipulation tricks, and this scale is certainly beyond what any human-scale computer (i.e. one as described in the previor paragraph) can currently do in a matter of hours or a few days.
On the software side, the following standard packages are used alongside Python 3:
tensorflow
numpy
PIL
(Python imaging library)matplotlib
imagemagick
(to make gifs or do other image conversions)
On the scientific side, I assume enough knowledge of machine learning and linear algebra to start experimentation. A quick basic tutorial is this great website by Harish Narayanan. A very brief introduction preaching to the choir (i.e. for people already familiar with machine learning) is given below.
On the artistic side, I assume nothing, and what perhaps looks interesting to me looks completely bizarre to a million other randomly sampled people.
The following hardware has been used for the experiments
- 2019 HP EliteBook 735 G6 14", 32 GB of RAM, AMD Ryzen 5 Pro 3500U CPU, Ubuntu 22.04
- 2015 Macbook Pro Retina 13", 8GB of RAM, Inter Core i5 CPU, MacOS Big Sur
and in practice the HP laptop has been used for running the code more often than not as it is slightly faster and less useful in day-to-day activities (in no small part due to poor battery life).
Remark. Things may change here and these changes will be reported as I upgrade computers, or perhaps migrate some experiments in the cloud, etc.
To quote François Chollet from the Keras webpage explaining the subject,
transfer learning consists of taking features learned on one problem, and leveraging them on a new, similar problem. For instance, features from a model that has learned to identify racoons may be useful to kick-start a model meant to identify tanukis.
For us models mean (convolutional) neural networks, usually trained on the ImageNet database. Two such examples are the VGG16 and VGG19 networks of Simonyan and Zisserman referenced below. They can be found in the Keras applications API.
Neural style transfer (NST) is a form of transfer learning originally introduced in the paper of Gatys et al referenced below. For a detailed textbook description, see Chollet's Deep learning with Python, second edition, from Manning Press (referenced below). See also the article Convolutional neural networks for artistic style transfer by Harish Narayanan (from his website, referenced below).
In its simplest form, NST takes a content image
with
The total variation loss function is the easiest to explain: it looks at pixels which are nearby in
where
where the sum is over a bunch of layers
$$Gr(X){kk'} = \frac{1}{(2 n_c n_h n_w)^2} \sum{i=1}^{n_h} \sum_{j=1}^{n_w} a_{ijk}(X) a_{ijk'}(X)$$
with
Remark. The pre-trained neural network here (for the sake of clarity, one of VGG16 or VGG19) is used as a distance function computing machine. That is, it is used to compute the loss function, and it is sometimes referred to as the loss neural network.
Remark. In the description above, NST takes
The first experiment, a base line as it were, consists of François Chollet's long description of the Gatys et al paper as given in Section 12.3 of his book Deep learning with Python (second edition). The relevant file is the Jupyter notebook notebooks/nst_orig_chollet_book.ipynb
. The network used in the text, and the one in the code, is the VGG19 network of Symonian and Zisserman. All parameters for the original run, including the number of iterations (4000) were left unchanged. In particular the weights for style, content, and total variation are as follows:
Other runs were made at 1000 training step iterations, with both VGG19 and VGG16 (changing from one to the other is a matter of simply replacing 16 by 19 in the cell loading the network). The optimizer was always SGD
with a massive (for my intuition) learning rate of 100 (decreasing every 100 steps).
In terms of timing I can report the following:
- the difference between VGG16 and VGG19 is somewhat significant:
Time per 100 training steps | Network |
---|---|
approx 15 min | VGG16 |
approx 18 min | VGG19 |
- otherwise, absent any parallelization, things are slow, at least measured against the human perception of "computers being fast":
Total time | Network | Number of training steps | Run no. |
---|---|---|---|
approx 12h | VGG19 | 4000 | 1 |
approx 2h 30min | VGG16 | 1000 | 2 |
approx 2h 42min | VGG16 | 1000 | 3 |
approx 2h 43min | VGG16 | 1000 | 4 |
Runs 3 and 4 have had the weights
Here are some remarks:
- that this algorithm is slow absent any parallelization has been reported in every source I could find on the matter (and in almost all of the cited references). Just how slow things were was unclear until now;
- parameter tuning seems out of reach with this method;
- it is difficult to decrease the learning rate and still have the gradient descent algorithm actually decrease the loss function, so it seems this choice of parameters
$\alpha, \beta, \gamma$ in combination with the learning rate (initially 100) is rather rigid; - visually, results similar to the final result seem to be achieved after fewer than 1000 iterations, perhaps as low as 500 or 600.
How interesting are the results? First, despite the vast difference in running times between run 1 and run 4, the results are similar. Nevertheless one can indeed see more of the style after a longer number of iterations as exemplified below: side by side are the results of VGG19 at 4000 iterations (run 1 above, on the left) and of VGG16 at 1000 iterations and with different than the original
Experiment 1, by the book, is rather slow to run, and results are interesting but not particularly pleasant to the eye. Cholet acknowledges both points in his book, and further claims one should expect no miracles with these hyperparameters. He further the approach to more of a signals processing (noising, denoising, sharpening, etc.) approach than a true deep learning approach, and certainly the results do not contradict his claims. However there are many hyperparameters or blocks of code that can be switched, and the results are promising enough to pursue experimentation.
Finally, in the course of testing various hyperparameters that failed to make the training iterations converge (minimize the loss function), I accidentally saved one of the resulting images, after just one iteration. The result is below. Now this is interesting, much more so than the above, to my eyes (perhaps except the red tint, easily removable in post processing).
The relevant file for this experiment is the Python script src/nst_var_1.py
(for a change, I decided on a Python script, a Jupyter notebook might also appear at some point). The approach is similar to that of Chollet but borrows ideas from the implementations of Narayanan and GitHub user Log0; the below references and also the Coursera course on convolutional neural networks. The main differences between the two approaches are:
- the Adam optimizer with a learning rate of 0.01 (Kinga and Ba, see reference) is used in this case as opposed to Stochastic Gradient Descent (with a whopping original learning rate of 100 used in the original approach);
- there is no total variation cost/loss in the overall loss function (in the above,
$\gamma = 0$ ); - the weights
$\alpha, \beta$ as well as the style weights$\delta_\ell$ for each style layer$\ell$ considered are different, and more importantly they are hand-coded for this situation (for experiment 1, the$\alpha_\ell$ 's were set automatically depending on the layer). In particular the following values already seem to give promising results:
- the training step and the cost functions are wrapped in a
@tf.function()
decorator them for faster performance; - the input image is not the original content image, but rather a blend of uniform noise and the content image (i.e. a noise image correlated to some extent with the content image).
The combination of hyperparameters, dropping the total variation loss, and using a noisy image for the start (as advertised already in the original paper of Gatys et al) makes this approach significantly faster for achieving the same types of results. Alternatively, one can wait similar times to those from experiment one but achieve much more "blended" images. No significant difference has been observed using this approach between the use of VGG19 and VGG16. Below are some runs of the program:
Total time | Network | Number of training steps | Run no. |
---|---|---|---|
approx 4h 45min | VGG19 | 5000 | 1 |
approx 1h 06min | VGG16 | 1000 | 2 |
approx 2h 50min | VGG19 | 2500 | 3 |
I suspect the 2x or 2.5x speedup from the previous setting is due to a variety of factors, and here testing each one individually is necessary to draw a better conclusion. They are:
- use of noise in the input image
- use of better hyperparameters (for example the choice of
$\alpha_\ell$ 's in constructing the style weights) - use of another optimization algorithm (Adam)
- drop of the total variation component of the loss function
- more aggressive use of
@tf.function()
decorator (probably the least significant of the speedups)
For the same number of runs, to my eye, this approach gives much better results. Below is a gif of running the algorithm for 5000 training steps (run 1 described in the table of the previous section). Anectodically it seemed that about 3500 training steps would have sufficed for a similar result.
Finally changing the amount of noise present in the input image can significantly alter the result. An example of "tripling" the amount of noise from one run to another is given below.
- Chollet, Deep learning with Python, second edition, publisher's website and GitHub repositiory
- Chollet, Neural style transfer, Keras.io GitHub examples repository
- Chollet, Transfer learning and fine-tuning, Keras website tutorial
- Gatys, Ecker, Bethge, Image style transfer using convolutional neural networks, link to PDF, link to arXiv
- Johnson, Alahi, Fei-Fei, Perceptual losses for real-time style transfer and super-resolution, link to arXiv, publication and supplementary material
- Kingma, Ba, Adam: a method for stochastic optimization, link to arXiv
- Log0, Neural style painting, GitHub repository
- Narayanan, Convolutional neural networks for artistic style transfer, website and GitHub repository
- Simonyan, Zisserman, Very deep convolutional networks for large-scale image recognition, link to arXiv
Footnotes
-
For the sake of reproducibility I am using the same source image as in (Section 12.3 of the) the book by François Chollet, Deep learning with Python, Manning Press, second edition. It is one of the standard texts on deep learning algorithms with Python and Tensorflow/Keras, and the only one I know of discussing the techniques in this report. ↩
-
If the gif is not animated, please click on it. ↩