Musical Timbre Transfer using the Pix2Pix architecture. Video documentation for the challenge submission can be found here (in Spanish):
- Introduction
- Quick reference
- Methodology
- Dataset
- Training
- Results
- Conclusion
- Future work
- Acknowledgements
- Contact
- License
The Pix2Pix architecture has proven effective for natural images, and the authors of the original paper claim that it can perform well the problem of image-to-image translation. However, synthetic images may present a challenging use scenario.
In this work, we use the Pix2Pix architecture for a substantially different application: to generate audio in a similar fashion as the style transfer problem.
Musical timbre transfer consists on obtaining a melody played by a target instrument given the same melody played by the original instrument. Namely, the process changes the style from one instrument into another preserving the semantic content of the song.
The best way to understand the problem is to listen to some audio examples.
In particular, the 4 from the /docs/examples
folder of this repository should provide an idea and are the ones that will be used thoughout this documentation.
Keyboard acoustic Guitar acoustic String acoustic Synth Lead synthetic --- | --- | --- | ---
To make listening easier while viewing this documentation, the reader can access them from the above links to Google Drive, where they are stored
The following table shows one STFT spectrogram frame of the same melody from the above examples, played by the 4 different instruments considered in this work. These images serve as input and output for the Pix2Pix network.
A more detailed explanation about spectrograms can be found in section Methodology.
Keyboard acoustic | Guitar acoustic | String acoustic | Synth Lead synthetic |
The objective of this project is to train a network that is able to perform image translation between any instrument pair of this set. For simplicity, the Keyboard is considered the canonical instrument such that the translations presented here have Keyboard as origin and any of the remaining 3 as target.
Clone this repository to your system.
$ git clone https://github.com/hmartelb/Pix2Pix-Timbre-Transfer.git
Make sure that you have Python 3 installed in your system. It is recommended to create a virtual environment to install the dependencies. Open a new terminal in the master directory and install the dependencies from requirements.txt by executing this command:
$ pip install -r requirements.txt
Download the NSynth Dataset and the Classical Music MIDI Dataset.
-
The NSynth Dataset, “A large-scale and high-quality dataset of annotated musical notes.” https://magenta.tensorflow.org/datasets/nsynth
-
Classical Music MIDI Dataset, from Kaggle https://www.kaggle.com/soumikrakshit/classical-music-midi
Generate the audios and the features with the following scripts. Optional arguments are displayed in brackets “[ ]”.
$ python synthesize_audios.py --nsynth_path <NSYNTH_PATH>
--midi_path <MIDI_PATH>
--audios_path <AUDIOS_PATH>
[--playback_speed <PLAYBACK_SPEED>]
[--duration_rate <DURATION_RATE>]
[--transpose <TRANSPOSE>]
$ python compute_features.py --audios_path <AUDIOS_PATH>
--features_path <FEATURES_PATH>
Train the Pix2Pix network with the train.py
script, specifying the instrument pair to convert from origin to target, and the path where the dataset is located.
$ python train.py --dataset_path <DATASET_PATH>
--origin <ORIGIN>
--target <TARGET>
[--gpu <GPU>]
[--epochs <EPOCHS>]
[--epoch_offset <EPOCH_OFFSET>]
[--batch_size <BATCH_SIZE>]
[--gen_lr <GENERATOR_LEARNING_RATE>]
[--disc_lr <DISCRIMINATOR_LEARNING_RATE>]
[--validation_split <VALIDATION_SPLIT>]
[--findlr <FINDLR>]
The train_multitarget.py
script allows for multitarget training instead of a fixed instrument pair. This means that the same origin can be conditioned to obtain a different target by having an additional input. To use it, specify the origin, a list of targets, and the path where the dataset is located.
$ python train_multitarget.py --dataset_path <DATASET_PATH>
--origin <ORIGIN>
--target <LIST_OF_TARGETS>
[--gpu <GPU>]
[--epochs <EPOCHS>]
[--epoch_offset <EPOCH_OFFSET>]
[--batch_size <BATCH_SIZE>]
[--gen_lr <GENERATOR_LEARNING_RATE>]
[--disc_lr <DISCRIMINATOR_LEARNING_RATE>]
[--validation_split <VALIDATION_SPLIT>]
[--findlr <FINDLR>]
It is also possible to train only the generator network with the train_generator.py
script, specifying the instrument pair to convert from origin to target, and the path where the dataset is located.
$ python train_generator.py --dataset_path <DATASET_PATH>
--origin <ORIGIN>
--target <TARGET>
[--gpu <GPU>]
[--epochs <EPOCHS>]
[--epoch_offset <EPOCH_OFFSET>]
[--batch_size <BATCH_SIZE>]
[--lr <LEARNING_RATE>]
[--validation_split <VALIDATION_SPLIT>]
[--findlr <FINDLR>]
The /models
folder of this repository contains the training history and the learning rate search results in separate directories for each instrument pair.
Since the weights of the trained models are too large for the Github repository, this alternative link to Google Drive is provided.
Individual models
- keyboard_acoustic_2_guitar_acoustic
- keyboard_acoustic_2_string_acoustic
- keyboard_acoustic_2_synth_lead_synthetic
- keyboard_acousitc_2_any
To use a pretrained model simply run the predict.py
script specifying the path to the trained generator model (i.e. .h5 weights file), the location of the input audio and the name of the output audio.
$ python predict.py --model <GENERATOR_WEIGHTS>
--input <INPUT_AUDIO>
--output <OUTPUT_AUDIO>
Additionally, in the case of a multitarget model the style must be specified. Run the predict_multitarget.py
script instead.
$ python predict_multitarget.py --model <GENERATOR_WEIGHTS>
--input <INPUT_AUDIO>
--style <TARGET_STYLE_AUDIO>
--output <OUTPUT_AUDIO>
The Pix2Pix architecture has been designed for image processing tasks, but in this case the format of the data is audio. Therefore, a preprocessing step to convert a 1D signal (audio) into a 2D signal (image) is required.
Audio applications using Machine Learning typically work better in Frequency domain than in Time domain. If an appropriate time-frequency transform, like the Short Time Fourier Transform (STFT) is applied to the time domain signal, the result is a 2D representation called a Spectrogram where the axes correspond to time (horizontal) and frequency (vertical).
Example of the Keyboard visualized in Adobe Audition.
Top: Time domain (Waveform), Bottom: Frequency domain (Spectrogram, STFT)
The spectrograms are computed from the audios using the librosa.stft()
function with a Hanning window of size 1024 and an overlap of 50% (hop size of 512), which gives a resolution of 513 frequency bins. The Sampling Rate of the input audio is 44.1kHz. These parameters have been found to provide a reasonable time-frequency compromise for this application.
The original Sampling Rate of 16kHz of the NSynth dataset makes the spectrograms have no content above 8kHz, according to the Nyquist-Shannon sampling theorem. Since the spectrograms are computed up to 22.05kHz in this case, as we use a Sampling Rate of 44.1kHz for professional audio, it is safe to trim one half of the image corresponding to High Frequencies because there is no content (i.e. the magnitude is all zeros in this region).
Strictly speaking, the values of the Spectrogram returned by the STFT operation are complex numbers. Therefore, for the network to process the data it needs to be decomposed further. The magnitude of the signal is the modulus of Spectrogram, namely np.abs(S)
and the phase of the signal is the angle, obtained as np.angle(S)
.
The component that carries the most relevant information is the magnitude, and it is the only one passed to the network, as shown in the following diagrams:
The network performs the timbre transfer operation in a fixed instrument pair setting. The task is to learn the differences between a given origin and target and apply it to the input to generate the prediction. This means that the origin and target instruments are always expected to be the same and the input audio is converted from origin to target.
Diagram of the end-to-end audio processing pipeline for a fixed instrument pair.
The STFT and iSTFT correspond to the forward and inverse Short Time Fourier Transforms respectively. The magnitude is processed at the Pix2Pix block, which returns a magnitude estimation as output. The phase is processed at the Phase estimator block, with one of the implementations discussed below.
The proposed setting is similar to the neural style transfer problem. Instead of having a fixed transformation, the network is conditioned to generate any instrument of the user’s choice. This is achieved by passing random notes played by the target instrument as an additional input. The task is not just to perform a predetermined transformation, but to analyze input and target simultaneously to generate the prediction.
Diagram of the end-to-end audio processing pipeline for the conditioned multitarget model.
The Pix2Pix block now receives 2 magnitude inputs to generate a magnitude with the content of the input audio as played by the instrument in style target audio.
Both magnitude and phase are required to reconstruct the audio from a Spectrogram, so we need to estimate the phase in some way.
Generating flat or random phases does not produce a decent result. Therefore, a more sophisticated phase estimation method is also necessary. The following can be implemented in the “Phase estimator” block as possible solutions:
- Griffin-Lim algorithm
- Reconstruction using the input phase (the phase estimator is the identity function, commonly used in audio source separation)
- Use another Pix2Pix network to learn the phase
- Pass magnitude and phase as 2 channels to a single Pix2Pix network
Some authors from the research literature claim that (1) may not converge into an acceptable result for this particular problem [i, ii], and any of the proposals in (3,4) are error prone since they will likely produce inconsistent spectrograms that are not invertible into a time-domain signal [ii, iii].
Consequently, (2) has been chosen for being the one with less computational cost, less error prone, and best perceptual output quality.
References of this section
Given the description of the problem, the dataset must contain the same audios played by different instruments. Unfortunately, this is very complex to achieve with human performances because of time alignment, note intensity differences, or even instrument tuning changes due to their physical construction.
For this reason, the audios of the dataset have been synthesized from MIDI files to obtain coherent and reliable data from different instruments. By doing this we ensure that the only change between two audios is the timbre, although this has its own limitations.
The dataset has been created using a combination of two publicly available datasets:
-
Classical Music MIDI, from Kaggle: https://www.kaggle.com/soumikrakshit/classical-music-midi
-
The NSynth Dataset, “A large-scale and high-quality dataset of annotated musical notes”, Magenta Project (Google AI): https://magenta.tensorflow.org/datasets/nsynth
The MAESTRO Dataset contains more than 200 hours of music in MIDI format and can be used to generate an even larger collection of synthesized music. Although the resulting size of the synthesized dataset made it impractical for the scope of this project, the author encourages other researchers with more computing resources to try this option as well.
- The MAESTRO Dataset “MIDI and Audio Edited for Synchronous TRacks and Organization”, Magenta Project (Google AI): https://magenta.tensorflow.org/datasets/maestro
The audios are generated from these 2 datasets by loading the notes from the MIDI file as a sequence of (pitch, velocity, start_time, end_time). Then, the corresponding note from the NSynth dataset is loaded, modified to the note duration, and placed into an audio file. After repeating these two steps for all the notes in the sequence, the piece from the MIDI file is synthesized as illustrated in this diagram:
Audio synthesizer block diagram. The notes from the MIDI file and the notes from NSynth are combined into a synthesized output audio.
The procedure has been done with all the MIDI files in Classical Music MIDI and with the following instruments from The NSynth Dataset in the note quality 0 (Bright):
- keyboard_acoustic
- guitar_acoustic
- string_acoustic
- synth_lead_synthetic
The Magnitude Spectrograms are converted from linear domain to logarithmic domain using the function amplitude_to_db()
within the data.py
module, inspired from librosa but adapted to avoid zero-valued regions. The implication of this is that the magnitudes are in decibels (dB), and the distribution of the magnitude values is more similar to how humans hear.
The minimum magnitude considered to be greater than zero is amin, expressed as the minimum increment of a 16 bit representation (-96 dB).
amin = 1 / (2**16)
mag_db = 20 * np.log1p(mag / amin)
mag_db /= 20 * np.log1p(1 / amin) # Normalization
Finally, the range is normalized in [-1,1] instead of [0,1] using the following conversion:
mag_db = mag_db * 2 - 1
To recover the audio, the inverse operations must be performed. Denormalize to [0,1], convert from logarithmic to linear using the function db_to_amplitude()
from data.py
, and then compute the inverse STFT using librosa.istft()
with the magnitude and the phase estimations. The complex spectrogram and the final audio can be obtained from the magnitude and phase as:
S = mag * np.exp(1j * phase)
audio = librosa.istft(S,...)
The adversarial networks have been trained in a single GTX 1080Ti GPU for 100 epochs using magnitude spectrograms of dimensions (256,256,1), a validation split of 0.1, 22875 examples per instrument pair, Adam optimizer, and Lambda of 100 as in the original Pix2Pix paper.
After some inconclusive experiments setting the batch size to 1, 2 and 4, the best convergence has been achieved using a batch size of 8. This gives a total of 2859 iterations per epoch.
In the case of the conditioned model the number of training examples is 68625, which gives 8578 iterations per epoch.
The learning rate has been searched using the Learning Rate Finder method mentioned in this blog post from Towards Data Science and this paper.
The search was performed separately for the generator, the discriminator and the joint adversarial system. The best learning rate is not the lowest loss, but the one with the steepest slope. This example shows the results for keyboard_acoustic_2_guitar_acoustic:
Generator MAE | Discriminator loss | Joint GAN loss |
Not only the learning rate has been found to be orders of magnitude lower than expected, but also different for the Generator and the Discriminator depending on the instrument pair. The optimal values found with this method are the following:
Origin | Target | Generator LR | Discriminator LR |
---|---|---|---|
keyboard_acoustic | guitar_acoustic | 5e-5 | 5e-6 |
keyboard_acoustic | string_acoustic | 1e-5 | 1e-5 |
keyboard_acoustic | synth_lead_synthetic | 1e-4 | 1e-5 |
keyboard_acoustic | any | 1e-5 | 5e-6 |
The training history is displayed below for the 100 training epochs, using all the instrument pairs with keyboard_acoustic as origin.
Generator MAE | Discriminator loss | Joint GAN loss |
(best = 0.0192, last = 0.0192) | (best = 1.3131, last = 1.3708) | (best = 2.6338, last = 2.6338) |
Generator MAE | Discriminator loss | Joint GAN loss |
(best = 0.0553, last = 0.0553) | (best = 0.6853, last = 1.0921) | (best = 6.4461, last = 6.5735) |
Generator MAE | Discriminator loss | Joint GAN loss |
(best = 0.0222, last = 0.0225) | (best = 1.3097, last = 1.3426) | (best = 2.9503, last = 2.9925) |
Generator MAE | Discriminator loss | Joint GAN loss |
(best = 0.0437, last = 0.0437) | (best = 1.0173, last = 1.2794) | (best = 5.1990, last = 5.2048) |
The numeric value of the loss can serve as a performance metric during training, but the most important part of this applied work is to observe the results subjectively. This section showcases the results both visually and with audios.
At the end of every training epoch the same audio file has been used to generate a spectrogram frame and the corresponding audio with the target timbre.
Input spectrogram | Prediction over 100 epochs | True target |
Input spectrogram | Prediction over 100 epochs | True target |
Input spectrogram | Prediction over 100 epochs | True target |
Input spectrogram | Prediction over 100 epochs | True target |
The following tables are provided to listen to the results one by one. The results are conformed by 6 samples, which are displayed in terms of Input, Output and Target. 4 samples are from the Training set and 2 from the Validation set.
NOTE: It is highly recommended to donwload the
test_results
folder from this Google Drive link to get all the audios at once.
Sample name | Set | Input | Output | Target |
---|---|---|---|---|
appass_3.wav | Training | Listen | Listen | Listen |
burg_gewitter.wav | Training | Listen | Listen | Listen |
debussy_cc_1.wav | Training | Listen | Listen | Listen |
mond_3.wav | Training | Listen | Listen | Listen |
schuim-3.wav | Validation | Listen | Listen | Listen |
ty_august.wav | Validation | Listen | Listen | Listen |
Sample name | Set | Input | Output | Target |
---|---|---|---|---|
appass_3.wav | Training | Listen | Listen | Listen |
burg_gewitter.wav | Training | Listen | Listen | Listen |
debussy_cc_1.wav | Training | Listen | Listen | Listen |
mond_3.wav | Training | Listen | Listen | Listen |
schuim-3.wav | Validation | Listen | Listen | Listen |
ty_august.wav | Validation | Listen | Listen | Listen |
Sample name | Set | Input | Output | Target |
---|---|---|---|---|
appass_3.wav | Training | Listen | Listen | Listen |
burg_gewitter.wav | Training | Listen | Listen | Listen |
debussy_cc_1.wav | Training | Listen | Listen | Listen |
mond_3.wav | Training | Listen | Listen | Listen |
schuim-3.wav | Validation | Listen | Listen | Listen |
ty_august.wav | Validation | Listen | Listen | Listen |
Sample name | Set | Input | Output (guitar), Output (string), Output (synth_lead) | Target (guitar), Target (string), Target (synth_lead) |
---|---|---|---|---|
appass_3.wav | Training | Listen | Listen, Listen, Listen | Listen, Listen, Listen |
burg_gewitter.wav | Training | Listen | Listen, Listen, Listen | Listen, Listen, Listen |
debussy_cc_1.wav | Training | Listen | Listen, Listen, Listen | Listen, Listen, Listen |
mond_3.wav | Training | Listen | Listen, Listen, Listen | Listen, Listen, Listen |
schuim-3.wav | Validation | Listen | Listen, Listen, Listen | Listen, Listen, Listen |
ty_august.wav | Validation | Listen | Listen, Listen, Listen | Listen, Listen, Listen |
NOTE: Again, it is highly recommended to donwload the
real_world_results
folder from this Google Drive link to get all the audios at once.
Since the audios presented here are real world recordings, there is no target audio to compare with the output. Therefore, this section is intended entirely for subjective evaluation. In this case, the entire piece has been exported instead of 10 second excerpts to allow the listener to check the audio quality in different parts of the piece.
Sample name | Length | Input | Output (guitar) | Output (string) | Output (synth_lead) |
---|---|---|---|---|---|
Chopin - Nocturne.mp3 | 04:13 | Listen | Listen | Listen | Listen |
Pachelbel Canon in D - Solo Piano.mp3 | 03:22 | Listen | Listen | Listen | Listen |
Sunrise In Meteora - Piano Solo.mp3 | 03:52 | Listen | Listen | Listen | Listen |
Sweet Memories - Piano Solo.mp3 | 03:34 | Listen | Listen | Listen | Listen |
The /models
folder of this repository contains the training history and the learning rate search results in separate directories for each instrument pair.
Since the weights of the trained models are too large for the Github repository, this alternative link to Google Drive is provided.
Individual models
- keyboard_acoustic_2_guitar_acoustic
- keyboard_acoustic_2_string_acoustic
- keyboard_acoustic_2_synth_lead_synthetic
- keyboard_acousitc_2_any
To use a pretrained model simply run the predict.py
script specifying the path to the trained generator model (i.e. .h5 weights file), the location of the input audio and the name of the output audio.
$ python predict.py --model <GENERATOR_WEIGHTS>
--input <INPUT_AUDIO>
--output <OUTPUT_AUDIO>
Additionally, in the case of a multitarget model the style must be specified. Run the predict_multitarget.py
script instead.
$ python predict_multitarget.py --model <GENERATOR_WEIGHTS>
--input <INPUT_AUDIO>
--style <TARGET_STYLE_AUDIO>
--output <OUTPUT_AUDIO>
The system presented in this work can perform the Timbre Transfer problem and achieve reasonable results. However, it is obvious that this system has some limitations and that the results are still far from being usable in a professional music production environment. In this section, the Results presented above are discussed.
The audios from the dataset are analyzed here according to the intermediate and final results. For this, the epochs 1, 50 and 100 are considered.
The first training epochs sound like an interpolation between the original instrument and the target, but present very noticeable distortion that is not pleasurable for the listener. This is specially true in the case of keyboard_acoustic_2_string_acoustic
.
In the epoch 50, the output sounds similar to the target with still notable artifacts. The models have learned characteristics from the target instrument that are present in the output. For example, the model keyboard_acoustic_2_guitar_acoustic
introduces the sound of the guitar strings even in places where the original melody from the keyboard does not produce such noises. Also, in the case of keyboard_acoustic_2_synth_lead_synthetic
, the output sound presents the high frequency harmonics sustained for a longer period of time than in the original melody.
The results from the epoch 50 to the epoch 100 have minimal changes, but the perceptual implications of them are noticeable. In all cases, some of the undesired noises and distortions from earlier training epochs are reduced. As a result, the output sound is perceptually more natural, but still presents significant artifacts when compared to the ground truth target.
Due to the limited instrument diversity during training (only one type of keyboard), a subjective listenting test on the audios from the real world reveals that the same instrument recorded in a different setting can also present a challenging scenario. That being said, the generalization of the networks allow for performing the Timbre Transfer operation from real world pianos into the 3 instruments considered in this work.
Artifacts appear specially when there are multiple notes being played at once and when there are sudden intensity changes. In general, the target instrument can be recognized and the audio quality is reasonably similar to the output audios from the dataset.
Further research applying one or more proposals from the section Future work may be required to refine the results.
There are some aspects of this work which have a considerable margin for improvement with further research. In this section, the author’s intention is to highlight some of the main lines to be followed in the future in the hope that they will help the research in this field.
As mentioned in the section Dataset, the single notes contained in the NSynth Dataset were used to synthesize the audios. In particular, the entire training has been performed using the note quality 0 (bright) of each instrument pair. However, it may be interesting to experiment with other note qualities.
Another way could be to create a custom dataset with the same structure as NSynth using a SoundFont synthesizer for each new instrument (.sf, .sf2 files).
Generate different versions of the audios by changing synthesis parameters, transposition, tempo, note length, etc. or applying audio effects used in professional audio productions such as Reverb, EQ or Delay.
Alternatively, consider using the MAESTRO Dataset as mentioned in the section Dataset if you have more time and resources for your research.
The scope of this project has been limited to explore 3 instrument pairs, having only one pair fixed for each model. In other words, the model converts a specific origin into a specific target and cannot perform the timbre transfer operation properly if the origin or target instruments change.
The any_2_any architecture would aim to be instrument-independent and achieve comparable performance in the Timbre Transfer problem.
I would like to thank Carlos from the YouTube channel DotCSV for organizing the Pix2Pix challenge and elaborating a tutorial on how to implement and train this architecture. The code and the tutorial were used as a starting point and adapted to the problem needs. Also, special mention to NVIDIA Corporation for providing the prize as a sponsorship for the challenge.
The challenge has been a major motivation to do this research on the topic of Timbre Transfer and develop this code. Regardless of the outcome of the challenge, I hope this work to be helpful in some way in further Machine Listening research.
Finally, thank you to various faculty members from the Music Technology Group (MTG) at Universitat Pompeu Fabra in Barcelona (Spain) for their valuable feedback and continuous support to my career since I was an undergraduate student there.
Please do not hesitate to reach out to me if you find any issue with the code or if you have any questions.
- Personal email: [email protected]
- LinkedIn profile: https://www.linkedin.com/in/hmartelb/
MIT License
Copyright (c) 2019 Héctor Martel
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.