A Convolutional Recurrent Neural Network (CRNN) trained on 332 annotated songs is used to make binary predictions of whether a meter in a song belongs to a chorus or not, achieving an F1 score of 0.876 (Precision: 0.884, Recall: 0.869) on an unseen testset of 50 songs. This repository contains various Python notebooks, scripts, and resources that support the end-to-end data science process for chorus detection, including data collection, exploratory data analysis, digital signal processing techniques, and modeling. Additionally, it includes a user-friendly command-line tool that leverages the audio processing pipeline and pre-trained CRNN model to predict chorus locations in songs from YouTube links.
The dataset consists of 332 manually labeled songs, predominantly from electronic music genres. The data wrangling steps included:
- Audio preprocessing - formatting songs uniformly, processing at consistent sampling rate, trimming silence, extracting metadata using Spotify's API. See Jupyter Notebook
- Manual chorus labeling - label start/end timestamp of choruses, skipping ambiguous songs. More details on the annotation process can be found in the Mixin Annotation Guide
The EDA aimed to uncover insights and patterns to inform model development. First, we examined the distribution of chorus labels across out dataset and how choruses might differ from non-choruses. Audio features pertaining to the spectral, rhythmic, tonal, and energy properties of music were extracted, quantified, and visualized to better understand the characteristics of music that are relevant to the task of chorus detection. Below are examples of audio feature visualizations of a song with 3 choruses (highlighted in green).
More details on the EDA process can be found in the following notebooks:
- EDA Part 1: Label distribution, label validity, A/B testing of feature extraction methods, audio feature visualization
- EDA Part 2: Finding optimal number of components for decomposing audio features, quantifying audio features and comparing between chorus vs. non-chorus
- Extracted features include Root Mean Squared energy, key-invariant chromagrams, Melspectrograms, MFCCs, and tempograms. The latter four features are decomposed using Non-negative Matrix Factorization with their optimal number of components derived during EDA.
- Songs are segmented into timesteps based on musical meters (computed using tempo and time-signature). This introduces an inductive bias to help the CRNN learn more relevant features and patterns.
- Positional encoding is applied to every meter in a song and every audio frame in a meter.
- Songs and labels are uniformly padded and split into train/validation/test sets. Data is processed into batch sizes of 32 using a custom generator.
The CRNN model consists of:
- Three 1D convolutional layers with ReLU and max-pooling to extract local patterns
- A Bidirectional LSTM layer to model long-range temporal dependencies
- A TimeDistributed Dense output layer with sigmoid activation for meter-wise predictions
def create_crnn_model(max_frames_per_meter, max_meters, n_features):
"""
Args:
max_frames_per_meter (int): Maximum number of frames per meter.
max_meters (int): Maximum number of meters.
n_features (int): Number of features per frame.
"""
frame_input = layers.Input(shape=(max_frames_per_meter, n_features))
conv1 = layers.Conv1D(filters=128, kernel_size=3, activation='relu', padding='same')(frame_input)
pool1 = layers.MaxPooling1D(pool_size=2, padding='same')(conv1)
conv2 = layers.Conv1D(filters=256, kernel_size=3, activation='relu', padding='same')(pool1)
pool2 = layers.MaxPooling1D(pool_size=2, padding='same')(conv2)
conv3 = layers.Conv1D(filters=256, kernel_size=3, activation='relu', padding='same')(pool2)
pool3 = layers.MaxPooling1D(pool_size=2, padding='same')(conv3)
frame_features = layers.Flatten()(pool3)
frame_feature_model = Model(inputs=frame_input, outputs=frame_features)
meter_input = layers.Input(shape=(max_meters, max_frames_per_meter, n_features))
time_distributed = layers.TimeDistributed(frame_feature_model)(meter_input)
masking_layer = layers.Masking(mask_value=0.0)(time_distributed)
lstm_out = layers.Bidirectional(layers.LSTM(256, return_sequences=True))(masking_layer)
output = layers.TimeDistributed(layers.Dense(1, activation='sigmoid'))(lstm_out)
model = Model(inputs=meter_input, outputs=output)
model.compile(optimizer='adam', loss=custom_binary_crossentropy, metrics=[custom_accuracy])
return model
- Custom loss and accuracy functions handle padded values
- Callbacks to save best model based on minimal validation loss, reduce learning rate on plateau, and early stopping
- Trained for 50 epochs (stopped early after 18 epochs). Training/Validation Loss and Accuracy plotted below:
The model achieved strong results on the held-out test set as shown in the summary table. Visualizations of the predictions on sample test songs are also provided and can be found in the test_predictions folder.
Metric | Score |
---|---|
Loss | 0.278 |
Accuracy | 0.891 |
Precision | 0.831 |
Recall | 0.900 |
F1 Score | 0.864 |
While the model demonstrates promising results, it's important to note limitations such as its potential biases towards the predominantly electronic music genre in the dataset. Future work could explore the application of semi-supervised learning techniques to leverage unlabeled data, expand the dataset to include a wider variety of genres, and explore alternative architectures or attention mechanisms that could further enhance model performance, generalizeability, and interpretability. More empirical testing is needed to determine whether the hierarchical positional encoding and segmentation techniques are effective.
-
Final Project Write-up: For a more in-depth analysis, see the Final Project Write-up.
-
Data Annotation: Details on the manual song labeling process are in the Mixin Data Annotation Guide.
-
Model Metrics: Key performance metrics for the CRNN model are summarized in the model_metrics.csv.
-
Notebooks:
- Preprocessing: Audio formatting, trimming, metadata extraction
- EDA: Exploratory analysis and visualizations of audio features
- Modeling: CRNN model preprocessing, architecture, training, evaluation
git clone https://github.com/dennisvdang/chorus-detection.git
cd chorus-detection
pip install virtualenv
virtualenv venv
venv\Scripts\activate # for MacOS/Linux use 'source venv/bin/activate'
pip install -r requirements.txt
conda create --name myenv python=3.8
conda activate myenv
python src/chorus_finder.py --url "https://www.youtube.com/watch?v=example"
This project includes a Makefile
to simplify the setup and execution process. Below are the steps to use the Makefile
to run the CLI.
git clone https://github.com/dennisvdang/chorus-detection.git
cd chorus-detection
Run the following command to set up the Python environment and install all necessary dependencies:
make setup
This command will create a virtual environment, activate it, and install the dependencies listed in requirements.txt
.
To run the chorus_finder.py
script using the Makefile
, use the following command, replacing "https://www.youtube.com/watch?v=example"
with the actual YouTube URL:
make run URL="https://www.youtube.com/watch?v=example"
This command will execute the script with the specified URL.
To clean up the environment and remove temporary files, you can use:
make clean