This repository contains the implementation of UGLF model. It is a deep learning model which is used to solve the action spotting task on the SoccerNet-v2 dataset. By investigating in many current researches, we found that most of them just focus on the global feature (the whole frame) without considering the local feature (objects). From that insight, we propose a UGLF by unifying the global and local feature.
Our proposed model
You can download the dataset SoccerNet-v2 from the offical repository of the challenges after signing the NDA form.
To prepare for the required libraries, you can either install them on the local environment, or use virtual environment with conda
. Run the following command to install all the needed libraries:
pip install -r requirements.txt
In the repository, you can use the code in /downloader
to download the data.
Particularly, you will get the password after signing the NDA and replace in the following command:
python3 download.py --password <password> \
--directory <download_path> \
--low_quality
You can change the flag --low_quality
by these one:
- label: Download labels
- baidu: Download baidu feature
- high_quality: Download video 720p
- low_quality: Download video 224p
From the downloaded videos, you need to use the frames_as_jpg_soccernet
script to extract frames:
python frames_as_jpg_soccernet.py <video_dir> \
--out_dir <output_dir>
By default, it extracts the video at 2fps and use
python frames_as_jpg_soccernet.py <video_dir> \
--out_dir <output_dir> \
--sample_fps <fps> \
--num_workers <n_workers>
Before training the models, run the parse_soccernet
script to convert the labels to the approriate format:
python parse_soccernet.py <label_dir> \
<frame_dir> \
--out_dir <out_dir>
As a result, the parser script will generate the labels by frame for each dataset. For instance, the output may look like:
[
{
"events": [
{
"comment": "away; visible",
"frame": 5509,
"label": "Foul"
},
{
"comment": "home; visible",
"frame": 5598,
"label": "Indirect free-kick"
}
],
"fps": 2.0833333333333335,
"height": 224,
"num_events": 65,
"num_frames": 5625,
"video": "england_epl/2014-2015/2015-05-17 - 18-00 Manchester United 1 - 1 Arsenal/1",
"width": 398
},
...
]
To train the model, please use GPUs to acclerate the training process. Following the below script and replacing the params:
feature_architecture
: The global context feature extractor- ResNet
- RegNet-Y
- ConvNextt
temporal_architecture
: The temporal reasoning module- GRU
- AS-Former
- Transformer encoder
label type
: The label encoding type and loss functioninteger
: Use integer encoding if not using mixup, with cross-entropy lossone-hot
: Use one-hot encoding with focal loss
export CUDA_VISIBLE_DEVICES = <list_of_gpu_ids>
python3 train_e2e.py <dataset_name> \
<frame_dir> \
--save_dir <save_dir> \
--feature_arch <feature_architecture> \
--temporal_arch <temporal_architecture> \
--glip_dir <local_feature_dir> \
--learning_rate <learning_rate> \
--num_epochs <n_epochs> \
--start_val_epoch <start_validate_epoch> \
--batch_size <batch_size> \
--clip_len <snippet_length> \
--crop_dim <crop_dimension> \
--label_type <label_type> \
--num_workers <n_workers> \
--mixup \
--gpu_parallel
Here is an example:
export CUDA_VISIBLE_DEVICES = 1,2
python3 train_e2e.py "soccernet_dataset" \
"/data/soccernet_720p_2fps" \
--save_dir "results/800MF_GRU_GSM_FOCAL_GLIP" \
--glip_dir "/ext_drive/data/glip_feat" \
--feature_arch "rny008_gsm" \
--temporal_arch "gru" \
--learning_rate 1e-3 \
--num_epochs 150 \
--start_val_epoch 149 \
--warm_up_epochs 3 \
--batch_size 8 \
--clip_len 100 \
--crop_dim -1 \
--label_type "one-hot" \
--num_workers 4 \
--mixup \
--gpu_parallel
After training the model, you can use that model to run inference on the other splits of dataset.
Also, different from the E2E-Spot script, we have added a recall_thresh
to tune the high recall filter threshold.
Use the following command to run inference:
export CUDA_VISIBLE_DEVICES = <list_of_gpu_ids>
python3 test_e2e.py <save_dir> \
<frame_dir> \
--glip_dir <local_feature_dir> \
--split <data_split> \
--recall_thresh <recall_threshold> \
--criterion_key "val" \
--save
From the SoccerNet-v2, you can choose 1 of these 4 splits
:
- train
- val
- test
- challenge
If you need to do post-processing (NMS) and evaluate (exclude challenge set), use the eval_soccernetv2.py
script:
python3 eval_soccernetv2.py <output_file> \
--split <data_split> \
--eval_dir <output_dir> \
--soccernet_path <label_path> \
--nms_window <nms_window> \
--filter_score <filter_score> \
--allow_remove
We have added the two arguments filter_score
to filter out all the prediction whose confidence (score) under the provided threshold. Also, if the output folder is existed, you can automatically remove it by passing the --allow_remove
flag.
Regards the challenge set, please submit your prediction on the eval.ai challenge.
To monitor the training process, you can use the loss_visualize.py
script to generate a training curve with the output file loss.json
while training the model.
python3 loss_visualize.py --input <loss_file> \
--output <output_image_file>
A single model may not be a good solution to work on 17 classes. Sometimes, we may want to merge the predictions for multiple models. To do so, use the merge_prediction
script as following:
python3 merge_prediction.py <first_prediction_dir> \
<second_prediction_dir> \
<output_dir> \
--either <list_of_either_class> \
--both <list_of_both_class> \
--first <list_of_first_class> \
--second <list_of_second_class>
For example, I want to keep the cards predictions
from the 2nd model
and the penalty prediction
from both models:
python3 merge_prediction.py "prediction_1.json" \
"prediction_2.json" \
"prediction_merge.json" \
--either "Penalty" \
--second "Red card,Yellow card,Yellow->red card"
To analyze the prediction, you can use the view
script.
Also, you can pass the --nms
flag to run the NMS with the score filter 0.2 threshold.
python view.py <data_name> \
<prediction_folder> \
<frame_folder> \
--nms
As a result, a website should be hosted in localhost:8000
.
With a given video, you can use the visualize_result
to see the video and select the event the you want to navigate to.
Firstly, please place the prediction in the same folder with the video:
|- match_name
|- 1_720.mkv
|- 2_720.mkv
|- results_spotting.json
We recommended that you should use anaconda to create the virtual environment in this application:
conda create -n annotation python=3.8
conda activate annotation
pip install --upgrade pip
pip install pyqt5
Then, run the application with:
cd visualize_result/src
python3 main.py
By combining our UGLF model with the E2E-Spot model, we achieve the top-1 result on SoccerNet-v2 dataset:
Method | Test set | Challenge set | ||
---|---|---|---|---|
Tight | Loose | Tight | Loose | |
CALF | - | - | 15.33 | 42.22 |
CALF-calib | - | 46.80 | 15.83 | 46.39 |
RMS-Net | 28.83 | 63.49 | 27.69 | 60.92 |
NetVLAD++ | - | - | 43.99 | 74.63 |
Zhou et al. | 47.05 | 73.77 | 49.56 | 74.84 |
Soares et al. | 65.07 | 78.59 | 67.81* | 78.05* |
E2E-Spot (baseline) | 61.82 | 74.05 | 66.73* | 73.26* |
UGLF-Combine (ours) | 62.49 | 73.98 | 69.38* | 76.14* |
The project is implemented by:
Under the instructions of our mentors:
Also, we also want to send a gracefully thank to these public researches, which has supported our implementation:
- Spotting Temporally Precise, Fine-Grained Events in Video
- Grounded Language-Image Pre-training
- AOE-Net
- SoccerNet
UNDER REVIEWING