Skip to content

A deep learning model to spot the actions in soccer videos by unifying global and local features

License

Notifications You must be signed in to change notification settings

phuc16102001/unifying-global-local-feature

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UGLF - Unifying global local features

Introduction

This repository contains the implementation of UGLF model. It is a deep learning model which is used to solve the action spotting task on the SoccerNet-v2 dataset. By investigating in many current researches, we found that most of them just focus on the global feature (the whole frame) without considering the local feature (objects). From that insight, we propose a UGLF by unifying the global and local feature.

Gallery

Our proposed model

Dataset

You can download the dataset SoccerNet-v2 from the offical repository of the challenges after signing the NDA form.

Usage

Libraries installation

To prepare for the required libraries, you can either install them on the local environment, or use virtual environment with conda. Run the following command to install all the needed libraries:

pip install -r requirements.txt

Data downloader

In the repository, you can use the code in /downloader to download the data. Particularly, you will get the password after signing the NDA and replace in the following command:

python3 download.py --password <password> \
    --directory <download_path> \
    --low_quality

You can change the flag --low_quality by these one:

  • label: Download labels
  • baidu: Download baidu feature
  • high_quality: Download video 720p
  • low_quality: Download video 224p

Frame extracting

From the downloaded videos, you need to use the frames_as_jpg_soccernet script to extract frames:

python frames_as_jpg_soccernet.py <video_dir> \
    --out_dir <output_dir>

By default, it extracts the video at 2fps and use $\frac{cpu_count}{4}$ workers. If you need to tune these value, use the following command:

python frames_as_jpg_soccernet.py <video_dir> \
    --out_dir <output_dir> \
    --sample_fps <fps> \
    --num_workers <n_workers>

Parse labels

Before training the models, run the parse_soccernet script to convert the labels to the approriate format:

python parse_soccernet.py <label_dir> \
	<frame_dir> \
	--out_dir <out_dir>

As a result, the parser script will generate the labels by frame for each dataset. For instance, the output may look like:

[
    {
        "events": [
            {
                "comment": "away; visible",
                "frame": 5509,
                "label": "Foul"
            },
            {
                "comment": "home; visible",
                "frame": 5598,
                "label": "Indirect free-kick"
            }
        ],
        "fps": 2.0833333333333335,
        "height": 224,
        "num_events": 65,
        "num_frames": 5625,
        "video": "england_epl/2014-2015/2015-05-17 - 18-00 Manchester United 1 - 1 Arsenal/1",
        "width": 398
    },
    ...
]

Train model

To train the model, please use GPUs to acclerate the training process. Following the below script and replacing the params:

  • feature_architecture: The global context feature extractor
    • ResNet
    • RegNet-Y
    • ConvNextt
  • temporal_architecture: The temporal reasoning module
    • GRU
    • AS-Former
    • Transformer encoder
  • label type: The label encoding type and loss function
    • integer: Use integer encoding if not using mixup, with cross-entropy loss
    • one-hot: Use one-hot encoding with focal loss
export CUDA_VISIBLE_DEVICES = <list_of_gpu_ids>

python3 train_e2e.py <dataset_name> \
	<frame_dir> \
	--save_dir <save_dir> \
	--feature_arch <feature_architecture> \
	--temporal_arch <temporal_architecture> \
	--glip_dir <local_feature_dir> \
	--learning_rate <learning_rate> \
	--num_epochs <n_epochs> \
	--start_val_epoch <start_validate_epoch> \
	--batch_size <batch_size> \
	--clip_len <snippet_length> \
	--crop_dim <crop_dimension> \
	--label_type <label_type> \
	--num_workers <n_workers> \
	--mixup \
	--gpu_parallel

Here is an example:

export CUDA_VISIBLE_DEVICES = 1,2
python3 train_e2e.py "soccernet_dataset" \
	"/data/soccernet_720p_2fps" \
	--save_dir "results/800MF_GRU_GSM_FOCAL_GLIP" \
	--glip_dir "/ext_drive/data/glip_feat" \
	--feature_arch "rny008_gsm" \
	--temporal_arch "gru" \
	--learning_rate 1e-3 \
	--num_epochs 150 \
	--start_val_epoch 149 \
	--warm_up_epochs 3 \
	--batch_size 8 \
	--clip_len 100 \
	--crop_dim -1 \
	--label_type "one-hot" \
	--num_workers 4 \
	--mixup \
    --gpu_parallel

Test model

After training the model, you can use that model to run inference on the other splits of dataset. Also, different from the E2E-Spot script, we have added a recall_thresh to tune the high recall filter threshold. Use the following command to run inference:

export CUDA_VISIBLE_DEVICES = <list_of_gpu_ids>

python3 test_e2e.py <save_dir> \
	<frame_dir> \
    --glip_dir <local_feature_dir> \
	--split <data_split> \
	--recall_thresh <recall_threshold> \
	--criterion_key "val" \
	--save

From the SoccerNet-v2, you can choose 1 of these 4 splits:

  • train
  • val
  • test
  • challenge

Post-processing & evaluate

If you need to do post-processing (NMS) and evaluate (exclude challenge set), use the eval_soccernetv2.py script:

python3 eval_soccernetv2.py <output_file> \
	--split <data_split> \
	--eval_dir <output_dir> \
	--soccernet_path <label_path> \
	--nms_window <nms_window> \
	--filter_score <filter_score> \
	--allow_remove

We have added the two arguments filter_score to filter out all the prediction whose confidence (score) under the provided threshold. Also, if the output folder is existed, you can automatically remove it by passing the --allow_remove flag.

Regards the challenge set, please submit your prediction on the eval.ai challenge.

Loss visualize

To monitor the training process, you can use the loss_visualize.py script to generate a training curve with the output file loss.json while training the model.

python3 loss_visualize.py --input <loss_file> \
    --output <output_image_file> 

Merge predictions

A single model may not be a good solution to work on 17 classes. Sometimes, we may want to merge the predictions for multiple models. To do so, use the merge_prediction script as following:

python3 merge_prediction.py <first_prediction_dir> \
    <second_prediction_dir> \
    <output_dir> \
    --either <list_of_either_class> \
    --both <list_of_both_class> \
    --first <list_of_first_class> \
    --second <list_of_second_class>

For example, I want to keep the cards predictions from the 2nd model and the penalty prediction from both models:

python3 merge_prediction.py "prediction_1.json" \
    "prediction_2.json" \
    "prediction_merge.json" \
    --either "Penalty" \
    --second "Red card,Yellow card,Yellow->red card"

Prediction analyze

To analyze the prediction, you can use the view script. Also, you can pass the --nms flag to run the NMS with the score filter 0.2 threshold.

python view.py <data_name> \
  <prediction_folder> \
  <frame_folder> \
  --nms

As a result, a website should be hosted in localhost:8000.

Prediction analyze

Prediction video visualize

With a given video, you can use the visualize_result to see the video and select the event the you want to navigate to. Firstly, please place the prediction in the same folder with the video:

|- match_name
   |- 1_720.mkv
   |- 2_720.mkv
   |- results_spotting.json

We recommended that you should use anaconda to create the virtual environment in this application:

conda create -n annotation python=3.8
conda activate annotation
pip install --upgrade pip
pip install pyqt5

Then, run the application with:

cd visualize_result/src
python3 main.py

Prediction visualize

Result

By combining our UGLF model with the E2E-Spot model, we achieve the top-1 result on SoccerNet-v2 dataset:

Method Test set Challenge set
Tight Loose Tight Loose
CALF - - 15.33 42.22
CALF-calib - 46.80 15.83 46.39
RMS-Net 28.83 63.49 27.69 60.92
NetVLAD++ - - 43.99 74.63
Zhou et al. 47.05 73.77 49.56 74.84
Soares et al. 65.07 78.59 67.81* 78.05*
E2E-Spot (baseline) 61.82 74.05 66.73* 73.26*
UGLF-Combine (ours) 62.49 73.98 69.38* 76.14*

Contribution

The project is implemented by:

Under the instructions of our mentors:

Also, we also want to send a gracefully thank to these public researches, which has supported our implementation:

Citation

UNDER REVIEWING

About

A deep learning model to spot the actions in soccer videos by unifying global and local features

Topics

Resources

License

Stars

Watchers

Forks