GitHub - qkrwoghd04/binary_classification_using_BERT-ViT: This project aims to develop multimodal deep learning model for fall detecting(Sleep or Fall)

Project : Mobile Agent For Smart Home Using Multimodal Learning <2023-11-13>

Project Field : Multimodal Learning, ViT, Transformer, Image captioning

Project Description :

This project aims to develop a Multimodal Model that can be embedded in robots designed to detect emergency situations for elderly individuals living alone. By leveraging the Vision Transformer model and image captioning, we intend to address problems that were previously undetectable. According to previous studies, "It was impossible to determine whether a person lying on the floor in a still image had fallen or was simply lying down." This is evidenced by the fact that after training on datasets for single modality, the test accuracy for BERT and ViT models was only 0.7 and 0.875, respectively, showing the difficulty. Compared to using a single modality, Vision-Language Models that utilize a dataset combining two modalities show significantly better results with a test accuracy of 0.92.

Methodologies

1. BERT Implementation BERT_MODEL_CODE

This project utilized a general Pretrained BERT model, and employed the tensorflow keras library for model training and dataset management.
Each sentence is tokenized using a tokenizer, and these tokenized sentences undergo input embedding followed by the addition of positional encoding before entering the multi-head attention mechanism.

The token “man” strongly attends to “hallway”, and “lying” similarly focuses on “ground”.
This highlights how the model prioritizes certain tokens over others, revealing how information is centralized around specific words.

2. Transformer Background Study (24.03.10 - 24.03.13)

3. Vision Transformer Architecture(ViT) Implementation (23.03.27 - 23.04.02) ViT_MODEL_CODE

ViT Mechanism

Similar to the BERT model, these patches are processed as individual image tokens. Each patch token progress linear projection and subsequently augmented with positional embeddings.
This step enables the Transformer encoder to process each patch. The self-attention mechanism within each Transformer block allows the model to prioritize and weigh the patches independently of their spatial positions, capturing global dependencies within the image.

Split img

The fundamental mechanism begins with dividing the image into patches according to established patch size which is 16x16.
Then, the image is resized from 224 x 224 to 16x16, resulting in a total of 196 patches of 16x16 each.

4. Multimodal deep learning Implementation (24.03.14 - 24.04.15) LATE_FUTION_CODE

The figure below illustrates the training process of a Late Fusion Model that combines textual and visual inputs along with labels. After integration as shown in Figure 2.10, the data passes through modality-specific layers: a BERT layer for textual data, and a Vision Transformer (ViT) layer. The BERT layer processes the textual data, producing a 768-dimensional pooled feature vector, and the ViT layer for visual data, establishing a 1000-dimensional pooled output.

Dataset

Image Dataset

Train data

fusion Dataset

Results

Test evaluations can vary significantly from the table. I believe that the quantity and quality of the dataset are the most crucial parts of any AI model. However, since this project was trained and tested with only 400 training and 40 test data, the results may differ significantly.

Table

Method	Accuracy(%)
Image (ViT)	70
Text (BERT)	87.4
Late Fusion (BERT&ViT)	92.5

Visualization

Reference List

[1] P. S. Sase and S. H. Bhandari, "Human Fall Detection using Depth Videos," Department of Computer Science and Engineering, Walchand College of Engineering, Sangli, India
[2] J. Zhang and C. Zong, "Neural machine translation: Challenges, progress and future," Science China Technological Sciences, vol. 63, no. 10, pp. 2028-2050, 2020/10/01 2020, doi: 10.1007/s11431-020-1632-x.
[3] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. TBD, no. TBD, pp. TBD, Jun. 2017, revised Aug. 2023. [Online]. Available: https://arxiv.org/abs/1706.03762
[4] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, "An image is worth 16x16 words: Transformers for image recognition at scale," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. TBD, no. TBD, pp. TBD, Oct. 2020, revised Jun. 2021. [Online]. Available: https://arxiv.org/abs/2010.11929
[5] P. Xu, X. Zhu and D. A. Clifton, "Multimodal Learning With Transformers: A Survey," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 10, pp. 12113-12132, Oct. 2023, doi: 10.1109/TPAMI.2023.3275156. keywords: {Transformers;Task analysis;Surveys;Visualization;Taxonomy;Mathematical models;Data models;Multimodal learning;transformer;introductory;taxonomy;deep learning;machine learning},

Code Reference

https://www.tensorflow.org/tutorials/images/cnn?hl=ko
https://keras.io/examples/vision/image_classification_with_vision_transformer/
https://medium.com/@konstantinos.gyftodimos/vision-transformer-for-binary-classification-of-custom-dataset-hands-on-fdcd162e605e

Dataset Reference

[Elderly Set]https://gram.web.uah.es/data/datasets/fpds/index.html

License

This project is released by Apeche License version 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
code		code
practice		practice
Image_preprocess.py		Image_preprocess.py
LICENSE		LICENSE
README.md		README.md
custom_image_dataset.py		custom_image_dataset.py
vit_model.py		vit_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project : Mobile Agent For Smart Home Using Multimodal Learning <2023-11-13>

Project Field : Multimodal Learning, ViT, Transformer, Image captioning

Project Description :

Methodologies

1. BERT Implementation BERT_MODEL_CODE

2. Transformer Background Study (24.03.10 - 24.03.13)

3. Vision Transformer Architecture(ViT) Implementation (23.03.27 - 23.04.02) ViT_MODEL_CODE

ViT Mechanism

Split img

4. Multimodal deep learning Implementation (24.03.14 - 24.04.15) LATE_FUTION_CODE

Dataset

Image Dataset

Train data

fusion Dataset

Results

Table

Visualization

Reference List

Code Reference

Dataset Reference

License

About

Releases

Packages

Languages

License

qkrwoghd04/binary_classification_using_BERT-ViT

Folders and files

Latest commit

History

Repository files navigation

Project : Mobile Agent For Smart Home Using Multimodal Learning <2023-11-13>

Project Field : Multimodal Learning, ViT, Transformer, Image captioning

Project Description :

Methodologies

1. BERT Implementation BERT_MODEL_CODE

2. Transformer Background Study (24.03.10 - 24.03.13)

3. Vision Transformer Architecture(ViT) Implementation (23.03.27 - 23.04.02) ViT_MODEL_CODE

ViT Mechanism

Split img

4. Multimodal deep learning Implementation (24.03.14 - 24.04.15) LATE_FUTION_CODE

Dataset

Image Dataset

Train data

fusion Dataset

Results

Table

Visualization

Reference List

Code Reference

Dataset Reference

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages