diff --git a/index.html b/index.html index 72ac94d..a381044 100644 --- a/index.html +++ b/index.html @@ -1,53 +1,15 @@ - - + - + M3AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset - + @@ -61,348 +23,380 @@ + -
-
-
-
-
-

- ๐ŸŽ“M3AV +
+
+
+
+
+

+ ๐ŸŽ“M3AV

-

- A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset - - -

-
- - Zhe Chen1, - - - Heyang Liu1, - - - Wenyi Yu2, - - - Guangzhi Sun3, - - - Hongcheng Liu1, - - - Ji Wu2, - - - Chao Zhang2โœ‰๏ธ, - - - Yu Wang1,4โœ‰๏ธ, - - - Yanfeng Wang1,4 - -
+

+ A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic + Lecture Dataset + + +

+
+ + Zhe Chen1, + + + Heyang Liu1, + + + Wenyi Yu2, + + + Guangzhi Sun3, + + + Hongcheng Liu1, + + + Ji Wu2, + + + Chao Zhang2โœ‰๏ธ, + + + Yu Wang1,4โœ‰๏ธ, + + + Yanfeng Wang1,4 + +
-
- 1Department of Electronic Engineering, Shanghai JiaoTong University
- 2Department of Electronic Engineering, Tsinghua University - 3University of Cambridge Department of Engineering
- 4Shanghai AI Laboratory
- ACL 2024 main conference -
+
+ 1Department of Electronic Engineering, + Shanghai JiaoTong University
+ 2Department of Electronic Engineering, + Tsinghua University + 3University of Cambridge Department of + Engineering
+ 4Shanghai AI Laboratory
+ ACL 2024 main conference +
-
-
-
+

-
-
- -
-
-

Abstract

-
-

- Publishing open-source academic video recordings is an emergent and prevalent approach to sharing knowledge online. Such videos carry rich multimodal information including speech, the facial and body movements of the speakers, as well as the texts and pictures in the slides and possibly even the papers. Although multiple academic video datasets have been constructed and released, few of them support both multimodal content recognition and understanding tasks, which is partially due to the lack of high-quality human annotations. -

-

- In this paper, we propose a novel multimodal, multigenre, and multipurpose audio-visual academic lecture dataset (๐ŸŽ“M3AV), which has almost 367 hours of videos from five sources covering computer science, mathematics, and medical and biology topics. With high-quality human annotations of the slide text and spoken words, in particular high-valued name entities, the dataset can be used for multiple audio-visual recognition and understanding tasks. -

-

- Evaluations performed on contextual speech recognition, speech synthesis, and slide and script generation tasks demonstrate that the diversity of ๐ŸŽ“M3AV makes it a challenging dataset. -

-
+
+

Abstract

+
+
+

+ Publishing open-source academic video recordings is an emergent and prevalent approach to sharing knowledge + online. Such videos carry rich multimodal information including speech, the facial and body movements + of the speakers, as well as the texts and pictures in the slides and possibly even the papers. Although + multiple academic video datasets have been constructed and released, few of them support both multimodal + content recognition and understanding tasks, which is partially due to the lack of high-quality human + annotations. +

+

+ In this paper, we propose a novel multimodal, multigenre, and multipurpose audio-visual academic lecture + dataset (๐ŸŽ“M3AV), which has almost 367 hours of videos from five sources covering computer science, + mathematics, and medical and biology topics. With high-quality human annotations of the slide text and + spoken words, in particular high-valued name entities, the dataset can be used for multiple + audio-visual recognition and understanding tasks. +

+

+ Evaluations performed on contextual speech recognition, speech synthesis, and slide and script generation + tasks demonstrate that the diversity of ๐ŸŽ“M3AV makes it a challenging dataset. +

- -
-
+ -
-
-

๐ŸŽ“M3AV Dataset

-
-
+
+
+

๐ŸŽ“M3AV Dataset

+
+
-
-
- -
+
+

Overview

+
-

Overview

-
-

- The overview of our ๐ŸŽ“M3AV dataset is shown below. - The first component is slides annotated with simple and complex blocks. They will be merged following some rules. - The second component is speech containing special vocabulary, - spoken and written forms, and word-level timestamps. - The third component is the paper corresponding to the video. - The asterisk (*) denotes that only computer science videos have corresponding papers. -

- -
+ The overview of our ๐ŸŽ“M3AV dataset is shown below. + The first component is slides annotated with simple and complex blocks. They will be merged following + some rules. + The second component is speech containing special vocabulary, + spoken and written forms, and word-level timestamps. + The third component is the paper corresponding to the video. + The asterisk (*) denotes that only computer science videos have corresponding papers. +
+
+
+
+
+
-
-
-

Statistics

- context +
+

Statistics

+
+
+
+
-
+
+

Comparsion with Related Work

+
-

Comparsion with Related Work

-
-

- ๐ŸŽ“M3AV dataset contains the most complete and human-annotated resources of slide, speech, and paper, thus supporting not only the recognition of multimodal content but also the comprehension of high-level academic knowledge. At the same time, the size of our dataset is also relatively rich while accessible. -

- -

+ ๐ŸŽ“M3AV dataset contains the most complete and human-annotated resources of slide, speech, and + paper, thus supporting not only the recognition of multimodal content but also the + comprehension of high-level academic knowledge. At the same time, the size of our dataset is also + relatively rich while accessible. +

+
+
+
+ +

- Figure: Comparison with other academic lecture-based datasets in terms of data types and designed tasks. "A" denotes fully automated processing and "M" denotes fully or partially manual labelling. + Figure: Comparison with other academic lecture-based datasets in terms of data types and designed tasks. + "A" denotes fully automated processing and "M" denotes fully or partially manual labelling. -

- context -

+

+
+
+
+
+ context +

Figure: Comparison with other academic lecture-based datasets in terms of data size and availability. -

-
+

-
-
+ -
-
-

Benchmark Systems

-
-
+
+
+

Benchmark Systems

+
+
-
-
- -
+
+

ASR & Contextual ASR

+
-

ASR & Contextual ASR

-
-

- End-to-end models suffer from rare word recognition as reflected by the BWER where a more than two times increase in the error rate is observed comparing BWER to the WER.
By using TCPGen utilizing the OCR information (contextual ASR), we achieve a relative BWER decrease of 37.8% and 34.2% on dev and test sets respectively. -

- -

+ End-to-end models suffer from rare word recognition as reflected by the BWER where a more than two + times increase in the error rate is observed comparing BWER to the WER.
By using TCPGen utilizing the + OCR information (contextual ASR), we achieve a relative BWER decrease of 37.8% and 34.2% on dev and + test sets respectively. +

+
+ +
+
+ +

Table: Evaluation results on ASR and CASR tasks. -

-
+

-
-
- -
+ + -
-
- -
+
+

Spontaneous TTS

+
-

Spontaneous TTS

-
-

- The MQTTS model shows the best performance within all the evaluation metrics. It indicates that the real speech in our dataset can drive AI systems to simulate more natural speech. -

- -

+ The MQTTS model shows the best performance within all the evaluation metrics. It indicates that the + real speech in our dataset can drive AI systems to simulate more natural speech. +

+
+
+
+ +

Table: Evaluation results on Spontaneous TTS task. โ€œGTโ€ denotes the ground truth. -

-
+

-
-
- -
+ -
-
-
+
+

Slide and Script Generation

+
-

Slide and Script Generation

-
-

- (1) The open-source models (LLaMA-2, InstructBLIP) show a limited performance improvement when raised from 7B to 13B. Their performances are far from the closed-source models (GPT-4 and GPT-4V). We believe that high-quality pre-training data, e.g., informative corpus and visual QA data which encapsulate multimodal information, is required to enhance their SSG performances beyond just boosting the model size.
- (2) The latest LMM (GPT-4V) has already exceeded the cascaded pipeline composed of unimodal expert models. It suggests that the LMM not only maintains the ability to process textual information but also possesses multi-sensory capabilities, such as the perception and recognition of the slides. -

- -

+ (1) The open-source models (LLaMA-2, InstructBLIP) show a limited performance improvement when raised + from 7B to 13B. Their performances are far from the closed-source models (GPT-4 and GPT-4V). We + believe that high-quality pre-training data, e.g., informative corpus and visual QA data which encapsulate + multimodal information, is required to enhance their SSG performances beyond just boosting the model + size.
+ (2) The latest LMM (GPT-4V) has already exceeded the cascaded pipeline composed of unimodal expert + models. It suggests that the LMM not only maintains the ability to process textual information but + also possesses multi-sensory capabilities, such as the perception and recognition of the slides. +

+
+
+
+ +

- Table: Evaluation results on SSG tasks. The upper part of โ€œSlideโ†’Script" shows cascading pipelines, while the lower part shows integrated systems. + Table: Evaluation results on SSG tasks. The upper part of โ€œSlideโ†’Script" shows cascading pipelines, + while the lower part shows integrated systems. -

-
+

+
+
-
-

-
(3) RAG substantially enhances the generation
, as shown in the improvement after the introduction of paper information. -

- -

+

+
+

+ (3) RAG substantially enhances the generation, as shown in the improvement after the + introduction of paper information. +

+ +

- Table: Performance improvements of LLaMA-27B brought by retrieving paper information. โ€œSubsetโ€ denotes that only Computer Science videos are contained in all sets for they are the only ones with downloadable papers. + Table: Performance improvements of LLaMA-27B brought by retrieving paper information. + โ€œSubsetโ€ denotes that only Computer Science videos are contained in all sets for they are the only ones + with downloadable papers. -

-
+

+
-
-
- -
-
-

Conclusion

-
-
- +
+
+

Conclusion

+
+
-
-
-
-
-
-

- We release the Multimodal, Multigenre, and Multipurpose Audio-Visual Dataset with Academic Lectures (๐ŸŽ“M3AV) covering a range of academic fields. This dataset contains manually annotated speech transcriptions, slide text, and additional extracted papers, providing a basis for evaluating AI models for recognizing multimodal content and understanding academic knowledge. We detail the creation pipeline and conduct various analyses of the dataset. Furthermore, we build benchmarks and conduct experiments around the dataset. We find there is still large room for existing models to improve perceptions and understanding of academic lecture videos. -

-
+
+
+
+

+ We release the Multimodal, Multigenre, and Multipurpose Audio-Visual Dataset with Academic Lectures + (๐ŸŽ“M3AV) covering a range of academic fields. This dataset contains manually annotated + speech transcriptions, slide text, and additional extracted papers, providing a basis for evaluating AI + models for recognizing multimodal content and understanding academic knowledge. We detail the + creation pipeline and conduct various analyses of the dataset. Furthermore, we build benchmarks and + conduct experiments around the dataset. We find there is still large room for existing models to + improve perceptions and understanding of academic lecture videos. +

-
-
+ - -
-
-

BibTeX

-
@article{chen2024m3av,
+  
+  
+
+

BibTeX

+
@article{chen2024m3av,
       title={{M\textsuperscript{3}AV}: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset},
       author={Chen, Zhe and Liu, Heyang and Yu, Wenyi and Sun, Guangzhi and Liu, Hongcheng and Wu, Ji and Zhang, Chao and Wang, Yu and Wang, Yanfeng},
       journal={arXiv preprint arXiv:2403.14168},
       year={2024}
 }
-
-
+
+
- - + + \ No newline at end of file