From 0641b026e96a754c956bdc24dc6286b2cb081943 Mon Sep 17 00:00:00 2001 From: Jack-ZC8 <73177056+Jack-ZC8@users.noreply.github.com> Date: Sat, 1 Jun 2024 22:00:42 +0800 Subject: [PATCH] Update index.html --- index.html | 588 ++++++++++++++++++++++++++--------------------------- 1 file changed, 291 insertions(+), 297 deletions(-) diff --git a/index.html b/index.html index 72ac94d..a381044 100644 --- a/index.html +++ b/index.html @@ -1,53 +1,15 @@ - - +
- +- Publishing open-source academic video recordings is an emergent and prevalent approach to sharing knowledge online. Such videos carry rich multimodal information including speech, the facial and body movements of the speakers, as well as the texts and pictures in the slides and possibly even the papers. Although multiple academic video datasets have been constructed and released, few of them support both multimodal content recognition and understanding tasks, which is partially due to the lack of high-quality human annotations. -
-- In this paper, we propose a novel multimodal, multigenre, and multipurpose audio-visual academic lecture dataset (๐M3AV), which has almost 367 hours of videos from five sources covering computer science, mathematics, and medical and biology topics. With high-quality human annotations of the slide text and spoken words, in particular high-valued name entities, the dataset can be used for multiple audio-visual recognition and understanding tasks. -
-- Evaluations performed on contextual speech recognition, speech synthesis, and slide and script generation tasks demonstrate that the diversity of ๐M3AV makes it a challenging dataset. -
-+ Publishing open-source academic video recordings is an emergent and prevalent approach to sharing knowledge + online. Such videos carry rich multimodal information including speech, the facial and body movements + of the speakers, as well as the texts and pictures in the slides and possibly even the papers. Although + multiple academic video datasets have been constructed and released, few of them support both multimodal + content recognition and understanding tasks, which is partially due to the lack of high-quality human + annotations. +
++ In this paper, we propose a novel multimodal, multigenre, and multipurpose audio-visual academic lecture + dataset (๐M3AV), which has almost 367 hours of videos from five sources covering computer science, + mathematics, and medical and biology topics. With high-quality human annotations of the slide text and + spoken words, in particular high-valued name entities, the dataset can be used for multiple + audio-visual recognition and understanding tasks. +
++ Evaluations performed on contextual speech recognition, speech synthesis, and slide and script generation + tasks demonstrate that the diversity of ๐M3AV makes it a challenging dataset. +
- The overview of our ๐M3AV dataset is shown below. - The first component is slides annotated with simple and complex blocks. They will be merged following some rules. - The second component is speech containing special vocabulary, - spoken and written forms, and word-level timestamps. - The third component is the paper corresponding to the video. - The asterisk (*) denotes that only computer science videos have corresponding papers. -
- -- ๐M3AV dataset contains the most complete and human-annotated resources of slide, speech, and paper, thus supporting not only the recognition of multimodal content but also the comprehension of high-level academic knowledge. At the same time, the size of our dataset is also relatively rich while accessible. -
- -+ ๐M3AV dataset contains the most complete and human-annotated resources of slide, speech, and + paper, thus supporting not only the recognition of multimodal content but also the + comprehension of high-level academic knowledge. At the same time, the size of our dataset is also + relatively rich while accessible. +
- Figure: Comparison with other academic lecture-based datasets in terms of data types and designed tasks. "A" denotes fully automated processing and "M" denotes fully or partially manual labelling. + Figure: Comparison with other academic lecture-based datasets in terms of data types and designed tasks. + "A" denotes fully automated processing and "M" denotes fully or partially manual labelling. -
- -+
+Figure: Comparison with other academic lecture-based datasets in terms of data size and availability. -
-
- End-to-end models suffer from rare word recognition as reflected by the BWER where a more than two times increase in the error rate is observed comparing BWER to the WER.
By using TCPGen utilizing the OCR information (contextual ASR), we achieve a relative BWER decrease of 37.8% and 34.2% on dev and test sets respectively.
-
+ End-to-end models suffer from rare word recognition as reflected by the BWER where a more than two
+ times increase in the error rate is observed comparing BWER to the WER.
By using TCPGen utilizing the
+ OCR information (contextual ASR), we achieve a relative BWER decrease of 37.8% and 34.2% on dev and
+ test sets respectively.
+
Table: Evaluation results on ASR and CASR tasks. -
-- The MQTTS model shows the best performance within all the evaluation metrics. It indicates that the real speech in our dataset can drive AI systems to simulate more natural speech. -
- -+ The MQTTS model shows the best performance within all the evaluation metrics. It indicates that the + real speech in our dataset can drive AI systems to simulate more natural speech. +
Table: Evaluation results on Spontaneous TTS task. โGTโ denotes the ground truth. -
-
- (1) The open-source models (LLaMA-2, InstructBLIP) show a limited performance improvement when raised from 7B to 13B. Their performances are far from the closed-source models (GPT-4 and GPT-4V). We believe that high-quality pre-training data, e.g., informative corpus and visual QA data which encapsulate multimodal information, is required to enhance their SSG performances beyond just boosting the model size.
- (2) The latest LMM (GPT-4V) has already exceeded the cascaded pipeline composed of unimodal expert models. It suggests that the LMM not only maintains the ability to process textual information but also possesses multi-sensory capabilities, such as the perception and recognition of the slides.
-
+ (1) The open-source models (LLaMA-2, InstructBLIP) show a limited performance improvement when raised
+ from 7B to 13B. Their performances are far from the closed-source models (GPT-4 and GPT-4V). We
+ believe that high-quality pre-training data, e.g., informative corpus and visual QA data which encapsulate
+ multimodal information, is required to enhance their SSG performances beyond just boosting the model
+ size.
+ (2) The latest LMM (GPT-4V) has already exceeded the cascaded pipeline composed of unimodal expert
+ models. It suggests that the LMM not only maintains the ability to process textual information but
+ also possesses multi-sensory capabilities, such as the perception and recognition of the slides.
+
- Table: Evaluation results on SSG tasks. The upper part of โSlideโScript" shows cascading pipelines, while the lower part shows integrated systems. + Table: Evaluation results on SSG tasks. The upper part of โSlideโScript" shows cascading pipelines, + while the lower part shows integrated systems. -
-
-
(3) RAG substantially enhances the generation, as shown in the improvement after the introduction of paper information.
-
+
+ (3) RAG substantially enhances the generation, as shown in the improvement after the + introduction of paper information. +
+ +- Table: Performance improvements of LLaMA-27B brought by retrieving paper information. โSubsetโ denotes that only Computer Science videos are contained in all sets for they are the only ones with downloadable papers. + Table: Performance improvements of LLaMA-27B brought by retrieving paper information. + โSubsetโ denotes that only Computer Science videos are contained in all sets for they are the only ones + with downloadable papers. -
-- We release the Multimodal, Multigenre, and Multipurpose Audio-Visual Dataset with Academic Lectures (๐M3AV) covering a range of academic fields. This dataset contains manually annotated speech transcriptions, slide text, and additional extracted papers, providing a basis for evaluating AI models for recognizing multimodal content and understanding academic knowledge. We detail the creation pipeline and conduct various analyses of the dataset. Furthermore, we build benchmarks and conduct experiments around the dataset. We find there is still large room for existing models to improve perceptions and understanding of academic lecture videos. -
-+ We release the Multimodal, Multigenre, and Multipurpose Audio-Visual Dataset with Academic Lectures + (๐M3AV) covering a range of academic fields. This dataset contains manually annotated + speech transcriptions, slide text, and additional extracted papers, providing a basis for evaluating AI + models for recognizing multimodal content and understanding academic knowledge. We detail the + creation pipeline and conduct various analyses of the dataset. Furthermore, we build benchmarks and + conduct experiments around the dataset. We find there is still large room for existing models to + improve perceptions and understanding of academic lecture videos. +
@article{chen2024m3av,
+
+
+
+ BibTeX
+ @article{chen2024m3av,
title={{M\textsuperscript{3}AV}: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset},
author={Chen, Zhe and Liu, Heyang and Yu, Wenyi and Sun, Guangzhi and Liu, Hongcheng and Wu, Ji and Zhang, Chao and Wang, Yu and Wang, Yanfeng},
journal={arXiv preprint arXiv:2403.14168},
year={2024}
}
-
-
+