您所在的位置: 首页- 新闻公告- 学术讲座-

学术讲座

BDAI重点实验室研究生沙龙第31期:BMU-MoCo: Bidirectional Momentum Update for Continual Video-Language Modeling
日期:2022-10-14访问量:

大数据管理与分析方法研究北京市重点实验室(BDAI)研究生沙龙由中国人民大学高瓴人工智能学院师生组织定期举行。10月19日研讨会由学院卢志武教授指导的博士生高一钊和胡迪准聘助理教授指导的博士生夏文科分别介绍自己的研究工作。欢迎同学们积极参与研讨!

BDAI3112.jpg

报告题目:BMU-MoCo: Bidirectional Momentum Update for Continual Video-Language Modeling

讲者:高一钊,博士三年级

导师:卢志武

研究方向:多模态预训练,连续学习

Abstract:Video-language models suffer from forgetting old/learned knowledge when trained with streaming data. In this work, we thus propose a continual video-language modeling (CVLM) setting, where models are supposed to be sequentially trained on five widely-used video-text datasets with different data distributions. Although most of existing continual learning methods have achieved great success by exploiting extra information (e.g., memory data of past tasks) or dynamically extended networks, they cause enormous resource consumption when transferred to our CVLM setting. To overcome the challenges (i.e., catastrophic forgetting and heavy resource consumption) in CVLM, we propose a novel cross-modal MoCo-based model with bidirectional momentum update (BMU), termed BMU-MoCo. Concretely, our BMU-MoCo has two core designs: (1) Different from the conventional MoCo, we apply the momentum update to not only momentum encoders but also encoders (i.e., bidirectional) at each training step, which enables the model to review the learned knowledge retained in the momentum encoders. (2) To further enhance our BMU-MoCo by utilizing earlier knowledge, we additionally maintain a pair of global momentum encoders (only initialized at the very beginning) with the same BMU strategy. Extensive results show that our BMU-MoCo remarkably outperforms recent competitors w.r.t. video-text retrieval performance and forgetting rate, even without using any extra data or dynamic networks.

报告题目:Robust Cross-Modal Knowledge Distillation for Videos in the Wild

讲者:夏文科,博士一年级

导师:胡迪

研究方向:跨模态知识蒸馏、多模态感知

Abstract: Cross-modal distillation has been widely used to transfer knowledge across different modalities, enriching the representation for the target unimodal one. Recent studies highly relate the temporal synchronization, which especially exists between vision and sound in the same scenario, to the semantic consistency for cross-modal distillation. However, such semantic consistency from the synchronization is hard to guarantee to transfer useful knowledge for videos in the wild, due to the irrelevant modality noise and differentiated semantic correlation. To this end, we present a robust cross-modal distillation approach for unconstrained videos. Concretely, we first propose a Modality Noise Filter (MNF) module to erase the irrelevant noise in teacher modality with cross-modal context. After this purification, we then design a Contrastive Semantic Calibration (CSC) module to adaptively distill useful knowledge for target modality, by referring to the differentiated sample-wise semantic correlation in a contrastive fashion. By cascading these modules, our approach alleviates the influences caused by the irrelevant modality noise and differentiated semantic correlation in an end-to-end manner. Extensive experiments are conducted to validate that our method could bring a performance boost compared with other distillation methods in both visual action recognition and video retrieval task. We also extend to audio tagging to prove the generalization of our cross-modal distillation method.

检测到您当前使用浏览器版本过于老旧,会导致无法正常浏览网站;请您使用电脑里的其他浏览器如:360、QQ、搜狗浏览器的速模式浏览,或者使用谷歌、火狐等浏览器。

下载Firefox