EDUCATION

2014-2019 Ph.D Candidate Northwestern Polytechnical University

2010-2014 B.S. Honors College, Northwestern Polytechnical University

WORK EXPERIENCE

2020-Now Assistant Professor, Renmin University of China

2019-2020 Research Scientist, Baidu Research

RESEARCH INTERESTS

Machine Multimodal Perception and Learning: Mining and exploring the potential problems and methods of multimodal messages (such as image, sound, touch etc.) in the direction of machine perception, reasoning and understanding, then equipping the machines with “multisensory cognitive ability”.

PROSPECTIVE STUDENTS/STAFFS

Curious about things surrounding, self-driven, aiming to do interesting, meaningful and valuable research

PUBLICATIONS

2025

Adaptive Unimodal Regulation for Balanced Multimodal Information Acquisition

Chengxiang Huang, Yake Wei, Zequn Yang, Di Hu

Computer Vision and Pattern Recognition (CVPR)


Phoenix: A Motion-based Self-Reflection Framework for Fine-grained Robotic Action Correction

Wenke Xia, Ruoxuan Feng, Dong Wang, Di Hu

Computer Vision and Pattern Recognition (CVPR)


Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

Henghui Du, Guangyao Li, Chang Zhou, Chunjie Zhang, Alan Zhao, Di Hu

Computer Vision and Pattern Recognition (CVPR)


Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception

Ruotian Peng, Haiying He, Yake Wei, Yandong Wen, Di Hu

Computer Vision and Pattern Recognition (CVPR)


AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors

Ruoxuan Feng, Jiangyu Hu, Wenke Xia, Tianci Gao, Ao Shen, Yuhao Sun, Bin Fang*, Di Hu*

International Conference on Learning Representations (ICLR)


2024

On-the-fly Modulation for Balanced Multimodal Learning

Yake Wei, Di Hu*, Henghui Du, and Ji-Rong Wen

IEEE Trans. Pattern Analysis and Machine Intelligence (TPAMI)


Play to the Score: Stage-Guided Dynamic Multi-Sensory Fusion for Robotic Manipulation (Oral)

Ruoxuan Feng, Di Hu*, Wenke Ma, Xuelong Li

Conference on Robot Learning (CoRL)


KOI: Accelerating Online Imitation Learning via Hybrid Key-state Guidance

Jingxian Lu, Wenke Xia, Dong Wang, Zhigang Wang, Bin Zhao, Di Hu*, and Xuelong Li

Conference on Robot Learning (CoRL)


Diagnosing and Re-learning for Balanced Multimodal Learning

Yake Wei, Siwei Li, Ruoxuan Feng, and Di Hu*

European Conference on Computer Vision (ECCV)


Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation

Juncheng Ma, Peiwen Sun, Yaoting Wang, and Di Hu*

European Conference on Computer Vision (ECCV)


Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes

Yaoting Wang†, Peiwen Sun†, Dongzhan Zhou, Guangyao Li, Honggang Zhang, and Di Hu*

European Conference on Computer Vision (ECCV)


Can Textual Semantics Mitigate Sounding Object Segmentation Preference?

Yaoting Wang†, Peiwen Sun†, Yuanchao Li, Honggang Zhang, and Di Hu*

European Conference on Computer Vision (ECCV)


Boosting Audio Visual Question Answering via Key Semantic-Aware Cues

ACM Conference on Multimedia (ACMMM)

Guangyao Li, HenghuiDu, and Di Hu


Unveiling and Mitigating Bias in Audio Visual Segmentation (Oral)

Peiwen Sun, Honggang Zhang, and Di Hu

ACM Conference on Multimedia (ACMMM)


Depth Helps: Improving Pre-trained RGB-based Policy with Depth Information Injection

Xincheng Pang†, Wenke Xia†, Zhigang Wang, Bin Zhao, Di Hu*, Dong Wang, and Xuelong Li

International Conference on Intelligent Robots and Systems (IROS)


Learning Manipulation by Predicting Interaction

Jia Zeng, Qingwen Bu, Bangjun Wang, Wenke Xia, Li Chen, Hao Dong, Haoming Song, Dong Wang, Di Hu, Ping Luo, Heming Cui, Bin Zhao, Xuelong Li, Yu Qiao, and Hongyang Li

Robotics: Science and Systems Conference (RSS)


MMPareto: Innocent Uni-modal Assistance for Enhanced Multi-modal Learning

Yake Wei, Di Hu

International Conference on Machine Learning (ICML)


Enhancing Multi-modal Cooperation via Fine-grained Modality Valuation

Yake Wei , Ruoxuan Feng , Zihe Wang , Di Hu

Computer Vision and Pattern Recognition(CVPR)


Quantifying and Enhancing Multi-modal Robustness with Modality Preference

Zequn Yang , Yake Wei , Ce Liang , Di Hu

The Twelfth International Conference on Learning Representations (ICLR)


SphereDiffusion: Spherical Geometry-aware Distortion Resilient Diffusion Model

Tao Wu , Xuewei Li , Zhongang Qi , Di Hu , Xintao Wang , Ying Shan , Xi Li

The 38th Annual AAAI Conference on Artificial Intelligence


Prompting Segmentation with Sound is Generalizable Audio-Visual Source Localizer

Yaoting Wang* , Weisong Liu* , Guangyao Li , Jian Ding , Di Hu , Xi Li

The 38th Annual AAAI Conference on Artificial Intelligence


Geometric-Inspired Graph-based Incomplete Multi-view Clustering

Zequn Yang , Han Zhang , Yake Wei , Zheng Wang , Feiping Nie , Di Hu

Pattern Recognition


Kinematic-aware Prompting for Generalizable Articulated Object Manipulation with LLMs

Wenke Xia , Dong Wang , Xincheng Pang , Zhigang Wang , Bin Zhao , Di Hu , Xuelong Li

IEEE International Conference on Robotics and Automation (ICRA)


2023

TikTalk: A Video-Based Dialogue Dataset for Multi-Modal Chitchat in Real World

Hongpeng Lin* , Ludan Ruan* , Wenke Xia* , Peiyu Liu , Jingyuan Wen , Yixin Xu , Di Hu , Ruihua Song , Wayne Xin Zhao , Qin Jin , Zhiwu Lu

ACM Multimedia(ACM MM)


Progressive Spatio-temporal Perception for Audio-Visual Question Answering

Guangyao Li , Wenxuan Hou , Di Hu

ACM Multimedia(ACM MM)


Towards Inadequately Pre-trained Models in Transfer Learning

Andong Deng , Xingjian Li , Di Hu , Tianyang Wang , Haoyi Xiong , Chengzhong Xu

International Conference on Computer Vision(ICCV)


Balanced Audiovisual Dataset for Imbalance Analysis

Wenke Xia* , Xu Zhao* , Xincheng Pang , Changqing Zhang , Di Hu

Computer Vision and Pattern Recognition(CVPR) Workshop


Multi-Scale Attention for Audio Question Answering

Guangyao Li , Yixin Xu , Di Hu

Interspeech


Supervised Knowledge May Hurt Novel Class Discovery Performance

ZiYun Li , Jona Otholt , Ben Dai , Di Hu , Christoph Meinel , Haojin Yang

Transactions on Machine Learning Research(TMLR)


Revisiting Pre-training in Audio-Visual Learning

Ruoxuan Feng , Wenke Xia , Di Hu

arXiv:2302.03533


MMCosine: Multi-Modal Cosine Loss Towards Balanced Audio-Visual Fine-Grained Learning

Ruize Xu , Ruoxuan Feng , Shi-xiong Zhang , Di Hu

ICASSP


2022

SeCo: Separating Unknown Musical Visual Sounds with Consistency Guidance

Xinchi Zhou, Dongzhan Zhou, Wanli Ouyang, Hang Zhou, Di Hu

IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)


Exploiting Visual Context Semantics for Sound Source Localization

Xinchi Zhou, Dongzhan Zhou, Di Hu, Hang Zhou, Wanli Ouyang

IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)


Self-supervised Learning for Heterogeneous Audiovisual Scene Analysis

Di Hu, Zheng Wang, Feiping Nie, Rong Wang, Xuelong Li

TMM


Learning to Answer Questions in Dynamic Audio-Visual Scenarios

Guangyao Li*, Yake Wei*, Yapeng Tian*, Chenliang Xu, Ji-Rong Wen, Di Hu

CVPR (ORAL)


Balanced Multimodal Learning via On-the-fly Gradient Modulation

Xiaokang Peng*, Yake Wei*, Andong Deng, Dong Wang, Di Hu

CVPR (ORAL)


SepFusion: Finding Optimal Fusion Structures for Visual Sound Separation

Dongzhan Zhou, Xinchi Zhou, Di Hu*, Hang Zhou, Lei Bai, Ziwei Liu, Wanli Ouyang

AAAI


Visual Sound Localization in-the-Wild by Cross-Modal Interference Erasing

Xian Liu, Rui Qian, Hang Zhou, Di Hu, Weiyao Lin, Ziwei Liu, Bolei Zhou, Xiaowei Zhou

AAAI


2021

Class-aware Sounding Objects Localization via Audiovisual Correspondence

Di Hu, Yake Wei, Rui Qian, Weiyao Lin, Ruihua Song, Ji-Rong Wen

TPAMI


Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation

Yapeng Tian, Di Hu*, Chenliang Xu*

CVPR


Unsupervised Multi-Source Domain Adaptation for Person Re-Identification

Zechen Bai, Zhigang Wang, Jian Wang, Di Hu*, Errui Ding*

CVPR


Temporal Relational Modeling with Self-Supervision for Action Segmentation

Dong Wang, Di Hu*, Xingjian Li, Dejing Dou

AAAI


2020

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching

Di Hu, Rui Qian, Minyue Jiang, Xiao Tan, Shilei Wen, Errui Ding, Weiyao Lin, Dejing Dou

NeurIPS


A Two-Stage Framework for Multiple Sound-Source Localization

Rui Qian, Di Hu, Heinrich Dinkel, Mengyue Wu, Ning Xu, Weiyao Lin

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), 2020.


Co-Learn Sounding Object Visual Grounding and Visually Indicated Sound Separation in A Cycle

Yapeng Tian, Di Hu, Chenliang Xu

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), 2020.


Does Ambient Sound Help? - Audiovisual Crowd Counting

Di Hu, LichaoMou, Qingzhong Wang, Junyu Gao, Yuansheng Hua, Dejing Dou, and Xiaoxiang Zhu

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), 2020.


Heterogeneous Scene Analysis via Self-supervised Audiovisual Learning

Di Hu, Zheng Wang, HaoyiXiong, Dong Wang, FeipingNie, and Dejing Dou

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), 2020.


Multiple Sound Sources Localization from Coarse to Fine

Rui Qian, Di Hu, Heinrich Dinkel, Mengyue Wu, Ning Xu, and Weiyao Lin

In Proceedings of the European Conference on Computer Vision (ECCV), 2020.


Cross-Task Transfer for Multimodal Aerial Scene Recognition

Di Hu, Xuhong Li, LichaoMou, Pu Jin, Dong Chen, Liping Jing, Xiaoxiang Zhu, and Dejing Dou

In Proceedings of the European Conference on Computer Vision (ECCV), 2020.


2019

Dense Multimodal Fusion for Hierarchically Joint Representation

Di Hu, Chengze Wang, FeipingNie, and Xuelong Li

In Proceedings of the IEEE Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.


Listen to the Image

Di Hu, Dong Wang, FeipingNie, Qi Wang, and Xuelong Li

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. (CCF A)


Deep Multimodal Clustering for Unsupervised Audiovisual Learning

Di Hu, FeipingNie, and Xuelong Li

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. (CCF A)


Deep Linear Discriminant Analysis Hashing

Di Hu, FeipingNie, and Xuelong Li

Sci Sin Inform, 2019. (CCF A)


2018

Deep Binary Reconstruction for Cross-modal Hashing

Di Hu, FeipingNie, and Xuelong Li

IEEE Trans. Multimedia (TMM), 2018.


Discrete Spectral Hashing for Efficient Similarity Retrieval

Di Hu, FeipingNie, and Xuelong Li

IEEE Trans. Image Processing (TIP), 2018. (CCF A)


2017

Large Graph Hashing with Spectral Rotation

Xuelong Li, Di Hu, and FeipingNie

In Proceedings of the AAAIConferenceonArtificialIntelligence (AAAI), 2017. (CCF A)


Deep Binary Reconstruction for Cross-modal Hashing

Xuelong Li, Di Hu, and FeipingNie

In Proceedings of the ACM Conference on Multimedia (ACMMM), 2017. (CCF A)


Image2song: Song Retrieval via Bridging Image Content and Lyric Words

Xuelong Li, Di Hu, and Xiaoqiang Lu

In Proceedings of the IEEE Conference on Computer Vision (ICCV), 2017. (CCF A)


2016

Temporal Multimodal Learning in Audiovisual Speech Recognition

Di Hu, Xuelong Li, and Xiaoqiang Lu

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. (CCF A)


Multimodal Learning via Exploring Deep Semantic Similarity

Di Hu, Xiaoqiang Lu, and Xuelong Li

In Proceedings of the ACM Conference on Multimedia (ACMMM), 2016. (CCF A)

HONORS AND AWARDS

2020.9 Won the 2020 CAAI Outstanding Doctoral Dissertation Award

2019.8 Selected by the『AIDU』Talent Recruitment Project of Baidu

2019.8 Won the 2019 ACM Xi'an Doctoral Dissertation Award

2019.5 Selected by the CVPR 2019 Doctoral Consortium

SERVICES

1、Reviewer of Journal: TIP, TKDE, TMM, Neurocomputing

2、Program Committee of Conference: NeurIPS 2020, CVPR 2018 2020, ICCV 2019, ECCV2020, AAAI 2018 2020, ACCV 2018 2020

3、Co-organizer: ICDM 2019 Tutorial on Automated Deep Learning: Theory, Algorithms, Platforms, and Applications

CONTACT

Email:dihu[at]ruc.edu.cn

Website:https://dtaoo.github.io/