讲座主题:Context Autoencoder for Scalable Self-Supervised Representation Pretrainin
邀请人:胡迪 中国人民大学高瓴人工智能学院准聘助理教授
讲座摘要:Self-supervised representation pretraining aims to learn an encoder from unlabeled images, such that the encoded representations take on semantics and benefit downstream tasks. In this talk, I present a novel masked image modeling approach, context autoencoder (CAE), for scalable self-supervised representation training. The core ideas include that predictions are made in the latent representation space from visible patches to masked patches and that the encoder is only for representation learning and representation learning is only by the encoder. I also discuss why masked image modeling potentially outperforms contrastive pretraining (e.g., SimCLR, MoCo) and why contrastive learning performs on par with supervised pretraining on ImageNet. In addition, I show that linear probing and the extended version, attentive probing, are more suitable than fine-tuning on ImageNet for pretraining evaluation.
主讲专家:王井东 百度 AI 计算机视觉首席科学家
Jingdong Wang is a Chief Scientist for computer vision with the Artificial Intelligence Group at Baidu. His team is focusing on conducting product-driven and cutting-edge computer vision/deep learning/AI research and developing practical computer vision applications. Before joining Baidu, he was a Senior Principal Researcher at Microsoft Research Asia. His areas of interest are computer vision, deep learning, and multimedia search. His representative works include deep high-resolution network (HRNet), discriminative regional feature integration (DRFI) for supervised saliency detection, neighborhood graph search (NGS, SPTAG) for large scale similarity search. He has been serving/served as an Associate Editor of IEEE TPAMI, IJCV, IEEE TMM, and IEEE TCSVT, and an area chair of leading conferences in vision, multimedia, and AI, such as CVPR, ICCV, ECCV, ACM MM, IJCAI, and AAAI. He was elected as an ACM Distinguished Member, a Fellow of IAPR, and a Fellow of IEEE, for his contributions to visual content understanding and retrieval.