On February 27th, the acceptance results of CVPR 2023, the Class A international academic conference recommended by the China Computer Society (CCF), were announced. And 6 papers from teachers and students of the Gaoling School of Artificial Intelligence, Renmin University of China were accepted. The International Conference on Computer Vision and Pattern Recognition (CVPR) is a top-level conference in the field of computer vision and pattern recognition organized by IEEE , which is held annually around the world. The acceptance rate of CVPR 2023 is 25.78%.
Paper Introduction 1
HOTNAS：Hierarchical Optimal Transport for Neural Architecture Search
Jiechao Yang，Yong Liu，Hongteng Xu
Corresponding Author：Yong Liu
Instead of searching the entire network directly, current NAS approaches are increasingly searching for multiple relatively small cells to reduce search costs. A major challenge is to jointly measure the similarity of cell micro-architectures and the difference in macro-architectures between different cell-based networks. Recently, optimal transport (OT) has been successfully applied to NAS as it can capture both the operation and structure similarity of different networks. Unfortunately, existing OT for NAS methods either ignore the similarity between cells or focus on searching for a single cell architecture. To solve these problems, we propose a hierarchical optimal transport metric called HOTNN for measuring the similarity of different networks. The cell-level similarity computes OT distance between cells in different networks by considering the similarity of each node in different cells and the differences in the information flow costs between node pairs within each cell in terms of the operation and structure information. The network-level similarity calculates OT distance between networks by considering both the cell-level similarity and the difference in the global position of each cell in their respective networks. We then explore HOTNN in a Bayesian optimization framework called HOTNAS to show its effectiveness across different tasks. Extensive experiments demonstrate that HOTNAS can find a network architecture with better performance in multiple modular cell-based search spaces.
Paper Introduction 2
Fair Scratch Tickets: Finding Fair Sparse Networks without Weight Training
Pengwei Tang*，Wei Yao*, Zhicong Li, Yong Liu
Corresponding Author：Yong Liu
Paper Overview：Recent studies suggest that computer vision models come at the risk of compromising fairness. There are extensive works to alleviate unfairness in computer vision using pre-processing, in-processing, and post-processing methods. In this paper, we lead a novel fairness-aware learning paradigm for in-processing methods through the lens of the lottery ticket hypothesis (LTH) in the context of computer vision fairness. We randomly initialize a dense neural network and find appropriate binary masks for the weights to obtain a fair sparse subnetworks without any weight training. Interestingly, to the best of our knowledge, we are the first to discover that such sparse subnetworks with inborn fairness exist in randomly initialized networks, achieving an accuracy-fairness trade-off comparable to that of dense neural networks trained with existing fairness-aware in-processing approaches. We term these fair subnetworks as Fair Scratch Tickets (FSTs). We also theoretically provide fairness and accuracy guarantees for them. In our experiments, we investigate the existence of FSTs on various datasets, target attributes, random initialization methods, sparsity patterns, and fairness surrogates. We also find that FSTs can transfer across datasets and investigate other properties of FSTs.
Paper Introduction 3
Modeling Video as Stochastic Processes for Fine-Grained Video Representation Learning
Heng Zhang，Daqing Liu，Qi Zheng，Bing Su
Corresponding Author：Bing Su
Paper Overview：A meaningful video is semantically coherent and changes smoothly. However, most existing fine-grained video representation learning methods learn frame-wise features by aligning frames across videos or exploring relevance between multiple views, neglecting the inherent dynamic process of each video. In this paper, we propose to learn video representations by modeling Video as Stochastic Processes (VSP) via a novel process-based contrastive learning framework, which aims to discriminate between video processes and simultaneously capture the temporal dynamics in the processes. Specifically, we enforce the embeddings of the frame sequence of interest to approximate a goal-oriented stochastic process, i.e., Brownian bridge, in the latent space via a process-based contrastive loss. To construct the Brownian bridge, we adapt specialized sampling strategies under different annotations for both self-supervised and weakly-supervised learning. Experimental results on four datasets show that VSP stands as a state-of-the-art method for various video understanding tasks, including phase progression, phase classification, and frame retrieval.
Paper Introduction 4
Paper Title：Transfer Knowledge from Head to Tail: Uncertainty Calibration under Long-tailed Distribution
Authors：Jiahao Chen, Bing Su
Corresponding Author：Bing Su
How to estimate the uncertainty of a given model is a crucial problem. Current calibration techniques treat different classes equally and thus implicitly assume that the distribution of training data is balanced, but ignore the fact that real-world data often follows a long-tailed distribution. In this paper, we explore the problem of calibrating the model trained from a long-tailed distribution. Due to the difference between the imbalanced training distribution and balanced test distribution, existing calibration methods such as temperature scaling can not generalize well to this problem. Specific calibration methods for domain adaptation are also not applicable because they rely on unlabeled target domain instances which are not available. Models trained from a long-tailed distribution tend to be more overconfident to head classes. To this end, we propose a novel knowledge-transferring-based calibration method by estimating the importance weights for samples of tail classes to realize long-tailed calibration. Our method models the distribution of each class as a Gaussian distribution and views the source statistics of head classes as a prior to calibrate the target distributions of tail classes. We adaptively transfer knowledge from head classes to get the target probability density of tail classes. The importance weight is estimated by the ratio of the target probability density over the source probability density. Extensive experiments on CIFAR-10-LT, MNIST-LT, CIFAR-100-LT, and ImageNet-LT datasets demonstrate the effectiveness of our method.
Paper Introduction 5
Paper Title：All are Worth Words: A ViT Backbone for Diffusion Models
Authors：Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Hang Su, Chongxuan Li, Jun Zhu
Chongxuan Li, Jun Zhu
Vision transformers (ViT) have shown promise in various vision tasks while the U-Net based on a convolutional neural network (CNN) remains dominant in diffusion models. We design a simple and general ViT-based architecture (named U-ViT) for image generation with diffusion models. U-ViT is characterized by treating all inputs including the time, condition and noisy image patches as tokens and employing long skip connections between shallow and deep layers. We evaluate U-ViT in unconditional and classconditional image generation, as well as text-to-image generation tasks, where U-ViT is comparable if not superior to a CNN-based U-Net of a similar size. In particular, latent diffusion models with U-ViT achieve record-breaking FID scores of 2.29 in class-conditional image generation on ImageNet 256×256, and 5.48 in text-to-image generation on MS-COCO, among methods without accessing large external datasets during the training of generative models.
Paper Introduction 6
Compacting Binary Neural Networks by Sparse Kernel Selection
Yikai Wang, Wenbing Huang, Yinpeng Dong, Fuchun Sun, Anbang Yao
Binary Neural Network (BNN) represents convolution weights with 1-bit values, which enhances the efficiency of both storage and computation. This paper is motivated by a previously revealed phenomenon that the binary kernels in successful BNNs are nearly power-law distributed: their values are mostly clustered into a small number of codewords. This phenomenon encourages us to compact typical BNNs and obtain further close performance through learning non-repetitive kernels within a subspace of the whole possible values. Specifically, we regard the binarization process as kernel grouping in terms of a binary codebook, and our task lies in learning to select a smaller subset of codewords from the full codebook. We then leverage the Gumbel-Sinkhorn technique to approximate the codeword selection process, and develop the Permutation Straight-Through Estimator (PSTE) that is able to not only optimize the selection process end-to-end but also maintain the non-repetitive occupancy of the selected codewords. Experiments on image classification and object detection provably verify that our method is able to reduce both the model size and bit-wise computational costs, with only a slight accuracy drop compared with state-of-the-art BNNs.