Title: Self-supervised video pretraining yields human-aligned visual representations
Authors: Nikhil Parthasarathy, S. M. Ali Eslami, João Carreira, Olivier J. Hénaff
Journal / Conference: Preprint
Background and Objectives:
The paper explores the question of how a neural network can learn to represent objects in a self-supervised manner, aligning more closely with human perception (for a definition of what is meant by alignment see the Methods section below). The authors propose that by employing self-supervised contrastive learning on video data, they can achieve more human-like object perception than was possible with previous pretraining methods that used static images.
Dataset: "[...] we hypothesized that collecting a minimally-curated video dataset matched to the rough properties of ImageNet would be beneficial for learning a more general visual model from videos." To validate this hypothesis, the authors developed a data curation pipeline (VideoNet) to selectively filter online videos. The aim was to obtain video training data that more accurately reflects the distribution of categories found in ImageNet.
Self-supervised contrastive learning: Similar to other contrastive SSL algorithms, the definition of positive and negative pairs is crucial. Positive pairs were identified by sampling from 2.56-second video clips within VideoNet, while negative frame pairs were selected from different clips. Along with the inherent temporal augmentations happening by sampling video clips, the authors implemented a series of standard augmentations. Additionally, they introduced an innovative multi-scale contrastive attention pooling method for aggregating positive/negative pairs across different views.
The proposed model, VITO, demonstrates superior generalization to other tasks compared to existing video pretrained models, and it stands competitive with models based on static images.
VITO is found to be more resilient to shifts in distribution, a key criterion used to measure its alignment with human vision. This robustness is shown in two benchmarks (ImageNet-Vid-Robust and ImageNet-A), where VITO outperforms other models by a considerable margin, particularly under the most challenging condition (i.e., IN-Vid pm10).
VITO also exhibits a closer resemblance to the spatial pattern of human attention during object recognition, even surpassing Harmonized, a model specifically trained to emulate human attention maps.
Evaluations conducted using a subset of the dataset proposed by Geirhos et al. (2021) show that VITO displays a significantly greater bias towards shape than other models in the comparison. In contrast, R3M, another contrastive learning model that uses video-language alignment on the Ego4D dataset, performs significantly worse, highlighting the significance of the training dataset.
Ablation experiments highlight the importance of all the proposed components in the obtained results, including the training data content, natural temporal augmentation in the videos, multi-scale attention scaling, etc. At least, in terms of generalization to semantic segmentation (PASCAL benchmark), the dataset content seems to have the largest effect.
Robustness to distributional shifts, the ability to generalize to new tasks, and alignment with human cognition (as gauged by attention map alignments and shape bias) may all be attainable through the appropriate pre-training paradigm. This paper posits that both the natural temporal augmentations found within videos and the specific content of the training data play significant roles in achieving these attributes.