Overview

In this challenge, we present a novel but challenging task of video understanding in a multi- modal noisy environment. While image based object detection and recognition has improved significantly in the last decade, the same has not been replicated in the video domain. Unlike images, understanding videos has proven to be difficult due to the added complexities of extra dimension and complicated sequence of motions. One of the major obstacle in learning better techniques that can understand complex actions in videos is the sheer lack of large scale annotated data. Large annotated datasets such as ImageNet and Open Image have propelled research in image understanding. The issue with doing this for videos is the massive cost of annotating millions of videos, which is why only a handful of large scale video datasets have been produced such as Kinetics 400/600/700, AVA and Youtube8M. An alternative to annotating such large video datasets is to accumulate the data from the web using specific search queries. However, this automatic annotation comes at a cost of variable noise in the data and annotation. As such, there is an ever growing need to generate better techniques for video action understanding based on such noisy datasets.

Task Descriptions

In this grand challenge we propose three different tasks.

Task 1: Multi-modal Noisy Learning for Action Recognition

The first task is to use the noisy action dataset to perform video action understanding. For this task, participants can use all available modalities of the data (raw video data and the meta data) to perform noisy visual representation learning. The goal of this task is to generate robust learning methods that can generate meaningful visual representations for video action recognition. The participants are free to utilize any combination of the modalities made available to them. In this task, the participants will have to pre-train their methods using the noisy action dataset, and then transfer this learning to a smaller target dataset. We will use UCF-101 dataset as target in this challenge. The participants can use the UCF-101 dataset itself for fine-tuning the pre-trained models. The final evaluation will be done on the UCF-101 test set. The transfer learning will demonstrate how generalized the visual understanding technique is and how reasonable the noisy dataset is for visual understanding tasks. There will be two sets for algorithmic development and training at scale, 25k and 100k splits.

Mini-set task This sub-task is for quick algorithm development and faster testing for the participants. This sub-task will only contain 25K videos with about 200 action classes.

Full-set task This sub-task is focused on a large scale algorithm development where the participants will train their methods on 100K videos. This task will allow participants to verify their approach on a wider distribution.

Task 2: Multi-modal Noisy Learning for Video Retrieval

In this task, the participants will be required train models to perform text to video retrieval. For this task, participants can use all available modalities of the data (raw video data and the meta data) to perform noisy visual representation learning. The goal of this task is to generate robust learning methods that can generate meaningful visual representations for text to video retrieval. The participants are free to utilize any combination of the modalities made available to them. The participants will finally test the representation quality by fine tuning on the MSR-VTT dataset. The final evaluation will be done on its test set. Similar to the first task, there will be two different splits which the participants can utilize.

Task 3: Multi-modal Noisy Video Understanding

We invite researchers and participants to demonstrate effectiveness of learning from multi- modal noisy datasets for tasks in video understanding other than those mentioned above. This task will not be counted towards the leaderboard statistics, however novel and extraordinary approaches will receive a special mention. Participants can use any split of the dataset provided. For this task, participants can submit a paper which will be reviewed.

About The Dataset

We are releasing three splits: Full, 100k and 25k, along with its meta data.

Note:

The metadata file will include the following meta information: (Weak) class labels (Concepts), title, description, comments and tags. Around 39,000 of these videos have geo-location information. The samples will have multi-labels with average number of labels per video as 6.825 and all examples cover roughly 7000 labels. Each sample also has on an average 11.97 tags assigned to it, with the total number of tags being 1.4 million in the whole dataset. The average number of comments per video is 5.57. The 25K split covers around 5000 action labels. This split is intended for interested researchers with limited computational resources, or to benchmark your methods before running on the bigger split. The evaluation of task 1 will be done on the test set of UCF-101 and the task 2 will be evaluated on the test set of MSR-VTT. We will also provide features for videos in the 25K and the 100K split, extracted using R(2+1)D-d network trained on the Kinetics dataset.

Evaluation

For task 1, we will use accuracy, which is essentially whether the ground truth label matches the predicted class. For task 2, we will use recall at rank K (R@K), median rank (M dR) and mean rank(MnR). For R@K, we look at the percentage of samples for which the correct result for found in the top K results. M dR is the median of the rank of the first relevant result. Similarly, MnR is the mean of the rank of the first relevant result.

Important dates

All deadlines are at midnight(23:59) anywhere on Earth.

Instructions

We use the same formatting template as ACM Multimedia 2021. Submissions can be of varying length from 4 to 8 pages, plus additional pages for the reference pages. There is no distinction between long and short papers. All papers will undergo the same review process and review period. All contributions must be submitted through CMT. Submission links for papers will be up soon.

Leaderboard

The submission links will be up soon.

Organizers

Mubarak Shah
Mubarak Shah
University of Central Florida (UCF)
Mohan S Kankanhalli
Mohan S Kankanhalli
National University of Singapore
Shin’ichi Satoh
Shin’ichi Satoh
National Institute of Informatics



Yogesh Rawat
Yogesh Rawat
University of Central Florida (UCF)
Rajiv Ratn Shah
Rajiv Ratn Shah
Indraprastha Institute of Information Technology Delhi
Roger Zimmermann
Roger Zimmermann
National University of Singapore



Shruti Vyas
Shruti Vyas
Center for Research in Computer Vision University of Central Florida
Mohit Sharma
Mohit Sharma
Indraprastha Institute of Information Technology Delhi
Aayush Rana
Aayush Rana
Center for Research in Computer Vision University of Central Florida


Volunteers

Raj Aaryaman Patra
Raj Aryaman Patra
National Institute of Technology Rourkela
Harshal Desai
Harshal Desai
National Institute of Technology Jamshedpur