Overview

In this challenge, we present a novel but challenging task of video understanding in a multi- modal noisy environment. While image based object detection and recognition has improved significantly in the last decade, the same has not been replicated in the video domain. Unlike images, understanding videos has proven to be difficult due to the added complexities of extra dimension and complicated sequence of motions. One of the major obstacle in learning better techniques that can understand complex actions in videos is the sheer lack of large scale annotated data. Large annotated datasets such as ImageNet and Open Image have propelled research in image understanding. The issue with doing this for videos is the massive cost of annotating millions of videos, which is why only a handful of large scale video datasets have been produced such as Kinetics 400/600/700, AVA and Youtube8M. An alternative to annotating such large video datasets is to accumulate the data from the web using specific search queries. However, this automatic annotation comes at a cost of variable noise in the data and annotation. As such, there is an ever growing need to generate better techniques for video action understanding based on such noisy datasets.

Task Descriptions

In this grand challenge we propose three different tasks.

The first task is to use the noisy action dataset to perform video action understanding. For this task, participants can use all available modalities of the data (raw video data and the meta data) to perform noisy visual representation learning. The goal of this task is to generate robust learning methods that can generate meaningful visual representations for video action recognition. The participants are free to utilize any combination of the modalities made available to them. In this task, the participants will have to pre-train their methods using the noisy action dataset, and then transfer this learning to a smaller target dataset. We will use UCF-101 dataset as target in this challenge. The participants can use the UCF-101 dataset itself for fine-tuning the pre-trained models. The final evaluation will be done on the UCF-101 test set. The transfer learning will demonstrate how generalized the visual understanding technique is and how reasonable the noisy dataset is for visual understanding tasks. There will be two sets for algorithmic development and training at scale, 25k and 100k splits.

Mini-set task This sub-task is for quick algorithm development and faster testing for the participants. This sub-task will only contain 25K videos with about 200 action classes.

Full-set task This sub-task is focused on a large scale algorithm development where the participants will train their methods on 100K videos. This task will allow participants to verify their approach on a wider distribution.

In this task, the participants will be required train models to perform text to video retrieval. For this task, participants can use all available modalities of the data (raw video data and the meta data) to perform noisy visual representation learning. The goal of this task is to generate robust learning methods that can generate meaningful visual representations for text to video retrieval. The participants are free to utilize any combination of the modalities made available to them. The participants will finally test the representation quality by fine tuning on the MSR-VTT dataset. The final evaluation will be done on its test set. Similar to the first task, there will be two different splits which the participants can utilize.

We invite researchers and participants to demonstrate effectiveness of learning from multi- modal noisy datasets for tasks in video understanding other than those mentioned above. This task will not be counted towards the leaderboard statistics, however novel and extraordinary approaches will receive a special mention. Participants can use any split of the dataset provided. For this task, participants can submit a paper which will be reviewed.

About The Dataset

We are releasing three splits: Full, 100k and 25k, along with its meta data.

Full Split: Metadata, Tags
100k Split: Metadata, Tags, Features
25k Split: Metadata, Tags, Features

Note:

To obtain the videos for a particular split, the participants will have to fill the following request form.
By filling up the request form, the participants acknowledge that these videos are a property of the original uploaders and the organizers of the challenge, in no way own these videos. Moreover, these videos can only be used for research purposes.
For the full split of 2.4M, participants will have to download the videos themselves from Flickr.
Tags json file contains tags associated with each video, as the main metadata file only contains tag IDs.
Files are compressed using bzip2.

The metadata file will include the following meta information: (Weak) class labels (Concepts), title, description, comments and tags. Around 39,000 of these videos have geo-location information. The samples will have multi-labels with average number of labels per video as 6.825 and all examples cover roughly 7000 labels. Each sample also has on an average 11.97 tags assigned to it, with the total number of tags being 1.4 million in the whole dataset. The average number of comments per video is 5.57. The 25K split covers around 5000 action labels. This split is intended for interested researchers with limited computational resources, or to benchmark your methods before running on the bigger split. The evaluation of task 1 will be done on the test set of UCF-101 and the task 2 will be evaluated on the test set of MSR-VTT. We will also provide features for videos in the 25K and the 100K split, extracted using R(2+1)D-d network trained on the Kinetics dataset.

Evaluation

For task 1, we will use accuracy, which is essentially whether the ground truth label matches the predicted class. For task 2, we will use recall at rank K (R@K), median rank (M dR) and mean rank(MnR). For R@K, we look at the percentage of samples for which the correct result for found in the top K results. M dR is the median of the rank of the first relevant result. Similarly, MnR is the mean of the rank of the first relevant result.

Important dates

Challenge Starts: August, 2021
Evaluation Starts: September, 2021
Paper Submission: 1st October, 2021
Notification to Authors: 15th October, 2021

All deadlines are at midnight(23:59) anywhere on Earth.

Instructions

We use the same formatting template as ACM Multimedia 2021. Submissions can be of varying length from 4 to 8 pages, plus additional pages for the reference pages. There is no distinction between long and short papers. All papers will undergo the same review process and review period. All contributions must be submitted through CMT. Submission links for papers will be up soon.