Overview
Task Descriptions
In this grand challenge we propose three different tasks.
Task 1: Multi-modal Noisy Learning for Action Recognition
The first task is to use the noisy action dataset to perform video action understanding. For this task, participants can use all available modalities of the data (raw video data and the meta data) to perform noisy visual representation learning. The goal of this task is to generate robust learning methods that can generate meaningful visual representations for video action recognition. The participants are free to utilize any combination of the modalities made available to them. In this task, the participants will have to pre-train their methods using the noisy action dataset, and then transfer this learning to a smaller target dataset. We will use UCF-101 dataset as target in this challenge. The participants can use the UCF-101 dataset itself for fine-tuning the pre-trained models. The final evaluation will be done on the UCF-101 test set. The transfer learning will demonstrate how generalized the visual understanding technique is and how reasonable the noisy dataset is for visual understanding tasks. There will be two sets for algorithmic development and training at scale, 25k and 100k splits.
Mini-set task This sub-task is for quick algorithm development and faster testing for the participants. This sub-task will only contain 25K videos with about 200 action classes.
Full-set task This sub-task is focused on a large scale algorithm development where the participants will train their methods on 100K videos. This task will allow participants to verify their approach on a wider distribution.
Task 2: Multi-modal Noisy Learning for Video Retrieval
In this task, the participants will be required train models to perform text to video retrieval. For this task, participants can use all available modalities of the data (raw video data and the meta data) to perform noisy visual representation learning. The goal of this task is to generate robust learning methods that can generate meaningful visual representations for text to video retrieval. The participants are free to utilize any combination of the modalities made available to them. The participants will finally test the representation quality by fine tuning on the MSR-VTT dataset. The final evaluation will be done on its test set. Similar to the first task, there will be two different splits which the participants can utilize.
Task 3: Multi-modal Noisy Video Understanding
About The Dataset
We are releasing three splits: Full, 100k and 25k, along with its meta data.
Note:
- To obtain the videos for a particular split, the participants will have to fill the following request form.
- By filling up the request form, the participants acknowledge that these videos are a property of the original uploaders and the organizers of the challenge, in no way own these videos. Moreover, these videos can only be used for research purposes.
- For the full split of 2.4M, participants will have to download the videos themselves from Flickr.
- Tags json file contains tags associated with each video, as the main metadata file only contains tag IDs.
- Files are compressed using bzip2.
Evaluation
For task 1, we will use accuracy, which is essentially whether the ground truth label matches the predicted class. For task 2, we will use recall at rank K (R@K), median rank (M dR) and mean rank(MnR). For R@K, we look at the percentage of samples for which the correct result for found in the top K results. M dR is the median of the rank of the first relevant result. Similarly, MnR is the mean of the rank of the first relevant result.
Important dates
- Challenge Starts: August, 2021
- Evaluation Starts: September, 2021
- Paper Submission: 1st October, 2021
- Notification to Authors: 15th October, 2021
All deadlines are at midnight(23:59) anywhere on Earth.
Instructions
Leaderboard
The submission links will be up soon.
Organizers