AVQA

This is the official repo for our ACM Multimedia 2022 paper AVQA: A Dataset for Audio-Visual Question Answering on Videos.

Dataset Website: https://mn.cs.tsinghua.edu.cn/avqa.

Download links: Baidu Netdisk, Onedrive.

AVQA Dataset

AVQA is an audio-visual question answering dataset for the multimodal understanding of audio-visual objects and activities in real-life scenarios on videos. AVQA provides diverse sets of questions specially designed considering both audio and visual information, involving various relationships between objects or in activities.

We collect 57,015 videos from daily audio-visual activities and 57,335 specially-designed question-answer pairs relying on clues from both audio and visual modalities. More Detailed information listed on the Dataset Website.

Description

Repo directories

./data: data dictionary;
./preprocess: code and scripts for data preprocessing and feature extraction;
./HAVF: our proposed HAVF model to reproduce the results.

Usage

Before we start

Clone this repo

git clone git@github.com:AlyssaYoung/AVQA.git

Download data

You can download the raw videos and extract features according to your needs. Besides, you can also directly use the features we provide. More detailed information for downloading data can be found in the #Downloads Section of the Dataset Website.
Data preprocessing and feature extraction

Extract audio waveforms. We write a shell script for you to extract audio waveforms. Just fix the directory path of raw videos in the script file extract_audio.sh and run the command:
```
cd data
mkdir audio
cd ..
bash preprocess/extract_audio.sh
```
wav files path: ./data/audio

Extract audio features. We have written the script extract_audio_feat.sh for audio feature extraction, just create a new python virtual environment and run the following command:
```
cd preprocess/preprocess_audio/
conda create -n preprocess_audio python=3.7
conda activate preprocess_audio
pip install -r requirements.txt
sh extract_audio_feat.sh
```
Extract visual frames. Videos are segmented into 8 clips, each clip contains 16 frames by default (following the setting in HCRN). We have provided the instruction step by step:
- Create a new python virtual environment and run the following command:
```
cd preprocess/preprocess_visual/
sh create_virtualenv.sh
```
- Extract appearance feature: Fix the file paths in extract_appearance_feat.sh and run the command:
```
sh extract_appearance_feat.sh
```
- Extract motion feature: Download ResNeXt-101 pretrained model(resnext-101-kinetics.pth). Fix the arguments of file paths in extract_motion_feat.sh and run the command:
```
sh extract_motion_feat.sh
```
Preprocess questions.
- Download glove pretrained 300d word vectors to data/glove/ and process it into a pickle file:
```
cd data/glove
python txt2pickle.py
cd ../..
```
- Preprocess train/val questions: Fix the file paths in preprocess_text_feat.sh and run the command:
```
cd preprocess/preprocess_text
sh preprocess_text_feat.sh
```

Finally, the feature dimensions of extracted features are as follows:

	Dimension
Audio features	(#num_videos, 2048)
Appearance features	(#num_videos, 8, 16, 2048)
Motion features	(#num_videos, 8, 2048)

Note: You can also develop data preprocessing and feature extraction methods in your own original and innonative ways. Here we just provide a possible way to utilize the audio and visual data :)

Our proposed HAVF

Results

Dependency

Anaconda3
Pip

Notice

To improve the code readability, we have recently rebuilded our code. You may encounter some bugs or find performance difference compared with the results reported in the paper. Please feel free to contact us if you have any questions or suggestions. Both issues and emails(pinci_yang@outlook.com) are available.

Citation

If you find our paper or code useful, please cite our paper using the following bibtex:

@inproceedings{yang2022avqa,
  title={AVQA: A Dataset for Audio-Visual Question Answering on Videos},
  author={Yang, Pinci and Wang, Xin and Duan, Xuguang and Chen, Hong and Hou, Runze and Jin, Cong and Zhu, Wenwu},
  booktitle={Proceedings of the 30th ACM International Conference on Multimedia},
  pages={3480--3491},
  year={2022}
}

Acknowledgement

As for audio feature extraction, we adapt PANNs from this repo to our code. Thank @qiuqiangkong for releasing the code and the pretrained models.
We refer to this repo to preprocess visual frames and extract appearance and motion features. Thank @thaolmk54 for this excellent work.
In this work, we conduct our experiments based on six video-qa backbone models. Here we list the original repositories:
- PSAC @lixiangpengcs
- HME @fanchenyou
- LADNet @lixiangpengcs
- ACRTransformer @op-multimodal
- HGA @Jumpin2
- HCRN @thaolmk54

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
HAVF/hcrn_havf		HAVF/hcrn_havf
data/annotation		data/annotation
pics		pics
preprocess		preprocess
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HAVF/hcrn_havf

HAVF/hcrn_havf

data/annotation

data/annotation

pics

pics

preprocess

preprocess

.gitignore

.gitignore

README.md

README.md

Repository files navigation

AVQA

AVQA Dataset

Description

Repo directories

Usage

Before we start

Our proposed HAVF

Results

Dependency

Notice

Citation

Acknowledgement

About

Releases

Packages

Languages

AlyssaYoung/AVQA

Folders and files

Latest commit

History

Repository files navigation

AVQA

AVQA Dataset

Description

Repo directories

Usage

Before we start

Our proposed HAVF

Results

Dependency

Notice

Citation

Acknowledgement

About

Resources

Stars

Watchers

Forks

Languages