Knowledge-Constrained Answer Generation for Open-Ended Video Question Answering

Yao Jin; Guocheng Niu; Xinyan Xiao; Jian Zhang; Xi Peng; Jun Yu

doi:10.1609/aaai.v37i7.25983

Authors

Yao Jin Hangzhou Dianzi University
Guocheng Niu Baidu Inc.
Xinyan Xiao Baidu Inc.
Jian Zhang Zhejiang International Studies University
Xi Peng College of Computer Science, Sichuan Univerisity
Jun Yu Hangzhou Dianzi University

DOI:

https://doi.org/10.1609/aaai.v37i7.25983

Keywords:

ML: Multi-Instance/Multi-View Learning, ML: Multimodal Learning

Abstract

Open-ended Video question answering (open-ended VideoQA) aims to understand video content and question semantics to generate the correct answers. Most of the best performing models define the problem as a discriminative task of multi-label classification. In real-world scenarios, however, it is difficult to define a candidate set that includes all possible answers. In this paper, we propose a Knowledge-constrained Generative VideoQA Algorithm (KcGA) with an encoder-decoder pipeline, which enables out-of-domain answer generation through an adaptive external knowledge module and a multi-stream information control mechanism. We use ClipBERT to extract the video-question features, extract framewise object-level external knowledge from a commonsense knowledge base and compute the contextual-aware episode memory units via an attention based GRU to form the external knowledge features, and exploit multi-stream information control mechanism to fuse video-question and external knowledge features such that the semantic complementation and alignment are well achieved. We evaluate our model on two open-ended benchmark datasets to demonstrate that we can effectively and robustly generate high-quality answers without restrictions of training data.

Knowledge-Constrained Answer Generation for Open-Ended Video Question Answering

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription