A Dataset and Baselines for Visual Question Answering on Art

Garcia, Noa; Ye, Chentao; Liu, Zihua; Hu, Qingtao; Otani, Mayu; Chu, Chenhui; Nakashima, Yuta; Mitamura, Teruko

doi:10.1007/978-3-030-66096-3_8

Noa Garcia¹⁰,
Chentao Ye¹¹,
Zihua Liu¹¹,
Qingtao Hu¹¹,
Mayu Otani¹²,
Chenhui Chu¹⁰,
Yuta Nakashima¹⁰ &
…
Teruko Mitamura¹¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12536))

Included in the following conference series:

European Conference on Computer Vision

2147 Accesses
19 Citations

Abstract

Answering questions related to art pieces (paintings) is a difficult task, as it implies the understanding of not only the visual information that is shown in the picture, but also the contextual knowledge that is acquired through the study of the history of art. In this work, we introduce our first attempt towards building a new dataset, coined AQUA (Art QUestion Answering). The question-answer (QA) pairs are automatically generated using state-of-the-art question generation methods based on paintings and comments provided in an existing art understanding dataset. The QA pairs are cleansed by crowdsourcing workers with respect to their grammatical correctness, answerability, and answers’ correctness. Our dataset inherently consists of visual (painting-based) and knowledge (comment-based) questions. We also present a two-branch model as baseline, where the visual and knowledge questions are handled independently. We extensively compare our baseline model against the state-of-the-art models for question answering, and we provide a comprehensive study about the challenges and potential future directions for visual question answering on art.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/noagarcia/ArtVQA.
2.
The code is reproduced by ourselves, and we confirmed a similar performance to that of the original paper.
3.
https://aws.amazon.com/rekognition/.
4.
https://github.com/facebookresearch/pythia.
5.
http://www.mturk.com.
6.
https://www.nltk.org/.
7.
We used XLNet instead of BERT as XLNet shows better performance on the popular Stanford question answering dataset (SQuAD2.0).

References

Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)
Google Scholar
Antol, S., et al.: VQA: visual question answering. In: ICCV (2015)
Google Scholar
Carneiro, G., da Silva, N.P., Del Bue, A., Costeira, J.P.: Artistic image classification: an analysis on the PRINTART database. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 143–157. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33765-9_11
Chapter Google Scholar
Crowley, E., Zisserman, A.: The state of the art: object retrieval in paintings using discriminative regions. In: BMVC (2014)
Google Scholar
Crowley, E.J., Parkhi, O.M., Zisserman, A.: Face painting: querying art with photos. In: BMVC (2015)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (2019)
Google Scholar
Du, X., Cardie, C.: Harvesting paragraph-level question-answer pairs from Wikipedia. In: ACL (2018)
Google Scholar
Du, X., Shao, J., Cardie, C.: Learning to ask: neural question generation for reading comprehension. In: ACL (2017)
Google Scholar
Duan, N., Tang, D., Chen, P., Zhou, M.: Question generation for question answering. In: EMNLP (2017)
Google Scholar
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The Pascal Visual Object Classes (VOC) challenge. IJCV 88(2), 303–338 (2010)
Article Google Scholar
Fan, Z., Wei, Z., Wang, S., Liu, Y., Huang, X.: A reinforcement learning framework for natural question generation using bi-discriminators. In: COLING (2018)
Google Scholar
Garcia, N., Otani, M., Chu, C., Nakashima, Y.: KnowIT VQA: answering knowledge-based questions about videos. In: AAAI (2020)
Google Scholar
Garcia, N., Renoust, B., Nakashima, Y.: Context-aware embeddings for automatic art analysis. In: ICMR (2019)
Google Scholar
Garcia, N., Vogiatzis, G.: How to read paintings: semantic art understanding with multi-modal retrieval. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11130, pp. 676–691. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11012-3_52
Chapter Google Scholar
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: Proceedings of CVPR (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Heilman, M., Smith, N.A.: Good question! Statistical ranking for question generation. In: NAACL (2010)
Google Scholar
Huckle, N., Garcia, N., Vogiatzis, G.: Demographic influences on contemporary art with unsupervised style embeddings. In: ECCV workshops (2020)
Google Scholar
Ikeuchi, K., et al.: The great Buddha project: digitally archiving restoring, and analyzing cultural heritage objects. IJCV 75, 189–208 (2007)
Article Google Scholar
Jain, U., Zhang, Z., Schwing, A.G.: Creativity: generating diverse questions using variational autoencoders. In: CVPR (2017)
Google Scholar
Jang, Y., Song, Y., Yu, Y., Kim, Y., Kim, G.: TGIF-QA: toward spatio-temporal reasoning in visual question answering. In: Proceedings of CVPR (2017)
Google Scholar
Johnson, C.R., et al.: Image processing for artist identification. IEEE Signal Process. Mag. 25(4), 37–48 (2008)
Article Google Scholar
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of CVPR (2017)
Google Scholar
Kim, J.H., Jun, J., Zhang, B.T.: Bilinear attention networks. In: NeurIPS (2018)
Google Scholar
Kim, K.M., Heo, M.O., Choi, S.H., Zhang, B.T.: DeepStory: video story QA by deep embedded memory networks. In: Proceedings of IJCAI (2017)
Google Scholar
Kim, Y., Lee, H., Shin, J., Jung, K.: Improving neural question generation using answer separation. In: AAAI (2019)
Google Scholar
Krishna, R., Bernstein, M., Fei-Fei, L.: Information maximizing visual question generation. In: CVPR (2019)
Google Scholar
Labutov, I., Basu, S., Vanderwende, L.: Deep questions without deep understanding. In: ACL-IJCNLP (2015)
Google Scholar
Lewis, M., Fan, A.: Generative question answering: learning to answer the whole question. In: ICLR (2019)
Google Scholar
Li, Y., et al.: Visual question generation as dual task of visual question answering. In: CVPR (2018)
Google Scholar
Ma, D., et al.: From part to whole: who is behind the painting? In: ACMMM (2017)
Google Scholar
Maharaj, T., Ballas, N., Rohrbach, A., Courville, A., Pal, C.: A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. In: CVPR (2017)
Google Scholar
Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: Proceedings of NIPS (2014)
Google Scholar
Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: OK-VQA: a visual question answering benchmark requiring external knowledge. In: CVPR (2019)
Google Scholar
Mazidi, K., Nielsen, R.D.: Linguistic considerations in automatic question generation. In: ACL (2014)
Google Scholar
Misra, I., Girshick, R., Fergus, R., Hebert, M., Gupta, A., van der Maaten, L.: Learning by asking questions. In: CVPR (2018)
Google Scholar
Mostafazadeh, N., Misra, I., Devlin, J., Mitchell, M., He, X., Vanderwende, L.: Generating natural questions about an image. In: ACL (2016)
Google Scholar
Mun, J., Hongsuck Seo, P., Jung, I., Han, B.: MarioQA: answering questions by watching gameplay videos. In: Proceedings of ICCV (2017)
Google Scholar
Pan, L., Lei, W., Chua, T., Kan, M.: Recent advances in neural question generation. CoRR abs/1905.08949 (2019)
Google Scholar
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k Entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: ICCV (2015)
Google Scholar
Ren, M., Kiros, R., Zemel, R.S.: Exploring models and data for image question answering. In: NeurIPS (2015)
Google Scholar
Shamir, L., Macura, T., Orlov, N., Eckley, D.M., Goldberg, I.G.: Impressionism, expressionism, surrealism: automated recognition of painters and schools of art. ACM Trans. Appl. Percept. 7, 1–17 (2010)
Article Google Scholar
Sun, X., Liu, J., Lyu, Y., He, W., Ma, Y., Wang, S.: Answer-focused and position-aware neural question generation. In: EMNLP (2018)
Google Scholar
Tan, W.R., Chan, C.S., Aguirre, H.E., Tanaka, K.: Ceci n’est pas une pipe: a deep convolutional network for fine-art paintings classification. In: ICIP (2016)
Google Scholar
Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: understanding stories in movies through question-answering. In: Proceedings of CVPR (2016)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR (2015)
Google Scholar
Wang, P., Wu, Q., Shen, C., Dick, A., van den Hengel, A.: FVQA: fact-based visual question answering. TPAMI 40(10), 2413–2427 (2018)
Article Google Scholar
Wang, P., Wu, Q., Shen, C., Dick, A., Van Den Henge, A.: Explicit knowledge-based reasoning for visual question answering. In: IJCAI, pp. 1290–1296 (2017)
Google Scholar
Wu, Q., Teney, D., Wang, P., Shen, C., Dick, A., van den Hengel, A.: Visual question answering: a survey of methods and datasets. CVIU 163, 1–20 (2017)
Google Scholar
Wu, Q., Wang, P., Shen, C., Dick, A., van den Hengel, A.: Ask me anything: free-form visual question answering based on knowledge from external sources. In: CVPR (2016)
Google Scholar
Yang, J., Lu, J., Lee, S., Dhruv Batra, D.P.: Visual curiosity: learning to ask questions to learn visual recognition. In: CoRL (2018)
Google Scholar
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: XLNet: generalized autoregressive pretraining for language understanding. In: NeurIPS (2019)
Google Scholar
Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: visual commonsense reasoning. In: Proceedings of CVPR (2019)
Google Scholar
Zhang, S., Qu, L., You, S., Yang, Z., Zhang, J.: Automatic generation of grounded visual questions. In: IJCAI, pp. 4235–4243 (2017)
Google Scholar
Zhao, Y., Ni, X., Ding, Y., Ke, Q.: Paragraph-level neural question generation with maxout pointer and gated self-attention networks. In: EMNLP (2018)
Google Scholar
Zhu, L., Xu, Z., Yang, Y., Hauptmann, A.G.: Uncovering the temporal context for video question answering. IJCV 124(3), 409–421 (2017)
Article MathSciNet Google Scholar
Zhu, Y., Groth, O., Bernstein, M.S., Fei-Fei, L.: Visual7W: grounded question answering in images. In: CVPR (2016)
Google Scholar

Download references

Acknowledgment

This work was partly supported by JSPS KAKENHI Nos. 18H03264 and 20K19822, and JST ACT-I.

Author information

Authors and Affiliations

Osaka University, Suita, Japan
Noa Garcia, Chenhui Chu & Yuta Nakashima
Carnegie Mellon University, Pittsburgh, USA
Chentao Ye, Zihua Liu, Qingtao Hu & Teruko Mitamura
CyberAgent, Inc., Tokyo, Japan
Mayu Otani

Authors

Noa Garcia
View author publications
You can also search for this author in PubMed Google Scholar
Chentao Ye
View author publications
You can also search for this author in PubMed Google Scholar
Zihua Liu
View author publications
You can also search for this author in PubMed Google Scholar
Qingtao Hu
View author publications
You can also search for this author in PubMed Google Scholar
Mayu Otani
View author publications
You can also search for this author in PubMed Google Scholar
Chenhui Chu
View author publications
You can also search for this author in PubMed Google Scholar
Yuta Nakashima
View author publications
You can also search for this author in PubMed Google Scholar
Teruko Mitamura
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Noa Garcia .

Editor information

Editors and Affiliations

University of Clermont Auvergne, Clermont Ferrand, France
Adrien Bartoli
Università degli Studi di Udine, Udine, Italy
Andrea Fusiello

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Garcia, N. et al. (2020). A Dataset and Baselines for Visual Question Answering on Art. In: Bartoli, A., Fusiello, A. (eds) Computer Vision – ECCV 2020 Workshops. ECCV 2020. Lecture Notes in Computer Science(), vol 12536. Springer, Cham. https://doi.org/10.1007/978-3-030-66096-3_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-66096-3_8
Published: 03 January 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66095-6
Online ISBN: 978-3-030-66096-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics