Abstract
Answering questions related to art pieces (paintings) is a difficult task, as it implies the understanding of not only the visual information that is shown in the picture, but also the contextual knowledge that is acquired through the study of the history of art. In this work, we introduce our first attempt towards building a new dataset, coined AQUA (Art QUestion Answering). The question-answer (QA) pairs are automatically generated using state-of-the-art question generation methods based on paintings and comments provided in an existing art understanding dataset. The QA pairs are cleansed by crowdsourcing workers with respect to their grammatical correctness, answerability, and answers’ correctness. Our dataset inherently consists of visual (painting-based) and knowledge (comment-based) questions. We also present a two-branch model as baseline, where the visual and knowledge questions are handled independently. We extensively compare our baseline model against the state-of-the-art models for question answering, and we provide a comprehensive study about the challenges and potential future directions for visual question answering on art.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
The code is reproduced by ourselves, and we confirmed a similar performance to that of the original paper.
- 3.
- 4.
- 5.
- 6.
- 7.
We used XLNet instead of BERT as XLNet shows better performance on the popular Stanford question answering dataset (SQuAD2.0).
References
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)
Antol, S., et al.: VQA: visual question answering. In: ICCV (2015)
Carneiro, G., da Silva, N.P., Del Bue, A., Costeira, J.P.: Artistic image classification: an analysis on the PRINTART database. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 143–157. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33765-9_11
Crowley, E., Zisserman, A.: The state of the art: object retrieval in paintings using discriminative regions. In: BMVC (2014)
Crowley, E.J., Parkhi, O.M., Zisserman, A.: Face painting: querying art with photos. In: BMVC (2015)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (2019)
Du, X., Cardie, C.: Harvesting paragraph-level question-answer pairs from Wikipedia. In: ACL (2018)
Du, X., Shao, J., Cardie, C.: Learning to ask: neural question generation for reading comprehension. In: ACL (2017)
Duan, N., Tang, D., Chen, P., Zhou, M.: Question generation for question answering. In: EMNLP (2017)
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The Pascal Visual Object Classes (VOC) challenge. IJCV 88(2), 303–338 (2010)
Fan, Z., Wei, Z., Wang, S., Liu, Y., Huang, X.: A reinforcement learning framework for natural question generation using bi-discriminators. In: COLING (2018)
Garcia, N., Otani, M., Chu, C., Nakashima, Y.: KnowIT VQA: answering knowledge-based questions about videos. In: AAAI (2020)
Garcia, N., Renoust, B., Nakashima, Y.: Context-aware embeddings for automatic art analysis. In: ICMR (2019)
Garcia, N., Vogiatzis, G.: How to read paintings: semantic art understanding with multi-modal retrieval. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11130, pp. 676–691. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11012-3_52
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: Proceedings of CVPR (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Heilman, M., Smith, N.A.: Good question! Statistical ranking for question generation. In: NAACL (2010)
Huckle, N., Garcia, N., Vogiatzis, G.: Demographic influences on contemporary art with unsupervised style embeddings. In: ECCV workshops (2020)
Ikeuchi, K., et al.: The great Buddha project: digitally archiving restoring, and analyzing cultural heritage objects. IJCV 75, 189–208 (2007)
Jain, U., Zhang, Z., Schwing, A.G.: Creativity: generating diverse questions using variational autoencoders. In: CVPR (2017)
Jang, Y., Song, Y., Yu, Y., Kim, Y., Kim, G.: TGIF-QA: toward spatio-temporal reasoning in visual question answering. In: Proceedings of CVPR (2017)
Johnson, C.R., et al.: Image processing for artist identification. IEEE Signal Process. Mag. 25(4), 37–48 (2008)
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of CVPR (2017)
Kim, J.H., Jun, J., Zhang, B.T.: Bilinear attention networks. In: NeurIPS (2018)
Kim, K.M., Heo, M.O., Choi, S.H., Zhang, B.T.: DeepStory: video story QA by deep embedded memory networks. In: Proceedings of IJCAI (2017)
Kim, Y., Lee, H., Shin, J., Jung, K.: Improving neural question generation using answer separation. In: AAAI (2019)
Krishna, R., Bernstein, M., Fei-Fei, L.: Information maximizing visual question generation. In: CVPR (2019)
Labutov, I., Basu, S., Vanderwende, L.: Deep questions without deep understanding. In: ACL-IJCNLP (2015)
Lewis, M., Fan, A.: Generative question answering: learning to answer the whole question. In: ICLR (2019)
Li, Y., et al.: Visual question generation as dual task of visual question answering. In: CVPR (2018)
Ma, D., et al.: From part to whole: who is behind the painting? In: ACMMM (2017)
Maharaj, T., Ballas, N., Rohrbach, A., Courville, A., Pal, C.: A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. In: CVPR (2017)
Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: Proceedings of NIPS (2014)
Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: OK-VQA: a visual question answering benchmark requiring external knowledge. In: CVPR (2019)
Mazidi, K., Nielsen, R.D.: Linguistic considerations in automatic question generation. In: ACL (2014)
Misra, I., Girshick, R., Fergus, R., Hebert, M., Gupta, A., van der Maaten, L.: Learning by asking questions. In: CVPR (2018)
Mostafazadeh, N., Misra, I., Devlin, J., Mitchell, M., He, X., Vanderwende, L.: Generating natural questions about an image. In: ACL (2016)
Mun, J., Hongsuck Seo, P., Jung, I., Han, B.: MarioQA: answering questions by watching gameplay videos. In: Proceedings of ICCV (2017)
Pan, L., Lei, W., Chua, T., Kan, M.: Recent advances in neural question generation. CoRR abs/1905.08949 (2019)
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k Entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: ICCV (2015)
Ren, M., Kiros, R., Zemel, R.S.: Exploring models and data for image question answering. In: NeurIPS (2015)
Shamir, L., Macura, T., Orlov, N., Eckley, D.M., Goldberg, I.G.: Impressionism, expressionism, surrealism: automated recognition of painters and schools of art. ACM Trans. Appl. Percept. 7, 1–17 (2010)
Sun, X., Liu, J., Lyu, Y., He, W., Ma, Y., Wang, S.: Answer-focused and position-aware neural question generation. In: EMNLP (2018)
Tan, W.R., Chan, C.S., Aguirre, H.E., Tanaka, K.: Ceci n’est pas une pipe: a deep convolutional network for fine-art paintings classification. In: ICIP (2016)
Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: understanding stories in movies through question-answering. In: Proceedings of CVPR (2016)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR (2015)
Wang, P., Wu, Q., Shen, C., Dick, A., van den Hengel, A.: FVQA: fact-based visual question answering. TPAMI 40(10), 2413–2427 (2018)
Wang, P., Wu, Q., Shen, C., Dick, A., Van Den Henge, A.: Explicit knowledge-based reasoning for visual question answering. In: IJCAI, pp. 1290–1296 (2017)
Wu, Q., Teney, D., Wang, P., Shen, C., Dick, A., van den Hengel, A.: Visual question answering: a survey of methods and datasets. CVIU 163, 1–20 (2017)
Wu, Q., Wang, P., Shen, C., Dick, A., van den Hengel, A.: Ask me anything: free-form visual question answering based on knowledge from external sources. In: CVPR (2016)
Yang, J., Lu, J., Lee, S., Dhruv Batra, D.P.: Visual curiosity: learning to ask questions to learn visual recognition. In: CoRL (2018)
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: XLNet: generalized autoregressive pretraining for language understanding. In: NeurIPS (2019)
Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: visual commonsense reasoning. In: Proceedings of CVPR (2019)
Zhang, S., Qu, L., You, S., Yang, Z., Zhang, J.: Automatic generation of grounded visual questions. In: IJCAI, pp. 4235–4243 (2017)
Zhao, Y., Ni, X., Ding, Y., Ke, Q.: Paragraph-level neural question generation with maxout pointer and gated self-attention networks. In: EMNLP (2018)
Zhu, L., Xu, Z., Yang, Y., Hauptmann, A.G.: Uncovering the temporal context for video question answering. IJCV 124(3), 409–421 (2017)
Zhu, Y., Groth, O., Bernstein, M.S., Fei-Fei, L.: Visual7W: grounded question answering in images. In: CVPR (2016)
Acknowledgment
This work was partly supported by JSPS KAKENHI Nos. 18H03264 and 20K19822, and JST ACT-I.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Garcia, N. et al. (2020). A Dataset and Baselines for Visual Question Answering on Art. In: Bartoli, A., Fusiello, A. (eds) Computer Vision – ECCV 2020 Workshops. ECCV 2020. Lecture Notes in Computer Science(), vol 12536. Springer, Cham. https://doi.org/10.1007/978-3-030-66096-3_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-66096-3_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66095-6
Online ISBN: 978-3-030-66096-3
eBook Packages: Computer ScienceComputer Science (R0)