Afouras, Owens, Chung, Zisserman, 2020. Self-Supervised Learning of Audio-Visual Objects from Video.