Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Filters








20,915 Hits in 7.1 sec

Supervised Fine-tuning Evaluation for Long-term Visual Place Recognition [article]

Farid Alijani, Esa Rahtu
2022 arXiv   pre-print
certain thresholds, for the visual place recognition tasks.  ...  in an end-to-end manner for visual place recognition task in challenging conditions, including seasonal and illumination variations.  ...  First, the outperforming validity of supervised fine-tuning the DCNN architechtures, purely trained for classification problems, is comprehensively studied for two real world datasets designed for place  ... 
arXiv:2211.07696v1 fatcat:rsfjiwxfjvgudjvppdzwrpi2c4

Learning and Verification of Task Structure in Instructional Videos [article]

Medhini Narasimhan, Licheng Yu, Sean Bell, Ning Zhang, Trevor Darrell
2023 arXiv   pre-print
We also introduce a long-term forecasting benchmark, where the goal is to predict long-range future steps from a given step.  ...  Additionally, we evaluate VideoTaskformer on 3 existing benchmarks -- procedural activity recognition, step classification, and step forecasting -- and demonstrate on each that our method outperforms existing  ...  We would like to thank Suvir Mirchandani for his help with experiments and paper writing.  ... 
arXiv:2303.13519v1 fatcat:z7ptqyv5qnehffxz3obn7o5kca

Learning Effective RGB-D Representations for Scene Recognition

Xinhang Song, Shuqiang Jiang, Luis Herranz, Chengpeng Chen
2019 IEEE Transactions on Image Processing  
Deep convolutional networks (CNN) can achieve impressive results on RGB scene recognition thanks to large datasets such as Places.  ...  Focusing on this scenario, we introduce the ISIA RGB-D video dataset to evaluate RGB-D scene recognition with videos.  ...  Particularly, the recurrent neural networks are implemented using Long-Short Term Memory (LSTM) units.  ... 
doi:10.1109/tip.2018.2872629 pmid:30281448 fatcat:lnhv5g46s5dpngmstriivjncgy

SeasonDepth: Cross-Season Monocular Depth Prediction Dataset and Benchmark under Multiple Environments [article]

Hanjiang Hu, Baoquan Yang, Zhijian Qiao, Shiqi Liu, Jiacheng Zhu, Zuxin Liu, Wenhao Ding, Ding Zhao, Hesheng Wang
2023 arXiv   pre-print
We show that long-term monocular depth prediction is still challenging and believe our work can boost further research on the long-term robustness and generalization for outdoor visual perception.  ...  Different environments pose a great challenge to the outdoor robust visual perception for long-term autonomous driving, and the generalization of learning-based algorithms on different environments is  ...  Fine-grained segmenta-tion networks: Self-supervised segmentation for improved long-term visual localization.  ... 
arXiv:2011.04408v7 fatcat:2brvlaft6vcjtfhms2f5p2sn3a

LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading [article]

Leyuan Qu, Cornelius Weber, Stefan Wermter
2021 arXiv   pre-print
The aim of this work is to investigate the impact of crossmodal self-supervised pre-training for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio and visual streams  ...  Lastly, we train the cascaded lip reading (video-to-text) system by fine-tuning the generated audios on a pre-trained speech recognition system and achieve state-of-the-art performance on both English  ...  neural networks [43] , long short-term memory [44] , or residual networks [45] .  ... 
arXiv:2112.04748v1 fatcat:nkecrtplr5h3laiwpsd6gxjnqu

Self-Supervised Visual Place Recognition Learning in Mobile Robots [article]

Sudeep Pillai, John Leonard
2019 arXiv   pre-print
that is specifically geared for visual place recognition in mobile robots.  ...  In this work, we develop a self-supervised approach to place recognition in robots.  ...  SELF-SUPERVISED METRIC LEARNING FOR PLACE RECOGNITION A.  ... 
arXiv:1905.04453v1 fatcat:vvtf7wnbwnbflkrjbr6kckthwa

Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey [article]

Longlong Jing, Yingli Tian
2019 arXiv   pre-print
Next, the main components and evaluation metrics of self-supervised learning methods are reviewed followed by the commonly used image and video datasets and the existing self-supervised visual feature  ...  At last, this paper is concluded and lists a set of promising future directions for self-supervised visual feature learning.  ...  In the self-supervised sensorial, the self-supervised models are fine-tuned on the dataset to evaluate the quality of the learned video features.  ... 
arXiv:1902.06162v1 fatcat:wwc3nenj3vbybcrd7gx2jytlte

Self-Supervised Domain Calibration and Uncertainty Estimation for Place Recognition [article]

Pierre-Yves Lajoie, Giovanni Beltrame
2022 arXiv   pre-print
Moreover, we leverage the procedure to improve uncertainty estimation for place recognition matches which is important in safety critical applications.  ...  Thus, to achieve top performance, it is sometimes necessary to fine-tune the networks to the target environment.  ...  I Visual Place Recognition (VPR) remains one of the core problems of autonomous driving and long-term robot localization.  ... 
arXiv:2203.04446v2 fatcat:qik2ig2hrbemdayetroeqtatfi

Towards Long-Form Video Understanding [article]

Chao-Yuan Wu, Philipp Krähenbühl
2021 arXiv   pre-print
We show that existing state-of-the-art short-term models are limited for long-form tasks.  ...  In this paper, we study long-form video understanding. We introduce a framework for modeling long-form videos and develop evaluation protocols on large-scale datasets.  ...  IIS-1845485 and IIS-2006820, and the NSF Institute for Foundations of Machine Learning. Chao-Yuan was supported by the Facebook PhD Fellowship.  ... 
arXiv:2106.11310v1 fatcat:vxw7ugggwbeibd43ubkaavuw3y

Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation [article]

Qiushi Zhu, Jie Zhang, Yu Gu, Yuchen Hu, Lirong Dai
2024 arXiv   pre-print
(ASR), visual speech recognition (VSR) and audio-visual speaker diarization (AVSD) tasks.  ...  The efficacy of self-supervised learning for far-field multichannel and multi-modal speech processing has not been well explored.  ...  Fine-tuning and Evaluation After pre-training, we fine-tune the pre-trained model or use the pre-trained model as a feature extractor.  ... 
arXiv:2401.03468v1 fatcat:npsfnnlzsbarhhfz3454sfiimq

Unifying Local and Global Multimodal Features for Place Recognition in Aliased and Low-Texture Environments [article]

Alberto García-Hernández, Riccardo Giubilato, Klaus H. Strobl, Javier Civera, Rudolph Triebel
2024 arXiv   pre-print
Perceptual aliasing and weak textures pose significant challenges to the task of place recognition, hindering the performance of Simultaneous Localization and Mapping (SLAM) systems.  ...  Since our work aims to enhance the reliability of SLAM in all situations, we also explore its performance on the widely used RobotCar dataset, for broader applicability.  ...  Downstream task fine-tuning for place recognition uses a triplet margin loss and batch-hard negative mining, following the MinkLoc approach [33] .  ... 
arXiv:2403.13395v1 fatcat:sdphbxqqs5ee3bzjv6as4cjiwa

LPT: Long-tailed Prompt Tuning for Image Classification [article]

Bowen Dong, Pan Zhou, Shuicheng Yan, Wangmeng Zuo
2023 arXiv   pre-print
For long-tailed classification, most works often pretrain a big model on a large-scale dataset, and then fine-tune the whole model for adapting to long-tailed data.  ...  In phase 1, we train the shared prompt via supervised prompt tuning to adapt a pretrained model to the desired long-tailed domain.  ...  Figure 4 : 4 Figure 4: Statistic results visualization of prompt matching proportion for classes in Places-LT. Figure 5 : 5 Figure 5: LDA visualization of LPT.  ... 
arXiv:2210.01033v2 fatcat:ith2du4wnzgbhhixedrtukvw5m

A Simple Framework for Contrastive Learning of Visual Representations [article]

Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton
2020 arXiv   pre-print
This paper presents SimCLR: a simple framework for contrastive learning of visual representations.  ...  When fine-tuned on only 1% of the labels, we achieve 85.8% top-5 accuracy, outperforming AlexNet with 100X fewer labels.  ...  We are also grateful for general support from Google Research teams in Toronto and elsewhere.  ... 
arXiv:2002.05709v3 fatcat:faxwkmq2vbcenbvqyi4lqfq4du

Detecting the Moment of Completion: Temporal Models for Localising Action Completion [article]

Farnoosh Heidarivincheh, Majid Mirmehdi, Dima Damen
2017 arXiv   pre-print
We use a supervised approach, where annotations of pre-completion and post-completion frames are available per action, and fine-tuned CNN features are used to train temporal models.  ...  In this work, we assess the ability of two temporal models, namely Hidden Markov Models (HMM) and Long-Short Term Memory (LSTM), to localise completion for six object interactions: switch, plug, open,  ...  We test handcrafted, pre-trained and fine-tuned CNN features for per-frame representations.  ... 
arXiv:1710.02310v1 fatcat:e3zywl26lvcrxkuzukvoib5jlm

Multimodal Multi-Stream Deep Learning for Egocentric Activity Recognition

Sibo Song, Vijay Chandrasekhar, Bappaditya Mandal, Liyuan Li, Joo-Hwee Lim, Giduthuri Sateesh Babu, Phyo Phyo San, Ngai-Man Cheung
2016 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)  
Second, we propose a multistream Long Short-Term Memory architecture to learn the features from multiple sensor streams (accelerometer, gyroscope, etc.).  ...  In this paper, we propose a multimodal multi-stream deep learning framework to tackle the egocentric activity recognition problem, using both the video and sensor data.  ...  [1, 4] introduce Recurrent Neural Network or Long-Short Term Memory to take advantage of long-term temporal information since most of the existing ConvNets are incapable of capturing long-term sequential  ... 
doi:10.1109/cvprw.2016.54 dblp:conf/cvpr/SongCMLLBSC16 fatcat:myz44dahlbavvio5syy4nlopby
« Previous Showing results 1 — 15 out of 20,915 results