A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is application/pdf
.
Filters
Supervised Fine-tuning Evaluation for Long-term Visual Place Recognition
[article]
2022
arXiv
pre-print
certain thresholds, for the visual place recognition tasks. ...
in an end-to-end manner for visual place recognition task in challenging conditions, including seasonal and illumination variations. ...
First, the outperforming validity of supervised fine-tuning the DCNN architechtures, purely trained for classification problems, is comprehensively studied for two real world datasets designed for place ...
arXiv:2211.07696v1
fatcat:rsfjiwxfjvgudjvppdzwrpi2c4
Learning and Verification of Task Structure in Instructional Videos
[article]
2023
arXiv
pre-print
We also introduce a long-term forecasting benchmark, where the goal is to predict long-range future steps from a given step. ...
Additionally, we evaluate VideoTaskformer on 3 existing benchmarks -- procedural activity recognition, step classification, and step forecasting -- and demonstrate on each that our method outperforms existing ...
We would like to thank Suvir Mirchandani for his help with experiments and paper writing. ...
arXiv:2303.13519v1
fatcat:z7ptqyv5qnehffxz3obn7o5kca
Learning Effective RGB-D Representations for Scene Recognition
2019
IEEE Transactions on Image Processing
Deep convolutional networks (CNN) can achieve impressive results on RGB scene recognition thanks to large datasets such as Places. ...
Focusing on this scenario, we introduce the ISIA RGB-D video dataset to evaluate RGB-D scene recognition with videos. ...
Particularly, the recurrent neural networks are implemented using Long-Short Term Memory (LSTM) units. ...
doi:10.1109/tip.2018.2872629
pmid:30281448
fatcat:lnhv5g46s5dpngmstriivjncgy
SeasonDepth: Cross-Season Monocular Depth Prediction Dataset and Benchmark under Multiple Environments
[article]
2023
arXiv
pre-print
We show that long-term monocular depth prediction is still challenging and believe our work can boost further research on the long-term robustness and generalization for outdoor visual perception. ...
Different environments pose a great challenge to the outdoor robust visual perception for long-term autonomous driving, and the generalization of learning-based algorithms on different environments is ...
Fine-grained segmenta-tion networks: Self-supervised segmentation for improved long-term visual localization. ...
arXiv:2011.04408v7
fatcat:2brvlaft6vcjtfhms2f5p2sn3a
LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading
[article]
2021
arXiv
pre-print
The aim of this work is to investigate the impact of crossmodal self-supervised pre-training for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio and visual streams ...
Lastly, we train the cascaded lip reading (video-to-text) system by fine-tuning the generated audios on a pre-trained speech recognition system and achieve state-of-the-art performance on both English ...
neural networks [43] , long short-term memory [44] , or residual networks [45] . ...
arXiv:2112.04748v1
fatcat:nkecrtplr5h3laiwpsd6gxjnqu
Self-Supervised Visual Place Recognition Learning in Mobile Robots
[article]
2019
arXiv
pre-print
that is specifically geared for visual place recognition in mobile robots. ...
In this work, we develop a self-supervised approach to place recognition in robots. ...
SELF-SUPERVISED METRIC LEARNING FOR PLACE RECOGNITION A. ...
arXiv:1905.04453v1
fatcat:vvtf7wnbwnbflkrjbr6kckthwa
Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey
[article]
2019
arXiv
pre-print
Next, the main components and evaluation metrics of self-supervised learning methods are reviewed followed by the commonly used image and video datasets and the existing self-supervised visual feature ...
At last, this paper is concluded and lists a set of promising future directions for self-supervised visual feature learning. ...
In the self-supervised sensorial, the self-supervised models are fine-tuned on the dataset to evaluate the quality of the learned video features. ...
arXiv:1902.06162v1
fatcat:wwc3nenj3vbybcrd7gx2jytlte
Self-Supervised Domain Calibration and Uncertainty Estimation for Place Recognition
[article]
2022
arXiv
pre-print
Moreover, we leverage the procedure to improve uncertainty estimation for place recognition matches which is important in safety critical applications. ...
Thus, to achieve top performance, it is sometimes necessary to fine-tune the networks to the target environment. ...
I Visual Place Recognition (VPR) remains one of the core problems of autonomous driving and long-term robot localization. ...
arXiv:2203.04446v2
fatcat:qik2ig2hrbemdayetroeqtatfi
Towards Long-Form Video Understanding
[article]
2021
arXiv
pre-print
We show that existing state-of-the-art short-term models are limited for long-form tasks. ...
In this paper, we study long-form video understanding. We introduce a framework for modeling long-form videos and develop evaluation protocols on large-scale datasets. ...
IIS-1845485 and IIS-2006820, and the NSF Institute for Foundations of Machine Learning. Chao-Yuan was supported by the Facebook PhD Fellowship. ...
arXiv:2106.11310v1
fatcat:vxw7ugggwbeibd43ubkaavuw3y
Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation
[article]
2024
arXiv
pre-print
(ASR), visual speech recognition (VSR) and audio-visual speaker diarization (AVSD) tasks. ...
The efficacy of self-supervised learning for far-field multichannel and multi-modal speech processing has not been well explored. ...
Fine-tuning and Evaluation After pre-training, we fine-tune the pre-trained model or use the pre-trained model as a feature extractor. ...
arXiv:2401.03468v1
fatcat:npsfnnlzsbarhhfz3454sfiimq
Unifying Local and Global Multimodal Features for Place Recognition in Aliased and Low-Texture Environments
[article]
2024
arXiv
pre-print
Perceptual aliasing and weak textures pose significant challenges to the task of place recognition, hindering the performance of Simultaneous Localization and Mapping (SLAM) systems. ...
Since our work aims to enhance the reliability of SLAM in all situations, we also explore its performance on the widely used RobotCar dataset, for broader applicability. ...
Downstream task fine-tuning for place recognition uses a triplet margin loss and batch-hard negative mining, following the MinkLoc approach [33] . ...
arXiv:2403.13395v1
fatcat:sdphbxqqs5ee3bzjv6as4cjiwa
LPT: Long-tailed Prompt Tuning for Image Classification
[article]
2023
arXiv
pre-print
For long-tailed classification, most works often pretrain a big model on a large-scale dataset, and then fine-tune the whole model for adapting to long-tailed data. ...
In phase 1, we train the shared prompt via supervised prompt tuning to adapt a pretrained model to the desired long-tailed domain. ...
Figure 4 : 4 Figure 4: Statistic results visualization of prompt matching proportion for classes in Places-LT.
Figure 5 : 5 Figure 5: LDA visualization of LPT. ...
arXiv:2210.01033v2
fatcat:ith2du4wnzgbhhixedrtukvw5m
A Simple Framework for Contrastive Learning of Visual Representations
[article]
2020
arXiv
pre-print
This paper presents SimCLR: a simple framework for contrastive learning of visual representations. ...
When fine-tuned on only 1% of the labels, we achieve 85.8% top-5 accuracy, outperforming AlexNet with 100X fewer labels. ...
We are also grateful for general support from Google Research teams in Toronto and elsewhere. ...
arXiv:2002.05709v3
fatcat:faxwkmq2vbcenbvqyi4lqfq4du
Detecting the Moment of Completion: Temporal Models for Localising Action Completion
[article]
2017
arXiv
pre-print
We use a supervised approach, where annotations of pre-completion and post-completion frames are available per action, and fine-tuned CNN features are used to train temporal models. ...
In this work, we assess the ability of two temporal models, namely Hidden Markov Models (HMM) and Long-Short Term Memory (LSTM), to localise completion for six object interactions: switch, plug, open, ...
We test handcrafted, pre-trained and fine-tuned CNN features for per-frame representations. ...
arXiv:1710.02310v1
fatcat:e3zywl26lvcrxkuzukvoib5jlm
Multimodal Multi-Stream Deep Learning for Egocentric Activity Recognition
2016
2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
Second, we propose a multistream Long Short-Term Memory architecture to learn the features from multiple sensor streams (accelerometer, gyroscope, etc.). ...
In this paper, we propose a multimodal multi-stream deep learning framework to tackle the egocentric activity recognition problem, using both the video and sensor data. ...
[1, 4] introduce Recurrent Neural Network or Long-Short Term Memory to take advantage of long-term temporal information since most of the existing ConvNets are incapable of capturing long-term sequential ...
doi:10.1109/cvprw.2016.54
dblp:conf/cvpr/SongCMLLBSC16
fatcat:myz44dahlbavvio5syy4nlopby
« Previous
Showing results 1 — 15 out of 20,915 results