Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Filters








6,138 Hits in 5.9 sec

Semi Supervised Meta Learning for Spatiotemporal Learning [article]

Faraz Waseem, Pratyush Muthukumar
2023 arXiv   pre-print
Specifically, we first experiment with applying a pre-trained MAE and fine-tuning on our small-scale spatiotemporal dataset for video reconstruction tasks.  ...  Next, we experiment with training an MAE encoder and applying a classification head for action classification tasks.  ...  That is, we scale down the vision transformer (ViT) backbone within the existing representation learning architecture for training on our custom small-scale video dataset.  ... 
arXiv:2308.01916v1 fatcat:wfe3o2otfzhbbjlmuxgwbjfydq

Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model [article]

Di Wang, Qiming Zhang, Yufei Xu, Jing Zhang, Bo Du, Dacheng Tao, Liangpei Zhang
2022 arXiv   pre-print
Large-scale vision foundation models have made significant progress in visual tasks on natural images, with vision transformers being the primary choice due to their good scalability and representation  ...  In this paper, we resort to plain vision transformers with about 100 million parameters and make the first attempt to propose large vision models tailored to RS tasks and investigate how such large models  ...  Vision transformers have also experienced rapid development towards large-scale foundation models thanks to their great potential in scalability and structure flexibility, e.g., the model size can be easily  ... 
arXiv:2208.03987v4 fatcat:4skyb2ytbbffvbk7njlgwmsf2a

Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking [article]

Peng Gao, Renrui Zhang, Rongyao Fang, Ziyi Lin, Hongyang Li, Hongsheng Li, Qiao Yu
2023 arXiv   pre-print
Masked Autoencoders (MAE) have been popular paradigms for large-scale vision representation pre-training.  ...  On ImageNet-1K, the MR-MAE base pre-trained for only 400 epochs achieves 85.8% top-1 accuracy after fine-tuning, surpassing the 1600-epoch MAE base by +2.2% and the previous state-of-the-art BEiT V2 base  ...  MCMAE [25] , UM-MAE [26] , MixMIM [27] and GreenMIM [28] explore efficient and effective MIM frameworks with hierarchical vision transformers [29] [30] [31] [32] .  ... 
arXiv:2303.05475v1 fatcat:du6oqf4orzc5dmbgew6rl2ihgq

A Survey on Masked Autoencoder for Self-supervised Learning in Vision and Beyond [article]

Chaoning Zhang, Chenshuang Zhang, Junha Song, John Seon Keun Yi, Kang Zhang, In So Kweon
2022 arXiv   pre-print
Masked autoencoders are scalable vision learners, as the title of MAE , which suggests that self-supervised learning (SSL) in vision might undertake a similar trajectory as in NLP.  ...  As a milestone to bridge the gap with BERT in NLP, masked autoencoder has attracted unprecedented attention for SSL in vision and beyond.  ...  Towards improving efficiency Despite the impressive performance, a significant bottleneck of masked autoencoder for visual SSL is that it requires large computation.  ... 
arXiv:2208.00173v1 fatcat:d2bxvpzcabg3lei4mcnsts5wqe

One for All: Toward Unified Foundation Models for Earth Vision [article]

Zhitong Xiong, Yi Wang, Fahong Zhang, Xiao Xiang Zhu
2024 arXiv   pre-print
Foundation models characterized by extensive parameters and trained on large-scale datasets have demonstrated remarkable efficacy across various downstream tasks for remote sensing data.  ...  Then the backbone model can be used in different downstream tasks, thus forging a path towards a unified foundation backbone model in Earth vision.  ...  Characterized by their extensive parameters and pre-trained on large-scale datasets, these models have greatly enhanced the performance on different downstream tasks.  ... 
arXiv:2401.07527v2 fatcat:h3sutjvognfmxlnu6ipee45dfq

Masked Autoencoder for Self-Supervised Pre-training on Lidar Point Clouds [article]

Georg Hess, Johan Jaxing, Elias Svensson, David Hagerman, Christoffer Petersson, Lennart Svensson
2022 arXiv   pre-print
Code available at https://github.com/georghess/voxel-mae  ...  Masked autoencoding has become a successful pretraining paradigm for Transformer models for text, images, and, recently, point clouds.  ...  Experiments on a large-scale automotive dataset show that Voxel-MAE learns useful point cloud representations from raw lidar point clouds.  ... 
arXiv:2207.00531v2 fatcat:r7vvgaqt4jhq3nzqb4rdrtn6ta

Fast Training of Diffusion Models with Masked Transformers [article]

Hongkai Zheng, Weili Nie, Arash Vahdat, Anima Anandkumar
2024 arXiv   pre-print
We propose an efficient approach to train large diffusion models with masked transformers.  ...  Thus, our method shows a promising way of efficiently training large transformer-based diffusion models without sacrificing the generative performance.  ...  This opens up new opportunities for efficient training of large transformer-based diffusion models.  ... 
arXiv:2306.09305v2 fatcat:2dalxgb25refzgdigs6jxo4ghy

Masked Autoencoders Are Scalable Vision Learners [article]

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick
2021 arXiv   pre-print
This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision.  ...  Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy.  ...  Towards robust vision trans- tive power. In NeurIPS, 2019. former. arXiv:2105.07926, 2021.  ... 
arXiv:2111.06377v3 fatcat:4d7762easfdcniz4jvqedqizqy

Surface Analysis with Vision Transformers [article]

Simon Dahan, Logan Z. J. Williams, Abdulah Fawaz, Daniel Rueckert, Emma C. Robinson
2022 arXiv   pre-print
Code available at https://github.com/metrics-lab/surface-vision-transformers  ...  Recent state-of-the-art performance of Vision Transformers (ViTs) demonstrates that a general-purpose architecture, which implements self-attention, could replace the local feature learning operations  ...  Optimisation To mitigate the lack of inductive biases in the architecture, transformers typically require large training datasets or efficient (pre-)training strategies [1, 15, 49, 51] .  ... 
arXiv:2205.15836v1 fatcat:b3zoftc7kzazxp7bj2zbhcqafe

Masked Image Residual Learning for Scaling Deeper Vision Transformers [article]

Guoxi Huang, Hongtao Fu, Adrian G. Bors
2023 arXiv   pre-print
Deeper Vision Transformers (ViTs) are more challenging to train. We expose a degradation problem in deeper layers of ViT when using masked image modeling (MIM) for pre-training.  ...  The deeper ViT-S-54, costing 3× less than ViT-Large, achieves performance on par with ViT-Large.  ...  Table 1 : 1 Details of Vision Transformer scaling along the depth dimension.  ... 
arXiv:2309.14136v3 fatcat:7kqnbdz22fc5vjvjhqi7m4swqu

Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language [article]

Alexei Baevski, Arun Babu, Wei-Ning Hsu, Michael Auli
2022 arXiv   pre-print
Current self-supervised learning algorithms are often modality-specific and require large amounts of computational resources.  ...  To address these issues, we increase the training efficiency of data2vec, a learning objective that generalizes across several modalities.  ...  In computer vision, there has been a shift towards Vision Transformer architectures (ViT; Dosovitskiy et al. 2020 ) and masked prediction methods that can be very efficient by not encoding masked patches  ... 
arXiv:2212.07525v1 fatcat:5hgg3nndb5e43bkndkvu5d2zzm

Denoising Diffusion Autoencoders are Unified Self-supervised Learners [article]

Weilai Xiang, Hongyu Yang, Di Huang, Yunhong Wang
2023 arXiv   pre-print
Transfer learning from ImageNet also confirms the suitability of DDAE for Vision Transformers, suggesting the potential to scale DDAEs as unified foundation models.  ...  Table 3 and Table 4 show that the scaled DiT-XL/2 outperforms the smaller MAE ViT-B/16 under all settings by large margins except for linear probing on CIFAR-10.  ...  However, DDAE is not as efficient as Sim-CLR on this dataset: a slightly larger ResNet-50 can surpass our linear probe result with fewer parameters. Transfer learning with Vision Transformers.  ... 
arXiv:2303.09769v2 fatcat:yg7pimlk5ne3fi4ingzfxmofbu

MAE-DFER: Efficient Masked Autoencoder for Self-supervised Dynamic Facial Expression Recognition

Licai Sun, Zheng Lian, Bin Liu, Jianhua Tao
2023 Proceedings of the 31st ACM International Conference on Multimedia  
Since the vanilla Vision Transformer (ViT) employed in VideoMAE requires substantial computation during fine-tuning, MAE-DFER develops an efficient local-global interaction Transformer (LGI-Former) as  ...  ., VideoMAE), this paper proposes MAE-DFER, a novel selfsupervised method which leverages large-scale self-supervised pretraining on abundant unlabeled data to largely advance the development of DFER.  ...  This is largely attributed to the limited training samples in current DFER datasets since large vision Transformers are data-hungry and training them typically requires more than million-level labeled  ... 
doi:10.1145/3581783.3612365 fatcat:3fz23bs2nvdj7cvjzzaeokxwfa

FastMIM: Expediting Masked Image Modeling Pre-training for Vision [article]

Jianyuan Guo, Kai Han, Han Wu, Yehui Tang, Yunhe Wang, Chang Xu
2022 arXiv   pre-print
Equipped with FastMIM, all kinds of vision backbones can be pre-trained in an efficient way.  ...  The combination of transformers and masked image modeling (MIM) pre-training framework has shown great potential in various vision tasks.  ...  N/A: MAE is not suitable for Swin Transformer.) becoming a generic MIM framework for various vision backbones [12, 21, 33, 42] .  ... 
arXiv:2212.06593v1 fatcat:ebdoruemjzg5plcnw3js2c5fp4

Masked Image Modeling with Denoising Contrast [article]

Kun Yi, Yixiao Ge, Xiaotong Li, Shusheng Yang, Dian Li, Jianping Wu, Ying Shan, Xiaohu Qie
2023 arXiv   pre-print
MIM recently dominates this line of research with state-of-the-art performance on vision Transformers (ViTs), where the core is to enhance the patch-level visual context capturing of the network via denoising  ...  ConMIM-pretrained models with various scales achieve competitive results on downstream image classification, semantic segmentation, object detection, and instance segmentation tasks, e.g., on ImageNet-  ...  Towards this goal, the recent work MAE (He et al., 2022) proposes to mask a large proportion of patches.  ... 
arXiv:2205.09616v2 fatcat:3x5a5vcibngqzmkqyidp73au2e
« Previous Showing results 1 — 15 out of 6,138 results