Efficient MAE towards Large-Scale Vision Transformers.

Specifically, we first experiment with applying a pre-trained MAE and fine-tuning on our small-scale spatiotemporal dataset for video reconstruction tasks. ... Next, we experiment with training an MAE encoder and applying a classification head for action classification tasks. ... That is, we scale down the vision transformer (ViT) backbone within the existing representation learning architecture for training on our custom small-scale video dataset. ...

arXiv:2308.01916v1 fatcat:wfe3o2otfzhbbjlmuxgwbjfydq

Large-scale vision foundation models have made significant progress in visual tasks on natural images, with vision transformers being the primary choice due to their good scalability and representation ... In this paper, we resort to plain vision transformers with about 100 million parameters and make the first attempt to propose large vision models tailored to RS tasks and investigate how such large models ... Vision transformers have also experienced rapid development towards large-scale foundation models thanks to their great potential in scalability and structure flexibility, e.g., the model size can be easily ...

arXiv:2208.03987v4 fatcat:4skyb2ytbbffvbk7njlgwmsf2a

Multiple Versions

Masked Autoencoders (MAE) have been popular paradigms for large-scale vision representation pre-training. ... On ImageNet-1K, the MR-MAE base pre-trained for only 400 epochs achieves 85.8% top-1 accuracy after fine-tuning, surpassing the 1600-epoch MAE base by +2.2% and the previous state-of-the-art BEiT V2 base ... MCMAE [25] , UM-MAE [26] , MixMIM [27] and GreenMIM [28] explore efficient and effective MIM frameworks with hierarchical vision transformers [29] [30] [31] [32] . ...

arXiv:2303.05475v1 fatcat:du6oqf4orzc5dmbgew6rl2ihgq

Open Access

Masked autoencoders are scalable vision learners, as the title of MAE , which suggests that self-supervised learning (SSL) in vision might undertake a similar trajectory as in NLP. ... As a milestone to bridge the gap with BERT in NLP, masked autoencoder has attracted unprecedented attention for SSL in vision and beyond. ... Towards improving efficiency Despite the impressive performance, a significant bottleneck of masked autoencoder for visual SSL is that it requires large computation. ...

arXiv:2208.00173v1 fatcat:d2bxvpzcabg3lei4mcnsts5wqe

Open Access

Foundation models characterized by extensive parameters and trained on large-scale datasets have demonstrated remarkable efficacy across various downstream tasks for remote sensing data. ... Then the backbone model can be used in different downstream tasks, thus forging a path towards a unified foundation backbone model in Earth vision. ... Characterized by their extensive parameters and pre-trained on large-scale datasets, these models have greatly enhanced the performance on different downstream tasks. ...

arXiv:2401.07527v2 fatcat:h3sutjvognfmxlnu6ipee45dfq

Open Access Multiple Versions

Code available at https://github.com/georghess/voxel-mae ... Masked autoencoding has become a successful pretraining paradigm for Transformer models for text, images, and, recently, point clouds. ... Experiments on a large-scale automotive dataset show that Voxel-MAE learns useful point cloud representations from raw lidar point clouds. ...

arXiv:2207.00531v2 fatcat:r7vvgaqt4jhq3nzqb4rdrtn6ta

Open Access Multiple Versions

We propose an efficient approach to train large diffusion models with masked transformers. ... Thus, our method shows a promising way of efficiently training large transformer-based diffusion models without sacrificing the generative performance. ... This opens up new opportunities for efficient training of large transformer-based diffusion models. ...

arXiv:2306.09305v2 fatcat:2dalxgb25refzgdigs6jxo4ghy

Open Access Multiple Versions

This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. ... Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. ... Towards robust vision trans- tive power. In NeurIPS, 2019. former. arXiv:2105.07926, 2021. ...

arXiv:2111.06377v3 fatcat:4d7762easfdcniz4jvqedqizqy

Open Access Multiple Versions

Code available at https://github.com/metrics-lab/surface-vision-transformers ... Recent state-of-the-art performance of Vision Transformers (ViTs) demonstrates that a general-purpose architecture, which implements self-attention, could replace the local feature learning operations ... Optimisation To mitigate the lack of inductive biases in the architecture, transformers typically require large training datasets or efficient (pre-)training strategies [1, 15, 49, 51] . ...

arXiv:2205.15836v1 fatcat:b3zoftc7kzazxp7bj2zbhcqafe

Open Access

Deeper Vision Transformers (ViTs) are more challenging to train. We expose a degradation problem in deeper layers of ViT when using masked image modeling (MIM) for pre-training. ... The deeper ViT-S-54, costing 3× less than ViT-Large, achieves performance on par with ViT-Large. ... Table 1 : 1 Details of Vision Transformer scaling along the depth dimension. ...

arXiv:2309.14136v3 fatcat:7kqnbdz22fc5vjvjhqi7m4swqu

Multiple Versions

Current self-supervised learning algorithms are often modality-specific and require large amounts of computational resources. ... To address these issues, we increase the training efficiency of data2vec, a learning objective that generalizes across several modalities. ... In computer vision, there has been a shift towards Vision Transformer architectures (ViT; Dosovitskiy et al. 2020 ) and masked prediction methods that can be very efficient by not encoding masked patches ...

arXiv:2212.07525v1 fatcat:5hgg3nndb5e43bkndkvu5d2zzm

Transfer learning from ImageNet also confirms the suitability of DDAE for Vision Transformers, suggesting the potential to scale DDAEs as unified foundation models. ... Table 3 and Table 4 show that the scaled DiT-XL/2 outperforms the smaller MAE ViT-B/16 under all settings by large margins except for linear probing on CIFAR-10. ... However, DDAE is not as efficient as Sim-CLR on this dataset: a slightly larger ResNet-50 can surpass our linear probe result with fewer parameters. Transfer learning with Vision Transformers. ...

arXiv:2303.09769v2 fatcat:yg7pimlk5ne3fi4ingzfxmofbu

Multiple Versions

Since the vanilla Vision Transformer (ViT) employed in VideoMAE requires substantial computation during fine-tuning, MAE-DFER develops an efficient local-global interaction Transformer (LGI-Former) as ... ., VideoMAE), this paper proposes MAE-DFER, a novel selfsupervised method which leverages large-scale self-supervised pretraining on abundant unlabeled data to largely advance the development of DFER. ... This is largely attributed to the limited training samples in current DFER datasets since large vision Transformers are data-hungry and training them typically requires more than million-level labeled ...

doi:10.1145/3581783.3612365 fatcat:3fz23bs2nvdj7cvjzzaeokxwfa

Equipped with FastMIM, all kinds of vision backbones can be pre-trained in an efficient way. ... The combination of transformers and masked image modeling (MIM) pre-training framework has shown great potential in various vision tasks. ... N/A: MAE is not suitable for Swin Transformer.) becoming a generic MIM framework for various vision backbones [12, 21, 33, 42] . ...

arXiv:2212.06593v1 fatcat:ebdoruemjzg5plcnw3js2c5fp4

MIM recently dominates this line of research with state-of-the-art performance on vision Transformers (ViTs), where the core is to enhance the patch-level visual context capturing of the network via denoising ... ConMIM-pretrained models with various scales achieve competitive results on downstream image classification, semantic segmentation, object detection, and instance segmentation tasks, e.g., on ImageNet- ... Towards this goal, the recent work MAE (He et al., 2022) proposes to mask a large proportion of patches. ...

arXiv:2205.09616v2 fatcat:3x5a5vcibngqzmkqyidp73au2e

Multiple Versions

Semi Supervised Meta Learning for Spatiotemporal Learning [article]

Preserved Fulltext

Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model [article]

Preserved Fulltext

Other Versions

Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking [article]

Preserved Fulltext

A Survey on Masked Autoencoder for Self-supervised Learning in Vision and Beyond [article]

Preserved Fulltext

One for All: Toward Unified Foundation Models for Earth Vision [article]

Preserved Fulltext

Masked Autoencoder for Self-Supervised Pre-training on Lidar Point Clouds [article]

Preserved Fulltext

Other Versions

Fast Training of Diffusion Models with Masked Transformers [article]

Preserved Fulltext

Masked Autoencoders Are Scalable Vision Learners [article]

Preserved Fulltext

Other Versions

Surface Analysis with Vision Transformers [article]

Preserved Fulltext

Masked Image Residual Learning for Scaling Deeper Vision Transformers [article]

Preserved Fulltext

Other Versions

Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language [article]

Preserved Fulltext

Denoising Diffusion Autoencoders are Unified Self-supervised Learners [article]

Preserved Fulltext

MAE-DFER: Efficient Masked Autoencoder for Self-supervised Dynamic Facial Expression Recognition

Preserved Fulltext

FastMIM: Expediting Masked Image Modeling Pre-training for Vision [article]

Preserved Fulltext

Masked Image Modeling with Denoising Contrast [article]

Preserved Fulltext

Other Versions