A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2023; you can also visit the original URL.
The file type is application/pdf
.
Filters
Semi Supervised Meta Learning for Spatiotemporal Learning
[article]
2023
arXiv
pre-print
Specifically, we first experiment with applying a pre-trained MAE and fine-tuning on our small-scale spatiotemporal dataset for video reconstruction tasks. ...
Next, we experiment with training an MAE encoder and applying a classification head for action classification tasks. ...
That is, we scale down the vision transformer (ViT) backbone within the existing representation learning architecture for training on our custom small-scale video dataset. ...
arXiv:2308.01916v1
fatcat:wfe3o2otfzhbbjlmuxgwbjfydq
Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model
[article]
2022
arXiv
pre-print
Large-scale vision foundation models have made significant progress in visual tasks on natural images, with vision transformers being the primary choice due to their good scalability and representation ...
In this paper, we resort to plain vision transformers with about 100 million parameters and make the first attempt to propose large vision models tailored to RS tasks and investigate how such large models ...
Vision transformers have also experienced rapid development towards large-scale foundation models thanks to their great potential in scalability and structure flexibility, e.g., the model size can be easily ...
arXiv:2208.03987v4
fatcat:4skyb2ytbbffvbk7njlgwmsf2a
Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking
[article]
2023
arXiv
pre-print
Masked Autoencoders (MAE) have been popular paradigms for large-scale vision representation pre-training. ...
On ImageNet-1K, the MR-MAE base pre-trained for only 400 epochs achieves 85.8% top-1 accuracy after fine-tuning, surpassing the 1600-epoch MAE base by +2.2% and the previous state-of-the-art BEiT V2 base ...
MCMAE [25] , UM-MAE [26] , MixMIM [27] and GreenMIM [28] explore efficient and effective MIM frameworks with hierarchical vision transformers [29] [30] [31] [32] . ...
arXiv:2303.05475v1
fatcat:du6oqf4orzc5dmbgew6rl2ihgq
A Survey on Masked Autoencoder for Self-supervised Learning in Vision and Beyond
[article]
2022
arXiv
pre-print
Masked autoencoders are scalable vision learners, as the title of MAE , which suggests that self-supervised learning (SSL) in vision might undertake a similar trajectory as in NLP. ...
As a milestone to bridge the gap with BERT in NLP, masked autoencoder has attracted unprecedented attention for SSL in vision and beyond. ...
Towards improving efficiency Despite the impressive performance, a significant bottleneck of masked autoencoder for visual SSL is that it requires large computation. ...
arXiv:2208.00173v1
fatcat:d2bxvpzcabg3lei4mcnsts5wqe
One for All: Toward Unified Foundation Models for Earth Vision
[article]
2024
arXiv
pre-print
Foundation models characterized by extensive parameters and trained on large-scale datasets have demonstrated remarkable efficacy across various downstream tasks for remote sensing data. ...
Then the backbone model can be used in different downstream tasks, thus forging a path towards a unified foundation backbone model in Earth vision. ...
Characterized by their extensive parameters and pre-trained on large-scale datasets, these models have greatly enhanced the performance on different downstream tasks. ...
arXiv:2401.07527v2
fatcat:h3sutjvognfmxlnu6ipee45dfq
Masked Autoencoder for Self-Supervised Pre-training on Lidar Point Clouds
[article]
2022
arXiv
pre-print
Code available at https://github.com/georghess/voxel-mae ...
Masked autoencoding has become a successful pretraining paradigm for Transformer models for text, images, and, recently, point clouds. ...
Experiments on a large-scale automotive dataset show that Voxel-MAE learns useful point cloud representations from raw lidar point clouds. ...
arXiv:2207.00531v2
fatcat:r7vvgaqt4jhq3nzqb4rdrtn6ta
Fast Training of Diffusion Models with Masked Transformers
[article]
2024
arXiv
pre-print
We propose an efficient approach to train large diffusion models with masked transformers. ...
Thus, our method shows a promising way of efficiently training large transformer-based diffusion models without sacrificing the generative performance. ...
This opens up new opportunities for efficient training of large transformer-based diffusion models. ...
arXiv:2306.09305v2
fatcat:2dalxgb25refzgdigs6jxo4ghy
Masked Autoencoders Are Scalable Vision Learners
[article]
2021
arXiv
pre-print
This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. ...
Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. ...
Towards robust vision trans- tive power. In NeurIPS, 2019.
former. arXiv:2105.07926, 2021. ...
arXiv:2111.06377v3
fatcat:4d7762easfdcniz4jvqedqizqy
Surface Analysis with Vision Transformers
[article]
2022
arXiv
pre-print
Code available at https://github.com/metrics-lab/surface-vision-transformers ...
Recent state-of-the-art performance of Vision Transformers (ViTs) demonstrates that a general-purpose architecture, which implements self-attention, could replace the local feature learning operations ...
Optimisation To mitigate the lack of inductive biases in the architecture, transformers typically require large training datasets or efficient (pre-)training strategies [1, 15, 49, 51] . ...
arXiv:2205.15836v1
fatcat:b3zoftc7kzazxp7bj2zbhcqafe
Masked Image Residual Learning for Scaling Deeper Vision Transformers
[article]
2023
arXiv
pre-print
Deeper Vision Transformers (ViTs) are more challenging to train. We expose a degradation problem in deeper layers of ViT when using masked image modeling (MIM) for pre-training. ...
The deeper ViT-S-54, costing 3× less than ViT-Large, achieves performance on par with ViT-Large. ...
Table 1 : 1 Details of Vision Transformer scaling along the depth dimension. ...
arXiv:2309.14136v3
fatcat:7kqnbdz22fc5vjvjhqi7m4swqu
Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language
[article]
2022
arXiv
pre-print
Current self-supervised learning algorithms are often modality-specific and require large amounts of computational resources. ...
To address these issues, we increase the training efficiency of data2vec, a learning objective that generalizes across several modalities. ...
In computer vision, there has been a shift towards Vision Transformer architectures (ViT; Dosovitskiy et al. 2020 ) and masked prediction methods that can be very efficient by not encoding masked patches ...
arXiv:2212.07525v1
fatcat:5hgg3nndb5e43bkndkvu5d2zzm
Denoising Diffusion Autoencoders are Unified Self-supervised Learners
[article]
2023
arXiv
pre-print
Transfer learning from ImageNet also confirms the suitability of DDAE for Vision Transformers, suggesting the potential to scale DDAEs as unified foundation models. ...
Table 3 and Table 4 show that the scaled DiT-XL/2 outperforms the smaller MAE ViT-B/16 under all settings by large margins except for linear probing on CIFAR-10. ...
However, DDAE is not as efficient as Sim-CLR on this dataset: a slightly larger ResNet-50 can surpass our linear probe result with fewer parameters. Transfer learning with Vision Transformers. ...
arXiv:2303.09769v2
fatcat:yg7pimlk5ne3fi4ingzfxmofbu
MAE-DFER: Efficient Masked Autoencoder for Self-supervised Dynamic Facial Expression Recognition
2023
Proceedings of the 31st ACM International Conference on Multimedia
Since the vanilla Vision Transformer (ViT) employed in VideoMAE requires substantial computation during fine-tuning, MAE-DFER develops an efficient local-global interaction Transformer (LGI-Former) as ...
., VideoMAE), this paper proposes MAE-DFER, a novel selfsupervised method which leverages large-scale self-supervised pretraining on abundant unlabeled data to largely advance the development of DFER. ...
This is largely attributed to the limited training samples in current DFER datasets since large vision Transformers are data-hungry and training them typically requires more than million-level labeled ...
doi:10.1145/3581783.3612365
fatcat:3fz23bs2nvdj7cvjzzaeokxwfa
FastMIM: Expediting Masked Image Modeling Pre-training for Vision
[article]
2022
arXiv
pre-print
Equipped with FastMIM, all kinds of vision backbones can be pre-trained in an efficient way. ...
The combination of transformers and masked image modeling (MIM) pre-training framework has shown great potential in various vision tasks. ...
N/A: MAE is not suitable for Swin Transformer.) becoming a generic MIM framework for various vision backbones [12, 21, 33, 42] . ...
arXiv:2212.06593v1
fatcat:ebdoruemjzg5plcnw3js2c5fp4
Masked Image Modeling with Denoising Contrast
[article]
2023
arXiv
pre-print
MIM recently dominates this line of research with state-of-the-art performance on vision Transformers (ViTs), where the core is to enhance the patch-level visual context capturing of the network via denoising ...
ConMIM-pretrained models with various scales achieve competitive results on downstream image classification, semantic segmentation, object detection, and instance segmentation tasks, e.g., on ImageNet- ...
Towards this goal, the recent work MAE (He et al., 2022) proposes to mask a large proportion of patches. ...
arXiv:2205.09616v2
fatcat:3x5a5vcibngqzmkqyidp73au2e
« Previous
Showing results 1 — 15 out of 6,138 results