mSLAM: Massively multilingual joint pre-training for speech and text.

We present mSLAM, a multilingual Speech and LAnguage Model that learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text ... We evaluate mSLAM on several downstream speech understanding tasks and find that joint pre-training with text improves quality on speech translation, speech intent classification and speech language-ID ... 9.1 6.0 9.1 Table 11 : 11 Speech recognition -BABEL ASR baselines in five languages, reporting WER. : Massively multilingual joint pre-training for speech and text Model as tl sw lo ka Avg Number of ...

arXiv:2202.01374v1 fatcat:hwphwa7dvbbx5ozjc6qip4jrwe

Open Access

We present Mu^2SLAM, a multilingual sequence-to-sequence model pre-trained jointly on unlabeled speech, unlabeled text and supervised data spanning Automatic Speech Recognition (ASR), Automatic Speech ... model for all speech and text understanding tasks. ... Acknowledgements We would like to give our special thanks to Yuan Cao and Zhehuai Chen for insightful discussions. ...

arXiv:2212.09553v1 fatcat:zqc5ix4swfgjfd3cm4njzn27a4

Open Access

We present Maestro, a self-supervised training method to unify representations learnt from speech and text modalities. ... Learning aligned representations from unpaired speech and text sequences is a challenging task. ... We apply Maestro to train both, monolingual and massively multilingual models of speech and text. ...

arXiv:2204.03409v2 fatcat:groj6z3xxvfffb626kkm4stmve

Open Access Multiple Versions

In this paper, we provide baselines for the tasks based on multilingual pre-trained models like mSLAM. ... FLEURS can be used for a variety of speech tasks, including Automatic Speech Recognition (ASR), Speech Language Identification (Speech LangID), Translation and Retrieval. ... More recently, mSLAM [10] , a joint speech and text multilingual pretrained model, outperformed XLS-R on speech translation and ASR and improved over speech-only baselines on Speech-LangID. ...

arXiv:2205.12446v1 fatcat:lasevczdsneblgqsmlutymscju

Open Access

In this work, we explore multiple approaches for leveraging much more widely available unsupervised and weakly-supervised speech and text data to improve the performance of direct S2ST based on Translatotron ... Our comparative studies suggest future research directions for S2ST and speech representation learning. ... Acknowledgments The authors thank Benjamin Lee for his help on improving the multi-task learning infrastructure in the Lingvo framework and Chung-Cheng Chiu for helpful feedback. ...

arXiv:2203.13339v2 fatcat:eyykz44scre7lnb66nma6x7hrq

Multiple Versions

Recent research has shown that independently trained encoders and decoders, combined through a shared fixed-size representation, can achieve competitive performance in speech-to-text translation. ... In this work, we show that this type of approach can be further improved with multilingual training. ... Recently, we saw the emergence of joint fixed-size representation for speech and text for mining purposes [17, 18, 19] : speech and text utterances are encoded in a shared sentence embedding space, then ...

arXiv:2310.03724v1 fatcat:epeoqee3wfgvtjh3rhgjxtvpby

We present a new approach to perform zero-shot cross-modal transfer between speech and text for translation tasks. Multilingual speech and text are encoded in a joint fixed-size representation space. ... Incorporating a speech decoder in our framework, we introduce the first results for zero-shot direct speech-to-speech and text-to-speech translation. ... pre-trained model for speech. ...

arXiv:2205.12216v2 fatcat:mf3mrnwxoffrbau24pcvvvev6q

Multiple Versions

Multilingual Automatic Speech Recognition (ASR) models have extended the usability of speech technologies to a wide variety of languages. ... Trained models and reproducible recipes are available at https://github.com/espnet/espnet/tree/master/egs2/fleurs/asr1 . ... INTRODUCTION Recent advancements in multilingual speech processing have shown great promise towards building speech systems for all, expanding language coverage beyond the high-resources [1] [2] [3] [ ...

arXiv:2302.12829v2 fatcat:wfjnzekbpzg35bt6fpjkwd6c6a

Multiple Versions

In this work, we demonstrate that a modality-matched joint speech and text model can be leveraged to train a massively multilingual ASR model without any supervised (manually transcribed) speech for some ... This paper explores the use of jointly learnt speech and text representations in a massively multilingual, zero supervised speech, real-world setting to expand the set of languages covered by ASR with ... To improve the joint speech and text representation learning for this setting we propose the following: • Building on the FLEURS benchmark [21] , we define a massively multilingual zero-supervised-speech ...

arXiv:2210.10027v2 fatcat:hldkutvw2fcthdrgdkcoqj2afy

Open Access Multiple Versions

In this paper, we take the idea of self-supervised pre-training one step further and propose token2vec, a novel joint pre-training framework for unpaired speech and text based on discrete representations ... The question is whether we are able to perform a speech-text joint pre-training on unpaired speech and text. ... SLAM [15] introduces two alignment losses, e.g. translation language modeling (TLM) and speech-text matching (STM) on paired data, to align speech and text. mSLAM [18] is a multilingual version of ...

arXiv:2210.16755v1 fatcat:w6jqm4qzmzhq3gnarqtp4pfnrq

When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing. ... We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. ... We are also grateful to the Acceleration and Supercomputing teams at OpenAI for their critical work on software and hardware infrastructure this project used. ...

arXiv:2212.04356v1 fatcat:uqrsf4qeb5cz3fpldlt64iekou

Finally, we summarize the trends and findings and conclude with a discussion for future direction and open questions for further investigation. ... Code-Switching, a common phenomenon in written text and conversation, has been studied over decades by the natural language processing (NLP) research community. ... Acknowledgements Thanks to Igor Malioutov for the insightful discussion on the paper. ...

arXiv:2212.09660v1 fatcat:tfmt3kqxfre6dkpuj76drc7kky

Open Access

We present Maestro, a self-supervised training method to unify representations learnt from speech and text modalities. ... Learning aligned representations from unpaired speech and text sequences is a challenging task. ... We apply Maestro to train both, monolingual and massively multilingual models of speech and text. ...

doi:10.21437/interspeech.2022-10937 fatcat:pm7opbzeqnfmrbrpwc4ylqmc5q

Recent research has shown that independently trained encoders and decoders, combined through a shared fixed-size representation, can achieve competitive performance in speech-to-text translation. ... In this work, we show that this type of approach can be further improved with multilingual training. ... Recently, we saw the emergence of joint fixed-size representation for speech and text for mining purposes [17, 18] : speech and text utterances are encoded in a shared sentence embedding space, then distances ...

doi:10.21437/interspeech.2023-2484 fatcat:d2eevgzmknednlutglfx7oeiyy

Finally, we summarize the trends and findings and conclude with a discussion for future direction and open questions for further investigation. ... Code-Switching, a common phenomenon in written text and conversation, has been studied over decades by the natural language processing (NLP) research community. ... Acknowledgements Thanks to Igor Malioutov for the insightful discussion on the paper. ...

doi:10.18653/v1/2023.findings-acl.185 fatcat:xyfu7z4uwjesvmk7zwvamrtlke

mSLAM: Massively multilingual joint pre-training for speech and text [article]

Preserved Fulltext

Mu^2SLAM: Multitask, Multilingual Speech and Language Models [article]

Preserved Fulltext

MAESTRO: Matched Speech Text Representations through Modality Matching [article]

Preserved Fulltext

Other Versions

FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech [article]

Preserved Fulltext

Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation [article]

Preserved Fulltext

Other Versions

Modular Speech-to-Text Translation for Zero-Shot Cross-Modal Transfer [article]

Preserved Fulltext

T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine Translation [article]

Preserved Fulltext

Other Versions

Improving Massively Multilingual ASR With Auxiliary CTC Objectives [article]

Preserved Fulltext

Other Versions

Maestro-U: Leveraging joint speech-text representation learning for zero supervised speech ASR [article]

Preserved Fulltext

Other Versions

token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text [article]

Preserved Fulltext

Robust Speech Recognition via Large-Scale Weak Supervision [article]

Preserved Fulltext

The Decades Progress on Code-Switching Research in NLP: A Systematic Survey on Trends and Challenges [article]

Preserved Fulltext

MAESTRO: Matched Speech Text Representations through Modality Matching

Preserved Fulltext

Modular Speech-to-Text Translation for Zero-Shot Cross-Modal Transfer

Preserved Fulltext

The Decades Progress on Code-Switching Research in NLP: A Systematic Survey on Trends and Challenges

Preserved Fulltext