Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Filters








15 Hits in 2.7 sec

mSLAM: Massively multilingual joint pre-training for speech and text [article]

Ankur Bapna, Colin Cherry, Yu Zhang, Ye Jia, Melvin Johnson, Yong Cheng, Simran Khanuja, Jason Riesa, Alexis Conneau
2022 arXiv   pre-print
We present mSLAM, a multilingual Speech and LAnguage Model that learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text  ...  We evaluate mSLAM on several downstream speech understanding tasks and find that joint pre-training with text improves quality on speech translation, speech intent classification and speech language-ID  ...  9.1 6.0 9.1 Table 11 : 11 Speech recognition -BABEL ASR baselines in five languages, reporting WER. : Massively multilingual joint pre-training for speech and text Model as tl sw lo ka Avg Number of  ... 
arXiv:2202.01374v1 fatcat:hwphwa7dvbbx5ozjc6qip4jrwe

Mu^2SLAM: Multitask, Multilingual Speech and Language Models [article]

Yong Cheng, Yu Zhang, Melvin Johnson, Wolfgang Macherey, Ankur Bapna
2022 arXiv   pre-print
We present Mu^2SLAM, a multilingual sequence-to-sequence model pre-trained jointly on unlabeled speech, unlabeled text and supervised data spanning Automatic Speech Recognition (ASR), Automatic Speech  ...  model for all speech and text understanding tasks.  ...  Acknowledgements We would like to give our special thanks to Yuan Cao and Zhehuai Chen for insightful discussions.  ... 
arXiv:2212.09553v1 fatcat:zqc5ix4swfgjfd3cm4njzn27a4

MAESTRO: Matched Speech Text Representations through Modality Matching [article]

Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Pedro Moreno, Ankur Bapna, Heiga Zen
2022 arXiv   pre-print
We present Maestro, a self-supervised training method to unify representations learnt from speech and text modalities.  ...  Learning aligned representations from unpaired speech and text sequences is a challenging task.  ...  We apply Maestro to train both, monolingual and massively multilingual models of speech and text.  ... 
arXiv:2204.03409v2 fatcat:groj6z3xxvfffb626kkm4stmve

FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech [article]

Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, Ankur Bapna
2022 arXiv   pre-print
In this paper, we provide baselines for the tasks based on multilingual pre-trained models like mSLAM.  ...  FLEURS can be used for a variety of speech tasks, including Automatic Speech Recognition (ASR), Speech Language Identification (Speech LangID), Translation and Retrieval.  ...  More recently, mSLAM [10] , a joint speech and text multilingual pretrained model, outperformed XLS-R on speech translation and ASR and improved over speech-only baselines on Speech-LangID.  ... 
arXiv:2205.12446v1 fatcat:lasevczdsneblgqsmlutymscju

Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation [article]

Ye Jia, Yifan Ding, Ankur Bapna, Colin Cherry, Yu Zhang, Alexis Conneau, Nobuyuki Morioka
2022 arXiv   pre-print
In this work, we explore multiple approaches for leveraging much more widely available unsupervised and weakly-supervised speech and text data to improve the performance of direct S2ST based on Translatotron  ...  Our comparative studies suggest future research directions for S2ST and speech representation learning.  ...  Acknowledgments The authors thank Benjamin Lee for his help on improving the multi-task learning infrastructure in the Lingvo framework and Chung-Cheng Chiu for helpful feedback.  ... 
arXiv:2203.13339v2 fatcat:eyykz44scre7lnb66nma6x7hrq

Modular Speech-to-Text Translation for Zero-Shot Cross-Modal Transfer [article]

Paul-Ambroise Duquenne, Holger Schwenk, Benoît Sagot
2023 arXiv   pre-print
Recent research has shown that independently trained encoders and decoders, combined through a shared fixed-size representation, can achieve competitive performance in speech-to-text translation.  ...  In this work, we show that this type of approach can be further improved with multilingual training.  ...  Recently, we saw the emergence of joint fixed-size representation for speech and text for mining purposes [17, 18, 19] : speech and text utterances are encoded in a shared sentence embedding space, then  ... 
arXiv:2310.03724v1 fatcat:epeoqee3wfgvtjh3rhgjxtvpby

T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine Translation [article]

Paul-Ambroise Duquenne, Hongyu Gong, Benoît Sagot, Holger Schwenk
2022 arXiv   pre-print
We present a new approach to perform zero-shot cross-modal transfer between speech and text for translation tasks. Multilingual speech and text are encoded in a joint fixed-size representation space.  ...  Incorporating a speech decoder in our framework, we introduce the first results for zero-shot direct speech-to-speech and text-to-speech translation.  ...  pre-trained model for speech.  ... 
arXiv:2205.12216v2 fatcat:mf3mrnwxoffrbau24pcvvvev6q

Improving Massively Multilingual ASR With Auxiliary CTC Objectives [article]

William Chen, Brian Yan, Jiatong Shi, Yifan Peng, Soumi Maiti, Shinji Watanabe
2023 arXiv   pre-print
Multilingual Automatic Speech Recognition (ASR) models have extended the usability of speech technologies to a wide variety of languages.  ...  Trained models and reproducible recipes are available at https://github.com/espnet/espnet/tree/master/egs2/fleurs/asr1 .  ...  INTRODUCTION Recent advancements in multilingual speech processing have shown great promise towards building speech systems for all, expanding language coverage beyond the high-resources [1] [2] [3] [  ... 
arXiv:2302.12829v2 fatcat:wfjnzekbpzg35bt6fpjkwd6c6a

Maestro-U: Leveraging joint speech-text representation learning for zero supervised speech ASR [article]

Zhehuai Chen, Ankur Bapna, Andrew Rosenberg, Yu Zhang, Bhuvana Ramabhadran, Pedro Moreno, Nanxin Chen
2022 arXiv   pre-print
In this work, we demonstrate that a modality-matched joint speech and text model can be leveraged to train a massively multilingual ASR model without any supervised (manually transcribed) speech for some  ...  This paper explores the use of jointly learnt speech and text representations in a massively multilingual, zero supervised speech, real-world setting to expand the set of languages covered by ASR with  ...  To improve the joint speech and text representation learning for this setting we propose the following: • Building on the FLEURS benchmark [21] , we define a massively multilingual zero-supervised-speech  ... 
arXiv:2210.10027v2 fatcat:hldkutvw2fcthdrgdkcoqj2afy

token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text [article]

Xianghu Yue and Junyi Ao and Xiaoxue Gao and Haizhou Li
2022 arXiv   pre-print
In this paper, we take the idea of self-supervised pre-training one step further and propose token2vec, a novel joint pre-training framework for unpaired speech and text based on discrete representations  ...  The question is whether we are able to perform a speech-text joint pre-training on unpaired speech and text.  ...  SLAM [15] introduces two alignment losses, e.g. translation language modeling (TLM) and speech-text matching (STM) on paired data, to align speech and text. mSLAM [18] is a multilingual version of  ... 
arXiv:2210.16755v1 fatcat:w6jqm4qzmzhq3gnarqtp4pfnrq

Robust Speech Recognition via Large-Scale Weak Supervision [article]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever
2022 arXiv   pre-print
When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.  ...  We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet.  ...  We are also grateful to the Acceleration and Supercomputing teams at OpenAI for their critical work on software and hardware infrastructure this project used.  ... 
arXiv:2212.04356v1 fatcat:uqrsf4qeb5cz3fpldlt64iekou

The Decades Progress on Code-Switching Research in NLP: A Systematic Survey on Trends and Challenges [article]

Genta Indra Winata, Alham Fikri Aji, Zheng-Xin Yong, Thamar Solorio
2022 arXiv   pre-print
Finally, we summarize the trends and findings and conclude with a discussion for future direction and open questions for further investigation.  ...  Code-Switching, a common phenomenon in written text and conversation, has been studied over decades by the natural language processing (NLP) research community.  ...  Acknowledgements Thanks to Igor Malioutov for the insightful discussion on the paper.  ... 
arXiv:2212.09660v1 fatcat:tfmt3kqxfre6dkpuj76drc7kky

MAESTRO: Matched Speech Text Representations through Modality Matching

Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Pedro J. Moreno, Ankur Bapna, Heiga Zen
2022 Interspeech 2022   unpublished
We present Maestro, a self-supervised training method to unify representations learnt from speech and text modalities.  ...  Learning aligned representations from unpaired speech and text sequences is a challenging task.  ...  We apply Maestro to train both, monolingual and massively multilingual models of speech and text.  ... 
doi:10.21437/interspeech.2022-10937 fatcat:pm7opbzeqnfmrbrpwc4ylqmc5q

Modular Speech-to-Text Translation for Zero-Shot Cross-Modal Transfer

Paul-Ambroise Duquenne, Holger Schwenk, Benoît Sagot
2023 INTERSPEECH 2023   unpublished
Recent research has shown that independently trained encoders and decoders, combined through a shared fixed-size representation, can achieve competitive performance in speech-to-text translation.  ...  In this work, we show that this type of approach can be further improved with multilingual training.  ...  Recently, we saw the emergence of joint fixed-size representation for speech and text for mining purposes [17, 18] : speech and text utterances are encoded in a shared sentence embedding space, then distances  ... 
doi:10.21437/interspeech.2023-2484 fatcat:d2eevgzmknednlutglfx7oeiyy

The Decades Progress on Code-Switching Research in NLP: A Systematic Survey on Trends and Challenges

Genta Winata, Alham Fikri Aji, Zheng Xin Yong, Thamar Solorio
2023 Findings of the Association for Computational Linguistics: ACL 2023   unpublished
Finally, we summarize the trends and findings and conclude with a discussion for future direction and open questions for further investigation.  ...  Code-Switching, a common phenomenon in written text and conversation, has been studied over decades by the natural language processing (NLP) research community.  ...  Acknowledgements Thanks to Igor Malioutov for the insightful discussion on the paper.  ... 
doi:10.18653/v1/2023.findings-acl.185 fatcat:xyfu7z4uwjesvmk7zwvamrtlke