Sound
- [1] arXiv:2406.05464 [pdf, ps, html, other]
-
Title: DAISY: Data Adaptive Self-Supervised Early Exit for Speech Representation ModelsComments: Accepted by Interspeech 2024Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Self-supervised speech models have shown to be useful for various tasks, but their large size limits the use in devices with low computing power and memory. In this work, we explore early exit, an approach for reducing latency by exiting the forward process of a network early. Most approaches of early exit need a separate early exit model for each task, with some even requiring fine-tuning of the entire pretrained model. We introduce Data Adaptive Self-Supervised Early Exit (DAISY), an approach that decides when to exit based on the self-supervised loss, eliminating the need for multiple round of training and fine-tuning. DAISY matches the performance of HuBERT on the MiniSUPERB benchmark, but with much faster inference times. Our analysis on the adaptivity of DAISY shows that the model exits early (using fewer layers) on clean data while exits late (using more layers) on noisy data, dynamically adjusting the computational cost of inference based on the noise level of each sample.
- [2] arXiv:2406.05515 [pdf, ps, html, other]
-
Title: Mmm whatcha say? Uncovering distal and proximal context effects in first and second-language word perception using psychophysical reverse correlationPaige Tuttösí, H. Henny Yeung, Yue Wang, Fenqi Wang, Guillaume Denis, Jean-Julien Aucouturier, Angelica LimComments: Accepted to INTERSPEECH 2024Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Acoustic context effects, where surrounding changes in pitch, rate or timbre influence the perception of a sound, are well documented in speech perception, but how they interact with language background remains unclear. Using a reverse-correlation approach, we systematically varied the pitch and speech rate in phrases around different pairs of vowels for second language (L2) speakers of English (/i/-/I/) and French (/u/-/y/), thus reconstructing, in a data-driven manner, the prosodic profiles that bias their perception. Testing English and French speakers (n=25), we showed that vowel perception is in fact influenced by conflicting effects from the surrounding pitch and speech rate: a congruent proximal effect 0.2s pre-target and a distal contrastive effect up to 1s before; and found that L1 and L2 speakers exhibited strikingly similar prosodic profiles in perception. We provide a novel method to investigate acoustic context effects across stimuli, timescales, and acoustic domain.
- [3] arXiv:2406.05547 [pdf, ps, html, other]
-
Title: Exploring the Benefits of Tokenization of Discrete Acoustic UnitsComments: Interspeech 2024Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Tokenization algorithms that merge the units of a base vocabulary into larger, variable-rate units have become standard in natural language processing tasks. This idea, however, has been mostly overlooked when the vocabulary consists of phonemes or Discrete Acoustic Units (DAUs), an audio-based representation that is playing an increasingly important role due to the success of discrete language-modeling techniques. In this paper, we showcase the advantages of tokenization of phonetic units and of DAUs on three prediction tasks: grapheme-to-phoneme, grapheme-to-DAUs, and unsupervised speech generation using DAU language modeling. We demonstrate that tokenization yields significant improvements in terms of performance, as well as training and inference speed, across all three tasks. We also offer theoretical insights to provide some explanation for the superior performance observed.
- [4] arXiv:2406.05653 [pdf, ps, html, other]
-
Title: Heart Sound Segmentation Using Deep Learning TechniquesSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Heart disease remains a leading cause of mortality worldwide. Auscultation, the process of listening to heart sounds, can be enhanced through computer-aided analysis using Phonocardiogram (PCG) signals. This paper presents a novel approach for heart sound segmentation and classification into S1 (LUB) and S2 (DUB) sounds. We employ FFT-based filtering, dynamic programming for event detection, and a Siamese network for robust classification. Our method demonstrates superior performance on the PASCAL heart sound dataset compared to existing approaches.
- [5] arXiv:2406.05681 [pdf, ps, html, other]
-
Title: Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody ModelingComments: 5 pages, 2 figures, accepted by Interspeech2024Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Recent research in zero-shot speech synthesis has made significant progress in speaker similarity. However, current efforts focus on timbre generalization rather than prosody modeling, which results in limited naturalness and expressiveness. To address this, we introduce a novel speech synthesis model trained on large-scale datasets, including both timbre and hierarchical prosody modeling. As timbre is a global attribute closely linked to expressiveness, we adopt a global vector to model speaker timbre while guiding prosody modeling. Besides, given that prosody contains both global consistency and local variations, we introduce a diffusion model as the pitch predictor and employ a prosody adaptor to model prosody hierarchically, further enhancing the prosody quality of the synthesized speech. Experimental results show that our model not only maintains comparable timbre quality to the baseline but also exhibits better naturalness and expressiveness.
- [6] arXiv:2406.05692 [pdf, ps, html, other]
-
Title: SPA-SVC: Self-supervised Pitch Augmentation for Singing Voice ConversionComments: Accepted by Interspeech 2024Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Diffusion-based singing voice conversion (SVC) models have shown better synthesis quality compared to traditional methods. However, in cross-domain SVC scenarios, where there is a significant disparity in pitch between the source and target voice domains, the models tend to generate audios with hoarseness, posing challenges in achieving high-quality vocal outputs. Therefore, in this paper, we propose a Self-supervised Pitch Augmentation method for Singing Voice Conversion (SPA-SVC), which can enhance the voice quality in SVC tasks without requiring additional data or increasing model parameters. We innovatively introduce a cycle pitch shifting training strategy and Structural Similarity Index (SSIM) loss into our SVC model, effectively enhancing its performance. Experimental results on the public singing datasets M4Singer indicate that our proposed method significantly improves model performance in both general SVC scenarios and particularly in cross-domain SVC scenarios.
- [7] arXiv:2406.05784 [pdf, ps, html, other]
-
Title: Optimizing Multi-Stuttered Speech Classification: Leveraging Whisper's Encoder for Efficient Parameter Reduction in Automated AssessmentSubjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
The automated classification of stuttered speech has significant implications for timely assessments providing assistance to speech language pathologists. Despite notable advancements in the field, the cases in which multiple disfluencies occur in speech require attention. We have taken a progressive approach to fill this gap by classifying multi-stuttered speech more efficiently. The problem has been addressed by firstly curating a dataset of multi-stuttered disfluencies from SEP-28k audio clips. Secondly, employing Whisper, a state-of-the-art speech recognition model has been leveraged by using its encoder and taking the problem as multi-label classification. Thirdly, using a 6 encoder layer Whisper and experimenting with various layer freezing strategies, a computationally efficient configuration of the model was identified. The proposed configuration achieved micro, macro, and weighted F1- scores of 0.88, 0.85, and 0.87, correspondingly on an external test dataset i.e. Fluency-Bank. In addition, through layer freezing strategies, we were able to achieve the aforementioned results by fine-tuning a single encoder layer, consequently, reducing the model's trainable parameters from 20.27 million to 3.29 million. This research study unveils the contribution of the last encoder layer in the identification of disfluencies in stuttered speech. Consequently, it has led to a computationally efficient approach which makes the model more adaptable for various dialects and languages.
- [8] arXiv:2406.05863 [pdf, ps, other]
-
Title: Source -Free Domain Adaptation for Speaker Verification in Data-Scarce Languages and Noisy ChannelsSubjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Domain adaptation is often hampered by exceedingly small target datasets and inaccessible source data. These conditions are prevalent in speech verification, where privacy policies and/or languages with scarce speech resources limit the availability of sufficient data. This paper explored techniques of sourcefree domain adaptation unto a limited target speech dataset for speaker verificationin data-scarce languages. Both language and channel mis-match between source and target were investigated. Fine-tuning methods were evaluated and compared across different sizes of labeled target data. A novel iterative cluster-learn algorithm was studied for unlabeled target datasets.
- [9] arXiv:2406.05923 [pdf, ps, html, other]
-
Title: Contrastive Learning from Synthetic Audio DoppelgangersComments: 17 pages, 6 figuresSubjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Learning robust audio representations currently demands extensive datasets of real-world sound recordings. By applying artificial transformations to these recordings, models can learn to recognize similarities despite subtle variations through techniques like contrastive learning. However, these transformations are only approximations of the true diversity found in real-world sounds, which are generated by complex interactions of physical processes, from vocal cord vibrations to the resonance of musical instruments. We propose a solution to both the data scale and transformation limitations, leveraging synthetic audio. By randomly perturbing the parameters of a sound synthesizer, we generate audio doppelgängers-synthetic positive pairs with causally manipulated variations in timbre, pitch, and temporal envelopes. These variations, difficult to achieve through transformations of existing audio, provide a rich source of contrastive information. Despite the shift to randomly generated synthetic data, our method produces strong representations, competitive with real data on standard audio classification benchmarks. Notably, our approach is lightweight, requires no data storage, and has only a single hyperparameter, which we extensively analyze. We offer this method as a complement to existing strategies for contrastive learning in audio, using synthesized sounds to reduce the data burden on practitioners.
- [10] arXiv:2406.06086 [pdf, ps, html, other]
-
Title: RawBMamba: End-to-End Bidirectional State Space Model for Audio Deepfake DetectionYujie Chen, Jiangyan Yi, Jun Xue, Chenglong Wang, Xiaohui Zhang, Shunbo Dong, Siding Zeng, Jianhua Tao, Lv Zhao, Cunhang FanComments: Accepted by Interspeech 2024Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Fake artefacts for discriminating between bonafide and fake audio can exist in both short- and long-range segments. Therefore, combining local and global feature information can effectively discriminate between bonafide and fake audio. This paper proposes an end-to-end bidirectional state space model, named RawBMamba, to capture both short- and long-range discriminative information for audio deepfake detection. Specifically, we use sinc Layer and multiple convolutional layers to capture short-range features, and then design a bidirectional Mamba to address Mamba's unidirectional modelling problem and further capture long-range feature information. Moreover, we develop a bidirectional fusion module to integrate embeddings, enhancing audio context representation and combining short- and long-range information. The results show that our proposed RawBMamba achieves a 34.1\% improvement over Rawformer on ASVspoof2021 LA dataset, and demonstrates competitive performance on other datasets.
- [11] arXiv:2406.06097 [pdf, ps, other]
-
Title: StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History SelectionComments: Accepted at ACL 2024 main conferenceSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Streaming speech-to-text translation (StreamST) is the task of automatically translating speech while incrementally receiving an audio stream. Unlike simultaneous ST (SimulST), which deals with pre-segmented speech, StreamST faces the challenges of handling continuous and unbounded audio streams. This requires additional decisions about what to retain of the previous history, which is impractical to keep entirely due to latency and computational constraints. Despite the real-world demand for real-time ST, research on streaming translation remains limited, with existing works solely focusing on SimulST. To fill this gap, we introduce StreamAtt, the first StreamST policy, and propose StreamLAAL, the first StreamST latency metric designed to be comparable with existing metrics for SimulST. Extensive experiments across all 8 languages of MuST-C v1.0 show the effectiveness of StreamAtt compared to a naive streaming baseline and the related state-of-the-art SimulST policy, providing a first step in StreamST research.
- [12] arXiv:2406.06139 [pdf, ps, html, other]
-
Title: Thunder : Unified Regression-Diffusion Speech Enhancement with a Single Reverse Step using Brownian BridgeComments: 5 pages, 3 figures, 4 tables, This paper will be submitted in the interspeech conferenceSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Diffusion-based speech enhancement has shown promising results, but can suffer from a slower inference time. Initializing the diffusion process with the enhanced audio generated by a regression-based model can be used to reduce the computational steps required. However, these approaches often necessitate a regression model, further increasing the system's complexity. We propose Thunder, a unified regression-diffusion model that utilizes the Brownian bridge process which can allow the model to act in both modes. The regression mode can be accessed by setting the diffusion time step closed to 1. However, the standard score-based diffusion modeling does not perform well in this setup due to gradient instability. To mitigate this problem, we modify the diffusion model to predict the clean speech instead of the score function, achieving competitive performance with a more compact model size and fewer reverse steps.
- [13] arXiv:2406.06208 [pdf, ps, html, other]
-
Title: Quantifying the effect of speech pathology on automatic and human speaker verificationBence Mark Halpern, Thomas Tienkamp, Wen-Chin Huang, Lester Phillip Violeta, Teja Rebernik, Sebastiaan de Visscher, Max Witjes, Martijn Wieling, Defne Abur, Tomoki TodaComments: 5 pages, 2 figures, 2 tables. Accepted to Interspeech 2024Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
This study investigates how surgical intervention for speech pathology (specifically, as a result of oral cancer surgery) impacts the performance of an automatic speaker verification (ASV) system. Using two recently collected Dutch datasets with parallel pre and post-surgery audio from the same speaker, NKI-OC-VC and SPOKE, we assess the extent to which speech pathology influences ASV performance, and whether objective/subjective measures of speech severity are correlated with the performance. Finally, we carry out a perceptual study to compare judgements of ASV and human listeners. Our findings reveal that pathological speech negatively affects ASV performance, and the severity of the speech is negatively correlated with the performance. There is a moderate agreement in perceptual and objective scores of speaker similarity and severity, however, we could not clearly establish in the perceptual study, whether the same phenomenon also exists in human perception.
- [14] arXiv:2406.06295 [pdf, ps, html, other]
-
Title: Zero-Shot Audio Captioning Using Soft and Hard PromptsComments: Submitted to IEEE/ACM Transactions on Audio, Speech and Language ProcessingSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
In traditional audio captioning methods, a model is usually trained in a fully supervised manner using a human-annotated dataset containing audio-text pairs and then evaluated on the test sets from the same dataset. Such methods have two limitations. First, these methods are often data-hungry and require time-consuming and expensive human annotations to obtain audio-text pairs. Second, these models often suffer from performance degradation in cross-domain scenarios, i.e., when the input audio comes from a different domain than the training set, which, however, has received little attention. We propose an effective audio captioning method based on the contrastive language-audio pre-training (CLAP) model to address these issues. Our proposed method requires only textual data for training, enabling the model to generate text from the textual feature in the cross-modal semantic this http URL the inference stage, the model generates the descriptive text for the given audio from the audio feature by leveraging the audio-text alignment from CLAP.We devise two strategies to mitigate the discrepancy between text and audio embeddings: a mixed-augmentation-based soft prompt and a retrieval-based acoustic-aware hard prompt. These approaches are designed to enhance the generalization performance of our proposed model, facilitating the model to generate captions more robustly and accurately. Extensive experiments on AudioCaps and Clotho benchmarks show the effectiveness of our proposed method, which outperforms other zero-shot audio captioning approaches for in-domain scenarios and outperforms the compared methods for cross-domain scenarios, underscoring the generalization ability of our method.
- [15] arXiv:2406.06310 [pdf, ps, html, other]
-
Title: Unsupervised Improved MVDR Beamforming for Sound EnhancementSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Neural networks have recently become the dominant approach to sound separation. Their good performance relies on large datasets of isolated recordings. For speech and music, isolated single channel data are readily available; however the same does not hold in the multi-channel case, and with most other sound classes. Multi-channel methods have the potential to outperform single channel approaches as they can exploit both spatial and spectral features, but the lack of training data remains a challenge. We propose unsupervised improved minimum variation distortionless response (UIMVDR), which enables multi-channel separation to leverage in-the-wild single-channel data through unsupervised training and beamforming. Results show that UIMVDR generalizes well and improves separation performance compared to supervised models, particularly in cases with limited supervised data. By using data available online, it also reduces the effort required to gather data for multi-channel approaches.
- [16] arXiv:2406.06332 [pdf, ps, html, other]
-
Title: An automatic analysis of ultrasound vocalisations for the prediction of interaction context in captive Egyptian fruit batsComments: Accepted at EUSIPCO 2024Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Prior work in computational bioacoustics has mostly focused on the detection of animal presence in a particular habitat. However, animal sounds contain much richer information than mere presence; among others, they encapsulate the interactions of those animals with other members of their species. Studying these interactions is almost impossible in a naturalistic setting, as the ground truth is often lacking. The use of animals in captivity instead offers a viable alternative pathway. However, most prior works follow a traditional, statistics-based approach to analysing interactions. In the present work, we go beyond this standard framework by attempting to predict the underlying context in interactions between captive \emph{Rousettus Aegyptiacus} using deep neural networks. We reach an unweighted average recall of over 30\% -- more than thrice the chance level -- and show error patterns that differ from our statistical analysis. This work thus represents an important step towards the automatic analysis of states in animals from sound.
- [17] arXiv:2406.06339 [pdf, ps, html, other]
-
Title: Audio-based Step-count Estimation for Running -- Windowing and Neural Network BaselinesComments: Accepted at EUSIPCO 2024Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
In recent decades, running has become an increasingly popular pastime activity due to its accessibility, ease of practice, and anticipated health benefits. However, the risk of running-related injuries is substantial for runners of different experience levels. Several common forms of injuries result from overuse -- extending beyond the recommended running time and intensity. Recently, audio-based tracking has emerged as yet another modality for monitoring running behaviour and performance, with previous studies largely concentrating on predicting runner fatigue. In this work, we investigate audio-based step count estimation during outdoor running, achieving a mean absolute error of 1.098 in window-based step-count differences and a Pearson correlation coefficient of 0.479 when predicting the number of steps in a 5-second window of audio. Our work thus showcases the feasibility of audio-based monitoring for estimating important physiological variables and lays the foundations for further utilising audio sensors for a more thorough characterisation of runner behaviour.
- [18] arXiv:2406.06341 [pdf, ps, html, other]
-
Title: Predicting Heart Activity from Speech using Data-driven and Knowledge-based featuresComments: Accepted at Interspeech 2024Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
Accurately predicting heart activity and other biological signals is crucial for diagnosis and monitoring. Given that speech is an outcome of multiple physiological systems, a significant body of work studied the acoustic correlates of heart activity. Recently, self-supervised models have excelled in speech-related tasks compared to traditional acoustic methods. However, the robustness of data-driven representations in predicting heart activity remained unexplored. In this study, we demonstrate that self-supervised speech models outperform acoustic features in predicting heart activity parameters. We also emphasize the impact of individual variability on model generalizability. These findings underscore the value of data-driven representations in such tasks and the need for more speech-based physiological data to mitigate speaker-related challenges.
- [19] arXiv:2406.06375 [pdf, ps, html, other]
-
Title: MOSA: Music Motion with Semantic Annotation Dataset for Cross-Modal Music ProcessingYu-Fen Huang, Nikki Moran, Simon Coleman, Jon Kelly, Shun-Hwa Wei, Po-Yin Chen, Yun-Hsin Huang, Tsung-Ping Chen, Yu-Chia Kuo, Yu-Chi Wei, Chih-Hsuan Li, Da-Yu Huang, Hsuan-Kai Kao, Ting-Wei Lin, Li SuComments: IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024. 14 pages, 7 figures. Dataset is available on: this https URL and this https URLSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
In cross-modal music processing, translation between visual, auditory, and semantic content opens up new possibilities as well as challenges. The construction of such a transformative scheme depends upon a benchmark corpus with a comprehensive data infrastructure. In particular, the assembly of a large-scale cross-modal dataset presents major challenges. In this paper, we present the MOSA (Music mOtion with Semantic Annotation) dataset, which contains high quality 3-D motion capture data, aligned audio recordings, and note-by-note semantic annotations of pitch, beat, phrase, dynamic, articulation, and harmony for 742 professional music performances by 23 professional musicians, comprising more than 30 hours and 570 K notes of data. To our knowledge, this is the largest cross-modal music dataset with note-level annotations to date. To demonstrate the usage of the MOSA dataset, we present several innovative cross-modal music information retrieval (MIR) and musical content generation tasks, including the detection of beats, downbeats, phrase, and expressive contents from audio, video and motion data, and the generation of musicians' body motion from given music audio. The dataset and codes are available alongside this publication (this https URL).
New submissions for Tuesday, 11 June 2024 (showing 19 of 19 entries )
- [20] arXiv:2406.05199 (cross-list from eess.AS) [pdf, ps, html, other]
-
Title: XANE: eXplainable Acoustic Neural EmbeddingsSri Harsha Dumpala, Dushyant Sharma, Chandramouli Shama Sastri, Stanislav Kruchinin, James Fosburgh, Patrick A. NaylorSubjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
We present a novel method for extracting neural embeddings that model the background acoustics of a speech signal. The extracted embeddings are used to estimate specific parameters related to the background acoustic properties of the signal in a non-intrusive manner, which allows the embeddings to be explainable in terms of those parameters. We illustrate the value of these embeddings by performing clustering experiments on unseen test data and show that the proposed embeddings achieve a mean F1 score of 95.2\% for three different tasks, outperforming significantly the WavLM based signal embeddings. We also show that the proposed method can explain the embeddings by estimating 14 acoustic parameters characterizing the background acoustics, including reverberation and noise levels, overlapped speech detection, CODEC type detection and noise type detection with high accuracy and a real-time factor 17 times lower than an external baseline method.
- [21] arXiv:2406.05286 (cross-list from eess.AS) [pdf, ps, html, other]
-
Title: Signal processing algorithm effective for sound quality of hearing loss simulatorsComments: This paper has been accepted for publication in Interspeech 2024Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Hearing loss (HL) simulators, which allow normal hearing (NH) listeners to experience HL, have been used in speech intelligibility experiments, but not in sound quality experiments due to perceptible distortion. If they produced less distortion, they might be useful for NH listeners to evaluate the sound quality of, for example, hearing aids. We conducted perceptual sound quality experiments to compare the Cambridge version of HL simulator (CamHLS) and the Wakayama version of the HL simulator (WHIS), which has the two algorithms of filterbank analysis synthesis (FBAS) and direct time-varying filter (DTVF). The experimental results showed that WHIS with DTVF produces less perceptible distortion in speech sounds than CamHLS and WHIS with FBAS, even when the nonlinear process is working. This advantage is mainly due to the use of the DTVF algorithm, which could be applied to various signal synthesis applications with filterbank analysis.
- [22] arXiv:2406.05325 (cross-list from eess.AS) [pdf, ps, html, other]
-
Title: LDM-SVC: Latent Diffusion Model Based Zero-Shot Any-to-Any Singing Voice Conversion with Singer GuidanceComments: Accepted by Interspeech 2024Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Any-to-any singing voice conversion (SVC) is an interesting audio editing technique, aiming to convert the singing voice of one singer into that of another, given only a few seconds of singing data. However, during the conversion process, the issue of timbre leakage is inevitable: the converted singing voice still sounds like the original singer's voice. To tackle this, we propose a latent diffusion model for SVC (LDM-SVC) in this work, which attempts to perform SVC in the latent space using an LDM. We pretrain a variational autoencoder structure using the noted open-source So-VITS-SVC project based on the VITS framework, which is then used for the LDM training. Besides, we propose a singer guidance training method based on classifier-free guidance to further suppress the timbre of the original singer. Experimental results show the superiority of the proposed method over previous works in both subjective and objective evaluations of timbre similarity.
- [23] arXiv:2406.05341 (cross-list from eess.AS) [pdf, ps, html, other]
-
Title: Diversifying and Expanding Frequency-Adaptive Convolution Kernels for Sound Event DetectionComments: Accepted to INTERSPEECH 2024Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Frequency dynamic convolution (FDY conv) has shown the state-of-the-art performance in sound event detection (SED) using frequency-adaptive kernels obtained by frequency-varying combination of basis kernels. However, FDY conv lacks an explicit mean to diversify frequency-adaptive kernels, potentially limiting the performance. In addition, size of basis kernels is limited while time-frequency patterns span larger spectro-temporal range. Therefore, we propose dilated frequency dynamic convolution (DFD conv) which diversifies and expands frequency-adaptive kernels by introducing different dilation sizes to basis kernels. Experiments showed advantages of varying dilation sizes along frequency dimension, and analysis on attention weight variance proved dilated basis kernels are effectively diversified. By adapting class-wise median filter with intersection-based F1 score, proposed DFD-CRNN outperforms FDY-CRNN by 3.12% in terms of polyphonic sound detection score (PSDS).
- [24] arXiv:2406.05359 (cross-list from eess.AS) [pdf, ps, html, other]
-
Title: Towards Lightweight Speaker Verification via Adaptive Neural Network QuantizationComments: submitted to IEEE/ACM Transactions on Audio Speech and Language Processing (Under Review)Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Modern speaker verification (SV) systems typically demand expensive storage and computing resources, thereby hindering their deployment on mobile devices. In this paper, we explore adaptive neural network quantization for lightweight speaker verification. Firstly, we propose a novel adaptive uniform precision quantization method which enables the dynamic generation of quantization centroids customized for each network layer based on k-means clustering. By applying it to the pre-trained SV systems, we obtain a series of quantized variants with different bit widths. To enhance the performance of low-bit quantized models, a mixed precision quantization algorithm along with a multi-stage fine-tuning (MSFT) strategy is further introduced. Unlike uniform precision quantization, mixed precision approach allows for the assignment of varying bit widths to different network layers. When bit combination is determined, MSFT is employed to progressively quantize and fine-tune network in a specific order. Finally, we design two distinct binary quantization schemes to mitigate performance degradation of 1-bit quantized models: the static and adaptive quantizers. Experiments on VoxCeleb demonstrate that lossless 4-bit uniform precision quantization is achieved on both ResNets and DF-ResNets, yielding a promising compression ratio of around 8. Moreover, compared to uniform precision approach, mixed precision quantization not only obtains additional performance improvements with a similar model size but also offers the flexibility to generate bit combination for any desirable model size. In addition, our suggested 1-bit quantization schemes remarkably boost the performance of binarized models. Finally, a thorough comparison with existing lightweight SV systems reveals that our proposed models outperform all previous methods by a large margin across various model size ranges.
- [25] arXiv:2406.05370 (cross-list from cs.CL) [pdf, ps, html, other]
-
Title: VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech SynthesizersSubjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
This paper introduces VALL-E 2, the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time. Based on its predecessor, VALL-E, the new iteration introduces two significant enhancements: Repetition Aware Sampling refines the original nucleus sampling process by accounting for token repetition in the decoding history. It not only stabilizes the decoding but also circumvents the infinite loop issue. Grouped Code Modeling organizes codec codes into groups to effectively shorten the sequence length, which not only boosts inference speed but also addresses the challenges of long sequence modeling. Our experiments on the LibriSpeech and VCTK datasets show that VALL-E 2 surpasses previous systems in speech robustness, naturalness, and speaker similarity. It is the first of its kind to reach human parity on these benchmarks. Moreover, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases. The advantages of this work could contribute to valuable endeavors, such as generating speech for individuals with aphasia or people with amyotrophic lateral sclerosis. Demos of VALL-E 2 will be posted to this https URL.
- [26] arXiv:2406.05401 (cross-list from eess.AS) [pdf, ps, html, other]
-
Title: Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speechComments: 5 pages, 2 figures. Final version, accepted to Interspeech 2024Subjects: Audio and Speech Processing (eess.AS); Human-Computer Interaction (cs.HC); Sound (cs.SD)
Converting input symbols to output audio in TTS requires modelling the durations of speech sounds. Leading non-autoregressive (NAR) TTS models treat duration modelling as a regression problem. The same utterance is then spoken with identical timings every time, unlike when a human speaks. Probabilistic models of duration have been proposed, but there is mixed evidence of their benefits. However, prior studies generally only consider speech read aloud, and ignore spontaneous speech, despite the latter being both a more common and a more variable mode of speaking. We compare the effect of conventional deterministic duration modelling to durations sampled from a powerful probabilistic model based on conditional flow matching (OT-CFM), in three different NAR TTS approaches: regression-based, deep generative, and end-to-end. Across four different corpora, stochastic duration modelling improves probabilistic NAR TTS approaches, especially for spontaneous speech. Please see this https URL for audio and resources.
- [27] arXiv:2406.05551 (cross-list from eess.AS) [pdf, ps, html, other]
-
Title: Autoregressive Diffusion Transformer for Text-to-Speech SynthesisSubjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
Audio language models have recently emerged as a promising approach for various audio generation tasks, relying on audio tokenizers to encode waveforms into sequences of discrete symbols. Audio tokenization often poses a necessary compromise between code bitrate and reconstruction accuracy. When dealing with low-bitrate audio codes, language models are constrained to process only a subset of the information embedded in the audio, which in turn restricts their generative capabilities. To circumvent these issues, we propose encoding audio as vector sequences in continuous space $\mathbb R^d$ and autoregressively generating these sequences using a decoder-only diffusion transformer (ARDiT). Our findings indicate that ARDiT excels in zero-shot text-to-speech and exhibits performance that compares to or even surpasses that of state-of-the-art models. High-bitrate continuous speech representation enables almost flawless reconstruction, allowing our model to achieve nearly perfect speech editing. Our experiments reveal that employing Integral Kullback-Leibler (IKL) divergence for distillation at each autoregressive step significantly boosts the perceived quality of the samples. Simultaneously, it condenses the iterative sampling process of the diffusion model into a single step. Furthermore, ARDiT can be trained to predict several continuous vectors in one step, significantly reducing latency during sampling. Impressively, one of our models can generate $170$ ms of $24$ kHz speech per evaluation step with minimal degradation in performance. Audio samples are available at this http URL .
- [28] arXiv:2406.05629 (cross-list from cs.CV) [pdf, ps, html, other]
-
Title: Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and LanguageComments: Computer Vision and Pattern Recognition 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visually aligned features solely through watching videos. We show that DenseAV can discover the ``meaning'' of words and the ``location'' of sounds without explicit localization supervision. Furthermore, it automatically discovers and distinguishes between these two types of associations without supervision. We show that DenseAV's localization abilities arise from a new multi-head feature aggregation operator that directly compares dense image and audio representations for contrastive learning. In contrast, many other systems that learn ``global'' audio and video representations cannot localize words and sound. Finally, we contribute two new datasets to improve the evaluation of AV representations through speech and sound prompted semantic segmentation. On these and other datasets we show DenseAV dramatically outperforms the prior art on speech and sound prompted semantic segmentation. DenseAV outperforms the previous state-of-the-art, ImageBind, on cross-modal retrieval using fewer than half of the parameters. Project Page: \href{this https URL}{this https URL}
- [29] arXiv:2406.05806 (cross-list from cs.CL) [pdf, ps, html, other]
-
Title: Do Prompts Really Prompt? Exploring the Prompt Understanding Capability of WhisperComments: In progressSubjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
This research explores the interaction between Whisper, a high-performing speech recognition model, and information in prompts. Our results unexpectedly show that Whisper may not fully grasp textual prompts as anticipated. Additionally, we find that performance improvement is not guaranteed even with stronger adherence to the topic information in textual prompts. It is also noted that English prompts generally outperform Mandarin ones on datasets of both languages, likely due to differences in training data distributions for these languages. Conversely, we discover that Whisper exhibits awareness of misleading information in language tokens by effectively ignoring incorrect language tokens and focusing on the correct ones. In summary, this work raises questions about Whisper's prompt understanding capability and encourages further studies.
- [30] arXiv:2406.05876 (cross-list from cs.CL) [pdf, ps, html, other]
-
Title: Zero-Shot End-To-End Spoken Question Answering In Medical DomainComments: Accepted to INTERSPEECH 2024Journal-ref: InterSpeech 2024Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
In the rapidly evolving landscape of spoken question-answering (SQA), the integration of large language models (LLMs) has emerged as a transformative development. Conventional approaches often entail the use of separate models for question audio transcription and answer selection, resulting in significant resource utilization and error accumulation. To tackle these challenges, we explore the effectiveness of end-to-end (E2E) methodologies for SQA in the medical domain. Our study introduces a novel zero-shot SQA approach, compared to traditional cascade systems. Through a comprehensive evaluation conducted on a new open benchmark of 8 medical tasks and 48 hours of synthetic audio, we demonstrate that our approach requires up to 14.7 times fewer resources than a combined 1.3B parameters LLM with a 1.55B parameters ASR model while improving average accuracy by 0.5\%. These findings underscore the potential of E2E methodologies for SQA in resource-constrained contexts.
- [31] arXiv:2406.05914 (cross-list from eess.AS) [pdf, ps, html, other]
-
Title: Soundscape Captioning using Sound Affective Quality Network and Large Language ModelComments: Code: this https URLSubjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Signal Processing (eess.SP)
We live in a rich and varied acoustic world, which is experienced by individuals or communities as a soundscape. Computational auditory scene analysis, disentangling acoustic scenes by detecting and classifying events, focuses on objective attributes of sounds, such as their category and temporal characteristics, ignoring the effect of sounds on people and failing to explore the relationship between sounds and the emotions they evoke within a context. To fill this gap and to automate soundscape analysis, which traditionally relies on labour-intensive subjective ratings and surveys, we propose the soundscape captioning (SoundSCap) task. SoundSCap generates context-aware soundscape descriptions by capturing the acoustic scene, event information, and the corresponding human affective qualities. To this end, we propose an automatic soundscape captioner (SoundSCaper) composed of an acoustic model, SoundAQnet, and a general large language model (LLM). SoundAQnet simultaneously models multi-scale information about acoustic scenes, events, and perceived affective qualities, while LLM generates soundscape captions by parsing the information captured by SoundAQnet to a common language. The soundscape caption's quality is assessed by a jury of 16 audio/soundscape experts. The average score (out of 5) of SoundSCaper-generated captions is lower than the score of captions generated by two soundscape experts by 0.21 and 0.25, respectively, on the evaluation set and the model-unknown mixed external dataset with varying lengths and acoustic properties, but the differences are not statistically significant. Overall, SoundSCaper-generated captions show promising performance compared to captions annotated by soundscape experts. The models' code, LLM scripts, human assessment data and instructions, and expert evaluation statistics are all publicly available.
- [32] arXiv:2406.06111 (cross-list from eess.AS) [pdf, ps, html, other]
-
Title: JenGAN: Stacked Shifted Filters in GAN-Based Speech SynthesisComments: Accepted to Interspeech 2024Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD); Signal Processing (eess.SP)
Non-autoregressive GAN-based neural vocoders are widely used due to their fast inference speed and high perceptual quality. However, they often suffer from audible artifacts such as tonal artifacts in their generated results. Therefore, we propose JenGAN, a new training strategy that involves stacking shifted low-pass filters to ensure the shift-equivariant property. This method helps prevent aliasing and reduce artifacts while preserving the model structure used during inference. In our experimental evaluation, JenGAN consistently enhances the performance of vocoder models, yielding significantly superior scores across the majority of evaluation metrics.
- [33] arXiv:2406.06185 (cross-list from eess.AS) [pdf, ps, html, other]
-
Title: EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and DereverberationJulius Richter, Yi-Chiao Wu, Steven Krenn, Simon Welker, Bunlong Lay, Shinji Watanabe, Alexander Richard, Timo GerkmannComments: Accepted at Interspeech 2024Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
We release the EARS (Expressive Anechoic Recordings of Speech) dataset, a high-quality speech dataset comprising 107 speakers from diverse backgrounds, totaling in 100 hours of clean, anechoic speech data. The dataset covers a large range of different speaking styles, including emotional speech, different reading styles, non-verbal sounds, and conversational freeform speech. We benchmark various methods for speech enhancement and dereverberation on the dataset and evaluate their performance through a set of instrumental metrics. In addition, we conduct a listening test with 20 participants for the speech enhancement task, where a generative method is preferred. We introduce a blind test set that allows for automatic online evaluation of uploaded data. Dataset download links and automatic evaluation server can be found online.
- [34] arXiv:2406.06220 (cross-list from eess.AS) [pdf, ps, html, other]
-
Title: Label-Looping: Highly Efficient Decoding for TransducersSubjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
This paper introduces a highly efficient greedy decoding algorithm for Transducer inference. We propose a novel data structure using CUDA tensors to represent partial hypotheses in a batch that supports parallelized hypothesis manipulations. During decoding, our algorithm maximizes GPU parallelism by adopting a nested-loop design, where the inner loop consumes all blank predictions, while non-blank predictions are handled in the outer loop. Our algorithm is general-purpose and can work with both conventional Transducers and Token-and-Duration Transducers. Experiments show that the label-looping algorithm can bring a speedup up to 2.0X compared to conventional batched decoding algorithms when using batch size 32, and can be combined with other compiler or GPU call-related techniques to bring more speedup. We will open-source our implementation to benefit the research community.
- [35] arXiv:2406.06293 (cross-list from eess.AS) [pdf, ps, html, other]
-
Title: Sample Rate Independent Recurrent Neural Networks for Audio Effects ProcessingComments: Accepted for publication in Proc. DAFx24, Guildford, UK, September 2024Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Signal Processing (eess.SP)
In recent years, machine learning approaches to modelling guitar amplifiers and effects pedals have been widely investigated and have become standard practice in some consumer products. In particular, recurrent neural networks (RNNs) are a popular choice for modelling non-linear devices such as vacuum tube amplifiers and distortion circuitry. One limitation of such models is that they are trained on audio at a specific sample rate and therefore give unreliable results when operating at another rate. Here, we investigate several methods of modifying RNN structures to make them approximately sample rate independent, with a focus on oversampling. In the case of integer oversampling, we demonstrate that a previously proposed delay-based approach provides high fidelity sample rate conversion whilst additionally reducing aliasing. For non-integer sample rate adjustment, we propose two novel methods and show that one of these, based on cubic Lagrange interpolation of a delay-line, provides a significant improvement over existing methods. To our knowledge, this work provides the first in-depth study into this problem.
- [36] arXiv:2406.06371 (cross-list from cs.CL) [pdf, ps, html, other]
-
Title: mHuBERT-147: A Compact Multilingual HuBERT ModelComments: Extended version of the Interspeech 2024 paper of same nameSubjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
We present mHuBERT-147, the first general-purpose massively multilingual HuBERT speech representation model trained on 90K hours of clean, open-license data. To scale up the multi-iteration HuBERT approach, we use faiss-based clustering, achieving 5.2x faster label assignment over the original method. We also apply a new multilingual batching up-sampling strategy, leveraging both language and dataset diversity. After 3 training iterations and with only 95M parameters, mHuBERT-147 outperforms larger models trained on substantially more data. We rank second and first on the ML-SUPERB 10min/1h leaderboards respectively, with SOTA scores for all LID tasks. Across ASR/LID tasks, our model consistently surpasses XLS-R (300M params; 436K hours) and demonstrates strong competitiveness against the much larger MMS (1B params; 491K hours). Our findings suggest that mHuBERT-147 is a promising model for multilingual speech processing tasks, offering an unprecedented balance between high performance and parameter efficiency.
- [37] arXiv:2406.06403 (cross-list from cs.CL) [pdf, ps, html, other]
-
Title: Meta Learning Text-to-Speech Synthesis in over 7000 LanguagesFlorian Lux, Sarina Meyer, Lyonel Behringer, Frank Zalkow, Phat Do, Matt Coler, Emanuël A. P. Habets, Ngoc Thang VuComments: accepted at Interspeech 2024Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
In this work, we take on the challenging task of building a single text-to-speech synthesis system that is capable of generating speech in over 7000 languages, many of which lack sufficient data for traditional TTS development. By leveraging a novel integration of massively multilingual pretraining and meta learning to approximate language representations, our approach enables zero-shot speech synthesis in languages without any available data. We validate our system's performance through objective measures and human evaluation across a diverse linguistic landscape. By releasing our code and models publicly, we aim to empower communities with limited linguistic resources and foster further innovation in the field of speech technology.
- [38] arXiv:2406.06406 (cross-list from cs.CL) [pdf, ps, html, other]
-
Title: Controlling Emotion in Text-to-Speech with Natural Language PromptsComments: accepted at Interspeech 2024Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
In recent years, prompting has quickly become one of the standard ways of steering the outputs of generative machine learning models, due to its intuitive use of natural language. In this work, we propose a system conditioned on embeddings derived from an emotionally rich text that serves as prompt. Thereby, a joint representation of speaker and prompt embeddings is integrated at several points within a transformer-based architecture. Our approach is trained on merged emotional speech and text datasets and varies prompts in each training iteration to increase the generalization capabilities of the model. Objective and subjective evaluation results demonstrate the ability of the conditioned synthesis system to accurately transfer the emotions present in a prompt to speech. At the same time, precise tractability of speaker identities as well as overall high speech quality and intelligibility are maintained.
- [39] arXiv:2406.06438 (cross-list from cs.CL) [pdf, ps, html, other]
-
Title: Multimodal Contextualized Semantic Parsing from SpeechComments: 10 Pages, 3 figures, ACL 2024 MainSubjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
We introduce Semantic Parsing in Contextual Environments (SPICE), a task designed to enhance artificial agents' contextual awareness by integrating multimodal inputs with prior contexts. SPICE goes beyond traditional semantic parsing by offering a structured, interpretable framework for dynamically updating an agent's knowledge with new information, mirroring the complexity of human communication. We develop the VG-SPICE dataset, crafted to challenge agents with visual scene graph construction from spoken conversational exchanges, highlighting speech and visual data integration. We also present the Audio-Vision Dialogue Scene Parser (AViD-SP) developed for use on VG-SPICE. These innovations aim to improve multimodal information processing and integration. Both the VG-SPICE dataset and the AViD-SP model are publicly available.
Cross submissions for Tuesday, 11 June 2024 (showing 20 of 20 entries )
- [40] arXiv:2304.05600 (replaced) [pdf, ps, html, other]
-
Title: Looking Similar, Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation LearningComments: Accepted to CVPR 2024Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Audiovisual representation learning typically relies on the correspondence between sight and sound. However, there are often multiple audio tracks that can correspond with a visual scene. Consider, for example, different conversations on the same crowded street. The effect of such counterfactual pairs on audiovisual representation learning has not been previously explored. To investigate this, we use dubbed versions of movies and television shows to augment cross-modal contrastive learning. Our approach learns to represent alternate audio tracks, differing only in speech, similarly to the same video. Our results, from a comprehensive set of experiments investigating different training strategies, show this general approach improves performance on a range of downstream auditory and audiovisual tasks, without majorly affecting linguistic task performance overall. These findings highlight the importance of considering speech variation when learning scene-level audiovisual correspondences and suggest that dubbed audio can be a useful augmentation technique for training audiovisual models toward more robust performance on diverse downstream tasks.
- [41] arXiv:2305.17834 (replaced) [pdf, ps, html, other]
-
Title: Streaming Audio Transformers for Online Audio TaggingComments: Interspeech2024Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Transformers have emerged as a prominent model framework for audio tagging (AT), boasting state-of-the-art (SOTA) performance on the widely-used Audioset dataset. However, their impressive performance often comes at the cost of high memory usage, slow inference speed, and considerable model delay, rendering them impractical for real-world AT applications. In this study, we introduce streaming audio transformers (SAT) that combine the vision transformer (ViT) architecture with Transformer-Xl-like chunk processing, enabling efficient processing of long-range audio signals. Our proposed SAT is benchmarked against other transformer-based SOTA methods, achieving significant improvements in terms of mean average precision (mAP) at a delay of 2s and 1s, while also exhibiting significantly lower memory usage and computational overhead. Checkpoints are publicly available this https URL.
- [42] arXiv:2402.09508 (replaced) [pdf, ps, html, other]
-
Title: Arrange, Inpaint, and Refine: Steerable Long-term Music Audio Generation and Editing via Content-based ControlsSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Controllable music generation plays a vital role in human-AI music co-creation. While Large Language Models (LLMs) have shown promise in generating high-quality music, their focus on autoregressive generation limits their utility in music editing tasks. To address this gap, we propose a novel approach leveraging a parameter-efficient heterogeneous adapter combined with a masking training scheme. This approach enables autoregressive language models to seamlessly address music inpainting tasks. Additionally, our method integrates frame-level content-based controls, facilitating track-conditioned music refinement and score-conditioned music arrangement. We apply this method to fine-tune MusicGen, a leading autoregressive music generation model. Our experiments demonstrate promising results across multiple music editing tasks, offering more flexible controls for future AI-driven music editing tools. The source codes and a demo page showcasing our work are available at this https URL.
- [43] arXiv:2406.02178 (replaced) [pdf, ps, html, other]
-
Title: Audio Mamba: Selective State Spaces for Self-Supervised Audio RepresentationsComments: Accepted at INTERSPEECH 2024Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Despite its widespread adoption as the prominent neural architecture, the Transformer has spurred several independent lines of work to address its limitations. One such approach is selective state space models, which have demonstrated promising results for language modelling. However, their feasibility for learning self-supervised, general-purpose audio representations is yet to be investigated. This work proposes Audio Mamba, a selective state space model for learning general-purpose audio representations from randomly masked spectrogram patches through self-supervision. Empirical results on ten diverse audio recognition downstream tasks show that the proposed models, pretrained on the AudioSet dataset, consistently outperform comparable self-supervised audio spectrogram transformer (SSAST) baselines by a considerable margin and demonstrate better performance in dataset size, sequence length and model size comparisons.
- [44] arXiv:2406.02315 (replaced) [pdf, ps, html, other]
-
Title: An Independence-promoting Loss for Music Generation with Language ModelsComments: Accepted to ICML 2024Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Music generation schemes using language modeling rely on a vocabulary of audio tokens, generally provided as codes in a discrete latent space learnt by an auto-encoder. Multi-stage quantizers are often employed to produce these tokens, therefore the decoding strategy used for token prediction must be adapted to account for multiple codebooks: either it should model the joint distribution over all codebooks, or fit the product of the codebook marginal distributions. Modelling the joint distribution requires a costly increase in the number of auto-regressive steps, while fitting the product of the marginals yields an inexact model unless the codebooks are mutually independent. In this work, we introduce an independence-promoting loss to regularize the auto-encoder used as the tokenizer in language models for music generation. The proposed loss is a proxy for mutual information based on the maximum mean discrepancy principle, applied in reproducible kernel Hilbert spaces. Our criterion is simple to implement and train, and it is generalizable to other multi-stream codecs. We show that it reduces the statistical dependence between codebooks during auto-encoding. This leads to an increase in the generated music quality when modelling the product of the marginal distributions, while generating audio much faster than the joint distribution model.
- [45] arXiv:2406.02897 (replaced) [pdf, ps, html, other]
-
Title: LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete CodesSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Prior works have demonstrated zero-shot text-to-speech by using a generative language model on audio tokens obtained via a neural audio codec. It is still challenging, however, to adapt them to low-latency scenarios. In this paper, we present LiveSpeech - a fully autoregressive language model-based approach for zero-shot text-to-speech, enabling low-latency streaming of the output audio. To allow multiple token prediction within a single decoding step, we propose (1) using adaptive codebook loss weights that consider codebook contribution in each frame and focus on hard instances, and (2) grouping codebooks and processing groups in parallel. Experiments show our proposed models achieve competitive results to state-of-the-art baselines in terms of content accuracy, speaker similarity, audio quality, and inference speed while being suitable for low-latency streaming applications.
- [46] arXiv:2406.03240 (replaced) [pdf, ps, html, other]
-
Title: Generalized Source Tracing: Detecting Novel Audio Deepfake Algorithm with Real Emphasis and Fake Dispersion StrategyYuankun Xie, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Xiaopeng Wang, Haonnan Cheng, Long Ye, Jianhua TaoComments: Accepted by INTERSPEECH 2024Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
With the proliferation of deepfake audio, there is an urgent need to investigate their attribution. Current source tracing methods can effectively distinguish in-distribution (ID) categories. However, the rapid evolution of deepfake algorithms poses a critical challenge in the accurate identification of out-of-distribution (OOD) novel deepfake algorithms. In this paper, we propose Real Emphasis and Fake Dispersion (REFD) strategy for audio deepfake algorithm recognition, demonstrating its effectiveness in discriminating ID samples while identifying OOD samples. For effective OOD detection, we first explore current post-hoc OOD methods and propose NSD, a novel OOD approach in identifying novel deepfake algorithms through the similarity consideration of both feature and logits scores. REFD achieves 86.83% F1-score as a single system in Audio Deepfake Detection Challenge 2023 Track3, showcasing its state-of-the-art performance.
- [47] arXiv:2406.03247 (replaced) [pdf, ps, html, other]
-
Title: Genuine-Focused Learning using Mask AutoEncoder for Generalized Fake Audio DetectionXiaopeng Wang, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Yuankun Xie, Yukun Liu, Jianhua Tao, Xuefei Liu, Yongwei Li, Xin Qi, Yi Lu, Shuchen ShiComments: Accepted by INTERSPEECH 2024Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
The generalization of Fake Audio Detection (FAD) is critical due to the emergence of new spoofing techniques. Traditional FAD methods often focus solely on distinguishing between genuine and known spoofed audio. We propose a Genuine-Focused Learning (GFL) framework guided, aiming for highly generalized FAD, called GFL-FAD. This method incorporates a Counterfactual Reasoning Enhanced Representation (CRER) based on audio reconstruction using the Mask AutoEncoder (MAE) architecture to accurately model genuine audio features. To reduce the influence of spoofed audio during training, we introduce a genuine audio reconstruction loss, maintaining the focus on learning genuine data features. In addition, content-related bottleneck (BN) features are extracted from the MAE to supplement the knowledge of the original audio. These BN features are adaptively fused with CRER to further improve robustness. Our method achieves state-of-the-art performance with an EER of 0.25% on ASVspoof2019 LA.
- [48] arXiv:2309.07648 (replaced) [pdf, ps, html, other]
-
Title: Incorporating Class-based Language Model for Named Entity Recognition in Factorized Neural TransducerComments: Accepted in INTERSPEECH 2024Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
Despite advancements of end-to-end (E2E) models in speech recognition, named entity recognition (NER) is still challenging but critical for semantic understanding. Previous studies mainly focus on various rule-based or attention-based contextual biasing algorithms. However, their performance might be sensitive to the biasing weight or degraded by excessive attention to the named entity list, along with a risk of false triggering. Inspired by the success of the class-based language model (LM) in NER in conventional hybrid systems and the effective decoupling of acoustic and linguistic information in the factorized neural Transducer (FNT), we propose C-FNT, a novel E2E model that incorporates class-based LMs into FNT. In C-FNT, the LM score of named entities can be associated with the name class instead of its surface form. The experimental results show that our proposed C-FNT significantly reduces error in named entities without hurting performance in general word recognition.
- [49] arXiv:2312.01053 (replaced) [pdf, ps, html, other]
-
Title: End-to-End Speech-to-Text Translation: A SurveyComments: 75 pagesSubjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Speech-to-text translation pertains to the task of converting speech signals in a language to text in another language. It finds its application in various domains, such as hands-free communication, dictation, video lecture transcription, and translation, to name a few. Automatic Speech Recognition (ASR), as well as Machine Translation(MT) models, play crucial roles in traditional ST translation, enabling the conversion of spoken language in its original form to written text and facilitating seamless cross-lingual communication. ASR recognizes spoken words, while MT translates the transcribed text into the target language. Such disintegrated models suffer from cascaded error propagation and high resource and training costs. As a result, researchers have been exploring end-to-end (E2E) models for ST translation. However, to our knowledge, there is no comprehensive review of existing works on E2E ST. The present survey, therefore, discusses the work in this direction. Our attempt has been to provide a comprehensive review of models employed, metrics, and datasets used for ST tasks, providing challenges and future research direction with new insights. We believe this review will be helpful to researchers working on various applications of ST models.
- [50] arXiv:2401.03689 (replaced) [pdf, ps, html, other]
-
Title: LUPET: Incorporating Hierarchical Information Path into Multilingual ASRComments: Accepted by Interspeech 2024Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Toward high-performance multilingual automatic speech recognition (ASR), various types of linguistic information and model design have demonstrated their effectiveness independently. They include language identity (LID), phoneme information, language-specific processing modules, and cross-lingual self-supervised speech representation. It is expected that leveraging their benefits synergistically in a unified solution would further improve the overall system performance. This paper presents a novel design of a hierarchical information path, named LUPET, which sequentially encodes, from the shallow layers to deep layers, multiple aspects of linguistic and acoustic information at diverse granularity scales. The path starts from LID prediction, followed by acoustic unit discovery, phoneme sharing, and finally token recognition routed by a mixture-of-expert. ASR experiments are carried out on 10 languages in the Common Voice corpus. The results demonstrate the superior performance of LUPET as compared to the baseline systems. Most importantly, LUPET effectively mitigates the issue of performance compromise of high-resource languages with low-resource ones in the multilingual setting.
- [51] arXiv:2403.17864 (replaced) [pdf, ps, html, other]
-
Title: Synthetic training set generation using text-to-audio models for environmental sound classificationSubjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Signal Processing (eess.SP)
In the past few years, text-to-audio models have emerged as a significant advancement in automatic audio generation. Although they represent impressive technological progress, the effectiveness of their use in the development of audio applications remains uncertain. This paper aims to investigate these aspects, specifically focusing on the task of classification of environmental sounds. This study analyzes the performance of two different environmental classification systems when data generated from text-to-audio models is used for training. Two cases are considered: a) when the training dataset is augmented by data coming from two different text-to-audio models; and b) when the training dataset consists solely of synthetic audio generated. In both cases, the performance of the classification task is tested on real data. Results indicate that text-to-audio models are effective for dataset augmentation, whereas the performance of the models drops when relying on only generated audio.
- [52] arXiv:2405.19334 (replaced) [pdf, ps, html, other]
-
Title: LLMs Meet Multimodal Generation and Editing: A SurveyYingqing He, Zhaoyang Liu, Jingye Chen, Zeyue Tian, Hongyu Liu, Xiaowei Chi, Runtao Liu, Ruibin Yuan, Yazhou Xing, Wenhai Wang, Jifeng Dai, Yong Zhang, Wei Xue, Qifeng Liu, Yike Guo, Qifeng ChenComments: 52 Pages with 16 Figures, 12 Tables, and 545 References. GitHub Repository at: this https URLSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
With the recent advancement in large language models (LLMs), there is a growing interest in combining LLMs with multimodal learning. Previous surveys of multimodal large language models (MLLMs) mainly focus on multimodal understanding. This survey elaborates on multimodal generation and editing across various domains, comprising image, video, 3D, and audio. Specifically, we summarize the notable advancements with milestone works in these fields and categorize these studies into LLM-based and CLIP/T5-based methods. Then, we summarize the various roles of LLMs in multimodal generation and exhaustively investigate the critical technical components behind these methods and the multimodal datasets utilized in these studies. Additionally, we dig into tool-augmented multimodal agents that can leverage existing generative models for human-computer interaction. Lastly, we discuss the advancements in the generative AI safety field, investigate emerging applications, and discuss future prospects. Our work provides a systematic and insightful overview of multimodal generation and processing, which is expected to advance the development of Artificial Intelligence for Generative Content (AIGC) and world models. A curated list of all related papers can be found at this https URL
- [53] arXiv:2406.02859 (replaced) [pdf, ps, other]
-
Title: ConPCO: Preserving Phoneme Characteristics for Automatic Pronunciation Assessment Leveraging Contrastive Ordinal RegularizationBi-Cheng Yan, Wei-Cheng Chao, Jiun-Ting Li, Yi-Cheng Wang, Hsin-Wei Wang, Meng-Shin Lin, Berlin ChenComments: This paper has been withdrawn because the authors aim to achieve better organization in writing and more detailed experimental analysisSubjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Automatic pronunciation assessment (APA) manages to evaluate the pronunciation proficiency of a second language (L2) learner in a target language. Existing efforts typically draw on regression models for proficiency score prediction, where the models are trained to estimate target values without explicitly accounting for phoneme-awareness in the feature space. In this paper, we propose a contrastive phonemic ordinal regularizer (ConPCO) tailored for regression-based APA models to generate more phoneme-discriminative features while considering the ordinal relationships among the regression targets. The proposed ConPCO first aligns the phoneme representations of an APA model and textual embeddings of phonetic transcriptions via contrastive learning. Afterward, the phoneme characteristics are retained by regulating the distances between inter- and intra-phoneme categories in the feature space while allowing for the ordinal relationships among the output targets. We further design and develop a hierarchical APA model to evaluate the effectiveness of our method. Extensive experiments conducted on the speechocean762 benchmark dataset suggest the feasibility and efficacy of our approach in relation to some cutting-edge baselines.