Noise and acoustic modeling with waveform generator in text-to-speech and neutral speech conversion

Mohammed Salah Al-Radhi; Tamás Gábor Csapó; Géza Németh

doi:10.1007/s11042-020-09783-9

This article focuses on developing a system for high-quality synthesized and converted speech by addressing three fundamental principles. Although the noise-like component in the state-of-the-art parametric vocoders (for example, STRAIGHT) is often not accurate enough, a novel analytical approach for modeling unvoiced excitations using a temporal envelope is proposed. Discrete All Pole, Frequency Domain Linear Prediction, Low Pass Filter, and True envelopes are firstly studied and applied to

more »

... noise excitation signal in our continuous vocoder. Second, we build a deep learning model based text–to–speech (TTS) which converts written text into human-like speech with a feed-forward and several sequence-to-sequence models (long short-term memory, gated recurrent unit, and hybrid model). Third, a new voice conversion system is proposed using a continuous fundamental frequency to provide accurate time-aligned voiced segments. The results have been evaluated in terms of objective measures and subjective listening tests. Experimental results showed that the proposed models achieved the highest speaker similarity and better quality compared with the other conventional methods.

doi:10.1007/s11042-020-09783-9 fatcat:5we3ryq6arb4xdxiblymuqwqlu

Open Access

Noise and acoustic modeling with waveform generator in text-to-speech and neutral speech conversion

Preserved Fulltext