slash-trombone.github.io

Pink trombone audios

Authors: Mateo Cámara, Zhiyuan Xu, Yisu Zong, and David Südholt.

Data: You can take this website as a preview of some of the examples we mention in the paper. Six methods have been applied to regenerate human sounds, while each of them has tried 4 different representation of sounds. If you are interested in all the regenerated audio samples, you can find it here. And if you are looking for the Pink Trombone generated dataset to train neural network, you can find it here.

Pink Trombone audios without variation (4.1)

All methods perform very well in predicting articulatory parameters, the result is shown in the table below. Since the performance of each method can be objectively computed, we only take a few examples as an example.

original 1: regenerated 1:

original 2: regenerated 2:

Pink Trombone audios without variation with Gaussian Noise (4.1)

In this part, different amounts of Gaussian white noise were added to the original signal, and we regenerated Pink Trombone audios with the predicted articulatory parameters that defined the signal. PLEASE MIND THE VOLUME, IT COULD BE VERY LOUD

40dB 20dB 0dB -10dB
original
GA_mel
GA_mfcc
GA_stft
GA_multiscale
PSO_mel
PSO_mfcc
PSO_stft
PSO_multiscale
TRF_mel
TRF_mfcc
TRF_stft
TRF_multiscale
NM_mel
NM_mfcc
NM_stft
NM_multiscale
CMA-ES_mel
CMA-ES_mfcc
CMA-ES_stft
CMA-ES_multiscale
NN_mel
NN_mfcc
NN_stft
NN_multiscale

Pink Trombone audios with variation (4.2)

We have tried several perceptual metrics (PESQ, PEAQ, ViSQOL, WARP-Q and STOI) to find out how similar the sounds generated by the synthesizer were to human-generated ones. You can find out the result of each perceptual metric in Table 3 in our paper. But you are very welcome and encouraged to evaluate quality by yourself, and that’s why this site exists.

vowel /a/ vowel /o/ yawn 1
original
GA_mel
GA_mfcc
GA_stft
GA_multiscale
PSO_mel
PSO_mfcc
PSO_stft
PSO_multiscale
TRF_mel
TRF_mfcc
TRF_stft
TRF_multiscale
NM_mel
NM_mfcc
NM_stft
NM_multiscale
CMA-ES_mel
CMA-ES_mfcc
CMA-ES_stft
CMA-ES_multiscale
NN_mel
NN_mfcc
NN_stft
NN_multiscale

Some Interesting trials

Trial One

Based on the outputs of optimization, which are the values of each control parameter, we not only fed it back to Pink Trombone to generate audio, GIF files are also generated to see how each part of vocal tract moves.

For example, the GIF file of a yawn is shown below. (Movements have been slown down to catch details)

Trial Two

Original version: Regenerated version: