This repository implements a speech-to-speech cascaded pipeline with consecutive parts:
- Voice Activity Detection (VAD): silero VAD v5
- Speech to Text (STT): Whisper checkpoints (including distilled versions)
- Language Model (LM): Any instruct model available on the Hugging Face Hub! 🤗
- Text to Speech (TTS): Parler-TTS🤗
The pipeline aims to provide a fully open and modular approach, leveraging models available on the Transformers library via the Hugging Face hub. The level of modularity intended for each part is as follows:
- VAD: Uses the implementation from Silero's repo.
- STT: Uses Whisper models exclusively; however, any Whisper checkpoint can be used, enabling options like Distil-Whisper and French Distil-Whisper.
- LM: This part is fully modular and can be changed by simply modifying the Hugging Face hub model ID. Users need to select an instruct model since the usage here involves interacting with it.
- TTS: The mini architecture of Parler-TTS is standard, but different checkpoints, including fine-tuned multilingual checkpoints, can be used.
The code is designed to facilitate easy modification. Each component is implemented as a class and can be re-implemented to match specific needs.
Clone the repository:
git clone https://github.com/huggingface/speech-to-speech.git
cd speech-to-speech
Install the required dependencies using uv:
uv pip install -r requirements.txt
The pipeline can be run in two ways:
- Server/Client approach: Models run on a server, and audio input/output are streamed from a client.
- Local approach: Runs locally.
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
docker compose up
To run the pipeline on the server:
python s2s_pipeline.py --recv_host 0.0.0.0 --send_host 0.0.0.0
Then run the client locally to handle sending microphone input and receiving generated audio:
python listen_and_play.py --host <IP address of your server>
To run on mac, we recommend setting the flag --local_mac_optimal_settings
:
python s2s_pipeline.py --local_mac_optimal_settings
You can also pass --device mps
to have all the models set to device mps.
The local mac optimal settings set the mode to be local as explained above and change the models to:
- LightningWhisperMLX
- MLX LM
- MeloTTS
Leverage Torch Compile for Whisper and Parler-TTS:
python s2s_pipeline.py \
--recv_host 0.0.0.0 \
--send_host 0.0.0.0 \
--lm_model_name microsoft/Phi-3-mini-4k-instruct \
--init_chat_role system \
--stt_compile_mode reduce-overhead \
--tts_compile_mode default
For the moment, modes capturing CUDA Graphs are not compatible with streaming Parler-TTS (reduce-overhead
, max-autotune
).
model_name
, torch_dtype
, and device
are exposed for each part leveraging the Transformers' implementations: Speech to Text, Language Model, and Text to Speech. Specify the targeted pipeline part with the corresponding prefix:
stt
(Speech to Text)lm
(Language Model)tts
(Text to Speech)
For example:
--lm_model_name google/gemma-2b-it
Other generation parameters of the model's generate method can be set using the part's prefix + _gen_
, e.g., --stt_gen_max_new_tokens 128
. These parameters can be added to the pipeline part's arguments class if not already exposed (see LanguageModelHandlerArguments
for example).
--thresh
: Threshold value to trigger voice activity detection.--min_speech_ms
: Minimum duration of detected voice activity to be considered speech.--min_silence_ms
: Minimum length of silence intervals for segmenting speech, balancing sentence cutting and latency reduction.
--init_chat_role
: Defaults toNone
. Sets the initial role in the chat template, if applicable. Refer to the model's card to set this value (e.g. for Phi-3-mini-4k-instruct you have to set--init_chat_role system
)--init_chat_prompt
: Defaults to"You are a helpful AI assistant."
Required when setting--init_chat_role
.
-
--description
: Sets the description for Parler-TTS generated voice. Defaults to:"A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks very fast."
-
--play_steps_s
: Specifies the duration of the first chunk sent during streaming output from Parler-TTS, impacting readiness and decoding steps.
@misc{Silero VAD,
author = {Silero Team},
title = {Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier},
year = {2021},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/snakers4/silero-vad}},
commit = {insert_some_commit_here},
email = {hello@silero.ai}
}
@misc{gandhi2023distilwhisper,
title={Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling},
author={Sanchit Gandhi and Patrick von Platen and Alexander M. Rush},
year={2023},
eprint={2311.00430},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@misc{lacombe-etal-2024-parler-tts,
author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
title = {Parler-TTS},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huggingface/parler-tts}}
}