By Xupeng Chen, Ran Wang, Amirhossein Khalilian-Gourtani, Leyao Yu, Patricia Dugan, Daniel Friedman, Werner Doyle, Orrin Devinsky, Yao Wang, Adeen Flinker
Paper Published in Nature Machine Intelligence
Check our Demo Page
Our ECoG to Speech decoding framework is initially described in A Neural Speech Decoding Framework Leveraging Deep Learning and Speech Synthesis. We present a novel deep learning-based neural speech decoding framework that includes an ECoG Decoder that translates electrocorticographic (ECoG) signals from the cortex into interpretable speech parameters and a novel differentiable Speech Synthesizer that maps speech parameters to spectrograms. We develop a companion audio-to-audio auto-encoder consisting of a Speech Encoder and the same Speech Synthesizer to generate reference speech parameters to facilitate the training of the ECoG Decoder. This framework generates natural-sounding speech and is highly reproducible across a large cohort of participants (N = 48). We provide two-stage training pipeline with visualization tools.
-
Install
CUDA 11.0
withcuDNN 8
following the official installation guide of CUDA and cuDNN. -
Setup conda environment:
conda create -n Neural_Speech python=3.7 -y
conda activate Neural_Speech
# Install requirements
conda install pytorch torchvision torchaudio cudatoolkit=11 -c pytorch -y
# Clone NSD
git clone https://github.com/flinkerlab/neural_speech_decoding
cd neural_speech_decoding
# Install other requirements
pip install -r requirements.txt
Prepare the data in HDF5 format, refer to the prepare_data
Optionally, if you want to provide extra supervision for pitch and formant, please refer to the formant_pitch_label_extraction
Example Data is available HERE, you could download and put it in the example_data
folder.
Fill in the config files including configs/a2a_production.yaml,configs/e2a_production.yaml,configs/AllSubjectInfo.json,configs/train_param_production.json
following the example of participants HB02
usage: train_a2a.py [-h] [-c FILE] [--DENSITY DENSITY] [--wavebased WAVEBASED]
[--bgnoise_fromdata BGNOISE_FROMDATA] [--ignore_loading IGNORE_LOADING]
[--finetune FINETUNE] [--learnedmask LEARNEDMASK]
[--dynamicfiltershape DYNAMICFILTERSHAPE] [--formant_supervision FORMANT_SUPERVISION]
[--pitch_supervision PITCH_SUPERVISION] [--intensity_supervision INTENSITY_SUPERVISION]
[--n_filter_samples N_FILTER_SAMPLES] [--n_fft N_FFT] [--reverse_order REVERSE_ORDER]
[--lar_cap LAR_CAP] [--intensity_thres INTENSITY_THRES]
[--RNN_COMPUTE_DB_LOUDNESS RNN_COMPUTE_DB_LOUDNESS]
[--BIDIRECTION BIDIRECTION][--MAPPING_FROM_ECOG MAPPING_FROM_ECOG] [--OUTPUT_DIR OUTPUT_DIR]
[--COMPONENTKEY COMPONENTKEY][--trainsubject TRAINSUBJECT] [--testsubject TESTSUBJECT]
[--reshape RESHAPE][--ld_loss_weight LD_LOSS_WEIGHT] [--alpha_loss_weight ALPHA_LOSS_WEIGHT]
[--consonant_loss_weight CONSONANT_LOSS_WEIGHT] [--batch_size BATCH_SIZE]
[--param_file PARAM_FILE][--pretrained_model_dir PRETRAINED_MODEL_DIR] [--causal CAUSAL]
[--anticausal ANTICAUSAL][--rdropout RDROPOUT] [--epoch_num EPOCH_NUM] [--use_stoi USE_STOI]
[--use_denoise USE_DENOISE][--noise_db NOISE_DB]
Example usage:
python train_a2a.py --OUTPUT_DIR output/a2a/HB02 --trainsubject HB02 --testsubject HB02 \
--param_file configs/a2a_production.yaml --batch_size 16 --reshape 1 --DENSITY "HB" \
--wavebased 1 --n_filter_samples 80 --n_fft 256 --formant_supervision 1 \
--intensity_thres -1 --epoch_num 60
Same arguments as train_a2a.py
Example usage:
python train_e2a.py --OUTPUT_DIR output/e2a/resnet_HB02 --trainsubject HB02 \
--testsubject HB02 --param_file configs/e2a_production.yaml --batch_size 16 \
--MAPPING_FROM_ECOG ECoGMapping_ResNet --reshape 1 --DENSITY "HB" --wavebased 1 \
--dynamicfiltershape 0 --n_filter_samples 80 --n_fft 256 --formant_supervision 1 \
--intensity_thres -1 --epoch_num 60 --pretrained_model_dir output/a2a/HB02 --causal 0
We train 60 epochs for Speech to Speech and ECoG to Speech. Running on one A100 GPU usually take 6 hours for Speech to Speech and 10 hours for ECoG to Speech
Based on Observation and assessment of acoustic contamination of electrophysiological brain signals during speech production and sound perception. We did contamination analysis in contamination_analysis
In model.py
- GHMR (
class
): Gradient Harmonized Single-stage Detector, used for balance data distributions - spectrogram_loss (
func
): given ground truth and decoded spectrogram, calculate difference loss in multi scales - run_a2a_loss (
func
): loss function for Speech to Speech decoding - run_components_loss (
func
): loss function for ECoG to Speech decoding
In networks.py
- FormantEncoder (
class
registered as EncoderFormant): the Speech Encoder - FormantSysth (
class
registered as GeneratorFormant): the Speech Synthesizer - ECoGMapping_Bottleneck (
class
registered as ECoGMapping_ResNet): the ECoG Decoder using ResNet as backbone - ECoGMappingRNN (
class
registered as ECoGMapping_RNN): the ECoG Decoder using RNN as backbone - ECoGMapping_3D_SWIN (
class
registered as ECoGMapping_3D_SWIN): the ECoG Decoder using 3D SWIN as backbone
@article{chen2023neural,
title={A Neural Speech Decoding Framework Leveraging Deep Learning and Speech Synthesis},
author={Chen, Xupeng and Wang, Ran and Khalilian-Gourtani, Amirhossein and Yu, Leyao and Dugan, Patricia and Friedman, Daniel and Doyle, Werner and Devinsky, Orrin and Wang, Yao and Flinker, Adeen},
journal={Nature Machine Intelligence},
year={2024},
publisher={Nature Publishing Group UK London}
}