Skip to content

Latest commit

 

History

History
83 lines (61 loc) · 6.16 KB

README.md

File metadata and controls

83 lines (61 loc) · 6.16 KB

Paper Title: CCC-wav2vec 2.0: Clustering aided Cross Contrastive Self-Supervised Learning of Speech Representations. At IEEE SLT 2022 (arxiv link).

ccc-wav2vec 2.0 is a pre-training mechanism which uses clustering and an augmentation-based cross-contrastive loss as its self-supervised objective. Through the clustering module, we scale down the influence of those negative examples that are highly similar to the positive. The Cross-Contrastive loss is computed between the encoder output of the original sample and the quantizer output of its augmentation and vice-versa, bringing robustness to the pre-training strategy.

Primary Contributions:

  • We introduce an augmentation of the original sample and use its representations to add an auxiliary Cross-Contrastive loss to the existing contrastive loss in wav2vec 2.0.
  • We demonstrate the usefulness of a clustering module to segregate the negative examples and thereby control the effect of the weak non-informative negative examples in the contrastive learning task.
  • Combining the above two modules leads to the development of ccc-wav2vec 2.0, a robust pre-training approach that consistently outperforms wav2vec 2.0 in tasks such as ASR, Domain Adaptation, and zero-shot decoding.

SUPERB Benchmark

The ccc-wav2vec 2.0 BASE model pre-trained on LibriSpeech-960h has been evaluated on the multiple downstream tasks over the SUPERB benchmark. The proposed method comprehensively outperforms the baseline wav2vec 2.0 BASE model over the array of downstream tasks presented over SUPERB.

Models

The WERs specified are without the use of any language model.

Model Pre-training data Fine-tuning data Model Links WER (test-clean | test-other)
wav2vec 2.0 Base LibriSpeech-360h No fine-tuning fairseq | huggingface ---
wav2vec 2.0 Base LibriSpeech-360h LibriSpeech-100h fairseq | huggingface 12.8 | 31.7
ccc-wav2vec 2.0 Base LibriSpeech-360h No fine-tuning fairseq | huggingface ---
ccc-wav2vec 2.0 Base LibriSpeech-360h LibriSpeech-100h fairseq | huggingface 10.8 | 27.7
ccc-wav2vec 2.0 Base LibriSpeech-960h No fine-tuning fairseq | huggingface ---
ccc-wav2vec 2.0 Base LibriSpeech-960h LibriSpeech-100h fairseq | huggingface 5.5 | 12.4
ccc-wav2vec 2.0 Base SUPERB LibriSpeech-960h No fine-tuning fairseq SUPERB model | huggingface SUPERB model ---
  • Pre-training and fine-tuning procedures can be found here.

Requirements and Installation

  • PyTorch version >= 1.10.0
  • Python version >= 3.8
  • For training new models, you'll also need an NVIDIA GPU and NCCL
  • To install fairseq with ccc-wav2vec 2.0 and develop locally:
git clone https://github.com/Speech-Lab-IITM/CCC-wav2vec-2.0
cd fairseq
pip install --editable ./
  • For faster training install NVIDIA's apex library:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./
  • For large datasets install PyArrow: pip install pyarrow

  • If you use Docker make sure to increase the shared memory size either with --ipc=host or --shm-size as command line options to nvidia-docker run .

  • For Augmentations to work install torchaudio-augmentations:

git clone https://github.com/Speech-Lab-IITM/torchaudio-augmentations
cd torchaudio-augmentations
pip install --editable ./
  • The clustering module functions on GPU needs fast-pytorch-kmeans to be installed: pip install fast-pytorch-kmeans

Parameters of interest

  • The $\alpha$, $\beta$ and $\gamma$ values from the paper can be caliberated from the cc_weights parameter in the criterion section of the pre-training configs which can be found from the pre-training config.
  • The cluster_factor and scale_factor parameters can be modified from the model section of the pre-training configs which can be found from the pre-training config.
  • The augmentations used for ccc-wav2vec 2.0 requires the noise set of MUSAN dataset. The path to the same is to be specified in the path_to_musan_noise_set variable of the getitem method of the raw_audio_dataset file.

Reference Code

  1. Facebook AI Research Sequence-to-Sequence Toolkit written in Python. fairseq