- Utilizing transfer learning with the ECAPA-TDNN model trained on the VoxCeleb2 dataset.
- Intra-voice assistant comparisons: Achieved accuracies of 83.33% (iOS) and 66.67% (Alexa) for text-independent samples and 50% for text-dependent samples.
- Inter-voice assistant comparisons (Alexa, Siri, Google Assistant, Cortana): 100% accuracy for text-independent, 80% for text-dependent.
- Demonstrates the effectiveness of transfer learning and ECAPA-TDNN model for secure speaker verification across speech assistant versions.
- Valuable insights for enhancing speaker verification in the context of speech assistants.
- Speaker verification utilizes speech characteristics differentiated based on pitch, formants, spectral envelope, MFCCs, and prosody characteristics.
- "Voice prints" represent a speaker's unique vocal qualities.
- Two types of speaker verification methods: text-dependent and text-independent.
- Transfer learning employs pre-trained models to improve performance when labeled data is scarce.
- The ECAPA-TDNN model from the SpeechBrain toolkit is used in this study for transfer learning on virtual assistants.
- A custom audio dataset was created with a subset selected for analysis.
- Organized into:
- Intra-pair Comparisons:
- Siri Versions (iOS 9 vs iOS 10 vs iOS 11)
- Alexa Versions (3rd gen vs 4th gen vs 5th gen)
- Inter-pair Comparisons:
- Alexa
- Siri
- Cortana
- Intra-pair Comparisons:
- Features the ECAPA-TDNN model, a state-of-the-art model for speaker recognition that uses TDNN design with MFA mechanism, Squeeze-Excitation (SE), and residual blocks.
- Hyperparameters are detailed in a YAML format.
- Data Loading makes use of a PyTorch dataset interface.
- Batching includes extracting speech features like spectrograms and MFCCs.
Brain_class()
simplifies the neural model training process.
- SpeechBrain provides outputs using pre-trained models such as ECAPA-TDNN.
- Data preprocessing: Extract 80-dimensional filterbank features.
- Model initialization: 5 TDNN layers, an attention mechanism, and an MLP classifier.
- Hyperparameter setting: epochs, batch size, learning rate, etc.
- Training: Trained on the VoxCeleb2 dataset.
- Validation and Testing: Evaluate on a validation set.
- Intra-pair TDSV analysis shows similarities among all versions, leading to potential security concerns.
- Inter-pair TDSV analysis found matches between Cortana & Google Assistant and Alexa.
- TISV has higher accuracy than TDSV due to the model's capability to differentiate different texts.
- For better performance, additional training on a broader dataset of synthetic voices is recommended.
- The study emphasizes the potential of transfer learning and SpeechBrain for speaker verification, also acknowledging challenges with synthetic voices.