Compositional Embeddings and Their Applications in Speaker Diarization

Li, Zeqian

Etd

Compositional Embeddings and Their Applications in Speaker Diarization

Public Deposited

This dissertation is focused on the task of speaker diarization. Our contribution includes: 1) We make a comparison of loss functions used in the training of speaker embedding models and discover that arcface loss is best for large training set while cross entropy loss is best for small training set; 2) We propose a novel method called compositional embedding that enables the embeddings to represent a set of classes instead just one; this approach is integrated into a speaker diarization pipeline and achieves the state-of-the-art result on a public benchmark, AMI-Headset Mix, with 22.19% DER compared with the previous 23.82%; 3) We introduce the problem of compositional clustering that both partitions data into clusters and models their compositional relationships; the proposed method achieves 96.2% accuracy in a dataset containing 15,000 samples created from LibriSpeech, outperforms the best baseline's result of 88.4%, and achieves 96.6\% accuracy in a dataset created from OmniGlot, outperforms the best baseline's result of 88.7%; 4) We make improvements to an enrollment-based speaker diarization system by utilizing the speaker embedding model trained in the compositional embedding manner and fine-tuning the embedding model using arcface loss.

Creator