C047

SPEECH EMOTION RECOGNITION WITH CONSTANT-Q TRANSFORM AND MULTI-AXIS VISION TRANSFORMER

Ong Kah Liang, Dr. Lee Chin Poo, Prof. Lim Heng Siong, Dr. Lim Kian Ming, Takeki Mukaida

AFFILIATION
Faculty of Information Science & Technology, Multimedia University

Description of Invention

An approach named ‘‘SCQT-MaxViT'' is proposed for speech emotion recognition, combining signal processing, computer vision, and deep learning techniques. This method utilizes the Constant-Q Transform (CQT) to convert speech waveforms into spectrograms to represent the fine-grained details of emotional expression. The Multi-Axis Vision Transformer is employed for further representation learning and classification of the spectrograms. MaxViT incorporates a multi-axis self-attention mechanism, facilitating both local and global interactions within the network to learn meaningful features. Dataset is augmented using random time masking techniques to enhance the generalization capabilities of the model. The proposed SCQT-MaxViT method achieves outperform accuracy across three datasets.