Irfan Chaugule and Satish R Sankaye
Speech Emotion Recognition (SER) is a pivotal technology for advancing human-computer interaction, with applications ranging from clinical diagnosis to customer service analytics. While deep learning (DL) models have significantly improved SER accuracy by automatically learning feature representations from speech, they often produce high-dimensional, redundant, and computationally expensive feature sets. This complexity can hinder the deployment of SER systems in real-time applications and may not always yield the most discriminative information for classification. This paper proposes a novel hybrid framework that synergistically combines deep learning architectures for comprehensive feature extraction with metaheuristic algorithms for optimal feature selection. The primary objective is to engineer a feature space that is both maximally discriminative and minimally redundant, thereby enhancing emotion classification accuracy and computational efficiency on established benchmark datasets. We conceptualize a framework where a hybrid Convolutional Recurrent Neural Network (CRNN) with an attention mechanism extracts a rich set of high-level features from speech spectrograms. Subsequently, a Genetic Algorithm (GA) is employed to navigate the complex search space and identify an optimal feature subset. The fitness function for the GA is designed to balance classification performance with feature set compactness. We anticipate that this hybrid approach will demonstrate superior performance, achieving higher classification accuracy and significantly faster inference times compared to baseline models that use the full, unselected feature set or traditional dimensionality reduction techniques. The conceptual validation is detailed for several benchmark corpora, including IEMOCAP, RAVDESS, and EMO-DB, aiming to establish a new standard for high-performance and efficient SER model development.
Pages: 1231-1240 | 148 Views 61 Downloads