Knowing how people engage with your content is a very important skill. The invisible mental states associated with thoughts and feelings are called emotions. Without physical cues, identification must rely on human movements such as speech, gestures and sounds.
Emotion Recognition in Conversations (ERC) aims to analyze text, visual, and auditory information to identify emotions expressed in speech. Using ERCs to analyze and refine multimedia information is rapidly gaining importance. It can be used for AI interviews, personalized conversational interfaces, user sentiment analysis, and contextualizing material for social media sites such as YouTube, Facebook, and Twitter.
Many of the state-of-the-art methods for performing robust ERC rely on text-based processing, ignoring the large amount of information available from auditory and visual channels.
The media analytics group at Sony Research India believes that the fusion of the three modalities present in ERC data (text, visual, and auditory) can enhance the performance and robustness of existing systems. The ERC system accepts samples of emotional expressions across three modalities as input and predicts the corresponding emotion for each.
Their new work introduces a multimodal fusion network (M2FNet) that uses a novel multi-headed fusion attention layer to take full advantage of media-specific diversity. Layers of audio and visual data are mapped onto the potential space of text properties, enabling the generation of emotionally relevant and rich expressions. Accuracy is improved using all three modalities and is further improved by the fusion process of the proposed method.
This concept has two important phases.
- Utterance level performs feature extraction at the individual utterance (intra-speaker) and modality level.
- In the middle of the dialog level, features are captured for each Inter-Speaker and contextual information is recorded.
Once the links between modalities are obtained, the final sentiment labels are estimated.
Previous studies have shown that treating audio data as images instead of plotted frequency-characteristic mel-spectrograms improves the accuracy of emotion recognition. Inspired by this, he said M2FNet extracts features from spoken words, much like images are extracted from text. In order to extract more emotion-related data from videos, M2FNet provides a dual network that captures context by considering not only the emotion of a person’s face but also the whole frame.
We also propose a new model for feature extraction using exHere. They developed a new adaptive margin-based triplet loss function that facilitates the ability of the proposed extractor to obtain accurate representations.
The team says that the inability of each embedding to improve accuracy on its own demonstrates the importance of scene context, in addition to aspects of facial expressions, in recognizing emotions. They introduce a dual network inspired by merging the emotional content of the scene considering the different people in the scene. Moreover, research has shown that the performance of his latest ERC approach, despite his success on one benchmark dataset like IEMOCAP, degrades on more complex datasets like MELD. increase.
Over 1,400 chats and 13,000 utterances from the “Friends” TV series make up MELD. Seven emotion labels are applied to each statement: anger, contempt, sadness, joy, surprise, fear, and neutral. Use existing Train/Valid as it is.
IEMOCAP is a conversational database with six emotion labels (happy, sad, neutral, furious, excited, frustrated). In this experiment, 10% of the training data was randomly selected and used for hyperparameter tuning. 10% of the training data was randomly selected to create the database.
The team experimented the performance of the proposed network against existing text-based and multimodal ERC techniques to verify the robustness of the network. They compared the MELD and IEMOCAP datasets as weighted average F1 scores. The results suggest that the M2FNet model significantly outperforms its competitors when comparing the weighted average F1 scores. The findings also suggest that M2FNet effectively used multimodal properties to improve the accuracy of emotion recognition.
check out paper. All credit for this research goes to the researchers of this project.Also, don’t forget to participate Our 13k+ ML SubReddit, cacophony channelWhen email newsletterWe share the latest AI research news, cool AI projects, and more.
Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree at the Indian Institute of Technology (IIT), Bhubaneswar. She is a data her science enthusiast and has a keen interest in the scope of artificial intelligence applications in various fields. Her passion lies in exploring new advancements in technology and its practical applications.