A person’s tone of voice can tell you a lot about how they’re feeling. Not surprisingly, emotion recognition is an increasingly popular conversational-AI research topic.
Emotion recognition has a wide range of applications: it can aid in health monitoring; it can make conversational-AI systems more engaging; and it can provide implicit customer feedback that could help voice agents like Alexa learn from their mistakes.
Typically, emotion classification systems are neural networks trained in a supervised fashion: training data is labeled according to the speaker’s emotional state, and the network learns to predict the labels from the data. At this year’s International Conference on Acoustics, Speech, and Signal Processing, my colleagues and I presented an alternative approach, in which we used a publicly available data set to train a neural network known as an adversarial autoencoder.
An adversarial autoencoder is an encoder-decoder neural network: one component of the network, the encoder, learns to produce a compact representation of input speech; the decoder reconstructs the input from the compact representation. The adversarial learning forces the encoder’s representations to conform to a desired probability distribution.
The compact representation — or “latent” representation — encodes all properties of the training example. In our model, we explicitly dedicate part of the latent representation to the speaker’s emotional state and assume that the remaining part captures all other input characteristics.
Our latent emotion representation consists of three network nodes, one for each of three emotional measures: valence, or whether the speaker’s emotion is positive or negative; activation, or whether the speaker is alert and engaged or passive; and dominance, or whether the speaker feels in control of the situation. The remaining part of the latent representation is much larger, 100 nodes.
The architecture of our adversarial autoencoder. The latent representation has two components (emotion classes and style), whose outputs feed into two adversarial discriminators.
We conduct training in three phases. In the first phase, we train the encoder and decoder using data without labels. In the second phase, we use adversarial training to tune the encoder.
Each latent representation — the three-node representation and the 100-node representation — passes to an adversarial discriminator. The adversarial discriminators are neural networks that attempt to distinguish real data representations, produced by the encoder, from artificial representations generated in accord with particular probability distributions. The encoder, in turn, attempts to fool the adversarial discriminator.
In so doing, the encoder learns to produce representations that fit the probability distributions. This ensures that it will not overfit the training data, or rely too heavily on statistical properties of the training data that don’t represent speech data in general.
In the third phase, we tune the encoder to ensure that the latent emotion representation predicts the emotional labels of the training data. We repeat all three training phases until we converge on the model with the best performance.
For training, we used a public data set containing 10,000 utterances from 10 different speakers, labeled according to valence, activation, and dominance. We compared the performance of the proposed learning method and the fully supervised learning baseline and observed marginal improvements.
In tests in which the inputs to our network were sentence-level feature vectors hand-engineered to capture relevant information about a speech signal, our network was 3% more accurate than a conventionally trained network in assessing valence.
When the input to the network was a sequence of vectors representing the acoustic characteristics of 20-millisecond frames, or audio snippets, the improvement was 4%. This suggests that our approach could be useful for end-to-end spoken-language-understanding systems, which dispense with hand-engineered features and rely entirely on neural networks.
Moreover, unlike conventional neural nets, adversarial autoencoders can benefit from training with unlabeled data. In our tests, for purposes of benchmarking, we used the same data sets to train both our network and the baseline network. But it’s likely that using additional unlabeled data in the first and second training phases can improve the network’s performance.
Viktor Rozgic is a senior applied scientist in the Alexa Speech group.
Paper: “Improving Emotion Classification through Variational Inference of Latent Variables”
Acknowledgments: Srinivas Parthasarathy, Ming Sun, Chao Wang