Ming Sun and Bowen Shi cowrote this post with Chieh-Chi Kao.
Neural networks are responsible for most recent advances in artificial intelligence, including many of Alexa’s latest capabilities. But neural networks tend to be large and unwieldy, and in recent years, the Alexa team has been investigating techniques for making them efficient enough to run on-device.
At this year’s Interspeech, we and our colleagues are presenting two papers that describe techniques for reducing the complexity of networks that process audio data. One of the networks recognizes individual spoken words; the other does acoustic-event detection.
Acoustic-event detection is the technology behind Alexa Guard, a feature that customers can enable on Echo devices to detect and notify them about the sound of smoke and carbon monoxide alarms or glass breaking while they’re away from home. With Guard, running a detector on-device helps protect customer privacy, ensuring that only highly suspicious sounds pass for confirmation to a more powerful detector running in the cloud.
Both models rely on convolutional neural networks, although in different ways. Originally developed for image processing, convolutional neural nets, or CNNs, repeatedly apply the same “filter” to small chunks of input data. For object recognition, for instance, a CNN might step through an image file in eight-by-eight blocks of pixels, inspecting each block for patterns associated with particular objects. That way, the network can detect the objects no matter where in the image they’re located.
Like images, audio signals can be represented as two-dimensional data. In speech recognition, for instance, it’s standard to represent signals using mel-frequency cepstral coefficients, or MFCCs. A signal’s cepstral coefficients are a sequence of numbers that describe its frequency characteristics; cepstral connotes a transformation of spectral properties. “Mel” means that the frequency bands are chosen to concentrate data in frequency ranges that humans are particularly sensitive to. Mapping cepstral coefficients against time produces a 2-D snapshot of an acoustic signal.
In object recognition, a CNN will typically apply a number of filters to each image block, each filter representing a different possible orientation of an object’s edge. Our system, too, applies a number of different filters, each attuned to characteristics of particular words. In our case, however, each filter is relevant only to some cepstral coefficients, not to all.
We exploit this difference to increase network efficiency. Our network architecture applies each filter only to the relevant cepstral coefficients, reducing the total number of operations required to identify a particular word. In experiments, we compared it to a traditional CNN and found that, when we held the output accuracy fixed, it reduced the computational load (measured in FLOPS, or floating-point operations per second) by 39.7% on command classification tasks and 49.3% on number recognition tasks.
A traditional CNN (left) and our more-efficient CNN, which applies filters (Conv1_1 through Conv1_3) only to the relevant cepstral coefficients. Note that in the signal representations, time is the y-axis.
In our other paper, we combine two different techniques to improve the efficiency of a sound detection network: distillation and quantization. Distillation is a technique in which the outputs of a large, powerful neural network — in this case, a CNN — are used to train a leaner, more efficient network — in this case, a shallow long-short-term-memory network, or LSTM.
Quantization is the process of considering the full range of values that a particular variable can take on and splitting it into a fixed number of intervals. All the values within a given interval are then approximated by a single number.
The typical neural network consists of a large number of simple processing nodes, each of which receives data from several other nodes and passes data to several more. Connections between nodes have associated weights, which indicate how big a role the outputs of one node play in the computation performed by the next. Training a neural network is largely a matter of adjusting its connection weights.
As storing a neural network in memory essentially amounts to storing its weights, quantizing those weights can dramatically reduce the network’s memory footprint.
In our case, we quantize not only the weights of our smaller network (the LSTM) but also its input values. An LSTM processes sequences of data in order, and the output corresponding to each input factors in the inputs and outputs that preceded it. We quantize not only the original inputs to the LSTM but also each output, which in turn becomes an input at the next processing step.
Furthermore, we quantize the LSTM during training, not afterward. Rather than fully training the LSTM and only then quantizing its weights for storage, we force it to select quantized weights during training. This means that the training process tunes the network to the quantized weights, not to continuous values that the quantizations merely approximate.
When we compare our distillation-trained and quantized LSTM to an LSTM with the same number of nodes trained directly on the same data, we find that it not only has a much smaller memory footprint — one-eighth the size — but also demonstrates a 15% improvement in accuracy, a result of the distillation training.
Chieh-Chi Kao is an applied scientist, Ming Sun a senior speech scientist, and Bowen Shi a summer intern (from the Toyota Technological Institute at Chicago), all in the Alexa Speech group.
“Sub-band Convolutional Neural Networks for Small-footprint Spoken Term Classification”
“Compression of Acoustic Event Detection Models With Quantized Distillation”
Acknowledgments: Yixin Gao, Shiv Vitaladevuni, Viktor Rozgic, Spyros Matsoukas, Chao Wang