Maarten Van Segbroeck, an applied scientist in the Alexa International group and first author on the associated paper, cowrote this post with Zaid Ahmed.
Amazon today announced the public release of a new data set that will help speech scientists address the difficult problem of separating speech signals in reverberant rooms with multiple speakers.
In the field of automatic speech recognition, this problem is known as the “cocktail party” or “dinner party” problem; accordingly, we call our data set the Dinner Party Corpus, or DiPCo. The dinner party problem is widely studied: it was, for instance, the subject of the fifth and most recent CHiME challenge in speech separation and recognition sponsored by the International Speech Communication Association.
We hope that the availability of a high-quality public data set will both promote research on this topic and make that research more productive.
We created our data set with the assistance of Amazon volunteers, who simulated the dinner-party scenario in the lab. We conducted multiple sessions, each involving four participants. At the beginning of each session, participants served themselves food from a buffet table. Most of the session took place at a dining table, and at fixed points in several sessions, we piped music into the room, to reproduce a noise source that will be common in real-world environments.
Each participant was outfitted with a headset microphone, which captured a clear, speaker-specific signal. Also dispersed around the room were five devices with seven microphones each, which fed audio signals directly to an administrator’s laptop.
The layout of the space in which we captured audio from simulated dinner parties. The numbered circles indicate the placement of the five microphone arrays.
The data set we are releasing includes both the raw audio from each of the seven microphones in each device and the headset signals. The headset signals provide speaker-specific references that can be used to gauge the success of speech separation systems acting on the signals from the microphone arrays. The data set also includes transcriptions of the headset signals.
The division of the data into segments with and without background music enables researchers to combine clean and noisy training data in whatever way necessary to extract optimal performance from their machine learning systems.
The DiPCo data set has been released under the CDLA-Permissive license and can be downloaded here. We have also posted a paper detailing the work. DiPCo’s release follows on Amazon’s recent releases of three other public data sets, two for the development of conversational AI systems and the other for fact verification.
Zaid Ahmed is a senior technical program manager in the Alexa Speech group.
Paper: “DiPCo — Dinner Party Corpus”
Acknowledgments: Maarten Van Segbroeck, Roland Maas