Today I am happy to announce our intention to make available the Topical Chat dataset, a corpus of human-human social conversations collected from crowd workers that will be released publicly on September 17, 2019.
The dataset was developed for teams competing in the Alexa Prize Socialbot Grand Challenge 3, with the application period closing May 14, 2019, and the competition launching September 9, 2019 (apply and learn more here). Teams competing in the Alexa Prize will have access to an expanded version of this dataset (the Extended Topical Chat dataset) which includes the results of on-going collections and annotations, in addition to the many other resources exclusive to Alexa Prize participants.
The Topical Chat dataset will consist of more than 210,000 utterances (over 4,100,000 words), making it the largest social conversation and knowledge dataset available publicly to the research community, supporting the publication of high quality, repeatable research.
Each conversation (and each turn of the conversation) in this dataset is linked to knowledge provided to crowd workers. The knowledge is collected from a variety of unstructured or loosely structured text resources, and each conversation refers to a related set of entities. None of these conversations are interactions with Alexa customers.
The goal of this collection is to enable the next steps of research in knowledge-grounded neural response generation systems, tackling hard challenges in natural conversation that are not addressed by other publicly available datasets. This will allow researchers to focus on the way humans transition between topics, knowledge-selection and enrichment, and integration of fact and opinion into dialogue.
Visit www.alexaprize.com to learn more and stay up-to-date.