Skip to content

This repository aggregate datasets that can be used for the development of conversational AI techniques.

License

Notifications You must be signed in to change notification settings

wangxieric/Conversational-AI-Datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 

Repository files navigation

Conversational-AI-datasets

This repository aggregates datasets that can be used to develop conversational AI techniques. In this repository, we cover the research tasks of open-domain conversation, conversational recommendation and conversational search.

Datasets

Conversational Question Answering Datasets

Dataset #dialogues collection year download
QuAC 13,569 Crowdsourcing 2018 Download
MANtIS 80,324 Stack Exchange 2019 Download
CoQA 8,399 Crowdsourcing 2019 Download
ShARC 948 Crowdsourcing 2018 Download
MSDialog 2,199 Microsoft Community 2018 Download

Conversational Search Datasets

Dataset #dialogues Corpus Size collection year download
CAsT-19,20,21,22 30 - 50 38,426,252 Crowdsourcing 2019 Download
OR-QuAC 5,644 11,377,951 Update QuAC for self-containment 2020 Download

Conversational Recommendation Datasets

Dataset #dialogues #utternaces domain collection language year download
ReDial 10,006 182,150 Movie Amazon Mechanical Turk (AMT) ENG 2018 Download
OpenDialKG 12,320 71,873 Movies & Books KG-walk Crowdsourcing ENG 2019 Download
INSPIRED 1,001 35,811 Movie Social-encouraged crowdsourcing (AMT) ENG 2020 Download
TG-ReDial 10,000 129,392 Movie Topic-driven generation, crowdsourcing CHN 2020 Download
DuRecDial2.0 16,482 255,346 Movie, music, star, food, restaurant, weather translation from DuRecDial (crowdsourced) ENG, CHN 2021 Download
INSPIRED2 1,001 35,811 Movie clean & augment INSPIRED ENG 2022 Download
U-NEED 7,698 53,712 e-commerce pre-sale dialogues from Taobao CHN 2023 Download
PEARL 57,277 548,061 Movie review-based syntheic dialogues ENG 2024 Download

Task-oriented Dialogue System Datasets

Dataset #dialogues #utternaces #domain collection language year download
MultiWoZ 8,438 113,556 7 Wizard-of-Oz EN 2018 Download
SGD 16,142 329,964 16 outline simulation then crowdsourced paraphrasing EN 2020 Download

Multi-Task Conversational Datasets

Dataset Paper Link
MG-ShopDial MG-ShopDial: A Multi-Goal Conversational Dataset for e-Commerce link

Cross-domain Conversational Datasets

Dataset Paper Link
DialogStudio DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI link

About

This repository aggregate datasets that can be used for the development of conversational AI techniques.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published