Table of Contents
This project was born from a desire to create a space for developers and researchers where they could easily have access to large dialog datasets.
You will be able to find everything you need to download, extract and process the data.
This project is made possible thanks to the computer science laboratory of Grenoble (LIG) and the MIAI institute. Thanks to them for making this project possible
Dataset | Language | # Dialogues | # Messages | # Words | Download | Updated Date |
---|---|---|---|---|---|---|
French Reddit | fr | 2 699 832 | 7 076 356 | 335 203 782 | Link ? | 10/12/2021 |
DiaBLa | fr | 144 | 5 748 | 50 998 | Link | 15/12/2021 |
... | ... | ... | ... | ... | ... |
To start using the project, you need to install some system dependencies
- Clone the repo
git clone [email protected]:Torilen/Create-Dialogue-Dataset.git
- Got into it
cd French-Dialogue-Dataset
- Run installation script
sh install.sh
python getAndConcatData.py --help
usage: getAndConcatData.py [-h] [--languageThreshold LANGUAGETHRESHOLD]
[--decompressedSourceFilePath DECOMPRESSEDSOURCEFILEPATH]
[--listSubredditFilePath LISTSUBREDDITFILEPATH]
[--maxCommentProcessed MAXCOMMENTPROCESSED]
[--useSubredditFilter USESUBREDDITFILTER]
[--downloadData DOWNLOADDATA]
[--languageToExtract LANGUAGETOEXTRACT]
Acquisition et traitement des données
optional arguments:
-h, --help show this help message and exit
--languageThreshold LANGUAGETHRESHOLD
Lowest ratio of language to non-language text. Enter a
value between 0 and 1.
--decompressedSourceFilePath DECOMPRESSEDSOURCEFILEPATH
Path to the source file downloaded and decompressed by
download.sh
--listSubredditFilePath LISTSUBREDDITFILEPATH
Path to the file contains the list of accepted
subreddit
--maxCommentProcessed MAXCOMMENTPROCESSED
Maximum number of comment processed
--useSubredditFilter USESUBREDDITFILTER
Use subreddit file ?
--downloadData DOWNLOADDATA
The data source are already downloaded ?
--languageToExtract LANGUAGETOEXTRACT
The language you want to extract from reddit ["fr",
"en", "es", etc]
Some commands example:
python getAndConcatData.py --languageThreshold 0.5 --useSubredditFilter False --downloadData False --languageToExtract "fr"
python getAndConcatData.py --languageThreshold 0.7 --useSubredditFilter True --languageToExtract "fr" --listSubredditFilePath "./data/acceptedSubbredit.txt"
python constructDialogueDataset.py --help
usage: constructDialogueDataset.py [-h]
[--extractedPreprocessCsvFilePath EXTRACTEDPREPROCESSCSVFILEPATH]
Construction des données dialogues
optional arguments:
-h, --help show this help message and exit
--extractedPreprocessCsvFilePath EXTRACTEDPREPROCESSCSVFILEPATH
Path to the source file preprocessed by
getAndConcatData.py
Example:
python constructDialogueDataset.py --extractedPreprocessCsvFilePath "./reddit_source_fr_preprocessed.csv"
- Multi-language Support
- Add some other data sources
Ce(tte) œuvre est mise à disposition selon les termes de la Licence Creative Commons Attribution - Pas d’Utilisation Commerciale 3.0 France.
Distributed under License. See LICENSE.md
for more information.
Ilyes Aniss Bentebib - [email protected]
Project Link: https://github.com/Torilen/Create-Dialogue-Dataset