GitHub - ia-bentebib/DiaGene

Dialogue Dataset Creator

Table of Contents

About The Project
- Built With
Getting Started
Usage
Roadmap
Contributing
License
Contact
Acknowledgments

About The Project

This project was born from a desire to create a space for developers and researchers where they could easily have access to large dialog datasets.

You will be able to find everything you need to download, extract and process the data.

This project is made possible thanks to the computer science laboratory of Grenoble (LIG) and the MIAI institute. Thanks to them for making this project possible

(back to top)

Built With

(back to top)

Datasets

Dataset	Language	# Dialogues	# Messages	# Words	Download	Updated Date
French Reddit	fr	2 699 832	7 076 356	335 203 782	Link ?	10/12/2021
DiaBLa	fr	144	5 748	50 998	Link	15/12/2021
...	...	...	...	...	...

Getting Started

To start using the project, you need to install some system dependencies

Installation

Clone the repo

git clone [email protected]:Torilen/Create-Dialogue-Dataset.git

Got into it
```
cd French-Dialogue-Dataset
```
Run installation script
```
sh install.sh
```

(back to top)

Usage

First Step : Data acquisition

python getAndConcatData.py --help
usage: getAndConcatData.py [-h] [--languageThreshold LANGUAGETHRESHOLD]
                        [--decompressedSourceFilePath DECOMPRESSEDSOURCEFILEPATH]
                        [--listSubredditFilePath LISTSUBREDDITFILEPATH]
                        [--maxCommentProcessed MAXCOMMENTPROCESSED]
                        [--useSubredditFilter USESUBREDDITFILTER]
                        [--downloadData DOWNLOADDATA]
                        [--languageToExtract LANGUAGETOEXTRACT]

Acquisition et traitement des données

optional arguments:
-h, --help            show this help message and exit
--languageThreshold LANGUAGETHRESHOLD
                     Lowest ratio of language to non-language text. Enter a
                     value between 0 and 1.
--decompressedSourceFilePath DECOMPRESSEDSOURCEFILEPATH
                     Path to the source file downloaded and decompressed by
                     download.sh
--listSubredditFilePath LISTSUBREDDITFILEPATH
                     Path to the file contains the list of accepted
                     subreddit
--maxCommentProcessed MAXCOMMENTPROCESSED
                     Maximum number of comment processed
--useSubredditFilter USESUBREDDITFILTER
                     Use subreddit file ?
--downloadData DOWNLOADDATA
                     The data source are already downloaded ?
--languageToExtract LANGUAGETOEXTRACT
                     The language you want to extract from reddit ["fr",
                     "en", "es", etc]

Some commands example:

python getAndConcatData.py --languageThreshold 0.5 --useSubredditFilter False --downloadData False --languageToExtract "fr"

python getAndConcatData.py --languageThreshold 0.7 --useSubredditFilter True --languageToExtract "fr" --listSubredditFilePath "./data/acceptedSubbredit.txt"

(back to top)

Second Step : Data recomposition

python constructDialogueDataset.py --help
usage: constructDialogueDataset.py [-h]
                                   [--extractedPreprocessCsvFilePath EXTRACTEDPREPROCESSCSVFILEPATH]

Construction des données dialogues

optional arguments:
  -h, --help            show this help message and exit
  --extractedPreprocessCsvFilePath EXTRACTEDPREPROCESSCSVFILEPATH
                        Path to the source file preprocessed by
                        getAndConcatData.py

Example:

python constructDialogueDataset.py --extractedPreprocessCsvFilePath "./reddit_source_fr_preprocessed.csv"

Roadmap

Multi-language Support
Add some other data sources

(back to top)

License

Ce(tte) œuvre est mise à disposition selon les termes de la Licence Creative Commons Attribution - Pas d’Utilisation Commerciale 3.0 France. Distributed under License. See LICENSE.md for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
acceptedSubreddit.txt		acceptedSubreddit.txt
constructDialogueDataset.py		constructDialogueDataset.py
getAndConcatData.py		getAndConcatData.py
install.sh		install.sh
reddit_source_fr_preprocessed.csv		reddit_source_fr_preprocessed.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dialogue Dataset Creator

About The Project

Built With

Datasets

Getting Started

Installation

Usage

First Step : Data acquisition

Second Step : Data recomposition

Roadmap

License

Contact

Acknowledgments

About

Releases

Packages

Languages

License

ia-bentebib/DiaGene

Folders and files

Latest commit

History

Repository files navigation

Dialogue Dataset Creator

About The Project

Built With

Datasets

Getting Started

Installation

Usage

First Step : Data acquisition

Second Step : Data recomposition

Roadmap

License

Contact

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages