Skip to content

ia-bentebib/DiaGene

Repository files navigation

Forks Stargazers Licence Creative Commons LinkedIn


Dialogue Dataset Creator

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. License
  7. Contact
  8. Acknowledgments

About The Project

This project was born from a desire to create a space for developers and researchers where they could easily have access to large dialog datasets.

You will be able to find everything you need to download, extract and process the data.

This project is made possible thanks to the computer science laboratory of Grenoble (LIG) and the MIAI institute. Thanks to them for making this project possible

(back to top)

Built With

(back to top)

Datasets

Dataset Language # Dialogues # Messages # Words Download Updated Date
French Reddit fr 2 699 832 7 076 356 335 203 782 Link ? 10/12/2021
DiaBLa fr 144 5 748 50 998 Link 15/12/2021
... ... ... ... ... ...

Getting Started

To start using the project, you need to install some system dependencies

Installation

  1. Clone the repo
    git clone [email protected]:Torilen/Create-Dialogue-Dataset.git
  2. Got into it
    cd French-Dialogue-Dataset
  3. Run installation script
    sh install.sh

(back to top)

Usage

First Step : Data acquisition

python getAndConcatData.py --help
usage: getAndConcatData.py [-h] [--languageThreshold LANGUAGETHRESHOLD]
                        [--decompressedSourceFilePath DECOMPRESSEDSOURCEFILEPATH]
                        [--listSubredditFilePath LISTSUBREDDITFILEPATH]
                        [--maxCommentProcessed MAXCOMMENTPROCESSED]
                        [--useSubredditFilter USESUBREDDITFILTER]
                        [--downloadData DOWNLOADDATA]
                        [--languageToExtract LANGUAGETOEXTRACT]

Acquisition et traitement des données

optional arguments:
-h, --help            show this help message and exit
--languageThreshold LANGUAGETHRESHOLD
                     Lowest ratio of language to non-language text. Enter a
                     value between 0 and 1.
--decompressedSourceFilePath DECOMPRESSEDSOURCEFILEPATH
                     Path to the source file downloaded and decompressed by
                     download.sh
--listSubredditFilePath LISTSUBREDDITFILEPATH
                     Path to the file contains the list of accepted
                     subreddit
--maxCommentProcessed MAXCOMMENTPROCESSED
                     Maximum number of comment processed
--useSubredditFilter USESUBREDDITFILTER
                     Use subreddit file ?
--downloadData DOWNLOADDATA
                     The data source are already downloaded ?
--languageToExtract LANGUAGETOEXTRACT
                     The language you want to extract from reddit ["fr",
                     "en", "es", etc]

Some commands example:

python getAndConcatData.py --languageThreshold 0.5 --useSubredditFilter False --downloadData False --languageToExtract "fr"
python getAndConcatData.py --languageThreshold 0.7 --useSubredditFilter True --languageToExtract "fr" --listSubredditFilePath "./data/acceptedSubbredit.txt"

(back to top)

Second Step : Data recomposition

python constructDialogueDataset.py --help
usage: constructDialogueDataset.py [-h]
                                   [--extractedPreprocessCsvFilePath EXTRACTEDPREPROCESSCSVFILEPATH]

Construction des données dialogues

optional arguments:
  -h, --help            show this help message and exit
  --extractedPreprocessCsvFilePath EXTRACTEDPREPROCESSCSVFILEPATH
                        Path to the source file preprocessed by
                        getAndConcatData.py

Example:

python constructDialogueDataset.py --extractedPreprocessCsvFilePath "./reddit_source_fr_preprocessed.csv"

Roadmap

  • Multi-language Support
  • Add some other data sources

(back to top)

License

Licence Creative Commons
Ce(tte) œuvre est mise à disposition selon les termes de la Licence Creative Commons Attribution - Pas d’Utilisation Commerciale 3.0 France. Distributed under License. See LICENSE.md for more information.

(back to top)

Contact

Ilyes Aniss Bentebib - [email protected]

Project Link: https://github.com/Torilen/Create-Dialogue-Dataset

(back to top)

Acknowledgments

(back to top)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published