Sources for the genre "Conversation" #276

ximenina · 2023-01-27T03:53:53Z

ximenina
Jan 27, 2023
Maintainer

Our starting point for face-to-face conversations is CHILDES. However, other talkbanks are available on the website , where CHILDES is hosted.

Which talkbanks are the most appropriate to be included in TeDDi?
Do you know some other sources where we can extract text for this genre that covers many languages, included low-resourced languages?

@tsamardzic @christianbentz @bambooforest

ximenina · 2023-01-31T01:32:13Z

ximenina
Jan 31, 2023
Maintainer Author

From talkbank.org I think the following corpora is suitable:

Conversation Banks

Child Language Banks

We just have to check the annotation specifications to be able to extract the text and filter out unwanted tags and annotations (as much as possible).

0 replies

ximenina · 2023-01-31T01:53:43Z

ximenina
Jan 31, 2023
Maintainer Author

Spoken Corpora from the Clarin project

It comprises several sources with transcriptions of spontaneous and planned speech.

It covers 15 languages: Arabic, Czech, Dutch, Estonian, Finnish, French, German, Hungarian, Italian, Nepali, Norwegian, Polish, Skoti Saami, Slovenian, Spanish, and Swedish.

We just need to check that the resource is downloable, that the genre is suitable (for example, spontaneous speech, interviews, conversations, etc) and be aware of the particular tags, anotations used in each corpus.

0 replies

ximenina · 2023-02-01T03:38:08Z

ximenina
Feb 1, 2023
Maintainer Author

Other potential resources:

OpenSLR is a site that hosts speech and language resources. Some of them might be useful especially when they have orthographic transcription available (and the corpus corresponds to the Conversation genre). So far, I've identified these resources:

Multilingual TEDx
MAGICDATA Mandarin Chinese Conversational Speech Corpus

There are many sources of spoken corpora. Unfortunately, they do not properly represent conversational corpora, i.e., they come from Wikipedia or predefined sentences that the speakers read. In any case, some of these multilingual sources might be useful to obtain text for the under-resource languages (although different genre):

CommonVoice Hundreds of languages, many low-resourced

https://www.openslr.org/79/ Kannada critically endangered
https://www.openslr.org/126/ Kannada critically endangered

0 replies

ximenina · 2023-02-01T03:39:41Z

ximenina
Feb 1, 2023
Maintainer Author

I updated the potential sources for conversational corpora, in case you want to have a look. @vukbatanovic @tsamardzic @christianbentz

0 replies

tsamardzic · 2023-02-06T17:03:54Z

tsamardzic
Feb 6, 2023
Maintainer

There might be some data for us here too: https://www.amazon.science/blog/amazon-releases-51-language-dataset-for-language-understanding (might not be for conversation)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sources for the genre "Conversation" #276

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Sources for the genre "Conversation" #276

ximenina Jan 27, 2023 Maintainer

Replies: 5 comments

ximenina Jan 31, 2023 Maintainer Author

ximenina Jan 31, 2023 Maintainer Author

ximenina Feb 1, 2023 Maintainer Author

ximenina Feb 1, 2023 Maintainer Author

tsamardzic Feb 6, 2023 Maintainer

ximenina
Jan 27, 2023
Maintainer

ximenina
Jan 31, 2023
Maintainer Author

ximenina
Jan 31, 2023
Maintainer Author

ximenina
Feb 1, 2023
Maintainer Author

ximenina
Feb 1, 2023
Maintainer Author

tsamardzic
Feb 6, 2023
Maintainer