Replies: 5 comments
-
From talkbank.org I think the following corpora is suitable: Conversation Banks Child Language Banks
|
Beta Was this translation helpful? Give feedback.
-
Spoken Corpora from the Clarin project It comprises several sources with transcriptions of spontaneous and planned speech. It covers 15 languages: Arabic, Czech, Dutch, Estonian, Finnish, French, German, Hungarian, Italian, Nepali, Norwegian, Polish, Skoti Saami, Slovenian, Spanish, and Swedish.
|
Beta Was this translation helpful? Give feedback.
-
Other potential resources:
Multilingual TEDx
CommonVoice Hundreds of languages, many low-resourced https://www.openslr.org/79/ Kannada critically endangered |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
There might be some data for us here too: https://www.amazon.science/blog/amazon-releases-51-language-dataset-for-language-understanding (might not be for conversation) |
Beta Was this translation helpful? Give feedback.
-
Our starting point for face-to-face conversations is CHILDES. However, other talkbanks are available on the website , where CHILDES is hosted.
Which talkbanks are the most appropriate to be included in TeDDi?
Do you know some other sources where we can extract text for this genre that covers many languages, included low-resourced languages?
@tsamardzic @christianbentz @bambooforest
Beta Was this translation helpful? Give feedback.
All reactions