The TeDDi repository is a collaborative effort to collect, curate, and analyze corpora.
Here you will find reports on our practices for transcribing and annotating texts:
Each file must contain a correctly formatted header. An example is given in the header_template.tsv file.
A description of the sampling algorithm is on Overleaf:
And the teddi_sample/Crawlers directory contains the web crawlers.
We use GitHub to maintain the corpus and code. You can read more about what GitHub is here:
If you haven't used GitHub before, you will need to create a free account here:
and you can start working through the tutorials here:
Please use the GitHub workflow, described in detail here:
When working with this repository, please use the issue tracker and submit a pull request for your branches, when adding or updating data or code.
We follow standard practice in collaborative development on GitHub, which is to follow the fork and pull request model:
There are a number of good GUIs for working with GitHub. For example:
If you are the command line type, these documents contain for more information about setting up a fork and pull request workflow:
- https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/configuring-a-remote-for-a-fork
- https://gist.github.com/Chaser324/ce0505fbed06b947d962
- https://www.gun.io/blog/how-to-github-fork-branch-and-pull-request
- https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/creating-a-pull-request-from-a-fork