The Definition Relation Extractor is a set of tools to pre-annotate, post-process and generate an RDF graph from natural languge dictionary definitions following the conceptual model proposed in the following work:
Vivian S. Silva, Siegfried Handschuh and André Freitas. Categorization of Semantic Roles for Dictionary Definitions. Cognitive Aspects of the Lexicon (CogALex-V), Workshop at the 26th International Conference on Computational Linguistics, (COLING), Osaka, 2016.
The definitions are pre-annotated based on syntactic patterns. Sample data generated by the pre-annotation can be manually curated to feed a machine learning classifier that can, in turn, classify a whole linguistic resource. This final classified data can then be converted into an RDF graph.
The WordNetGraph is an example of graph generated by the Definition Relation Extractor. Pre-annotated data was curated with the help of the Brat annotation tool and then used to train a RNN model. The trained model was used to classify all WordNet's noun and verb definitions, which where later post-processed, in order to fix some mistakes in the sequence of labels, and finally converted to an RDF graph.
Reads a list of natural language definitions and identifies the definition's semantic roles for each of them
Input:
List of definitions: one per line in the format id|POS|word_list|def, where:
- id: the synset id (an integer, starting from 1)
- POS: noun or verb
- word_list: a comma-separated list of words that compose the synset (1 to n)
- def: the definition text
Output:
Pre-annotated data file: definitions classified in IOB format
Reads classified data generated by a machine learning classifier and prepares it to be converted into and RDF graph
Input:
List of definitions: one per line in the format id|POS|word_list|def, where:
- id: the synset id (an integer, starting from 1)
- POS: noun or verb
- word_list: a comma-separated list of words that compose the synset (1 to n)
- def: the definition text
Classified data: file in IOB format (returned by the RNN classifier)
Note: sequence of definitions in both files must match
Output:
Fixed classified data: file in IOB format with all classifications fixed (missing supertypes added and inconsistent IOB sequences adjusted)
Input:
List of definitions: one per line in the format id|POS|word_list|def, where:
- id: the synset id (an integer, starting from 1)
- POS: noun or verb
- word_list: a comma-separated list of words that compose the synset (1 to n)
- def: the definition text
Classified data: file in IOB format (preferably the one returned by the PostProcessing routine)
Note: sequence of definitions in both files must match
Output:
RDF files in XML and/or N-TRIPLES format (options must be set in the configuration file params.txt in the conf folder)
Auxiliary routines to convert data between different formats
Generate a file in the standoff format to be read by the Brat annotation tool
Input:
Classified data: file in IOB format
Output:
FIle in standoff format as defined by the Brat tool
Reads the standoff file generated by the Brat tool after data annotation and converts it back to IOB format
Input:
List of definitions: one per line in the format id|POS|word_list|def, where:
- id: the synset id (an integer, starting from 1)
- POS: noun or verb
- word_list: a comma-separated list of words that compose the synset (1 to n)
- def: the definition text
Standoff file: file generated by the Brat tool
Note: the sequence in the list of definitions must be the same as in the one sent as input to the Brat tool
Output:
File in IOB format
Generates a python script for creating the dataset to be sent as input for the RNN model
Input:
Classified data: file in IOB format
Output:
A python script to generate a pickle file to feed the RNN model