Skip to content

A set of tools to pre-annotate, post-process and generate an RDF graph from natural language dictionary definitions.

Notifications You must be signed in to change notification settings

ssvivian/DefRelExtractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Definition Relation Extractor

The Definition Relation Extractor is a set of tools to pre-annotate, post-process and generate an RDF graph from natural languge dictionary definitions following the conceptual model proposed in the following work:

Vivian S. Silva, Siegfried Handschuh and André Freitas. Categorization of Semantic Roles for Dictionary Definitions. Cognitive Aspects of the Lexicon (CogALex-V), Workshop at the 26th International Conference on Computational Linguistics, (COLING), Osaka, 2016.

The definitions are pre-annotated based on syntactic patterns. Sample data generated by the pre-annotation can be manually curated to feed a machine learning classifier that can, in turn, classify a whole linguistic resource. This final classified data can then be converted into an RDF graph.

The WordNetGraph is an example of graph generated by the Definition Relation Extractor. Pre-annotated data was curated with the help of the Brat annotation tool and then used to train a RNN model. The trained model was used to classify all WordNet's noun and verb definitions, which where later post-processed, in order to fix some mistakes in the sequence of labels, and finally converted to an RDF graph.

Dependencies

Pre-annotation

Class extraction.RoleExtractor

Reads a list of natural language definitions and identifies the definition's semantic roles for each of them

Input:

List of definitions: one per line in the format id|POS|word_list|def, where:

  • id: the synset id (an integer, starting from 1)
  • POS: noun or verb
  • word_list: a comma-separated list of words that compose the synset (1 to n)
  • def: the definition text

Output:

Pre-annotated data file: definitions classified in IOB format

Post-Processing

Class extraction.PostProcessing

Reads classified data generated by a machine learning classifier and prepares it to be converted into and RDF graph

Input:

List of definitions: one per line in the format id|POS|word_list|def, where:

  • id: the synset id (an integer, starting from 1)
  • POS: noun or verb
  • word_list: a comma-separated list of words that compose the synset (1 to n)
  • def: the definition text

Classified data: file in IOB format (returned by the RNN classifier)

Note: sequence of definitions in both files must match

Output:

Fixed classified data: file in IOB format with all classifications fixed (missing supertypes added and inconsistent IOB sequences adjusted)

RDF Model Construction

Class model.ModelBuilder

Input:

List of definitions: one per line in the format id|POS|word_list|def, where:

  • id: the synset id (an integer, starting from 1)
  • POS: noun or verb
  • word_list: a comma-separated list of words that compose the synset (1 to n)
  • def: the definition text

Classified data: file in IOB format (preferably the one returned by the PostProcessing routine)

Note: sequence of definitions in both files must match

Output:

RDF files in XML and/or N-TRIPLES format (options must be set in the configuration file params.txt in the conf folder)

Utils

Auxiliary routines to convert data between different formats

Class util.IOBtoStandoff

Generate a file in the standoff format to be read by the Brat annotation tool

Input:

Classified data: file in IOB format

Output:

FIle in standoff format as defined by the Brat tool

Class util.StandofftoIOB

Reads the standoff file generated by the Brat tool after data annotation and converts it back to IOB format

Input:

List of definitions: one per line in the format id|POS|word_list|def, where:

  • id: the synset id (an integer, starting from 1)
  • POS: noun or verb
  • word_list: a comma-separated list of words that compose the synset (1 to n)
  • def: the definition text

Standoff file: file generated by the Brat tool

Note: the sequence in the list of definitions must be the same as in the one sent as input to the Brat tool

Output:

File in IOB format

Class util.DataScriptBuilder

Generates a python script for creating the dataset to be sent as input for the RNN model

Input:

Classified data: file in IOB format

Output:

A python script to generate a pickle file to feed the RNN model

About

A set of tools to pre-annotate, post-process and generate an RDF graph from natural language dictionary definitions.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published