Lexicological framework for pipeline text processing
RAPOSA processes texts word by word and applies different filters in a conveyor-belt like fashion.
Define a pipeline, with its tokenization method, and the different tubes through which the tokens will travel. Tubes may modify the token, discard it, tag it, or any combination of those three. Some basic pipelines and tubes are included, but every case is different, so customization was the key guiding principle. As such, we encourage to check the demo.py
file and the code itself to know how to create and combine your own derived classes.
The intended use case for RAPOSA is lexicology analysis, being of special convenience for neology, lexicography and morphology, but its open-endedness and customization allow for many different kinds of purposes. For this reason, it also includes many other NLP/CompLing goodies.
RAPOSA is not tied to any specific language, though currently it may only contain filters for some languages due to obvious time development constraints. RAPOSA is specially proud to support minorized and minority languages.
Contributions are always warmly welcome and appreciated!
Define a pipeline, with its tokenization method, and the different tubes through which the tokens will travel. Tubes may modify the token, discard it, tag it, or any combination of those three. Some basic pipelines and tubes are included, but every case is different, so customization was the key guiding principle. As such, we encourage to check the demo.py
file and the code itself to know how to create and combine your own derived classes.
As the software is under development, look at demo.py
for examples until proper docs are in place.
The initial version of this package has been developed under a research scholarship from the Deputación da Coruña for the year 2016.
The software is released under a MIT License (see LICENSE
file in the root folder for details), except for the following resources, which are derivative work:
-
Module
langs.gl.stemmer
is an adaptation of the code at http://bvg.udc.es/recursos_lingua/stemming.jsp, copyright 2006 Biblioteca Virtual Galega -
The data in
langs/es/data/drae_2011.dat
is taken from the lemmas for the Diccionario de la Real Academia Española as release at http://dirae.es/ -
The data in
langs/es/data/nombres_2016.dat
andlangs/es/data/apellidos_2016.dat
is taken from official data from the Spanish Instituto Nacional de Estadística: http://www.ine.es/dyngs/INEbase/es/operacion.htm?c=Estadistica_C&cid=1254736177009&menu=resultados&idp=1254734710990 -
The data in
langs/gl/data/corga_1.7.dat
is taken from the frequency list of the Corpus de Referencia do Galego Actual: http://corpus.cirp.es/corga/ As such, this modified lexicon is released under the terms of the Lesser General Public License For Linguistic Resources as the original. SeeLICENSE
file in that folder for details. -
The data in
langs/gl/data/xiada_2.6.dat
is taken from the XIADA project by the Centro Ramón Piñeiro para a Investigación en Humanidades: http://corpus.cirp.es/xiada/ As such, this modified lexicon is released under the terms of the Lesser General Public License For Linguistic Resources as the original. SeeLICENSE
file in that folder for details. -
The data in
lans/gl/data/estraviz_09_2017.dat
is taken from the sitemaps for the Dicionário Estraviz: http://estraviz.org/ -
The data in
langs/gl/data/toponimia_2013.dat
is taken from official data from the Xunta de Galicia: http://abertos.xunta.gal/catalogo/territorio-vivienda-transporte/-/dataset/0159/microtoponimia-galicia