Skip to content

Latest commit

 

History

History
113 lines (91 loc) · 4.68 KB

README.md

File metadata and controls

113 lines (91 loc) · 4.68 KB

Arthropod gene and species Named Entity Recognition

Project aims to collect a literature corpus as our training and testing data with automated or manual labeled entities, from abstracts in the arthropod sciences.

And then, Using SpaCy NLP and computational linguistics algorithms to make inferences and gain insights about data we have

  • TeamTat - a collaborative text annotation and web-based annotation local setup available tool,
    equipped to manage team annotation projects engagingly and efficiently.

  • Spacy - a free, open-source library for advanced Natural Language Processing (NLP) in Python.
    designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

Requirment

Python Anaconda and Jupyter Notebook
Python 3.3 or greater is required to install the Jupyter Notebook.

  • Anaconda and Jupyter Notebook Install Instructions - Windows
  • How to install Python 3.6 and run the Spyder Integrated Development Environment (IDE) or the Jupyter Notebook. Vedio

Installation

1.Convert data format form BioCxml to spacy

User guide

The BioCxml format of the TeamTat's annotation

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE collection SYSTEM "BioC.dtd">
<collection>
  <source>PubTator</source>
  <date/>
  <key>BioC.key</key>
  <document>
    <id>3392027</id>
    <infon key="tt_curatable">no</infon>
    <infon key="tt_version">0</infon>
    <infon key="tt_round">0</infon>
    <passage>
      <infon key="type">title</infon>
      <offset>0</offset>
      <text>(title)Primary structure of apolipophorin-III from the migratory locust, Locusta migratoria. Potential amphipathic structures and molecular evolution of an insect apolipoprotein.</text>
      <annotation id="1">
        <infon key="identifier"></infon>
        <infon key="type">Gene</infon>
        <infon key="updated_at">1980-01-01T00:00:00Z</infon>
        <location offset="21" length="17"/>
        <text>apolipophorin-III</text>
      </annotation>
      <annotation id="2">
        <infon key="identifier">7004</infon>
        <infon key="type">Species</infon>
        <infon key="updated_at">1980-01-01T00:00:00Z</infon>
        <location offset="48" length="16"/>
        <text>migratory locust</text>
      </annotation>
    </passage>
  </document>
</collection>

The Spacy's entity annotations format

train_data = [
    ('Primary structure of ....',{'entities': [(21,38,'Gene'),(48,64,'Species'),(66,84,'Species'),(225,242,'Gene'),(248,266,'Species'),(423,440,'Gene'),(450,466,'Species'),(468,481,'Species'),(597,610,'Species'),(969,978,'Species'),(1159,1168,'Species'),(1234,1243,'Species')]}),(ex2),(ex3)]
#("Eample text or content"(string), {"entities": [(start_position(int), end_position(int), "label_name"(string))]})

Prerequisite

  • Python 3.x { x > 4 }
    check by command python --version
  • Pip (package manager)
    check by command pip --version
  • lxml module
    Install by command pip install lxml
    check by command pip list to see whether lxml exists

Getting started step by step

  • Step1 - Clone the repo to local path
cd `Your path`
git clone https://github.com/ShangYuChiang/NER.git
  • Step2 - Run BioCxml2spacy.py
cd / NER/BioCxml2spacy
python BioCxml2spacy.py
  • Step3 - The results are shown in the output file

2. Basic spaCy NER example and implement

  • Follow the README.md tutorial at NER_GS Github

3. spaCy Arthropod gene Named Entity Recognition example and experiment

other

Convert annotation format from BioCxml to Spacy Github Link
Convert text file into BioCxml format Github Link
Using PySysrev Gene Annotations dataset NER with PySysrev dataset.ipynb

Reference

XML Tutorial : w3schools.com
Spacy : Training spaCy’s Statistical Models
Github : BioC-JSON , Spacy-ner-annotator