Home

Jump to bottom

Kiri Wagstaff edited this page Mar 25, 2022 · 6 revisions

Welcome to the MTE wiki!

The MTE system has three components:

The MTE database (stored as an SQLite database following the MTE Database Schema
The MTE ingestion pipeline (which populates the database)
The MTE user interface (website) - used internally at JPL; not part of this repository

Ingestion Pipeline

The pipeline takes in a PDF file and applies these steps
- Convert PDF to text (using Tika)
- Obtain document information such as title, authors, etc. (lookup via ADS API and fall back to extraction from the text content using Grobid)
- Extract (recognize) named entities, including Targets, Elements, Minerals, and Properties
  - Custom CoreNLP NER model
  - Custom Python CRFSuite model
- Extract relations between entities (e.g., "contains") (using jSRE)
To generate an MTE database: see detailed instructions
MTE Wishlist - ideas for future improvements

Howto

Update CoreNLP NER and restart the CoreNLP server