-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Kiri Wagstaff edited this page Mar 25, 2022
·
6 revisions
The MTE system has three components:
- The MTE database (stored as an SQLite database following the MTE Database Schema
- The MTE ingestion pipeline (which populates the database)
- The MTE user interface (website) - used internally at JPL; not part of this repository
- The pipeline takes in a PDF file and applies these steps
- Convert PDF to text (using Tika)
- Obtain document information such as title, authors, etc. (lookup via ADS API and fall back to extraction from the text content using Grobid)
- Extract (recognize) named entities, including Targets, Elements, Minerals, and Properties
- Custom CoreNLP NER model
- Custom Python CRFSuite model
- Extract relations between entities (e.g., "contains") (using jSRE)
- To generate an MTE database: see detailed instructions
- MTE Wishlist - ideas for future improvements
- Update CoreNLP NER and restart the CoreNLP server