Skip to content

Celery-based worker to add quotes to a DB of story text via external Stanford CoreNLP server

Notifications You must be signed in to change notification settings

dataculturegroup/Quote-Annotator

 
 

Repository files navigation

Media Cloud Story Quote Extractor

A helper that will extract quotes from a DB of stories from Media cloud. This starts with a Mongo database full of stories, where each document in the database is a story that has a story_text property.

Requirements:

  • Python3 - we use pyenv to manage different versions
  • Stanford CoreNLP Server - This requires you to be running a copy of the Stanford CoreNLP Server, (here is my fork of the Docker install with some tweaks for the annotators we use for quote extraction).
  • Redis - we use this via celery as a queue for parallel processing
  • Mongo - this holds the story information

Dev Installation

Install the dependencies pip install -r requirements.txt.

Configuration

Copy the .env.template to .env and then edit it.

Use

Open up one terminal window and start the workers waiting: celery -A quoteworker worker -l info. Watch the log to see if processing stories.

In another window start filling up the queue with python queue-stories-from-db.py.

Notes

  • To empty out your queue of jobs, run redis-cli FLUSHALL.
  • Run a few quick sanity tests to make sure you are connected to the NLP server: test.sh

About

Celery-based worker to add quotes to a DB of story text via external Stanford CoreNLP server

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%