An application to crawl a text corpus of Rotten Tomatoes movie reviews, act as a search engine to query over the corpus and perform text classification and clustering.
This repo is structured into four main folders:
- TomatoCrawler
- TomatoClassifier
- TomatoSearch
- OkTomato
It is a crawling module implemented in Node.js.
To install the dependency,
$ npm install
To run the crawling,
$ node TomatoCrawler/main.js
First, we need to install the following dependencies manually because the installation process is not consistent across platform:
- Install Mathplotlib
- Install Scipy
- Install Numpy
- Install Scikit-learn
To run the classifier,
$ python3 main.py
It will try different classifiers and show precision. We tweaks parameters in main.py for different classifier.
To label all the data using the classifier,
$ python3 label_data.py
There are two folders config
and website
which are contains the code for indexing and the website respectively.
The instructions can be found as follows:
This folder is mainly used to download the entities from Elasticsearch and upload them to Wit.ai.
In the OkTomato
directory:
- To download the entities, run
$ python data/populate_data.py
- To upload to Wit.ai, run
$ python upload_entities.py