Skip to content

An application to crawl a text corpus of Rotten Tomatoes movie reviews, act as a search engine to query over the corpus and perform text classification and clustering.

Notifications You must be signed in to change notification settings

junyi/TomatoEngine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TomatoEngine

An application to crawl a text corpus of Rotten Tomatoes movie reviews, act as a search engine to query over the corpus and perform text classification and clustering.

This repo is structured into four main folders:

  • TomatoCrawler
  • TomatoClassifier
  • TomatoSearch
  • OkTomato

TomatoCrawler

It is a crawling module implemented in Node.js.

To install the dependency,

$ npm install

To run the crawling,

$ node TomatoCrawler/main.js

TomatoClassifier

First, we need to install the following dependencies manually because the installation process is not consistent across platform:

  1. Install Mathplotlib
  2. Install Scipy
  3. Install Numpy
  4. Install Scikit-learn

To run the classifier,

$ python3 main.py

It will try different classifiers and show precision. We tweaks parameters in main.py for different classifier.

To label all the data using the classifier,

$ python3 label_data.py

TomatoSearch

There are two folders config and website which are contains the code for indexing and the website respectively. The instructions can be found as follows:

OkTomato

This folder is mainly used to download the entities from Elasticsearch and upload them to Wit.ai.

In the OkTomato directory:

  • To download the entities, run
$ python data/populate_data.py
  • To upload to Wit.ai, run
$ python upload_entities.py

About

An application to crawl a text corpus of Rotten Tomatoes movie reviews, act as a search engine to query over the corpus and perform text classification and clustering.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages