The goal of our project is to understand the impact of social media posts on the future prices of individual stocks. We examined the impact of posts in the social media platform reddit.com within investment focused subreddits/forums using textual analysis techniques.
The files found here can be used to clean and apply sentiment to reddit posts and comments as well as create regression models using this data. We are also supplying the code which can be used for the interactive web application that is used to visualize real time sentiment and prediction for specific stocks tickers.
All analysis and data collection is found within CODE/Analysis
Obtain reddit comments and posts from google big query. https://bigquery.cloud.google.com/dataset/fh-bigquery:reddit_posts?pli=1
Example queries for May 2018. Each table represents 1 month. Repeat as necessary.
SELECT * FROM fh-bigquery.reddit_posts.2018_05 where subreddit = 'wallstreetbets';
SELECT * FROM fh-bigquery:reddit_comments.2018_05 WHERE subreddit = 'wallstreetbets';
Place .csv files from big query into respective folders.
Run the following files in order.
CleanData.rmd > RedditSentiment.rmd > getStockData.rmd > Python Sentiment/vaderSentiment.py > FinalModel.Rmd > FinalModel_2.Rmd
-
Export comments collection (before sentiment was calcuated) mongoexport --db liztd -c comments --out comments.csv --type csv --fields "author_flair_css_class,distinguished,ups,subreddit,body,score_hidden,archived,name,author,author_flair_text,downs,created_utc,subreddit_id,link_id,parent_id,score,retrieved_on,controversiality,gilded,id"
-
Export the submissions collection (before sentiment was calculated) mongoexport --db liztd -c submissions --out submissions.csv --type csv --fields "created_utc,subreddit,author,domain,url,num_comments,score,ups,downs,title,selftext,saved,id,from_kind,gilded,from,stickied,retrieved_on,over_18,thumbnail,subreddit_id,hide_score,link_flair_css_class,author_flair_css_class,archived,is_self,from_id,permalink,name,author_flair_text,quarantine,link_flair_text,distinguished"
-
Export the aggregated sentiments collection mongoexport --db liztd -c reddit_sentiments --out sentiments.csv --type csv --fields "date,ticker,sumCompound,count,close,pct2,pred"
-
Importing it into local: mongoimport --host mongodb://://dvafinalproject-anotq.mongodb.net/liztd -c reddit_submissions --type csv --headerline --file submissions_with_sentiments.csv mongoimport --db liztd -c reddit_comments --type csv --headerline --file comments_with_sentiments.csv
-
Import to the cloud.mongodb.com shard mongoimport --host dvafinalproject-shard-0/dvafinalproject-shard-00-00-anotq.mongodb.net:27017,dvafinalproject-shard-00-01-anotq.mongodb.net:27017,dvafinalproject-shard-00-02-anotq.mongodb.net:27017 --ssl --username arvnan52 --password --authenticationDatabase admin --db liztd --collection reddit_submissions --type csv --file submissions_with_sentiments.csv --headerline mongoimport --host dvafinalproject-shard-0/dvafinalproject-shard-00-00-anotq.mongodb.net:27017,dvafinalproject-shard-00-01-anotq.mongodb.net:27017,dvafinalproject-shard-00-02-anotq.mongodb.net:27017 --ssl --username arvnan52 --password --authenticationDatabase admin --db liztd --collection reddit_comments --type csv --file comments_with_sentiments.csv --headerline mongoimport --host dvafinalproject-shard-0/dvafinalproject-shard-00-00-anotq.mongodb.net:27017,dvafinalproject-shard-00-01-anotq.mongodb.net:27017,dvafinalproject-shard-00-02-anotq.mongodb.net:27017 --ssl --username arvnan52 --password --authenticationDatabase admin --db liztd --collection sentiments --type csv --file sentiments.csv --headerline
- Connect to the cloud.mongodb.com (arvnan52/hp..)
- The connection is enabled only from 2 IP's. a. My laptop b. The digitalocean server
mongo "mongodb+srv://dvafinalproject-anotq.mongodb.net/liztd" --username --password
a. reddit_submissions
db.reddit_submissions.createIndex({title: "text", selftext: "text", id: 1, created_utc: 1})
b. reddit_comments
Indexes: db.reddit_comments.createIndex({parent_id: 'text', body: 'text', created_utc: 1, id: 1})
c. sentiments
This collection aggregates stock price with reddit sentiment analysis and final prediction
hostname: ubuntu-s-1vcpu-1gb-nyc1-01: 159.89.232.113
liztd.com
The following fuctionalities were hosted on one ubuntu server hosted by digitalocean.
The python script under CODE/liztd_python_load is setup as a cronjob to be executed every night.
This is handled by the python script inside CODE/liztd_python_stream folder. This script has an open connection to monitor reddit 'wallstreetbets' stream and upload them into the mongodb database.
PM2 - PM2 is a process mangement tool which is setup to keep the jobs running the scheduled time for data collection.
API:
The python bottlepy based web server is hosted as an api to the database and the frontend. The project is present in CODE/liztd_python_api
UI:
The UI is created using ReactjS, evergreen library for UI components and Recharts for charting components. The scripts neccessary to run the web ui are
present at CODE/liztd_ui/readme.md file.