-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcisc_684_final_project.py
1192 lines (927 loc) · 50.3 KB
/
cisc_684_final_project.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# -*- coding: utf-8 -*-
"""CISC 684 Final Project.ipynb
Automatically generated by Colaboratory.
Original file is located at
https://colab.research.google.com/drive/1ZdAk8aKH1fH8qZ2V8mwlKAyzQ4UgwXl7
# Premise
**Background**
---
The stock market is a platform through with securities (also referred to as shares, stocks, equity, etc.) are traded daily. When an individual or corporation buys a share of a stock, they are purchasing a fraction of ownership of the publicly traded company the stock belongs to.
The price of a single share changes throughout the business hours of the stock exchange it is traded on, normally 9 am to 5 pm, Monday to Friday in the United States. During these hours, the market is *open*, and securities can be bought and sold freely. When the market closes, no securities can be traded. The price of a share can fluctuate between the close of the market and the next time it opens; often, news or earnings reports about a company can have a major impact on the value of a stock, and take place after market hours. Investors are unable to trade shares of any company until the market opens, by which the opening share price will already reflect the new information revealed since the previous day's close. Some institutions are given the ability to trade pre- and post-hours, but not the typical investor.
There are many economic metrics used to measure the value of a security. There are *momentum* indicators, which measure if the price of a share is rising, falling, how quickly, and for how long. There are *value* metrics, which compare the *actual* price of a security with a mathematical calculation of what the price *should* be. This calculation varies for every analyst, though some common metrics take into account a companiy's revenue, profit, cash flow, liabilities, etc.
**Goal**
---
The goal of this project is to create an algorithm that increases the profit realized by investors.
**Methods**
---
This goal will be acheived by using two models, a *Logistic Regression* and *Neural Network*.
The data was scraped from the internet for individual stocks. This includes daily price and trading data, as well as earnings data for every quarter since 1995.
The scraped data was manipulated to calculate a variety of value- and momentum-based metrics.
For each trial, a testing date is given. The training data includes a 1 or 2 year period (chosen as a hyperparamter of the model) occurring at least *n* days before the testing date. The test data is comprised of the 6 month period following the testing date.
*N* corresponds the the classifcation of an instance. This data has a binary classification defined by whether or not the price of the security increased *n* business days after the current date. It is for this reason that there must be a gap of *n* days between the training and test data. This value is also selected as a hyperparameter of the model. Predicting the price of a stock 1 day into the future (which would practically yield the highest returns), is nearly impossible. It is unrealistic to do so given the feature spaced used in this project, as no geopolitical or media information was gleaned.
*Note: This model does not take dividend payouts into account, and all stock prices are split-adjusted.
**Performance**
---
Economic climates can vary by month, quarter, and year. To gain a more accurate understanding of a model's performance, every 6 month period was tested from the period beginning on January 1st, 2003 to the period beginning on January 1st, 2016 (which ends on July 1st, 2016). The total returns of the model are then aggregated and compared with the total returns of the *unengaged investor*.
The *unengaged investor* is a control group to compare the model against. Many analysts try to predict the market, and end up making less money than they would have if they didn't do anything. The *unengaged investor* mimics the returns that would be realized if a stock was bought on the first day of a period and kept untili the final day of the period (also referred to at the *static return*). The model buys and sells a security each day depending on the predicted classification of the test data.
When an instance is classified as True (meaning the model expects the share price to increase in the future), the model buys a share of the stock. It then holds the stock until it classifies an instance as False (meaning the model expects the share price to decrease in the future). The maximum returns would be realized if the model were to buy a security the day before it increases, and sell it the day before it decreases, for every day in the test period. The inverse is also possible, which would be the worst case.
The performance of this model will not be measured by the accuracy of its predictions. Rather, the most practical measurement of this model are to compare its returns to the *unengaged investor*. If the model produces higher returns than an investor who did not use the model and merely held a security for the duration of the period, than utilizing the model to guide investment decisions will have been worthwhile.
*Note: Other measures of success could have also been selected. For example, one could have optimized the model to reduce volatility or minimize losses. This project, however, optimizes the model by maximizing returns without regard for volaility or short-term losses.*
# Import packages
"""
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time as t
from pandas_datareader import data as wb
from datetime import date, timedelta
import warnings
warnings.filterwarnings('ignore')
"""# Define code for model
## Define code for pulling individual stock data
"""
sectorInfo = pd.read_csv('https://datahub.io/core/s-and-p-500-companies/r/constituents.csv').set_index('Symbol')
def getTickers(step):
'''
Input: integer
Output: list (of ticker symbols)
This function returns a list of ticker symbols. On www.stockpup.com, there are 500+ data sets.
This function returns every step-th ticker symbol associated with the available datasets.
'''
url = 'http://www.stockpup.com/data/'
site = BeautifulSoup(requests.get(url).text, "html.parser")
files = site.findAll('a')[22:-1][::step]
tickers = []
for f in files:
if f['href'][-3:] == 'csv':
tickers.append(f['href'][6:].split('_')[0])
return tickers
def getPriceData(ticker, daysIntoFuture):
'''
Input: string (ticker sybol), integer (value used in determining the classification of an instance)
Output: pandas dataframe (historical stock trading data)
This function creates a dataframe containing the historical trading data for a given stock.
All trading data from January 1st, 1995 is pulled from yahoo.
Features calculated from this data include:
- Ticker
- 52 Week High
- 52 Week Low
- 10 Day Moving Average
- 50 Day Moving Average
- 200 Day Moving Average
- 1 Day Change
- 5 Day Change
- 30 Day Change
- Class
The class feature is determined by whethter or not the price of the stock increased
n business days into the future, where n is given by the input variable daysIntoFuture.
'''
pData = pd.DataFrame()
startDate = '1995-01-01'
dataSource = 'yahoo'
ticker_data= wb.DataReader(ticker, data_source = dataSource, start = startDate)
pData = pd.DataFrame(ticker_data)[['Close', 'Volume']]
pData['Ticker'] = ticker
pData['52 Week High'] = pData['Close'].rolling(window = 5*52).max()
pData['52 Week Low'] = pData['Close'].rolling(window = 5*52).min()
pData['10 Day Moving Average'] = pData['Close'].rolling(window = 10).mean()
pData['50 Day Moving Average'] = pData['Close'].rolling(window = 50).mean()
pData['200 Day Moving Average'] = pData['Close'].rolling(window = 200).mean()
pData['1 Day Change'] = (pData['Close'] - pData['Close'].shift(1)) / pData['Close'].shift(1)
pData['5 Day Change'] = (pData['Close'] - pData['Close'].shift(5)) / pData['Close'].shift(5)
pData['30 Day Change'] = (pData['Close'] - pData['Close'].shift(30)) / pData['Close'].shift(30)
pData['Class'] = pData['Close'].rolling(daysIntoFuture).sum().shift(-daysIntoFuture) - pData['Close'].rolling(daysIntoFuture - 1).sum().shift(-daysIntoFuture + 1) > pData['Close']
return pData.reset_index()
def pullStockData(tick, daysIntoFuture = 30):
'''
Inputs: string (ticker symbol), integer (value used in determining the classification of an instance)
Output: pandas dataframe (historical stock trading and earnings data)
This function creates a dataframe containing the historical trading and earnings data for a given stock.
All trading data from January 1st, 1995 is pulled from yahoo and www.stockpup.com
Features calculated from this data include:
TRADING METRICS
- Close (price at the close of the business day)
- Price / 52 Week High
- Price / 52 Week Low
- 10 Day Moving Average Ratio
- 50 Day Moving Average Ratio
- 200 Day Moving Average Ratio
- 1 Day Change
- 5 Day Change
- 30 Day Change
FUNDAMENTALS/EARNINGS METRICS
- Market / Book Ratio
- P/E (price to earnings ratio)
- Debt / Equity Ratio
- Free Cash Flow Yield
The class feature is determined by whethter or not the price of the stock increased
n business days into the future, where n is given by the input variable daysIntoFuture.
'''
url = 'http://www.stockpup.com/data/'
filePath = '_quarterly_financial_data.csv'
fData = pd.read_csv(url + tick + filePath)
# Get fundamentals data for stock
fData = fData[['Quarter end',
'Cash at end of period',
'Shares split adjusted',
'Cash from operating activities',
'Capital expenditures',
'Assets',
'Liabilities',
'EPS basic']]
fData['Ticker'] = tick
fData['Cash from operating activities'] = pd.to_numeric(fData['Cash from operating activities'], errors = 'coerce')
fData['EPS basic'] = pd.to_numeric(fData['EPS basic'], errors = 'coerce')
fData['Quarter end'] = pd.to_datetime(fData['Quarter end'])
fData = fData.reset_index().drop(columns = ['index']).sort_values(by = 'Quarter end', ascending = True)
# Get price data for stock
pData = getPriceData(tick, daysIntoFuture)
# Merge price and fundamentals data and build attributes
stockData = pd.merge_asof(pData, fData, left_on = 'Date', right_on = 'Quarter end', by = 'Ticker', direction = 'backward', allow_exact_matches = False)
try:
stockData['Sector'] = sectorInfo['Sector'][tick]
except:
stockData['Sector'] = 'Unknown'
stockData['Market / Book Ratio'] = stockData['Close'] / (stockData['Assets'] - stockData['Liabilities']) * stockData['Shares split adjusted']
stockData['P/E'] = stockData['Close'] / stockData['EPS basic']
stockData['Debt / Equity Ratio'] = stockData['Liabilities'] / (stockData['Assets'] - stockData['Liabilities'])
stockData['Free Cash Flow Yield'] = (stockData['Cash from operating activities'] - stockData['Capital expenditures']) / (stockData['Shares split adjusted']*stockData['Close'] + stockData['Liabilities'] - stockData['Cash at end of period'])
stockData['Price / 52 Week High'] = stockData['Close'] / stockData['52 Week High']
stockData['Price / 52 Week Low'] = stockData['Close'] / stockData['52 Week Low']
stockData['10 Day Moving Average Ratio'] = stockData['Close'] / stockData['10 Day Moving Average']
stockData['50 Day Moving Average Ratio'] = stockData['Close'] / stockData['50 Day Moving Average']
stockData['200 Day Moving Average Ratio'] = stockData['Close'] / stockData['200 Day Moving Average']
stockData = stockData[['Ticker',
'Sector',
'Date',
'Close',
'Price / 52 Week High',
'Price / 52 Week Low',
'10 Day Moving Average Ratio',
'50 Day Moving Average Ratio',
'200 Day Moving Average Ratio',
'1 Day Change',
'5 Day Change',
'30 Day Change',
'Volume',
'Market / Book Ratio',
'P/E',
'Debt / Equity Ratio',
'Free Cash Flow Yield',
'Class']]
return stockData .dropna()
allFeatures = ['Price / 52 Week High', 'Price / 52 Week Low',
'10 Day Moving Average Ratio', '50 Day Moving Average Ratio',
'200 Day Moving Average Ratio', 'Volume',
'Market / Book Ratio', 'P/E', 'Debt / Equity Ratio',
'Free Cash Flow Yield']
"""## Code for the Logistic Regression"""
class LogisticRegression():
def __init__(self, learningRate = 0.01, numIter = 5000, thresh = 0.5, fitIntercept = True, printUpdates = True):
self.learningRate = learningRate
self.numIter = numIter
self.thresh = thresh
self.fitIntercept = fitIntercept
self.printUpdates = printUpdates
self.errors = []
self.average = .5
def addIntercept(self, X):
intercept = np.ones((X.shape[0], 1))
return np.concatenate((intercept, X), axis = 1)
def sigmoid(self, z):
return 1 / (1 + np.exp(-z))
def loss(self, Y, h):
return (-Y * np.log(h) - (1 - Y)*np.log(1 - h)).mean()
def fit(self, X, Y):
'''
Inputs: pandas dataframe (training instances), series (training classifications)
Outputs: None
This function takes in the training data and fits a logistic regression to it.
'''
self.trainingData = (X, Y)
if self.fitIntercept:
X = self.addIntercept(X)
# Initialize weights
self.theta = np.zeros(X.shape[1])
for i in range(1, self.numIter + 1):
z = np.dot(X, self.theta)
h = self.sigmoid(z)
gradient = np.dot(X.T, (h - Y)) / Y.size
self.theta -= self.learningRate * gradient
self.errors.append(self.loss(Y, h))
if (i % 100 == 0 and self.printUpdates):
print('Iteration %d Error: %.4f' % (i, self.errors[-1]))
#self.threshold = sum(Y) / len(Y)
def predictProb(self, X):
'''
Inputs: pandas dataframe (testing instances)
Output: array (of floats)
This function takes in a testing dataset (without classifications) and uses the trained
model to predict the probabilities associated with each instance. The returned
floats correspond to the probabilities that each instance is classified as True.
'''
if X.shape[1] < self.theta.shape[0]:
X = self.addIntercept(X)
return self.sigmoid(np.dot(X, self.theta))
def predict(self, X):
'''
Inputs: pandas dataframe (testing instances)
Output: array (of binary classifications)
This function takes in a testing dataset (without classifications) and uses the trained
model to predict the probabilities associated with each instance. The returned
binary values correspond to the binary classifications predicted by the model.
'''
return self.predictProb(X) > self.thresh
def printThresholds(self, cols):
print('These are the weights for each column for this model:')
print('%30.30s: %.2f' % ('Intercept', self.theta[0]))
for i in range(1, len(self.theta)):
print('%30.30s: %.2f' % (cols[i -1], self.theta[i]))
"""## Code for the Neural Network"""
class ANN:
def __init__(self, size, momentum = 0.0, nIters = 1000, lr = 0.25, thresh = 0.01, printUpdates = True):
"""
---
Network constructor for multi-class classification
---
size: network size as an array of number of units in input, hidden, output layers
i.e. [100,300,10,5] indicates 100 neurons in input layer, 300 and 10 neurons in subsequent
hidden layes and 5 neurons in output layer
momentum: a constant, to avoid local minima
nIters: number of iterations
lr: learning rate
"""
self.size = size
self.nHidden = len(self.size) - 2 #except input and output layer
self.momentum = momentum
self.lr = lr
self.thresh = thresh
self.nIters = nIters
self.errors = []
self.printUpdates = printUpdates
self.weights = {}
self.biases = {}
# Initialize weights and biases
for k in range(self.nHidden + 1):
self.weights[k+1] = np.random.randn(self.size[k], self.size[k+1]) / np.sqrt(self.size[k])
self.biases[k+1] = np.random.rand(1,self.size[k+1])
def feedFwd(self, inputs):
""" Propagate the network forward """
if type(inputs) != np.ndarray:
inputs = inputs.to_numpy()
self.pre_activations = {} #Linear, i.e. w.x+b
self.hidden = {} # after activation, i.e. sigmoid(w.x+b)
self.hidden[0] = np.array([inputs])
for k in range(self.nHidden + 1):
self.pre_activations[k+1] = np.dot(self.hidden[k], self.weights[k+1]) + self.biases[k+1]
self.hidden[k+1] = sigmoid(self.pre_activations[k+1])
output = self.hidden[self.nHidden + 1]
return output
def backProp(self, inputs, targets):
#self.feedFwd(inputs)
self.d_weights = {}
self.d_biases = {}
self.d_hidden = {}
for k in range(self.nHidden + 1):
self.d_weights[k+1] = np.zeros((np.shape(self.weights[k+1])))
self.d_biases[k+1] = np.zeros((np.shape(self.biases[k+1])))
self.output = self.feedFwd(inputs)
#error = (1/2)*np.sum((self.output - targets)**2)
self.d_hidden[self.nHidden + 1] = (targets - self.output)*grad_sigmoid(self.output) # delta_o
# compute deltas
for k in range(self.nHidden + 1, 0, -1):
self.d_hidden[k-1] = grad_sigmoid(self.hidden[k-1])*(np.dot(self.d_hidden[k],np.transpose(self.weights[k])))
self.d_weights[k] = np.transpose(np.dot(np.transpose(self.d_hidden[k]),self.hidden[k-1])) + self.momentum*self.d_weights[k]
self.d_biases[k] = self.d_hidden[k] + self.momentum*self.d_biases[k]
def train(self, X, Y):
'''
Inputs: pandas dataframe (training instances), series (training classifications)
Outputs: None
This function takes in the training data and fits a logistic regression to it.
'''
X = X.to_numpy()
Y = Y.to_numpy()
pred = np.empty(Y.shape)
for i in range(self.nIters+1):
for k, (x, y) in enumerate(zip(X,Y)):
self.backProp(x,y)
pred[k] = self.feedFwd(x)
# weight updates
for k in range(self.nHidden + 1):
self.weights[k+1] += self.lr*self.d_weights[k+1]
self.biases[k+1] += self.lr*self.d_biases[k+1]
self.errors.append(np.sum(pred-Y))
if (i % 10 == 0 and self.printUpdates):
print('Iteration %d Error: %.4f' % (i, self.errors[-1]))
if len(self.errors) > 1:
if abs(self.errors[-1] - self.errors[-2]) < self.thresh:
break
def predictProb(self, X):
'''
Inputs: pandas dataframe (testing instances)
Output: array (of floats)
This function takes in a testing dataset (without classifications) and uses the trained
model to predict the probabilities associated with each instance. The returned
floats correspond to the probabilities that each instance is classified as True.
'''
return self.feedFwd(X)[0,:,0]
def predict(self, X):
'''
Inputs: pandas dataframe (testing instances)
Output: array (of binary classifications)
This function takes in a testing dataset (without classifications) and uses the trained
model to predict the probabilities associated with each instance. The returned
binary values correspond to the binary classifications predicted by the model.
'''
return self.predictProb(X) > .5
def sigmoid(z):
return 1/(1 + np.exp(-z))
def grad_sigmoid(z):
return z*(1 - z)
"""## Code for running the model"""
startMidDate = '01-01-2003'
finalMidDate = '01-01-2016'
def normalize(X):
'''
Input: Matrix or dataframe (containing only floats)
Output: Matrix or dataframe
This function normalizes the input matrix or dataframe along each column and across all rows.
'''
mins = np.min(X, axis = 0)
maxs = np.max(X, axis = 0)
rng = maxs - mins
norm_X = 1 - ((maxs - X)/rng)
return norm_X
def runModel(ticker, midDate, trainDataYears = 2, modelType = 'LogReg', daysIntoFuture = 30, lr = 0.01, numIter = 5000, updates = True, printResults = True, cols = allFeatures, thresh = 0.5):
'''
Inputs: string (ticker symbol), string (midDate, of format 'mm-dd-yyyy'), **kwargs
**kwargs:
- trainDataYears: int
This int corresponds to the length of the training data, in years
- modelType: string, {'LogReg', 'NeuralNet'}
This defines whether the model is a logistic regression or a neural network
- daysIntoFuture: int, used to define the classification of each instance
- lr: float, learning rate of the model
- numIter: int, number of iterations used in fitting the model
- updates: Boolean
If True, updates about the model's progress will be printed when fitting it to the training data
- printResults: Boolen
If True, the results (accuracy, returns) of the model will be printed after testing the model
- cols: list of strings, where each element is a feature that is to be included in the returned dataframes
- thresh: float, the cutoff between predicting True and False when predicting classifications
Output: pandas dataframe (trainData), pandas dataframe (testData)
This function creates a model for a specific stock over a given training period and tests it over a 6 month period.
'''
np.random.seed(0)
data = pullStockData(ticker, daysIntoFuture)
startTrainData = subtractNDays(midDate, daysIntoFuture - 1)
for i in range(trainDataYears):
startTrainData = subtractOneYear(startTrainData)
# Get training and test data
trainData, testData = getTrainTest(data, midDate, startDate = startTrainData, cols = cols, daysIntoFuture = daysIntoFuture)
# Create inputs and outputs for the training data
X = trainData[cols]
Y = trainData['Class']
pricesTrain = trainData['Close']
# Create inputs and outputs for the test data
xTest = testData[cols]
yTest = testData['Class']
pricesTest = testData['Close']
if modelType == 'LogReg':
# Create and fit the model
model = LogisticRegression(learningRate = lr, numIter = numIter, printUpdates = updates, thresh = thresh);
model.fit(X, Y);
elif modelType == 'NeuralNet':
nHidden = [250]
size = [X.shape[1]] + nHidden+[1] # last array [1] corresponds to output layer
# Create a model instance
model = ANN(size, printUpdates = updates, nIters = numIter)
model.train(X, Y)
# Predict the classifications of the training and test data
predictionsTrain = model.predict(X)
predictionsTest = model.predict(xTest)
#return model, pricesTrain, predictionsTrain, pricesTest, predictionsTest
# Get the accuracy of the model on the training and test data
accuracyTrain = getAccuracy(predictionsTrain, Y)
accuracyTest = getAccuracy(predictionsTest, yTest)
if printResults:
print('\nThe accuracy of the model on the training data from %s to %s was %.2f%%' % (trainData.Date[0].strftime("%b %d, %Y"), trainData.Date.to_list()[-1].strftime("%b %d, %Y"), 100*accuracyTrain))
print('The accuracy of the model on %s stock from %s to %s was %.2f%%\n' % (ticker, testData.Date[0].strftime("%b %d, %Y"), testData.Date.to_list()[-1].strftime("%b %d, %Y"), 100*accuracyTest))
if modelType == 'LogReg':
model.printThresholds(cols)
printReturns(pricesTrain, predictionsTrain)
printReturns(pricesTest, predictionsTest)
return model, pricesTrain, predictionsTrain, accuracyTrain, pricesTest, predictionsTest, accuracyTest
def runAllPeriods(ticker, startMidDate, finalMidDate, modelType = 'LogReg', cols = allFeatures, daysIntoFuture = 30, trainDataYears = 2, printResults = False, printYears = True, numIter = 5000, thresh = 0.5):
'''
Inputs: string ('ticker symbol'), string (startMidDate, of format 'mm-dd-yyyy'), string (endMidDate, of format 'mm-dd-yyyy'), **kwargs
**kwargs:
- modelType: string, {'LogReg', 'NeuralNet'}
This defines whether each model is a logistic regression or a neural network
- cols: list of strings, where each element is a feature that is to be included in the models
- daysIntoFuture: int, used to define the classification of each instance
- trainDataYears: int
This int corresponds to the length of each training set, in years
- printResults: Boolen
If True, the results (accuracy, returns) of each model will be printed after testing each model
- printYears: {True, 'some', False}
If True, the starting and ending date of every period will be printed as it is being tested.
If 'some', the starting and ending date of every 4th period will be printed as it is being tested.
If False, no updates will be printed.
- numIter: int, number of iterations used in fitting each model
- thresh: float, the cutoff between predicting True and False when predicting classifications
Output: list of objects, list of floats (staticReturns), list of floats (modelReturns), list of floats (accuracies), list of strings (midDates)
This function creates a model for every 6 month period between the two input dates.
Each model is trained based on training data relative to the start of its testing period.
Each model is then tested over the 6 month period following the start of its testing period.
'''
midDates, models, staticReturns, modelReturns, accuracies = [], [], [], [], []
count = 0
midDate = startMidDate
while yearsBetween(midDate, finalMidDate) >= 0:
if printYears == True or (printYears =='some' and count % 4 == 0):
print('Testing period from %s to %s' % (pd.to_datetime(midDate).strftime("%b %d, %Y"), pd.to_datetime(addSixMonths(midDate)).strftime("%b %d, %Y")))
midDates.append(midDate)
if count == 0:
midDate = addOneDay(midDate)
model, pricesTrain, predictionsTrain, accuracyTrain, pricesTest, predictionsTest, accuracyTest = runModel(ticker, subtractOneDay(midDate), trainDataYears = trainDataYears, modelType = modelType, daysIntoFuture = daysIntoFuture, updates = False, printResults = printResults, cols = cols, numIter = numIter, thresh = thresh)
sReturn, mReturn = getReturns(pricesTest, predictionsTest)
models.append(model)
staticReturns.append(sReturn)
modelReturns.append(mReturn)
accuracies.append(accuracyTest)
midDate = addSixMonths(midDate) # Test data is a six month period
count += 1
plotReturns(midDates, staticReturns, modelReturns)
printTotalReturns(midDates, staticReturns, modelReturns)
printAccuracies(accuracies)
return models, staticReturns, modelReturns, accuracies, midDates
"""## Date manipulation functions"""
def addThreeMonths(date):
'''
Input: string (date, of format 'mm-dd-yyyy')
Output: string (date, of format 'mm-dd-yyyy')
This function adds three months to a date and returns the new date as a string.
'''
if date[:2] == '10':
date = '01' + date[2:-2] + '%2.2d' % (int(date[-2:]) + 1)
else:
date = '%2.2d' % (int(date[:2]) + 3) + date[2:]
return date
def addSixMonths(date):
'''
Input: string (date, of format 'mm-dd-yyyy')
Output: string (date, of format 'mm-dd-yyyy')
This function adds six months to a date and returns the new date as a string.
'''
if date[:2] in ['01', '12']:
year = getYear(date)
if date[:2] == '12':
year += 1
return '07-01-%2.2d' % year
else:
return '01-01' + date[5:-2] + '%2.2d' % (int(date[-2:]) + 1)
def addOneYear(date):
'''
Input: string (date, of format 'mm-dd-yyyy')
Output: string (date, of format 'mm-dd-yyyy')
This function adds one year to a date and returns the new date as a string.
'''
return date[:-2] + '%2.2d' % (int(date[-2:]) + 1)
def subtractOneDay(date):
'''
Input: string (date, of format 'mm-dd-yyyy')
Output: string (date, of format 'mm-dd-yyyy')
This function subtracts one day from the given date.
It the new date is not a business day, it continues to subtract days until
it reaches a business day.
The new date is returned as a string.
'''
adate = pd.to_datetime(date)
adate -= timedelta(days=1)
while adate.weekday() > 4: # Mon-Fri are 0-4
adate -= timedelta(days=1)
return adate.strftime("%m-%d-%Y")
def addOneDay(date):
'''
Input: string (date, of format 'mm-dd-yyyy')
Output: string (date, of format 'mm-dd-yyyy')
This function adds one day to the given date.
If the new date is not a business day, it continues to add days until
it reaches a business day.
The new date is returned as a string.
'''
adate = pd.to_datetime(date)
adate += timedelta(days=1)
while adate.weekday() > 4: # Mon-Fri are 0-4
adate += timedelta(days=1)
return adate.strftime("%m-%d-%Y")
def subtractSixMonths(date):
'''
Input: string (date, of format 'mm-dd-yyyy')
Output: string (date, of format 'mm-dd-yyyy')
This function subtracts six months from a date and returns the new date as a string.
'''
if date[:2] == '07':
date = '01-01' + date[5:]
else:
date = '01-01' + date[5:-2] + '%2.2d' % (int(date[-2:]) - 1)
return date
def subtractOneYear(date):
'''
Input: string (date, of format 'mm-dd-yyyy')
Output: string (date, of format 'mm-dd-yyyy')
This function subtracts one year from a date and returns the new date as a string.
'''
return date[:-2] + '%2.2d' % (int(date[-2:]) - 1)
def getYear(date):
'''
Input: string (date, of format 'mm-dd-yyyy')
Output: int (year)
This function returns the year of a given date.
'''
return int(date[-4:])
def getMonth(date):
'''
Input: string (date, of format 'mm-dd-yyyy')
Output: int (year)
This function returns the month of a given date.
'''
return int(date[:2])
def yearsBetween(d1, d2):
'''
Input: string (startDate, of format 'mm-dd-yyyy'), string (endDate, of format 'mm-dd-yyyy')
Output: int (year)
This function returns the number of years between two dates.
Each month counts as 1/12 of a year.
'''
nYears = getYear(d2) - getYear(d1)
nYears += (getMonth(d2) - getMonth(d1)) / 12
return nYears
def busDaysBetween(startDate, endDate):
'''
Input: string (startDate, of format 'mm-dd-yyyy'), string (endDate, of format 'mm-dd-yyyy')
Output: int (year)
This function returns number of business days between two dates.
'''
startDate = [int(x) for x in startDate.split('-')]
startDate = date(startDate[2], startDate[0], startDate[1])
endDate = [int(x) for x in endDate.split('-')]
endDate = date(endDate[2], endDate[0], endDate[1])
days = np.busday_count(startDate, endDate)
return days
def subtractNDays(date, n):
'''
Input: string (date, of format 'mm-dd-yyyy'), int (number of business days to subtract)
Output: string (date, of format 'mm-dd-yyyy')
This function adds n days to the given date.
If the new date is not a business day, it continues to add days until
it reaches a business day.
The new date is returned as a string.
'''
adate = pd.to_datetime(date)
for i in range(n):
adate -= timedelta(days = 1)
while adate.weekday() > 4: # Mon-Fri are 0-4
adate -= timedelta(days=1)
return adate.strftime("%m-%d-%Y")
def addNDays(date, n):
'''
Input: string (date, of format 'mm-dd-yyyy'), int (number of business days to add)
Output: string (date, of format 'mm-dd-yyyy')
This function adds n business days to the given date.
If the new date is not a business day, it continues to add days until
it reaches a business day.
The new date is returned as a string.
'''
adate = pd.to_datetime(date)
for i in range(n):
adate += timedelta(days=1)
while adate.weekday() > 4: # Mon-Fri are 0-4
adate += timedelta(days=1)
return adate.strftime("%m-%d-%Y")
"""## Economic functions"""
def annualizedReturn(totalReturn, startDate, endDate):
'''
Input: float (cummulative return), string (date, of format 'mm-dd-yyyy'), string (date, of format 'mm-dd-yyyy')
Output: float (annualized returns)
This function converts a cumulative return to an annualized return.
The two date inputs are used to determine the number of years over which the cummulative returns are realized.
'''
years = yearsBetween(startDate, endDate)
return (totalReturn + 1)**(1/years) - 1
def cumulativeReturn(periodReturns):
'''
Input: list of floats (returns for each period)
Output: float (cummulative return)
This function converts a list of consecutive returns into a single cummulative return.
'''
cumulativeReturn = 1
for i in range(len(periodReturns)):
cumulativeReturn *= (1 + periodReturns[i])
return cumulativeReturn - 1
def getReturns(prices, predictions):
'''
Input: list of floats (consecutive price data), list of Booleans (predicted classifications)
Output: float (total return)
This function calculates the static and model returns over a period.
Static return: this takes the difference between the final price and the initial price (total gains)
normalizes it by dividing the difference by the initial price.
Model return: this uses the Boolean predictions as a guide to invest in or sell the stock.
When the prediction is True, it buys a share, and vice versa. At the end of the period, it measures
the total returns generated, also normalizing by the initial stock price.
'''
# Returns if the stock was bought at the begining of the period and not sold until the end of the period
staticReturn = prices.iloc[-1] / prices.iloc[0]
# Returns if the stock was bought when the model predicted the price would increase, and sold when the model predicted it would decrease
buying = True
balance = prices.iloc[0]
for i in range(len(prices)):
if predictions[i] and buying:
balance -= prices.iloc[i]
buying = False
if (not predictions[i]) and (not buying):
balance += prices.iloc[i]
buying = True
if not buying:
balance += prices.iloc[-1]
modelReturn = balance / prices.iloc[0]
return staticReturn - 1, modelReturn - 1
def printReturns(prices, predictions):
'''
Input: list of floats (consecutive price data), list of Booleans (predicted classifications)
Output: None
This function calculates the static and model returns over a period, and prints them out nicely.
'''
staticReturn, modelReturn = getReturns(prices, predictions)
print('\n\033[1mTotal Static Return:\033[0m %.1f%%' % (100*staticReturn))
print('\033[1mTotal Model Return:\033[0m %.1f%%' % (100*modelReturn))
def plotReturns(midDates, staticReturns, modelReturns):
'''
Inputs: list of strings (midDates), list of floats (staticReturns), list of floats (modelReturns)
Output: None
This function plots the static and model returns for all periods.
The midDates represent the starting date of each test period.
'''
# Plot the static and model returns
ind = np.arange(len(midDates))
width = 0.4
ax = plt.figure().add_subplot(1, 1, 1)
ax.bar(ind, staticReturns, width, label = 'Static Returns')
ax.bar(ind + width, modelReturns, width, label = 'Model Returns')
ax.legend()
# Clean up axis labels
ax.axhline(y = 0, color = 'k', linewidth = 0.5)
ax.set_yticklabels(['{:,.2%}'.format(tick) for tick in ax.get_yticks()]);
ax.set_xticks(ind[::3] + width / 2)
ax.set_xticklabels(midDates[::3])
for tick in ax.get_xticklabels():
tick.set_rotation(45)
ax.set_title("Returns from %s to %s" % (midDates[0], midDates[-1]))
def printTotalReturns(midDates, staticReturns, modelReturns):
'''
Inputs: list of strings (midDates), list of floats (staticReturns), list of floats (modelReturns)
Output: None
This function calculates the static and model returns over all periods, and prints them out nicely.
'''
totalStaticReturn = cumulativeReturn(staticReturns)
totalModelReturn = cumulativeReturn(modelReturns)
startDate, endDate = midDates[0], addSixMonths(midDates[-1])
print('\nFrom the period %s to %s, the total returns were:' % (startDate, endDate))
print('\tStatic Returns: %.2f%% (%.2f%% /yr)' % (100*totalStaticReturn, 100*annualizedReturn(totalStaticReturn, startDate, endDate)))
print('\tModel Returns: %.2f%% (%.2f%% /yr)' % (100*totalModelReturn, 100*annualizedReturn(totalModelReturn, startDate, endDate)))
if totalStaticReturn > totalModelReturn:
msg = "holding onto the stock for the entire period."
else:
msg = "using the model to buy/sell the stock throughout the period."
print('\nBased on these results, an investor would have been better off\033[1m', msg, '\033[0m')
"""## Dataframe maniuplation functions"""
def splitByDate(data, startDate, endDate):
'''
Inputs: pandas dataframe (data), string (date, of format 'mm-dd-yyyy'), string (date, of format 'mm-dd-yyyy')
Output: pandas dataframe
This function returns a queried subset of a dataframe of all instances between the two given dates.
The index of the queried dataframe is reset.
'''
startDate = pd.to_datetime(startDate)
endDate = pd.to_datetime(endDate)
splitData = data[data['Date'] >= startDate]
splitData = splitData[splitData['Date'] < endDate]
return splitData.reset_index(drop = True)
def getTrainTest(data, midDate, startDate = None, cols = allFeatures, daysIntoFuture = 1):
'''
Inputs: pandas dataframe (data), string (midDate, of format 'mm-dd-yyyy'), **kwargs
**kwargs:
- startDate: string (of format 'mm-dd-yyyy')
- cols: list of strings, where each element is a feature that is to be included in the returned dataframes
- daysIntoFuture: int, used to define the classification of each instance
Output: pandas dataframe (trainData), pandas dataframe (testData)
This function creates two dataframes.
TRAINING DATA
This is a queried subset of a dataframe of all instances between the two dates.
The starting date of this dataset is startDate. If startDate = None, it is set
equal to midDate minus 1 year minus daysIntoFuture days.
The ending date of this dataset is midDate minus daysIntoFuture days.
TEST DATA
This is a queried subset of a dataframe of all instances between the two dates.
The starting date of this dataset is midDate.
The ending date of this dataset is midDate plus 6 months.
The indicies of both datasets are reset.
'''
if not startDate:
startDate = subtractNDays(subtractOneYear(midDate), daysIntoFuture - 1)
endDate = addSixMonths(midDate)
# Normalize the data
data = splitByDate(data, startDate, endDate)
data[cols] = normalize(data[cols])
# Get splits
trainData = splitByDate(data, startDate, subtractNDays(midDate, daysIntoFuture))
testData = splitByDate(data, midDate, endDate)
return trainData, testData
def rankFeatures(ticker, startDate, endDate):
'''
Inputs: string (ticker symbol), string (date, of format 'mm-dd-yyyy'), string (date, of format 'mm-dd-yyyy')
Output: series (sorted correlation values)
This function returns the correlation values between the classification and each feature of the stock data.
'''
df = pullStockData(ticker)
df = splitByDate(df, startDate, endDate)
return (-abs(df.corr())).sort_values(by = 'Class').Class
"""## Accuracy functions"""
def getAccuracy(predictions, Y):
'''
Inputs: list of Booleans (predicted classifications), list of Booleans (actual classifications)
Output: float
This function returns the fraction of predictions that were correctly classified.
'''
return (predictions == Y).mean()
def printAccuracies(accuracies):
print('The total accuracy of the model was %.2f%% (min %.2f%%, median %.2f, max %.2f%%)' % (np.mean(accuracies), min(accuracies), np.median(accuracies), max(accuracies)))
"""# Case Studies
## Case Study: Amazon.com, Inc. (AMZN)
Static Returns: 30.55% /yr
**Logistic Regrssion Returns: 31.51% /yr**
Neural Network Returns: 21.95% /yr
"""
logCols = ['50 Day Moving Average Ratio',
'Volume',
'Market / Book Ratio',
'P/E',
'Debt / Equity Ratio',
'Free Cash Flow Yield']
results = runAllPeriods('AMZN', startMidDate, finalMidDate, modelType = 'LogReg', cols = logCols, daysIntoFuture = 30, printYears = 'some', numIter = 1000)
netCols = ['5 Day Change',
'10 Day Moving Average Ratio',
'50 Day Moving Average Ratio',
'200 Day Moving Average Ratio',
'Volume',
'Market / Book Ratio',
'P/E',
'Debt / Equity Ratio',
'Free Cash Flow Yield']
results = runAllPeriods('AMZN', startMidDate, finalMidDate, modelType = 'NeuralNet', cols = netCols, trainDataYears = 1, daysIntoFuture = 130, printYears = 'some', numIter = 1000)
"""## Case Study: DuPont de Nemours, Inc. (DD)
Static Returns: 3.69% /yr
**Logistic Regrssion Returns: 5.66% /yr**
**Neural Network Returns: 7.93% /yr**
"""
logCols = ['Free Cash Flow Yield',
'Debt / Equity Ratio',
'Market / Book Ratio',
'P/E']
results = runAllPeriods('DD', startMidDate, finalMidDate, modelType = 'LogReg', cols = logCols, trainDataYears = 1, daysIntoFuture = 65, printYears = 'some', numIter = 1000)
logCols = ['Free Cash Flow Yield',