-
Notifications
You must be signed in to change notification settings - Fork 8
Regression & Regularization
##Statsmodels
Statsmodels is a relatively new package, but provides better utilities for investigating the results of a model. It use's Patsy to provide R formula syntax
A formula allows you to write a functional relationship between variables.
Example:
Y ~ X1 + X2 + X3
It automatically assumes there is an intercept term. You can make this explicit by using
Y ~ 1 + X1 + X2 + X3
As you can see +
is not acting as an addition operator but as a separator between other variables. There are other operators that lose their algebraic meaning in a formula. :
adds the interaction of two variables. *
adds the original terms as well as their interaction effect.
import statsmodels.formula.api as sm
import pd as pd
data = pd.read_csv("http://data.princeton.edu/wws509/datasets/salary.dat", sep='\s+')
model = sm.ols(formula="sl ~ yr", data=data).fit()
model.summary()
model = sm.ols(formula="sl ~ sx + yr + rk", data=data).fit()
model.summary()
from patsy import dmatrices
y, X = dmatrices('sl ~ sx + yr + rk', data=data, return_type='dataframe')
###Sklearn
Scikits-learn also offer the same, but also provides regularization operations and more robust methods.
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model = model.fit(X,y)
model.score(X,y)
from sklearn import linear_model
model = linear_model.Ridge(alpha = .5)
model.fit(X,y)
print model.coef_
from sklearn import linear_model
model = linear_model.RidgeCV(alphas=[0.1, 1.0, 10.0])
model.fit(X,y)
print model.coef_
print model.alpha_
Also please submit a commented Python file with some of the things you tried.
##Assignment
-
Split the data into training and test sets by assigning a random sample of 25% of the data to a test dataset.
-
Build a simple linear regression on the NCAA basketball dataset to predict the score margin (home team score - away team score). Try adding and dropping parameters and see if they improve the model. Try adding interaction effects to improve your model. (Note: beware of the computational overhead) Compare both R-squared and MAE on your test set.
-
Use the regularization options (Ridge + Lasso) available to attempt to improve the model
***Bonus: Let's try doing the same model using a faster package
git clone git://github.com/JohnLangford/vowpal_wabbit.git
cd vowpal_wabbit
make install
Create a training file for vowpal wabbit using your favorite scripting language, the format should be as follows:
<salary> | This is a text feature | Some other feature | Job Title |
Use the |
to separate features. Vowpal Wabbit will handle turning these into dummy variables automatically. Let's assume you've named your new files train.vw
Vowpal Wabbit Examples Vowpal Wabbit Input Validator
Try the following:
vw -c -k train.vw --loss squared -f model
vw -c -k train.vw --loss squared -f model -l1 0.0001 ##for l1 loss
vw -c -k train.vw --loss squared -f model -l2 0.0001 ## for l2 loss
vw -c -k -t test.vw -i model -p test.predictions
####Links