Regression & Regularization

##Statsmodels

Statsmodels is a relatively new package, but provides better utilities for investigating the results of a model. It use's Patsy to provide R formula syntax

A formula allows you to write a functional relationship between variables.
Example:

Y ~ X1 + X2 + X3

It automatically assumes there is an intercept term. You can make this explicit by using

Y ~ 1 + X1 + X2 + X3

As you can see + is not acting as an addition operator but as a separator between other variables. There are other operators that lose their algebraic meaning in a formula. : adds the interaction of two variables. * adds the original terms as well as their interaction effect.

import statsmodels.formula.api as sm
import pd as pd

data = pd.read_csv("http://data.princeton.edu/wws509/datasets/salary.dat", sep='\s+')

model = sm.ols(formula="sl ~ yr", data=data).fit()
model.summary()

model = sm.ols(formula="sl ~ sx + yr + rk", data=data).fit()
model.summary()

from patsy import dmatrices

y, X = dmatrices('sl ~ sx + yr + rk', data=data, return_type='dataframe')

###Sklearn

Scikits-learn also offer the same, but also provides regularization operations and more robust methods.

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model = model.fit(X,y)
model.score(X,y)

from sklearn import linear_model

model = linear_model.Ridge(alpha = .5)
model.fit(X,y)

print model.coef_

from sklearn import linear_model

model = linear_model.RidgeCV(alphas=[0.1, 1.0, 10.0])
model.fit(X,y)

print model.coef_
print model.alpha_

Also please submit a commented Python file with some of the things you tried.

##Assignment

Split the data into training and test sets by assigning a random sample of 25% of the data to a test dataset.
Build a simple linear regression on the NCAA basketball dataset to predict the score margin (home team score - away team score). Try adding and dropping parameters and see if they improve the model. Try adding interaction effects to improve your model. (Note: beware of the computational overhead) Compare both R-squared and MAE on your test set.
Use the regularization options (Ridge + Lasso) available to attempt to improve the model

***Bonus: Let's try doing the same model using a faster package

git clone git://github.com/JohnLangford/vowpal_wabbit.git
cd vowpal_wabbit
make install

Create a training file for vowpal wabbit using your favorite scripting language, the format should be as follows:

<salary> | This is a text feature | Some other feature | Job Title |

Use the | to separate features. Vowpal Wabbit will handle turning these into dummy variables automatically. Let's assume you've named your new files train.vw

Vowpal Wabbit Examples Vowpal Wabbit Input Validator

Try the following:

vw -c -k train.vw --loss squared -f model
vw -c -k train.vw --loss squared -f model -l1 0.0001 ##for l1 loss
vw -c -k train.vw --loss squared -f model -l2 0.0001 ## for l2 loss

vw -c -k -t test.vw -i model -p test.predictions

####Links

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression & Regularization

Clone this wiki locally