Skip to content

SoniSiddharth/ML-Linear-Regression-from-scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

May 23, 2021
a5c5f95 Β· May 23, 2021

History

2 Commits
May 23, 2021
May 23, 2021
May 23, 2021
May 23, 2021
May 23, 2021
May 23, 2021
May 23, 2021
May 23, 2021
May 23, 2021
May 23, 2021
May 23, 2021
May 23, 2021
May 23, 2021
May 23, 2021
May 23, 2021
May 23, 2021
May 23, 2021
May 23, 2021
May 23, 2021
May 23, 2021

Repository files navigation

Linear Regression ⭐⭐

Directory Structure πŸ“

β”‚   collinear_dataset.py     
β”‚   compare_time.py
β”‚   contour_plot.gif
β”‚   degreevstheta.py
β”‚   gif1.gif
β”‚   gif2.gif
β”‚   linear_regression_test.py
β”‚   line_plot.gif
β”‚   Makefile
β”‚   metrics.py
β”‚   Normal_regression.py     
β”‚   plot_contour.py
β”‚   poly_features_test.py    
β”‚   README.md
β”‚   surface_plot.gif
β”‚
β”œβ”€β”€β”€images
β”‚       q5plot.png
β”‚       q6plot.png
β”‚       q8features.png       
β”‚       q8samples.png
β”‚
β”œβ”€β”€β”€linearRegression
β”‚   β”‚   linearRegression.py
β”‚   β”‚   __init__.py
β”‚   β”‚
β”‚   └───__pycache__
β”‚           linearRegression.cpython-37.pyc
β”‚           __init__.cpython-37.pyc
β”‚
β”œβ”€β”€β”€preprocessing
β”‚   β”‚   polynomial_features.py
β”‚   β”‚   __init__.py
β”‚   β”‚
β”‚   └───__pycache__
β”‚           polynomial_features.cpython-37.pyc
β”‚           __init__.cpython-37.pyc
β”‚
β”œβ”€β”€β”€temp_images
└───__pycache__
        metrics.cpython-37.pyc

Instructions to run πŸƒ

make help
make regression
make polynomial_features
make normal_regression
make poly_theta
make contour
make compare_time
make collinear

Stochastic GD (Batch size = 1) ☝️

  • Learning rate type = constant RMSE: 0.9119624181584616 MAE: 0.7126923090787688

  • Learning rate type = inverse RMSE: 0.9049599308106121 MAE: 0.7098334683036919

Vanilla GD (Batch size = N) βœ‹

  • Learning rate type = constant RMSE: 0.9069295672718122 MAE: 0.7108301179089876

  • Learning rate type = inverse RMSE: 0.9607329070540364 MAE: 0.7641616657610887

Mini Batch GD (Batch size between 1 and N(5)) 🀘

  • Learning rate type = constant RMSE: 0.9046502501334435 MAE: 0.7102161700019564

  • Learning rate type = inverse RMSE: 0.9268357442221973 MAE: 0.7309246821952116

Polynomial Feature Transformation πŸ”°

  • The output [[1, 2]] is [[1, 1, 2, 1, 2, 4]]

  • The output for [[1, 2, 3]] is [[1, 1, 2, 3, 1, 2, 3, 4, 6, 9]]

  • The outputs are similar to sklearn's PolynomialFeatures fit transform

Theta vs degree πŸ“ˆ

alt text

  • Conclusion - As the degree of the polynomial increases, the norm of theta increases because of overfitting.

L2 Norm of Theta vs Degree of Polynomial for varying Sample size πŸ“ˆ

alt text

Conclusion

  • As the degree increases magnitude of theta increases due to overfitting of data.
  • But at the same degree, as the number of samples increases, the magnitude of theta decreases because more samples reduce the overfitting to some extent.

Linear Regression line fit πŸ”₯

alt text

Linear Regression Surface plot πŸ”₯

alt text

Linear Regression Contour plot πŸ”₯

alt text

Time Complexities ⏳

  • Theoretical time complexity of Normal equation is O(D^2N) + O(D^3)
  • Theoretical time complexity of Gradient Descent equation is O((t+N)D^2)

Time vs Number of Features β³πŸ“Š

alt text

When the number of samples are kept constant, normal equation solution takes more time as it has a factor of D^3 whereas Gradient Descent has a factor of D^2 in the time complexity.

Time vs Number of Samples β³πŸ“Š

alt text

When the number of features are kept constant varying number of samples, it can be noticed that time for normal equation is still higher as compared to gradient descent because of computational expenses.

Multicollinearity in Dataset ❗ ❗

  • The gradient descent implementation works for the multicollinearity.
  • But as the multiplication factor increases, RMSE and MAE values takes a large shoot
  • It reduces the precision of the coefficients