Basic & Econometrics - OLS via Stochastic Gradient Descent

3 minute read

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from random import seed, random, gauss
import statsmodels.api as sm

/Users/manguito/.virtualenvs/myEnv/lib/python3.7/site-packages/statsmodels/compat/pandas.py:49: FutureWarning: The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version
  data_klasses = (pandas.Series, pandas.DataFrame, pandas.Panel)

OLS with Stochastic Gradient Descent

# Generate data
n = 100
np.random.seed(0)
X = 2.5 * np.random.randn(n) + 1.5   # Array of n values with mean = 1.5, stddev = 2.5
res = 0.5 * np.random.randn(n)       # Generate n residual terms
y = 2 + 0.3 * X + res                  # Actual values of Y

# Create pandas dataframe to store our X and y values
df = pd.DataFrame(
    {'X': X,
     'y': y}
)

# Show the first five rows of our dataframe
df.head()

	X	y
0	5.910131	4.714615
1	2.500393	2.076238
2	3.946845	2.548811
3	7.102233	4.615368
4	6.168895	3.264107

# Calculate the mean of X and y
xmean = np.mean(X)
ymean = np.mean(y)

# Calculate the terms needed for the numator and denominator of beta
df['xycov'] = (df['X'] - xmean) * (df['y'] - ymean)
df['xvar'] = (df['X'] - xmean)**2

# Calculate beta and alpha
beta = df['xycov'].sum() / df['xvar'].sum()
alpha = ymean - (beta * xmean)
print(f'alpha = {alpha}')
print(f'beta = {beta}')

alpha = 2.0031670124623426
beta = 0.3229396867092763

ypred = alpha + beta * X

# Plot regression against actual data
plt.figure(figsize=(12, 6))
plt.plot(X, ypred)     # regression line
plt.plot(X, y, 'ro')   # scatter plot showing actual data
plt.title('Actual vs Predicted')
plt.xlabel('X')
plt.ylabel('y')

plt.show()

png

# Make a prediction with coefficients
def predict(row, coefficients):
	yhat = coefficients[0]
	for i in range(len(row)-1):
		yhat += coefficients[i + 1] * row[i]
	return yhat
 
# Estimate linear regression coefficients using stochastic gradient descent
def coefficients_sgd(train, l_rate, n_epoch):
	coef = [0.0 for i in range(len(train[0]))]
	for epoch in range(n_epoch):
		sum_error = 0
		for row in train:
			yhat = predict(row, coef)
			error = yhat - row[-1]
			sum_error += error**2
			coef[0] = coef[0] - l_rate * error
			for i in range(len(row)-1):
				coef[i + 1] = coef[i + 1] - l_rate * error * row[i]
		print('>epoch=%d, lrate=%.3f, error=%.3f' % (epoch, l_rate, sum_error))
	return coef
 
# # Calculate coefficients
# dataset = [[1, 1], [2, 3], [4, 3], [3, 2], [5, 5]]
dataset = np.array(df[['X','y']])
l_rate = 0.001
n_epoch = 50
coef = coefficients_sgd(dataset, l_rate, n_epoch)
print(coef)
# df.head()

>epoch=0, lrate=0.001, error=468.733
>epoch=1, lrate=0.001, error=258.425
>epoch=2, lrate=0.001, error=206.738
>epoch=3, lrate=0.001, error=180.854
>epoch=4, lrate=0.001, error=160.966
>epoch=5, lrate=0.001, error=143.946
>epoch=6, lrate=0.001, error=129.113
>epoch=7, lrate=0.001, error=116.156
>epoch=8, lrate=0.001, error=104.837
>epoch=9, lrate=0.001, error=94.950
>epoch=10, lrate=0.001, error=86.314
>epoch=11, lrate=0.001, error=78.772
>epoch=12, lrate=0.001, error=72.184
>epoch=13, lrate=0.001, error=66.430
>epoch=14, lrate=0.001, error=61.404
>epoch=15, lrate=0.001, error=57.015
>epoch=16, lrate=0.001, error=53.181
>epoch=17, lrate=0.001, error=49.832
>epoch=18, lrate=0.001, error=46.907
>epoch=19, lrate=0.001, error=44.352
>epoch=20, lrate=0.001, error=42.120
>epoch=21, lrate=0.001, error=40.171
>epoch=22, lrate=0.001, error=38.468
>epoch=23, lrate=0.001, error=36.980
>epoch=24, lrate=0.001, error=35.681
>epoch=25, lrate=0.001, error=34.545
>epoch=26, lrate=0.001, error=33.554
>epoch=27, lrate=0.001, error=32.687
>epoch=28, lrate=0.001, error=31.930
>epoch=29, lrate=0.001, error=31.269
>epoch=30, lrate=0.001, error=30.691
>epoch=31, lrate=0.001, error=30.186
>epoch=32, lrate=0.001, error=29.745
>epoch=33, lrate=0.001, error=29.359
>epoch=34, lrate=0.001, error=29.022
>epoch=35, lrate=0.001, error=28.728
>epoch=36, lrate=0.001, error=28.471
>epoch=37, lrate=0.001, error=28.246
>epoch=38, lrate=0.001, error=28.049
>epoch=39, lrate=0.001, error=27.877
>epoch=40, lrate=0.001, error=27.727
>epoch=41, lrate=0.001, error=27.596
>epoch=42, lrate=0.001, error=27.481
>epoch=43, lrate=0.001, error=27.381
>epoch=44, lrate=0.001, error=27.293
>epoch=45, lrate=0.001, error=27.216
>epoch=46, lrate=0.001, error=27.149
>epoch=47, lrate=0.001, error=27.090
>epoch=48, lrate=0.001, error=27.039
>epoch=49, lrate=0.001, error=26.994
[1.9381705148304995, 0.335948469233473]

Compare to coefficient estimates from statsmodels

mod = sm.OLS(df.y,sm.add_constant(df['X']))
res = mod.fit()
res.summary()

OLS Regression Results
Dep. Variable:	y	R-squared:	0.715
Model:	OLS	Adj. R-squared:	0.712
Method:	Least Squares	F-statistic:	245.5
Date:	Thu, 10 Dec 2020	Prob (F-statistic):	1.93e-28
Time:	15:28:41	Log-Likelihood:	-75.359
No. Observations:	100	AIC:	154.7
Df Residuals:	98	BIC:	159.9
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	2.0032	0.062	32.273	0.000	1.880	2.126
X	0.3229	0.021	15.669	0.000	0.282	0.364

Omnibus:	5.184	Durbin-Watson:	1.995
Prob(Omnibus):	0.075	Jarque-Bera (JB):	3.000
Skew:	0.210	Prob(JB):	0.223
Kurtosis:	2.262	Cond. No.	3.73

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Share on

Twitter Facebook LinkedIn

Youngmin Ju

Basic & Econometrics - OLS via Stochastic Gradient Descent

OLS with Stochastic Gradient Descent

Share on

You may also enjoy

A resource list for causality in statistics, data science

Measuring the Effect of a Training Program on Fulfillment Center Associate Performance

A Comprehensive Causal Inference Framework for Business Questions

Big Data Econometrics