Gradient descent for linear regression using numpy/pandas

Question

I currently follow along Andrew Ng's Machine Learning Course on Coursera and wanted to implement the gradient descent algorithm in python3 using numpy and pandas.

This is what I came up with:

import os
import numpy as np
import pandas as pd

def get_training_data(path):    # path to read data from
    raw_panda_data = pd.read_csv(path)

    # append a column of ones to the front of the data set
    raw_panda_data.insert(0, 'Ones', 1)

    num_columns = raw_panda_data.shape[1]                       # (num_rows, num_columns)
    panda_X = raw_panda_data.iloc[:,0:num_columns-1]            # [ slice_of_rows, slice_of_columns ]
    panda_y = raw_panda_data.iloc[:,num_columns-1:num_columns]  # [ slice_of_rows, slice_of_columns ]

    X = np.matrix(panda_X.values)   # pandas.DataFrame -> numpy.ndarray -> numpy.matrix
    y = np.matrix(panda_y.values)   # pandas.DataFrame -> numpy.ndarray -> numpy.matrix

    return X, y

def compute_mean_square_error(X, y, theta):
    summands = np.power(X * theta.T - y, 2)
    return np.sum(summands) / (2 * len(X))

def gradient_descent(X, y, learning_rate, num_iterations):
    num_parameters = X.shape[1]                                 # dim theta
    theta = np.matrix([0.0 for i in range(num_parameters)])     # init theta
    cost = [0.0 for i in range(num_iterations)]

    for it in range(num_iterations):
        error = np.repeat((X * theta.T) - y, num_parameters, axis=1)
        error_derivative = np.sum(np.multiply(error, X), axis=0)
        theta = theta - (learning_rate / len(y)) * error_derivative
        cost[it] = compute_mean_square_error(X, y, theta)

    return theta, cost

This is how one could use the code:

X, y = get_training_data(os.getcwd() + '/data/data_set.csv')
theta, cost = gradient_descent(X, y, 0.008, 10000)

print('Theta: ', theta)
print('Cost: ', cost[-1])

Where data/data_set.csv could contain data (model used: 2 + x1 - x2 = y) looking like this:

x1, x2, y
0, 1, 1
1, 1, 2
1, 0, 3
0, 0, 2
2, 4, 0
4, 2, 4
6, 0, 8

Output:

Theta:  [[ 2.  1. -1.]]
Cost:  9.13586056551e-26

I'd especially like to get the following aspects of my code reviewed:

Overall python style. I'm relatively new to python coming from a C background and not sure if I'm misunderstanding some concepts here.
numpy/pandas integration. Do I use these packages correctly?
Correctness of the gradient descent algorithm.
Efficiency. How can I further improve my code?

You could use np.zeros to initialize theta and cost in your gradient descent function, in my opinion it is clearer. Also why uppercase X and lowercase y? I would make them consistent and perhaps even give them descriptive names, e.g. input and output. Finally, you could look into exceptions handling e.g. for bad input data from pandas or invalid values for learning_rate or num_iterations. — nluigi, Commented Jul 25, 2017 at 21:37
@nluigi The notation comes from the machine learning course. X is a matrix and y is a vector, but you are probably right that I should rename the parameters or add an explaining comment. — Hericks, Commented Jul 26, 2017 at 13:15
Note you can also use theta = np.zeros_like(X) if you would like to initialize theta with an array of zeros with dimensions of X. — nluigi, Commented Jul 26, 2017 at 14:25
@nluigi Thank you for the tipp. Sadly theta doesn't have the same dimensions as X. Regardless I'll keep the np.zeros_like(...) function in the back of my head. — Hericks, Commented Jul 26, 2017 at 21:27

Reinderien · Accepted Answer · 2017-07-25 18:23:27Z

2

Without having the insight (or, honestly, time) to verify your actual algorithm, I can say that your Python is pretty good.

Only minor stuff - this kind of comment - # path to read data from - should be turned into a PEP257-style docstring.

You should add a shebang at the top of your file, probably #!/usr/bin/env python3.

Otherwise, you're off to a good start.

answered Jul 25, 2017 at 18:23

Reinderien

68.9k5 gold badges74 silver badges237 bronze badges

1

\$\begingroup\$ I just added the shebang and made the script executable. It's honestly so much more comfortable than typing python3 gradient_descent.py all the time. Thank you for the tipps! \$\endgroup\$
– Hericks
Commented Jul 25, 2017 at 18:28

Add a comment |

thoq · Accepted Answer · 2018-08-21 17:22:03Z

1

I like your Python style. There is an issue with your algorithm though. numpy.repeat does not work the way you expect it to. Try this code:

import numpy as np
theta = np.matrix([1,2,3])    
y = 2
X = np.matrix(np.array(range(9)).reshape(3,3))
error = np.repeat((X * theta.T) - y, 3, axis=1)

print(error)

>>>[[ 6  6  6]
 [24 24 24]
 [42 42 42]]

print(np.dot(X, theta.T)-y)

>>>[[ 6]
 [24]
 [42]

Do you see how numpy.repeat returns a matrix although you want to return a vector?

edited Aug 21, 2018 at 17:22

answered Jul 23, 2018 at 22:33

thoq

112 bronze badges

\$\begingroup\$ Do you refer to my line error = np.repeat((X * theta.T) - y, num_parameters, axis=1)?. I thought about it and still think, that my calculation is correct. I calculate the heuristic function for the x-values via X * theta.T. The result is a matrix with a single column, where the i-th row contains the difference of the heuristic function for the x-values of the i-th training example and output value in the i-th training example. What do you think is wrong here? \$\endgroup\$
– Hericks
Commented Aug 15, 2018 at 22:41
\$\begingroup\$ Sry, my previous answer was wrong. I edited it accordingly. \$\endgroup\$
– thoq
Commented Aug 21, 2018 at 17:24

Add a comment |

Sᴀᴍ Onᴇᴌᴀ · Accepted Answer · 2019-08-11 22:38:21Z

1

You can also use numpy.zeros(shape) to initialize the tetha and cost vectors with zeros.

edited Aug 11, 2019 at 22:38

Sᴀᴍ Onᴇᴌᴀ♦

29k16 gold badges45 silver badges197 bronze badges

answered Aug 11, 2019 at 22:19

Miguel Pinheiro

1112 bronze badges

Add a comment |

Stack Exchange Network

Gradient descent for linear regression using numpy/pandas

3 Answers 3

Hot Network Questions

Gradient descent for linear regression using numpy/pandas

3 Answers 3

Related

Hot Network Questions