Linear Regression on Pandas

Question

I'm working on a simple statistics problem with Pandas and sklearn. I'm aware that my code is ugly, but how can I improve it?

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

df = pd.read_csv("sphist.csv")
df["Date"] = pd.to_datetime(df["Date"])
df.sort_values(["Date"], inplace=True)
df["day_5"] = np.nan
df["day_30"] = np.nan
df["std_5"] = np.nan


for i in range(30, len(df)):
    last_5 = df.iloc[i-5:i, 4]
    last_30 = df.iloc[i-30:i, 4]
    df.iloc[i, -3] = last_5.mean()
    df.iloc[i, -2] = last_30.mean()
    df.iloc[i, -1] = last_5.std()

df = df.iloc[30:]
df.dropna(axis=0, inplace=True)

train = df[df["Date"] < datetime(2013, 1, 1)]
test = df[df["Date"] >= datetime(2013, 1, 1)]
# print(train.head(), test.head())

X_cols = ["day_5", "day_30", "std_5"]
y_col = "Close"

lr = LinearRegression()
lr.fit(train[X_cols], train[y_col])
yhat = lr.predict(test[X_cols])
mse = mean_squared_error(yhat, test[y_col])
rmse = mse/len(yhat)
score = lr.score(test[X_cols], test[y_col])

print(rmse, score)

plt.scatter(yhat, test[y_col], c="k", s=1)
plt.plot([.95*yhat.min(), 1.05*yhat.max()], [.95*yhat.min(), 1.05*yhat.max()], c="r")
plt.show()

It relies on hard-code iloc indices, which is hard to read or maintain. How can I change it to column names/row names?
The codes look messy. Any advice to improve it?

It's hard to provide guidance on using column name without knowing what columns your CSV contains. Can you include an example (like the first 10 lines for instance) of your dataset? — 301_Moved_Permanently, Commented Jan 23, 2019 at 8:50

Maarten Fabré · Accepted Answer · 2019-01-23 17:00:19Z

functions

This is 1 long script. Partition the code in logical blocks. This could be like this

get the raw data
summarize the data
split the test- and train data
get the result from the regression
plot the results

magical values

there are some magical values in your code, for example 4 as the column, datetime(2013, 1, 1) as the threshold to split the data. Define them as variables (or parameters for the functions)

dummy data

to illustrate this, I use this dummy data

def generate_dummy_data(
    x_label="x",
    date_label="date",
    size=100,
    seed=0,
    start="20120101",
    freq="7d",
):
    np.random.seed(seed)

    return pd.DataFrame(
        {
            "Close": np.random.randint(100, 200, size=size),
            x_label: np.random.randint(1000, 2000, size=size),
            date_label: pd.DatetimeIndex(start=start, freq=freq, periods=size),
        }
    )

summarize

The rolling mean and std you do can be done with builtin pandas functionality

You also change the raw data. It would be better to make this summary a different DataFrame, and not alter the original data.

def summarize(df, date_label, x_label, y_label="Close"):
    return pd.DataFrame(
        {
            y_label: df[y_label],
            date_label: df[date_label],
            "day_5": df[x_label].rolling(5).mean(),
            "std_5": df[x_label].rolling(5).std(),
            "day_30": df[x_label].rolling(30).mean(),
        }
    ).dropna()

regression

here I followed pep-8, and renamed X_cols to x_cols

def regression(train, test, x_cols, y_col):
    lr = LinearRegression()
    lr.fit(train[x_cols], train[y_col])
    yhat = lr.predict(test[x_cols])
    mse = mean_squared_error(yhat, test[y_col])
    rmse = mse/len(yhat)
    score = lr.score(test[x_cols], test[y_col])

    return yhat, rmse, score

main guard

If you put the calling code behind if __name__ == "__main__":, you can import this script in other code without running the analysis, and reuse the functions

if __name__ == "__main__":

    x_label = "x"
    date_label = "date"
    y_label = "Close"
    data = generate_dummy_data(
        x_label=x_label, date_label=date_label, y_label=y_label
    )

    summary = summarize(
        data, date_label=date_label, x_label=x_label, y_label=y_label
    )

    threshold = "20130101"

    train = summary.loc[summary[date_label] < threshold]
    test = summary.loc[summary[date_label] >= threshold]

    x_cols = ["day_5", "std_5", "day_30"]

    yhat, rmse, score = regression(train, test, x_cols, y_col)

    print(x_cols, rmse, score)

    plt.scatter(yhat, test[y_col], c="k", s=1)
    plt.plot(
        [0.95 * yhat.min(), 1.05 * yhat.max()],
        [0.95 * yhat.min(), 1.05 * yhat.max()],
        c="r",
    )
    plt.show()

If you want to compare what each of the 3 metrics do individually, you'll have to do something like this:

for x_label in x_cols:
    yhat, rmse, score = regression(train, test, [x_label], y_col)

    print(x_label, rmse, score)

    plt.scatter(yhat, test[y_col], c="k", s=1)
    plt.plot([.95*yhat.min(), 1.05*yhat.max()], [.95*yhat.min(), 1.05*yhat.max()], c="r")
    plt.show()

Stack Exchange Network

Linear Regression on Pandas

1 Answer 1

functions

magical values

dummy data

summarize

regression

main guard

Hot Network Questions

Linear Regression on Pandas

1 Answer 1

functions

magical values

dummy data

summarize

regression

main guard

Related

Hot Network Questions