1
\$\begingroup\$

I'm working on a simple statistics problem with Pandas and sklearn. I'm aware that my code is ugly, but how can I improve it?

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

df = pd.read_csv("sphist.csv")
df["Date"] = pd.to_datetime(df["Date"])
df.sort_values(["Date"], inplace=True)
df["day_5"] = np.nan
df["day_30"] = np.nan
df["std_5"] = np.nan


for i in range(30, len(df)):
    last_5 = df.iloc[i-5:i, 4]
    last_30 = df.iloc[i-30:i, 4]
    df.iloc[i, -3] = last_5.mean()
    df.iloc[i, -2] = last_30.mean()
    df.iloc[i, -1] = last_5.std()

df = df.iloc[30:]
df.dropna(axis=0, inplace=True)

train = df[df["Date"] < datetime(2013, 1, 1)]
test = df[df["Date"] >= datetime(2013, 1, 1)]
# print(train.head(), test.head())

X_cols = ["day_5", "day_30", "std_5"]
y_col = "Close"

lr = LinearRegression()
lr.fit(train[X_cols], train[y_col])
yhat = lr.predict(test[X_cols])
mse = mean_squared_error(yhat, test[y_col])
rmse = mse/len(yhat)
score = lr.score(test[X_cols], test[y_col])

print(rmse, score)

plt.scatter(yhat, test[y_col], c="k", s=1)
plt.plot([.95*yhat.min(), 1.05*yhat.max()], [.95*yhat.min(), 1.05*yhat.max()], c="r")
plt.show()
  1. It relies on hard-code iloc indices, which is hard to read or maintain. How can I change it to column names/row names?
  2. The codes look messy. Any advice to improve it?
\$\endgroup\$
1
  • 1
    \$\begingroup\$ It's hard to provide guidance on using column name without knowing what columns your CSV contains. Can you include an example (like the first 10 lines for instance) of your dataset? \$\endgroup\$ Commented Jan 23, 2019 at 8:50

1 Answer 1

2
\$\begingroup\$

functions

This is 1 long script. Partition the code in logical blocks. This could be like this

  • get the raw data
  • summarize the data
  • split the test- and train data
  • get the result from the regression
  • plot the results

magical values

there are some magical values in your code, for example 4 as the column, datetime(2013, 1, 1) as the threshold to split the data. Define them as variables (or parameters for the functions)

dummy data

to illustrate this, I use this dummy data

def generate_dummy_data(
    x_label="x",
    date_label="date",
    size=100,
    seed=0,
    start="20120101",
    freq="7d",
):
    np.random.seed(seed)

    return pd.DataFrame(
        {
            "Close": np.random.randint(100, 200, size=size),
            x_label: np.random.randint(1000, 2000, size=size),
            date_label: pd.DatetimeIndex(start=start, freq=freq, periods=size),
        }
    )

summarize

The rolling mean and std you do can be done with builtin pandas functionality

You also change the raw data. It would be better to make this summary a different DataFrame, and not alter the original data.

def summarize(df, date_label, x_label, y_label="Close"):
    return pd.DataFrame(
        {
            y_label: df[y_label],
            date_label: df[date_label],
            "day_5": df[x_label].rolling(5).mean(),
            "std_5": df[x_label].rolling(5).std(),
            "day_30": df[x_label].rolling(30).mean(),
        }
    ).dropna()

regression

here I followed pep-8, and renamed X_cols to x_cols

def regression(train, test, x_cols, y_col):
    lr = LinearRegression()
    lr.fit(train[x_cols], train[y_col])
    yhat = lr.predict(test[x_cols])
    mse = mean_squared_error(yhat, test[y_col])
    rmse = mse/len(yhat)
    score = lr.score(test[x_cols], test[y_col])

    return yhat, rmse, score

main guard

If you put the calling code behind if __name__ == "__main__":, you can import this script in other code without running the analysis, and reuse the functions

if __name__ == "__main__":

    x_label = "x"
    date_label = "date"
    y_label = "Close"
    data = generate_dummy_data(
        x_label=x_label, date_label=date_label, y_label=y_label
    )

    summary = summarize(
        data, date_label=date_label, x_label=x_label, y_label=y_label
    )

    threshold = "20130101"

    train = summary.loc[summary[date_label] < threshold]
    test = summary.loc[summary[date_label] >= threshold]

    x_cols = ["day_5", "std_5", "day_30"]

    yhat, rmse, score = regression(train, test, x_cols, y_col)

    print(x_cols, rmse, score)

    plt.scatter(yhat, test[y_col], c="k", s=1)
    plt.plot(
        [0.95 * yhat.min(), 1.05 * yhat.max()],
        [0.95 * yhat.min(), 1.05 * yhat.max()],
        c="r",
    )
    plt.show()

If you want to compare what each of the 3 metrics do individually, you'll have to do something like this:

for x_label in x_cols:
    yhat, rmse, score = regression(train, test, [x_label], y_col)

    print(x_label, rmse, score)

    plt.scatter(yhat, test[y_col], c="k", s=1)
    plt.plot([.95*yhat.min(), 1.05*yhat.max()], [.95*yhat.min(), 1.05*yhat.max()], c="r")
    plt.show()
\$\endgroup\$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.