Learning/multi_linear_regression.py at main · Geekymedic-codes/Learning · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
# -*- coding: utf-8 -*-
"""Multi linear regression

Automatically generated by Colab.

Original file is located at
    https://colab.research.google.com/drive/1Q8HGgky2YH0FNS8drOXsKKOWOHw2RwYK
"""

# Commented out IPython magic to ensure Python compatibility.
# %matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D

"""Next load data as a Pandas DataFrame. Use the famous mtcars dataset, an extract from the 1974 US Motor Trend magazine, which comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles."""

#read the dataset
df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/regression_sprint/mtcars.csv', index_col=0)
df.head()

#Explore the shape of the dataset
df.shape

"""# **Modelling miles per gallon**"""

# create figure and 3D axes
fig = plt.figure(figsize=(8,7))
ax = fig.add_subplot(111, projection='3d')

# set axis labels
ax.set_zlabel('MPG')
ax.set_xlabel('No. of Cylinders')
ax.set_ylabel('Weight (1000 lbs)')

# scatter plot with response variable and 2 predictors
ax.scatter(df['cyl'], df['wt'], df['mpg'], c='red', marker='o', s=50)
ax.set_title('3D Scatter Plot: MPG vs Cylinders and Weight')
plt.show()

"""# **Fitting a multivariate regression model**"""

# import regression module
from sklearn.linear_model import LinearRegression

# split predictors and response
X = df.drop(['mpg'], axis=1)
y = df['mpg']

# create model object
lm = LinearRegression()

# import train/test split module
from sklearn.model_selection import train_test_split

# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.20,
                                                    random_state=1)

# train model
lm.fit(X_train, y_train)

# extract model intercept
beta_0 = float(lm.intercept_)
print("Intercept:", beta_0)

# extract model coefficients
beta_js = pd.DataFrame(lm.coef_, X.columns, columns=['Coefficient'])

beta_js

"""Let's see what our model looks like in a few two-dimensional plots by plotting wt, disp, cyl, and hp vs. mpg, respectively (top-left to bottom-right)."""

fig, axs = plt.subplots(2, 2, figsize=(9,7))

axs[0,0].scatter(df['wt'], df['mpg'])
axs[0,0].plot(df['wt'], lm.intercept_ + lm.coef_[4]*df['wt'], color='red')
axs[0,0].title.set_text('Weight (wt) vs. mpg')

axs[0,1].scatter(df['disp'], df['mpg'])
axs[0,1].plot(df['disp'], lm.intercept_ + lm.coef_[1]*df['disp'], color='red')
axs[0,1].title.set_text('Engine displacement (disp) vs. mpg')

axs[1,0].scatter(df['cyl'], df['mpg'])
axs[1,0].plot(df['cyl'], lm.intercept_ + lm.coef_[0]*df['cyl'], color='red')
axs[1,0].title.set_text('Number of cylinders (cyl) vs. mpg')

axs[1,1].scatter(df['hp'], df['mpg'])
axs[1,1].plot(df['hp'], lm.intercept_ + lm.coef_[2]*df['hp'], color='red')
axs[1,1].title.set_text('Horsepower (hp) vs. mpg')

fig.tight_layout(pad=3.0)
plt.show()

"""It creates a 2×2 grid of scatter plots, each showing how a predictor variable relates to the miles per gallon (mpg) of cars, and overlays a red regression line to visualize the model’s fit for each feature.

# **Assessing model accuracy**

Let's assess the fit of our multivariate model. For a rudimentary comparison, let's measure model accuracy against a simple linear regression model that uses only disp as a predictor variable for mpg.
"""

# comparison linear model
slr = LinearRegression()

slr.fit(X_train[['disp']], y_train)

from sklearn import metrics
import math

"""Let's calculate the training mean squared error (MSE), test MSE, and test the root mean squared error (RMSE) for both our simple linear regression (SLR) and multiple linear regression (MLR) models"""

# dictionary of results
results_dict = {'Training MSE':
                    {
                        "SLR": metrics.mean_squared_error(y_train, slr.predict(X_train[['disp']])),
                        "MLR": metrics.mean_squared_error(y_train, lm.predict(X_train))
                    },
                'Test MSE':
                    {
                        "SLR": metrics.mean_squared_error(y_test, slr.predict(X_test[['disp']])),
                        "MLR": metrics.mean_squared_error(y_test, lm.predict(X_test))
                    },
                'Test RMSE':
                    {
                        "SLR": math.sqrt(metrics.mean_squared_error(y_test, slr.predict(X_test[['disp']]))),
                        "MLR": math.sqrt(metrics.mean_squared_error(y_test, lm.predict(X_test)))
                    }
                }

# create DataFrame from dictionary
results_df = pd.DataFrame(data=results_dict)

results_df

"""Clearly the multiple linear regression performed a lot better than using just disp to try and predict mpg, underpinning that the relationship between the dependent variable (mpg) and the independent variables is not adequately captured by a single predictor.

By incorporating multiple predictors simultaneously, the multiple linear regression model can account for the combined influence of these factors on the dependent variable, resulting in improved predictive performance and a more accurate representation of the underlying relationship in the data.
"""