Founder’s Scholarship – Your Gateway to a Dream Engineering Career! Only 1̶0̶0̶ +10 Seats Available.

00D 00H 00M 00S

Executive Programs

Workshops

Projects

Blogs

Careers

Placements

Student Reviews

For Business

Academic Training

Informative Articles

Find Jobs

We are Hiring!

All Courses

Choose a category

Mechanical

Electrical

Civil

Computer Science

Electronics

Offline Program

All Courses

CHOOSE A CATEGORY

Mechanical

Electrical

Civil

Computer Science

Electronics

Offline Program

Top Job Leading Courses

Automotive

CFD

FEA

Design

MBD

Med Tech

Courses by Software

Design

Solver

Automation

Vehicle Dynamics

CFD Solver

Preprocessor

Courses by Semester

First Year

Second Year

Third Year

Fourth Year

Courses by Domain

Automotive

CFD

Design

FEA

Tool-focused Courses

Design

Solver

Automation

Preprocessor

CFD Solver

Vehicle Dynamics

Machine learning

Machine Learning and AI

POPULAR COURSES

Post Graduate Program in Hybrid Electric Vehicle Design and Analysis

Post Graduate Program in Computational Fluid Dynamics

Post Graduate Program in CAD

Post Graduate Program in CAE

Post Graduate Program in Manufacturing Design

Post Graduate Program in Computational Design and Pre-processing

Post Graduate Program in Complete Passenger Car Design & Product Development

Executive Programs

Workshops

For Business

Success Stories

Placements

Student Reviews

Projects

Blogs

Academic Training

Find Jobs

Informative Articles

We're Hiring!

+91 9342691281 Log in

Project 2

Here an effort is made to clean the Automobile 1985 dataset and perform the descriptive analytics and make the predicitive model using the Random Forest classifier. lets import all the required libraries. import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import load_iris from…

Faisal MH
updated on 30 Sep 2021

Here an effort is made to clean the Automobile 1985 dataset and perform the descriptive analytics and make the predicitive model using the Random Forest classifier.

lets import all the required libraries.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
import seaborn as sns

Lets read the data in python.

df = pd.read_csv('auto.csv',header=None)
df.head()

Lets check each of the column and it is found that no column is provided with the names.

So now lets name each of the column accordingly.

column_names = ['symboling','normalized_losses', 'CarName','fueltype','aspiration','doornumber','carbody','drivewheel','enginelocation',
               'wheelbase','carlength','carwidth','carheight','curbweight','enginetype','cylindernumber','enginesize','fuelsystem',
               'boreratio','stroke','compressionratio','horsepower','peakrpm','citympg','highwaympg','price']
df.columns = column_names
df.head()

Now lets check data each of the column and find if there are any NULL values or "Rubbish" data to be reomved before we proceed with the descriptive analysis.

#check missing values
plt.figure(figsize=(12,4))
sns.heatmap(df.isnull(),cbar=False,cmap='Wistia',yticklabels=False)
plt.title('Missing value in the dataset');

So there is no NULL data in any of the columns.

Lets check for "Rubbish" data by checking unique values for each column and replacing the rubbish data with the mean data if the data is numerical and normally distributed. If it is skewed we replace it with Median value. If the data is categorical we replace it with the mode values. We can even remove the records if the rubbish data accounts to less than 0.5% records of the total data.

lets check for "normalized_losses" column, it is found that 20% of the data is rubbish.

def removenotnum(list1):
 notnum = []

 for x in list1:
   try:
     float(x)
     
   except:
     notnum.append(x)
 return notnum

notnumtable = removenotnum(df['normalized_losses'])
print(notnumtable)
print('Percent of identified rubbish data in Table → {:.3f}%'.format(len(notnumtable) / len(df['normalized_losses'])*100))

So we replace the '?' value with either mean or median of the other column values depending upon the distribution.

Before we go in, we will make an array of remaining values of the column to find the distribution and mean & median of the remaining data.

rubbish=['?']
n_l = df['normalized_losses'][~df['normalized_losses'].isin(rubbish)]
n_l = n_l.astype('int64')
sns.displot(x=n_l, kind="hist",kde=True, bins = 100, aspect = 5.5)

It is found that the data is positive skewed so we replace the data with the "Median value".

print('median:{}'.format(n_l.median()))

median: 115.0

So we replace the value with 115

df['normalized_losses'].replace({'?':'115'}, inplace=True)

Similarly for the "doornumber" column which is categorical we found rubbish data and replace it with "Mode" of the column.

Similarly for "boreratio" , "stroke", "horsepower", "price" and "peakrpm" columns, we follow the same procedure as done for the "normalized_losses" column as mentioned above.

Now all the rubbish data is replaced with the respective mean, median and mode. We will replace few categorical columns with OHE (one hot encoding).

As we replace we will change the column names with respective records. It is found that cylinder number and door number has same records like "two" which may confuse us. So we will replace the "two" in door to "doortwo"

df['doornumber']=df['doornumber'].map('door{}'.format)
df.head()

We will change "symboling" and "CarName" column as type category. As we dont want to perform the OHE on them.

df['symboling'] = df.symboling.astype('category')
df['CarName'] = df.CarName.astype('category')
df.dtypes

Now lets perform the OHE only on the columns with type "object".

So we collect the columns with dtype object.

cat_cols=[]
for col in df.columns:
    if df[col].dtypes=='object':
        cat_cols.append(col)
cat_cols

Now lets perform the OHE on those columns and dropping the first column to avoid dummy trap.

df = pd.get_dummies(data = df, prefix = 'OHE', prefix_sep='_',
               columns = cat_cols,
               drop_first =True,
              dtype='int8')
df.columns

So we have in total 46 columns to perform the descriptive analytics.

The car name includes the Brand and model, so we will use only the brand name, we will be splitting the columns using the delimitters '-' & ' '. And, replace the wrong spellings of the brands.

df['CarName']=df['CarName'].str.split('-| ').str[0]

replace_values = {'maxda':'mazda','Nissan':'nissan','porcshce':'porsche','toyouta':'toyota','vokswagen':'volkswagen','vw':'volkswagen'}
df=df.replace({'CarName':replace_values})

Now lets check how data is classified for each brand of car.

target_counts= df['CarName'].value_counts()

fig, ax = plt.subplots(1, 2, figsize=(15,7))
target_counts_barplot = sns.barplot(y = target_counts.index,x = target_counts.values, ax = ax[0])
target_counts_barplot.set_ylabel('Models of Car Brand')
ax[0].set_facecolor('white')
#colors = ['#8d99ae','#ffe066', '#f77f00','#348aa7','#bce784','#ffcc99',  '#f25f5c']
target_counts.plot.pie(autopct="%1.1f%%", ax=ax[1])

Classification Fuel Type

target_counts= df['fueltype'].value_counts()

fig, ax = plt.subplots(1, 2, figsize=(15,7))
target_counts_barplot = sns.barplot(x = target_counts.index,y = target_counts.values, ax = ax[0])
target_counts_barplot.set_ylabel('Count')

#colors = ['#8d99ae','#ffe066', '#f77f00','#348aa7','#bce784','#ffcc99',  '#f25f5c']
target_counts.plot.pie(autopct="%1.1f%%", ax=ax[1])
fig.suptitle('Fuel Type')

Classification of Aspiration

target_counts= df['aspiration'].value_counts()

fig, ax = plt.subplots(1, 2, figsize=(15,7))
target_counts_barplot = sns.barplot(x = target_counts.index,y = target_counts.values, ax = ax[0])
target_counts_barplot.set_ylabel('Count')

#colors = ['#8d99ae','#ffe066', '#f77f00','#348aa7','#bce784','#ffcc99',  '#f25f5c']
target_counts.plot.pie(autopct="%1.1f%%", ax=ax[1])
                                
fig.suptitle('Aspiration')

Classification of Number of Doors

target_counts= df['doornumber'].value_counts()

fig, ax = plt.subplots(1, 2, figsize=(15,7))
target_counts_barplot = sns.barplot(x = target_counts.index,y = target_counts.values, ax = ax[0])
target_counts_barplot.set_ylabel('Count')

#colors = ['#8d99ae','#ffe066', '#f77f00','#348aa7','#bce784','#ffcc99',  '#f25f5c']
target_counts.plot.pie(autopct="%1.1f%%", ax=ax[1])
fig.suptitle('Number of Doors')

Classification of Body Type:

target_counts= df['carbody'].value_counts()

fig, ax = plt.subplots(1, 2, figsize=(15,7))
target_counts_barplot = sns.barplot(x = target_counts.index,y = target_counts.values, ax = ax[0])
target_counts_barplot.set_ylabel('Count')

#colors = ['#8d99ae','#ffe066', '#f77f00','#348aa7','#bce784','#ffcc99',  '#f25f5c']
target_counts.plot.pie(autopct="%1.1f%%", ax=ax[1])
fig.suptitle('Body Type')

Classification of Drive Wheel:

target_counts= df['drivewheel'].value_counts()

fig, ax = plt.subplots(1, 2, figsize=(15,7))
target_counts_barplot = sns.barplot(x = target_counts.index,y = target_counts.values, ax = ax[0])
target_counts_barplot.set_ylabel('Count')

#colors = ['#8d99ae','#ffe066', '#f77f00','#348aa7','#bce784','#ffcc99',  '#f25f5c']
target_counts.plot.pie(autopct="%1.1f%%", ax=ax[1])
fig.suptitle('Drive Wheel')

Classification of Engine Location:

target_counts= df['enginelocation'].value_counts()

fig, ax = plt.subplots(1, 2, figsize=(15,7))
target_counts_barplot = sns.barplot(x = target_counts.index,y = target_counts.values, ax = ax[0])
target_counts_barplot.set_ylabel('Count')

#colors = ['#8d99ae','#ffe066', '#f77f00','#348aa7','#bce784','#ffcc99',  '#f25f5c']
target_counts.plot.pie(autopct="%1.1f%%", ax=ax[1])
fig.suptitle('Engine Location')

Classification of Engine Type:

target_counts= df['enginetype'].value_counts()

fig, ax = plt.subplots(1, 2, figsize=(15,7))
target_counts_barplot = sns.barplot(x = target_counts.index,y = target_counts.values, ax = ax[0])
target_counts_barplot.set_ylabel('Count')

#colors = ['#8d99ae','#ffe066', '#f77f00','#348aa7','#bce784','#ffcc99',  '#f25f5c']
target_counts.plot.pie(autopct="%1.1f%%", ax=ax[1])
fig.suptitle('Engine Type')

Classification of Number of Cylinders:

target_counts= df['cylindernumber'].value_counts()

fig, ax = plt.subplots(1, 2, figsize=(15,7))
target_counts_barplot = sns.barplot(x = target_counts.index,y = target_counts.values, ax = ax[0])
target_counts_barplot.set_ylabel('Count')

#colors = ['#8d99ae','#ffe066', '#f77f00','#348aa7','#bce784','#ffcc99',  '#f25f5c']
target_counts.plot.pie(autopct="%1.1f%%", pctdistance=0.6, ax=ax[1])
fig.suptitle('Number of Cylinders')

Classification of Fuel System:

target_counts= df['fuelsystem'].value_counts()

fig, ax = plt.subplots(1, 2, figsize=(15,7))
target_counts_barplot = sns.barplot(x = target_counts.index,y = target_counts.values, ax = ax[0])
target_counts_barplot.set_ylabel('Count')

#colors = ['#8d99ae','#ffe066', '#f77f00','#348aa7','#bce784','#ffcc99',  '#f25f5c']
target_counts.plot.pie(autopct="%1.1f%%",pctdistance=0.8, ax=ax[1])
fig.suptitle('Fuel System')

Let us look at the distribution plot of compression ratio.

sns.displot(df, x="compressionratio", hue='fueltype',kde=True)
plt.gcf().set_size_inches(11.7, 8.27)

Looking from the data compression ratio is easily classified depending upon the fueltype. So we can remove either of one column to progress with our analysis.

Only numerical data distribution and linear and polynomial regression of order 4 is considered to look at the distribution of data.

The green line represents the linear regression line and the red line represents the polynomial fit of order 4 and with confidence interval of 90%.

g=sns.PairGrid(data = df,vars = ['wheelbase','horsepower','carlength','carwidth','price'])
g.map_upper(sns.regplot,ci=None, scatter_kws={'s':15}, line_kws={"color": "green"})
g.map_upper(sns.regplot,ci=90,line_kws={"color": "red"},order=3,scatter=False)
g.map_diag(sns.histplot)
g.map_lower(sns.scatterplot)

g=sns.PairGrid(data = df,vars = ['carheight','curbweight','enginesize','boreratio','price'])
g.map_upper(sns.regplot,ci=None, scatter_kws={'s':15}, line_kws={"color": "green"})
g.map_upper(sns.regplot,ci=90, line_kws={"color": "red"},order=3,scatter=False)
g.map_diag(sns.histplot)
g.map_lower(sns.scatterplot)

g=sns.PairGrid(data = df,vars = ['stroke','peakrpm','citympg','highwaympg','price'])
g.map_upper(sns.regplot,ci=None, scatter_kws={'s':15}, line_kws={"color": "green"})
g.map_upper(sns.regplot,ci=90, line_kws={"color": "red"},order=3,scatter=False)
g.map_diag(sns.histplot)
g.map_lower(sns.scatterplot)

Now lets start with K-means algorithm.

Before we go further let us perform OHE (one hot encoding) for categorical data and the scaling for numerical data.

First lets drop the compression ratio column as this information can be obtained from fuel type column.

df_encode = pd.read_csv('CarPrice_Assignment.csv')
df_encode=df_encode.drop(columns=['car_ID'])

df_encode['CarName']=df_encode['CarName'].str.split('-| ').str[0]
replace_values = {'maxda':'mazda','Nissan':'nissan','porcshce':'porsche','toyouta':'toyota','vokswagen':'volkswagen','vw':'volkswagen'}
df_encode=df_encode.replace({'CarName':replace_values})


df_encode=df_encode.drop(columns=['compressionratio'])

df_encode['symboling'] = df_encode.symboling.astype('category')
df_encode['CarName'] = df_encode.CarName.astype('category')
df_encode.dtypes

from sklearn.preprocessing import MinMaxScaler

sc = MinMaxScaler(feature_range =(0,1))

for col in df_encode.columns:
    if df_encode[col].dtypes=='float64' or df_encode[col].dtypes=='int64':
        df_encode[col]=sc.fit_transform(df_encode[[col]])
df_encode.head(10)

Moreover, door number has the same catgorical information as number of cylinders. so lets add the string to the door column to avoid the confusion with naming the columns same in OHE.

df_encode['doornumber']=df_encode['doornumber'].map('door{}'.format)
df_encode.head()

Noe collecting all the column names with categorical data.

cat_cols=[]
for col in df_encode.columns:
    if df_encode[col].dtypes=='object':
        cat_cols.append(col)
cat_cols

Encoding the categorical data and droping one column to avoid the dummy trap variable.

df_encode = pd.get_dummies(data = df_encode, prefix = 'OHE', prefix_sep='_',
               columns = cat_cols,
               drop_first =True,
              dtype='int8')
df_encode.head(10)

Now we will collect the data required for applying the clusters.

X=df_encode.drop(columns=['price','CarName'])

We are dropping Car name and price, as we want to cluster depending upon this variables.

Lets identify the optimum K value for K-means.

k_rng = range(1,10)

sse1 = []

for k1 in k_rng:
    km1 = KMeans(n_clusters = k1)
    km1.fit(X)
    sse1.append(km1.inertia_)
    
plt.plot(k_rng,sse1,marker='*')
plt.ylabel('SSE')
plt.xticks(k_rng)
plt.xlabel('K_values')

Now we will consider '4' as the optimum K-Value for our analysis. We will classify the cluster names as 'Economy', 'Budget', 'Premium' and 'Luxury'. These can be set looking further into price and classify the cluster category to a particular cluster name.

Performing our clustering

km = KMeans(n_clusters = 4)
km.fit(X)
y_predict = km.predict(X)
df_encode['cluster_Number']=y_predict

We have added the cluster number to our data frame.

By using the dictionary as shown below we will map these cluster values to new column.

cluster_category = {0:'Budget',1:'Luxury',2:'Premium',3:'Economy'}

df['cluster'] = df['cluster_Number'].map(cluster_category)

Now let us look at the results:

sns.scatterplot(data=df, x = "symboling", y="price",hue="cluster",palette=['green','orange','brown','dodgerblue'])
plt.legend(bbox_to_anchor=(1.25, 1),
           borderaxespad=0)

We can look at the symboling data all the symbol '3' data are Premium cars. High end cars with high prices belong to luxury cluster. The ones with low cost are classified into Premium and Budget.

You can look into further categorical columns and it gives the similar information depending upon the clusters.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans

df_encode = pd.read_csv('CarPrice_Assignment.csv')
df_encode=df_encode.drop(columns=['car_ID'])

df_encode['CarName']=df_encode['CarName'].str.split('-| ').str[0]
replace_values = {'maxda':'mazda','Nissan':'nissan','porcshce':'porsche','toyouta':'toyota','vokswagen':'volkswagen','vw':'volkswagen'}
df_encode=df_encode.replace({'CarName':replace_values})


df_encode=df_encode.drop(columns=['compressionratio'])

df_encode['symboling'] = df_encode.symboling.astype('category')
df_encode['CarName'] = df_encode.CarName.astype('category')
df_encode.dtypes

from sklearn.preprocessing import MinMaxScaler

sc = MinMaxScaler(feature_range =(0,1))

for col in df_encode.columns:
    if df_encode[col].dtypes=='float64' or df_encode[col].dtypes=='int64':
        df_encode[col]=sc.fit_transform(df_encode[[col]])
df_encode.head(10)

df_encode['doornumber']=df_encode['doornumber'].map('door{}'.format)
df_encode.head()

cat_cols=[]
for col in df_encode.columns:
    if df_encode[col].dtypes=='object':
        cat_cols.append(col)
cat_cols

df_encode = pd.get_dummies(data = df_encode, prefix = 'OHE', prefix_sep='_',
               columns = cat_cols,
               drop_first =True,
              dtype='int8')
df_encode.head(10)

X=df_encode.drop(columns=['price','CarName'])

Obtaining the optimum number of clusters.

k_rng = range(1,10)

sse1 = []

for k1 in k_rng:
    km1 = KMeans(n_clusters = k1)
    km1.fit(X)
    sse1.append(km1.inertia_)
    
plt.plot(k_rng,sse1,marker='*')
plt.ylabel('SSE')
plt.xticks(k_rng)
plt.xlabel('K_values')

I am selecting 4 clusters as there is no change in SSE beyond the 4 clusters.

Now mapping cluster numbers with categorical values.

cluster_category = {0:'Budget',1:'Luxury',2:'Premium',3:'Economy'}

df_encode['cluster'] = df_encode['cluster_Number'].map(cluster_category)

Now performing the random forest model classifier on the data.

df =df_encode.copy()
X =df.drop(['cluster_Number', 'cluster','CarName'], axis=1)
y=df[['cluster']]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state = 0)

#lets perform the random forest classifier to identify the appropriate cluster.
# Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

RFC= RandomForestClassifier()
RFC.fit(X_train, y_train)

y_pred_RFC = RFC.predict(X_test)

# Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred_RFC))
print(confusion_matrix(y_test, y_pred_RFC))
# Accuracy score
from sklearn.metrics import accuracy_score
print('accuracy is',accuracy_score(y_pred_RFC,y_test))

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred_RFC)

print(cm)
plt.figure(figsize=(10,7))
categories = np.unique(y)
df_cm = pd.DataFrame(cm, index = [i for i in categories], columns = [i for i in categories])
sns.heatmap(df_cm,annot=True,cmap='Reds')
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

It can be found that the accuracy of the model is found to be 95%.

# score the model on the train set
from sklearn.metrics import r2_score,mean_squared_error

print('Train score: {}\n'.format(RFC.score(X_train,y_train)))
# score the model on the test set
print('Test score: {}\n'.format(RFC.score(X_test,y_test)))