All Courses
All Courses
Courses by Software
Courses by Semester
Courses by Domain
Tool-focused Courses
Machine learning
POPULAR COURSES
Success Stories
Here an effort is made to clean the Automobile 1985 dataset and perform the descriptive analytics and make the predicitive model using the Random Forest classifier. lets import all the required libraries. import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import load_iris from…
Faisal MH
updated on 30 Sep 2021
Here an effort is made to clean the Automobile 1985 dataset and perform the descriptive analytics and make the predicitive model using the Random Forest classifier.
lets import all the required libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
import seaborn as sns
Lets read the data in python.
df = pd.read_csv('auto.csv',header=None)
df.head()
Lets check each of the column and it is found that no column is provided with the names.
So now lets name each of the column accordingly.
column_names = ['symboling','normalized_losses', 'CarName','fueltype','aspiration','doornumber','carbody','drivewheel','enginelocation',
'wheelbase','carlength','carwidth','carheight','curbweight','enginetype','cylindernumber','enginesize','fuelsystem',
'boreratio','stroke','compressionratio','horsepower','peakrpm','citympg','highwaympg','price']
df.columns = column_names
df.head()
Now lets check data each of the column and find if there are any NULL values or "Rubbish" data to be reomved before we proceed with the descriptive analysis.
#check missing values
plt.figure(figsize=(12,4))
sns.heatmap(df.isnull(),cbar=False,cmap='Wistia',yticklabels=False)
plt.title('Missing value in the dataset');
So there is no NULL data in any of the columns.
Lets check for "Rubbish" data by checking unique values for each column and replacing the rubbish data with the mean data if the data is numerical and normally distributed. If it is skewed we replace it with Median value. If the data is categorical we replace it with the mode values. We can even remove the records if the rubbish data accounts to less than 0.5% records of the total data.
lets check for "normalized_losses" column, it is found that 20% of the data is rubbish.
def removenotnum(list1):
notnum = []
for x in list1:
try:
float(x)
except:
notnum.append(x)
return notnum
notnumtable = removenotnum(df['normalized_losses'])
print(notnumtable)
print('Percent of identified rubbish data in Table → {:.3f}%'.format(len(notnumtable) / len(df['normalized_losses'])*100))
So we replace the '?' value with either mean or median of the other column values depending upon the distribution.
Before we go in, we will make an array of remaining values of the column to find the distribution and mean & median of the remaining data.
rubbish=['?']
n_l = df['normalized_losses'][~df['normalized_losses'].isin(rubbish)]
n_l = n_l.astype('int64')
sns.displot(x=n_l, kind="hist",kde=True, bins = 100, aspect = 5.5)
It is found that the data is positive skewed so we replace the data with the "Median value".
print('median:{}'.format(n_l.median()))
median: 115.0
So we replace the value with 115
df['normalized_losses'].replace({'?':'115'}, inplace=True)
Similarly for the "doornumber" column which is categorical we found rubbish data and replace it with "Mode" of the column.
Similarly for "boreratio" , "stroke", "horsepower", "price" and "peakrpm" columns, we follow the same procedure as done for the "normalized_losses" column as mentioned above.
Now all the rubbish data is replaced with the respective mean, median and mode. We will replace few categorical columns with OHE (one hot encoding).
As we replace we will change the column names with respective records. It is found that cylinder number and door number has same records like "two" which may confuse us. So we will replace the "two" in door to "doortwo"
df['doornumber']=df['doornumber'].map('door{}'.format)
df.head()
We will change "symboling" and "CarName" column as type category. As we dont want to perform the OHE on them.
df['symboling'] = df.symboling.astype('category')
df['CarName'] = df.CarName.astype('category')
df.dtypes
Now lets perform the OHE only on the columns with type "object".
So we collect the columns with dtype object.
cat_cols=[]
for col in df.columns:
if df[col].dtypes=='object':
cat_cols.append(col)
cat_cols
Now lets perform the OHE on those columns and dropping the first column to avoid dummy trap.
df = pd.get_dummies(data = df, prefix = 'OHE', prefix_sep='_',
columns = cat_cols,
drop_first =True,
dtype='int8')
df.columns
So we have in total 46 columns to perform the descriptive analytics.
The car name includes the Brand and model, so we will use only the brand name, we will be splitting the columns using the delimitters '-' & ' '. And, replace the wrong spellings of the brands.
df['CarName']=df['CarName'].str.split('-| ').str[0]
replace_values = {'maxda':'mazda','Nissan':'nissan','porcshce':'porsche','toyouta':'toyota','vokswagen':'volkswagen','vw':'volkswagen'}
df=df.replace({'CarName':replace_values})
Now lets check how data is classified for each brand of car.
target_counts= df['CarName'].value_counts()
fig, ax = plt.subplots(1, 2, figsize=(15,7))
target_counts_barplot = sns.barplot(y = target_counts.index,x = target_counts.values, ax = ax[0])
target_counts_barplot.set_ylabel('Models of Car Brand')
ax[0].set_facecolor('white')
#colors = ['#8d99ae','#ffe066', '#f77f00','#348aa7','#bce784','#ffcc99', '#f25f5c']
target_counts.plot.pie(autopct="%1.1f%%", ax=ax[1])
Classification Fuel Type
target_counts= df['fueltype'].value_counts()
fig, ax = plt.subplots(1, 2, figsize=(15,7))
target_counts_barplot = sns.barplot(x = target_counts.index,y = target_counts.values, ax = ax[0])
target_counts_barplot.set_ylabel('Count')
#colors = ['#8d99ae','#ffe066', '#f77f00','#348aa7','#bce784','#ffcc99', '#f25f5c']
target_counts.plot.pie(autopct="%1.1f%%", ax=ax[1])
fig.suptitle('Fuel Type')
Classification of Aspiration
target_counts= df['aspiration'].value_counts()
fig, ax = plt.subplots(1, 2, figsize=(15,7))
target_counts_barplot = sns.barplot(x = target_counts.index,y = target_counts.values, ax = ax[0])
target_counts_barplot.set_ylabel('Count')
#colors = ['#8d99ae','#ffe066', '#f77f00','#348aa7','#bce784','#ffcc99', '#f25f5c']
target_counts.plot.pie(autopct="%1.1f%%", ax=ax[1])
fig.suptitle('Aspiration')
Classification of Number of Doors
target_counts= df['doornumber'].value_counts()
fig, ax = plt.subplots(1, 2, figsize=(15,7))
target_counts_barplot = sns.barplot(x = target_counts.index,y = target_counts.values, ax = ax[0])
target_counts_barplot.set_ylabel('Count')
#colors = ['#8d99ae','#ffe066', '#f77f00','#348aa7','#bce784','#ffcc99', '#f25f5c']
target_counts.plot.pie(autopct="%1.1f%%", ax=ax[1])
fig.suptitle('Number of Doors')
Classification of Body Type:
target_counts= df['carbody'].value_counts()
fig, ax = plt.subplots(1, 2, figsize=(15,7))
target_counts_barplot = sns.barplot(x = target_counts.index,y = target_counts.values, ax = ax[0])
target_counts_barplot.set_ylabel('Count')
#colors = ['#8d99ae','#ffe066', '#f77f00','#348aa7','#bce784','#ffcc99', '#f25f5c']
target_counts.plot.pie(autopct="%1.1f%%", ax=ax[1])
fig.suptitle('Body Type')
Classification of Drive Wheel:
target_counts= df['drivewheel'].value_counts()
fig, ax = plt.subplots(1, 2, figsize=(15,7))
target_counts_barplot = sns.barplot(x = target_counts.index,y = target_counts.values, ax = ax[0])
target_counts_barplot.set_ylabel('Count')
#colors = ['#8d99ae','#ffe066', '#f77f00','#348aa7','#bce784','#ffcc99', '#f25f5c']
target_counts.plot.pie(autopct="%1.1f%%", ax=ax[1])
fig.suptitle('Drive Wheel')
Classification of Engine Location:
target_counts= df['enginelocation'].value_counts()
fig, ax = plt.subplots(1, 2, figsize=(15,7))
target_counts_barplot = sns.barplot(x = target_counts.index,y = target_counts.values, ax = ax[0])
target_counts_barplot.set_ylabel('Count')
#colors = ['#8d99ae','#ffe066', '#f77f00','#348aa7','#bce784','#ffcc99', '#f25f5c']
target_counts.plot.pie(autopct="%1.1f%%", ax=ax[1])
fig.suptitle('Engine Location')
Classification of Engine Type:
target_counts= df['enginetype'].value_counts()
fig, ax = plt.subplots(1, 2, figsize=(15,7))
target_counts_barplot = sns.barplot(x = target_counts.index,y = target_counts.values, ax = ax[0])
target_counts_barplot.set_ylabel('Count')
#colors = ['#8d99ae','#ffe066', '#f77f00','#348aa7','#bce784','#ffcc99', '#f25f5c']
target_counts.plot.pie(autopct="%1.1f%%", ax=ax[1])
fig.suptitle('Engine Type')
Classification of Number of Cylinders:
target_counts= df['cylindernumber'].value_counts()
fig, ax = plt.subplots(1, 2, figsize=(15,7))
target_counts_barplot = sns.barplot(x = target_counts.index,y = target_counts.values, ax = ax[0])
target_counts_barplot.set_ylabel('Count')
#colors = ['#8d99ae','#ffe066', '#f77f00','#348aa7','#bce784','#ffcc99', '#f25f5c']
target_counts.plot.pie(autopct="%1.1f%%", pctdistance=0.6, ax=ax[1])
fig.suptitle('Number of Cylinders')
Classification of Fuel System:
target_counts= df['fuelsystem'].value_counts()
fig, ax = plt.subplots(1, 2, figsize=(15,7))
target_counts_barplot = sns.barplot(x = target_counts.index,y = target_counts.values, ax = ax[0])
target_counts_barplot.set_ylabel('Count')
#colors = ['#8d99ae','#ffe066', '#f77f00','#348aa7','#bce784','#ffcc99', '#f25f5c']
target_counts.plot.pie(autopct="%1.1f%%",pctdistance=0.8, ax=ax[1])
fig.suptitle('Fuel System')
Let us look at the distribution plot of compression ratio.
sns.displot(df, x="compressionratio", hue='fueltype',kde=True)
plt.gcf().set_size_inches(11.7, 8.27)
Looking from the data compression ratio is easily classified depending upon the fueltype. So we can remove either of one column to progress with our analysis.
Only numerical data distribution and linear and polynomial regression of order 4 is considered to look at the distribution of data.
The green line represents the linear regression line and the red line represents the polynomial fit of order 4 and with confidence interval of 90%.
g=sns.PairGrid(data = df,vars = ['wheelbase','horsepower','carlength','carwidth','price'])
g.map_upper(sns.regplot,ci=None, scatter_kws={'s':15}, line_kws={"color": "green"})
g.map_upper(sns.regplot,ci=90,line_kws={"color": "red"},order=3,scatter=False)
g.map_diag(sns.histplot)
g.map_lower(sns.scatterplot)
g=sns.PairGrid(data = df,vars = ['carheight','curbweight','enginesize','boreratio','price'])
g.map_upper(sns.regplot,ci=None, scatter_kws={'s':15}, line_kws={"color": "green"})
g.map_upper(sns.regplot,ci=90, line_kws={"color": "red"},order=3,scatter=False)
g.map_diag(sns.histplot)
g.map_lower(sns.scatterplot)
g=sns.PairGrid(data = df,vars = ['stroke','peakrpm','citympg','highwaympg','price'])
g.map_upper(sns.regplot,ci=None, scatter_kws={'s':15}, line_kws={"color": "green"})
g.map_upper(sns.regplot,ci=90, line_kws={"color": "red"},order=3,scatter=False)
g.map_diag(sns.histplot)
g.map_lower(sns.scatterplot)
Now lets start with K-means algorithm.
Before we go further let us perform OHE (one hot encoding) for categorical data and the scaling for numerical data.
First lets drop the compression ratio column as this information can be obtained from fuel type column.
df_encode = pd.read_csv('CarPrice_Assignment.csv')
df_encode=df_encode.drop(columns=['car_ID'])
df_encode['CarName']=df_encode['CarName'].str.split('-| ').str[0]
replace_values = {'maxda':'mazda','Nissan':'nissan','porcshce':'porsche','toyouta':'toyota','vokswagen':'volkswagen','vw':'volkswagen'}
df_encode=df_encode.replace({'CarName':replace_values})
df_encode=df_encode.drop(columns=['compressionratio'])
df_encode['symboling'] = df_encode.symboling.astype('category')
df_encode['CarName'] = df_encode.CarName.astype('category')
df_encode.dtypes
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range =(0,1))
for col in df_encode.columns:
if df_encode[col].dtypes=='float64' or df_encode[col].dtypes=='int64':
df_encode[col]=sc.fit_transform(df_encode[[col]])
df_encode.head(10)
Moreover, door number has the same catgorical information as number of cylinders. so lets add the string to the door column to avoid the confusion with naming the columns same in OHE.
df_encode['doornumber']=df_encode['doornumber'].map('door{}'.format)
df_encode.head()
Noe collecting all the column names with categorical data.
cat_cols=[]
for col in df_encode.columns:
if df_encode[col].dtypes=='object':
cat_cols.append(col)
cat_cols
Encoding the categorical data and droping one column to avoid the dummy trap variable.
df_encode = pd.get_dummies(data = df_encode, prefix = 'OHE', prefix_sep='_',
columns = cat_cols,
drop_first =True,
dtype='int8')
df_encode.head(10)
Now we will collect the data required for applying the clusters.
X=df_encode.drop(columns=['price','CarName'])
We are dropping Car name and price, as we want to cluster depending upon this variables.
Lets identify the optimum K value for K-means.
k_rng = range(1,10)
sse1 = []
for k1 in k_rng:
km1 = KMeans(n_clusters = k1)
km1.fit(X)
sse1.append(km1.inertia_)
plt.plot(k_rng,sse1,marker='*')
plt.ylabel('SSE')
plt.xticks(k_rng)
plt.xlabel('K_values')
Now we will consider '4' as the optimum K-Value for our analysis. We will classify the cluster names as 'Economy', 'Budget', 'Premium' and 'Luxury'. These can be set looking further into price and classify the cluster category to a particular cluster name.
Performing our clustering
km = KMeans(n_clusters = 4)
km.fit(X)
y_predict = km.predict(X)
df_encode['cluster_Number']=y_predict
We have added the cluster number to our data frame.
By using the dictionary as shown below we will map these cluster values to new column.
cluster_category = {0:'Budget',1:'Luxury',2:'Premium',3:'Economy'}
df['cluster'] = df['cluster_Number'].map(cluster_category)
Now let us look at the results:
sns.scatterplot(data=df, x = "symboling", y="price",hue="cluster",palette=['green','orange','brown','dodgerblue'])
plt.legend(bbox_to_anchor=(1.25, 1),
borderaxespad=0)
We can look at the symboling data all the symbol '3' data are Premium cars. High end cars with high prices belong to luxury cluster. The ones with low cost are classified into Premium and Budget.
You can look into further categorical columns and it gives the similar information depending upon the clusters.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
df_encode = pd.read_csv('CarPrice_Assignment.csv')
df_encode=df_encode.drop(columns=['car_ID'])
df_encode['CarName']=df_encode['CarName'].str.split('-| ').str[0]
replace_values = {'maxda':'mazda','Nissan':'nissan','porcshce':'porsche','toyouta':'toyota','vokswagen':'volkswagen','vw':'volkswagen'}
df_encode=df_encode.replace({'CarName':replace_values})
df_encode=df_encode.drop(columns=['compressionratio'])
df_encode['symboling'] = df_encode.symboling.astype('category')
df_encode['CarName'] = df_encode.CarName.astype('category')
df_encode.dtypes
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range =(0,1))
for col in df_encode.columns:
if df_encode[col].dtypes=='float64' or df_encode[col].dtypes=='int64':
df_encode[col]=sc.fit_transform(df_encode[[col]])
df_encode.head(10)
df_encode['doornumber']=df_encode['doornumber'].map('door{}'.format)
df_encode.head()
cat_cols=[]
for col in df_encode.columns:
if df_encode[col].dtypes=='object':
cat_cols.append(col)
cat_cols
df_encode = pd.get_dummies(data = df_encode, prefix = 'OHE', prefix_sep='_',
columns = cat_cols,
drop_first =True,
dtype='int8')
df_encode.head(10)
X=df_encode.drop(columns=['price','CarName'])
Obtaining the optimum number of clusters.
k_rng = range(1,10)
sse1 = []
for k1 in k_rng:
km1 = KMeans(n_clusters = k1)
km1.fit(X)
sse1.append(km1.inertia_)
plt.plot(k_rng,sse1,marker='*')
plt.ylabel('SSE')
plt.xticks(k_rng)
plt.xlabel('K_values')
I am selecting 4 clusters as there is no change in SSE beyond the 4 clusters.
Now mapping cluster numbers with categorical values.
cluster_category = {0:'Budget',1:'Luxury',2:'Premium',3:'Economy'}
df_encode['cluster'] = df_encode['cluster_Number'].map(cluster_category)
Now performing the random forest model classifier on the data.
df =df_encode.copy()
X =df.drop(['cluster_Number', 'cluster','CarName'], axis=1)
y=df[['cluster']]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state = 0)
#lets perform the random forest classifier to identify the appropriate cluster.
# Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
RFC= RandomForestClassifier()
RFC.fit(X_train, y_train)
y_pred_RFC = RFC.predict(X_test)
# Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred_RFC))
print(confusion_matrix(y_test, y_pred_RFC))
# Accuracy score
from sklearn.metrics import accuracy_score
print('accuracy is',accuracy_score(y_pred_RFC,y_test))
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred_RFC)
print(cm)
plt.figure(figsize=(10,7))
categories = np.unique(y)
df_cm = pd.DataFrame(cm, index = [i for i in categories], columns = [i for i in categories])
sns.heatmap(df_cm,annot=True,cmap='Reds')
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()
It can be found that the accuracy of the model is found to be 95%.
# score the model on the train set
from sklearn.metrics import r2_score,mean_squared_error
print('Train score: {}\n'.format(RFC.score(X_train,y_train)))
# score the model on the test set
print('Test score: {}\n'.format(RFC.score(X_test,y_test)))
The model seems to slightly overfit. To enhance it further we can check it with K-fold cross validation.
Leave a comment
Thanks for choosing to leave a comment. Please keep in mind that all the comments are moderated as per our comment policy, and your email will not be published for privacy reasons. Please leave a personal & meaningful conversation.
Other comments...
Project 2
Here an effort is made to clean the Automobile 1985 dataset and perform the descriptive analytics and make the predicitive model using the Random Forest classifier. lets import all the required libraries. import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import load_iris from…
30 Sep 2021 03:52 PM IST
Project 1
Here an effort is made to clean the Automobile 1985 dataset and perform the descriptive analytics. lets import all the required libraries. import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import load_iris from sklearn.metrics import confusion_matrix from sklearn.metrics import…
30 Sep 2021 03:18 AM IST
Supervised Learning - Classification Week 9 Challenge
What is a Neural Network? Neural networks reflect the behavior of the human brain, allowing computer programs to recognize patterns and solve common problems in the fields of AI, machine learning, and deep learning. Neural networks, also known as artificial neural networks (ANNs) or simulated neural networks (SNNs),…
27 Sep 2021 05:16 PM IST
Unsupervised Learning - Kmeans Week 11 Challenge
K-means Clustering for Car Dataset: The aim of this project is to cluster the data into different classes depending upon the price. Let us import all the required libraries. import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.cluster import KMeans Import the data into…
23 Sep 2021 09:34 PM IST
Related Courses
0 Hours of Content
Skill-Lync offers industry relevant advanced engineering courses for engineering students by partnering with industry experts.
© 2025 Skill-Lync Inc. All Rights Reserved.