Background

The clean air act of 1970 was the beginning of the end for the use of asbestos in home building. By 1976, the U.S. Environmental Protection Agency (EPA) was given authority to restrict the use of asbestos in paint. Homes built during and before this period are known to have materials with asbestos YOu can read more about this ban.

The state of Colorado has a large portion of their residential dwelling data that is missing the year built and they would like you to build a predictive model that can classify if a house is built pre 1980.

Colorado gave you home sales data for the city of Denver from 2013 on which to train your model. They said all the column names should be descriptive enough for your modeling and that they would like you to use the latest machine learning methods.

https://github.com/byuidatascience/data4dwellings/blob/master/data.md

Client Request

The Client is a state agency in Colorado that is responsible for the health and safety of its residents. They have a large portion of their residential dwelling data that is missing the year built and they would like you to build a predictive model that can classify if a house is built pre 1980.

Elevator pitch

This project aims to predict whether a house was built before 1980 based on various features such as the number of bedrooms, bathrooms, square footage, and more. The dataset used for this project contains information on various houses in Denver, Colorado, and the goal is to build a predictive model that can accurately classify whether a house was built before 1980.

Libraries and Tide Data

Show the code

import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
from sklearn import tree
import matplotlib.pyplot as plt # for graphs}
from lets_plot import *
LetsPlot.setup_html()

Load the data

# data in local file
#df = pd.read_json(r"C:\Users\Gabriel Guerrero\OneDrive - AVASA\BYU-I\DS 250 Data Coding\db-projects\flights_missing.json")



df = pd.read_csv("https://github.com/byuidatascience/data4dwellings/raw/master/data-raw/dwellings_ml/dwellings_ml.csv")

#online
#https://raw.githubusercontent.com/byuidatascience/data4missing/refs/heads/master/data-raw/flights_missing/flights_missing.json

Question 1:

Create 2-3 charts that evaluate potential relationships between the home variables and before1980. Explain what you learn from the charts that could help a machine learning algorithm.

From the correlation I learn that the variables that have the highest correlation with the target variable are: ybuilt, qualityB, noncars, gartypenon to mention a few.

Chart of relationships1

corrMatrix = df.corr()

sns.heatmap(corrMatrix, annot=True, annot_kws={"size": 7})
plt.title('Correlation Before Cleaning')
plt.figure(figsize=(20, 10))
plt.show()

<Figure size 1920x960 with 0 Axes>

x data

df_ready = df[['stories', 'arcstyle_END UNIT','arcstyle_ONE-STORY','arcstyle_TWO-STORY','basement','condition_Excel','condition_Good','gartype_Att','gartype_None','livearea','nocars','numbaths','quality_B','quality_C','yrbuilt','before1980' ]]

print(df_ready)

Chart of relationships2

corrMatrix = df_ready.corr()
plt.title('Correlation After Cleaning')
sns.heatmap(corrMatrix, annot=True, annot_kws={"size": 7})
plt.figure(figsize=(20, 10))

plt.show()

<Figure size 1920x960 with 0 Axes>

Chart1

from plotnine import ggplot, aes, geom_boxplot, geom_point, geom_bar, labs, theme_minimal, theme, element_text

# 1. Boxplot
boxplot_chart = (
    ggplot(df_ready, aes(x='yrbuilt', y='livearea')) +
    geom_point() +
    labs(title='Distribution of Home Values',
         subtitle='Comparison year and living area',
         x='Built Before 1980', 
         y='Home Value') 
    #theme(plot_title=element_text(color='black', size=16, face='bold'))
)

# 2. Scatter Plot 
scatter_chart = (
    ggplot(df_ready, aes(x='yrbuilt', y='basement', color='before1980')) +
    geom_point(alpha=0.6) +
    labs(title='ybuilt and basement',
         subtitle='Colored by Homes Built Before and After 1980',
         x='Square Footage', 
         y='Price') +
    #theme_minimal() +
    theme(plot_title=element_text(color='black', size=16, face='bold'))
)

# 3. Bar Chahrt
bar_chart = (
    ggplot(df_ready, aes(x='nocars', fill='before1980')) +
    geom_bar(position='dodge') +
    labs(title='Comparaison of noncars and before 1980',
         subtitle='Built Before and After 1980',
         x='Neighborhood', 
         y='Count')
)

# Display the charts
print(boxplot_chart)
print(scatter_chart)
print(bar_chart)

boxplot_chart.save('plot1.png') 
scatter_chart.save('plot2.png') 
bar_chart.save('plot3.png')

<ggplot: (672 x 480)>
<ggplot: (672 x 480)>
<ggplot: (672 x 480)>

Question 2:

Build a classification model labeling houses as being built “before 1980” or “during or after 1980”. Your goal is to reach or exceed 90% accuracy. Explain your final model choice (algorithm, tuning parameters, etc) and describe what other models you tried.

building_model

X = df_ready[['stories', 'arcstyle_END UNIT','arcstyle_ONE-STORY','arcstyle_TWO-STORY','basement','condition_Excel','condition_Good','gartype_Att','gartype_None','nocars','numbaths','quality_B','quality_C']]

y=df_ready['before1980']

training

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
          X, y, test_size=0.3, random_state=1)

running_model

clf = tree.DecisionTreeClassifier(max_leaf_nodes=5500)
clf = clf.fit(X_train, y_train)

Accuracy

test_predictions=clf.predict(X_test)

#Accuracy
from sklearn.metrics import accuracy_score
accuracy_score(y_test, test_predictions)

0.8897294151876637

Question 3:

Justify your classification model by discussing the most important features selected by your model. This discussion should include a feature importance chart and a description of the features.

The most important features selected by the model are: conditon Excel, gartype?none, quality?B, arctstyle-story, arcstyle-end-unit. It seems that the model will pick these features becasue it looks like there is a pattern of use of asbestos in paint in some houses by these specific features.

importance

importance = clf.feature_importances_
df = pd.DataFrame(list(zip(importance, X.columns.to_list())),columns =['importance', 'feature'])
#print(importance,X.columns.to_list())

df = df.sort_values(by='importance')
df.head()

	importance	feature
5	0.000167	condition_Excel
8	0.002010	gartype_None
11	0.009865	quality_B
3	0.010496	arcstyle_TWO-STORY
1	0.023401	arcstyle_END UNIT

Question 4:

Describe the quality of your classification model using 2-3 different evaluation metrics. You also need to explain how to interpret each of the evaluation metrics you use.

I have used the classification report to evaluate the model. The classification report provides a detailed breakdown of the model’s performance on each class. It includes metrics such as precision, recall, and F1-score for each class.

Precision is the ratio of true positives to the sum of true positives and false positives. It measures the accuracy of positive predictions, indicating the proportion of predictions labeled as positive that are actually correct.

Recall is the ratio of true positives to the sum of true positives and false negatives. It measures the coverage of actual positives, showing the proportion of actual positive cases that are correctly identified by the model.

evaluation_metrics

from sklearn.metrics import classification_report
print(classification_report(y_test, test_predictions))

              precision    recall  f1-score   support

           0       0.85      0.86      0.85      2572
           1       0.92      0.91      0.91      4302

    accuracy                           0.89      6874
   macro avg       0.88      0.88      0.88      6874
weighted avg       0.89      0.89      0.89      6874

Strech

Repeat the classification model using 3 different algorithms. Display their Feature Importance, and Decision Matrix. Explian the differences between the models and which one you would recommend to the Client.

The models differes in the way they handle the data and the way they make predictions. The first model uses a logisitic regression classifier, which is a linear model that uses a logistic function to model the probability of a binary outcome. The second model uses k-nearest neighbors classifier, which is a non-parametric method that classifies data points based on the majority class of their k-nearest neighbors. The third model uses a decision tree classifier, which is a type of supervised learning algorithm that builds a tree-like model of decisions and their possible consequences. The last model is a randon forest classifier, which is an ensemble learning method that combines multiple decision trees to make predictions.

Desicion tree classifer is the best model because it has the highest accuracy score and the lowest error rate.

function

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

def print_score(clf, X_train, y_train, X_test, y_test, train=True):
    if train:
        pred = clf.predict(X_train)
        clf_report = pd.DataFrame(classification_report(y_train, pred, output_dict=True))
        print("Train Result:\n================================================")
        print(f"Accuracy Score: {accuracy_score(y_train, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_train, pred)}\n")
        
    elif train==False:
        pred = clf.predict(X_test)
        clf_report = pd.DataFrame(classification_report(y_test, pred, output_dict=True))
        print("Test Result:\n================================================")        
        print(f"Accuracy Score: {accuracy_score(y_test, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_test, pred)}\n")

logistic_regression

from sklearn.linear_model import LogisticRegression

lr_clf = LogisticRegression(solver='liblinear')
lr_clf.fit(X_train, y_train)

print_score(lr_clf, X_train, y_train, X_test, y_test, train=True)
print_score(lr_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
================================================
Accuracy Score: 83.79%
_______________________________________________
CLASSIFICATION REPORT:
                     0             1  accuracy     macro avg  weighted avg
precision     0.793583      0.863187  0.837895      0.828385      0.837053
recall        0.768017      0.879904  0.837895      0.823961      0.837895
f1-score      0.780591      0.871465  0.837895      0.826028      0.837346
support    6022.000000  10017.000000  0.837895  16039.000000  16039.000000
_______________________________________________
Confusion Matrix: 
 [[4625 1397]
 [1203 8814]]

Test Result:
================================================
Accuracy Score: 84.23%
_______________________________________________
CLASSIFICATION REPORT:
                     0            1  accuracy    macro avg  weighted avg
precision     0.799758     0.866348  0.842304     0.833053      0.841432
recall        0.771773     0.884472  0.842304     0.828123      0.842304
f1-score      0.785516     0.875316  0.842304     0.830416      0.841716
support    2572.000000  4302.000000  0.842304  6874.000000   6874.000000
_______________________________________________
Confusion Matrix: 
 [[1985  587]
 [ 497 3805]]

Kneighbors

from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_train)

print_score(knn_clf, X_train, y_train, X_test, y_test, train=True)
print_score(knn_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
================================================
Accuracy Score: 88.66%
_______________________________________________
CLASSIFICATION REPORT:
                     0             1  accuracy     macro avg  weighted avg
precision     0.849028      0.909164  0.886589      0.879096      0.886585
recall        0.848887      0.909254  0.886589      0.879071      0.886589
f1-score      0.848958      0.909209  0.886589      0.879083      0.886587
support    6022.000000  10017.000000  0.886589  16039.000000  16039.000000
_______________________________________________
Confusion Matrix: 
 [[5112  910]
 [ 909 9108]]

Test Result:
================================================
Accuracy Score: 85.38%
_______________________________________________
CLASSIFICATION REPORT:
                     0            1  accuracy    macro avg  weighted avg
precision     0.803565     0.883997  0.853797     0.843781      0.853902
recall        0.806376     0.882148  0.853797     0.844262      0.853797
f1-score      0.804968     0.883072  0.853797     0.844020      0.853848
support    2572.000000  4302.000000  0.853797  6874.000000   6874.000000
_______________________________________________
Confusion Matrix: 
 [[2074  498]
 [ 507 3795]]

DecisionTreeClassifier

from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)

print_score(tree_clf, X_train, y_train, X_test, y_test, train=True)
print_score(tree_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
================================================
Accuracy Score: 94.12%
_______________________________________________
CLASSIFICATION REPORT:
                     0             1  accuracy     macro avg  weighted avg
precision     0.924453      0.951173  0.941206      0.937813      0.941141
recall        0.918466      0.954877  0.941206      0.936671      0.941206
f1-score      0.921449      0.953021  0.941206      0.937235      0.941167
support    6022.000000  10017.000000  0.941206  16039.000000  16039.000000
_______________________________________________
Confusion Matrix: 
 [[5531  491]
 [ 452 9565]]

Test Result:
================================================
Accuracy Score: 88.94%
_______________________________________________
CLASSIFICATION REPORT:
                     0            1  accuracy    macro avg  weighted avg
precision     0.847126     0.915338  0.889438     0.881232      0.889816
recall        0.859642     0.907252  0.889438     0.883447      0.889438
f1-score      0.853338     0.911277  0.889438     0.882308      0.889599
support    2572.000000  4302.000000  0.889438  6874.000000   6874.000000
_______________________________________________
Confusion Matrix: 
 [[2211  361]
 [ 399 3903]]

RandomForestClassifier

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

rf_clf = RandomForestClassifier(n_estimators=1000, random_state=42)
rf_clf.fit(X_train, y_train)

print_score(rf_clf, X_train, y_train, X_test, y_test, train=True)
print_score(rf_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
================================================
Accuracy Score: 94.12%
_______________________________________________
CLASSIFICATION REPORT:
                     0             1  accuracy     macro avg  weighted avg
precision     0.927453      0.949297  0.941206      0.938375      0.941095
recall        0.914978      0.956973  0.941206      0.935976      0.941206
f1-score      0.921174      0.953120  0.941206      0.937147      0.941125
support    6022.000000  10017.000000  0.941206  16039.000000  16039.000000
_______________________________________________
Confusion Matrix: 
 [[5510  512]
 [ 431 9586]]

Test Result:
================================================
Accuracy Score: 89.37%
_______________________________________________
CLASSIFICATION REPORT:
                     0            1  accuracy    macro avg  weighted avg
precision     0.857198     0.915522  0.893657     0.886360      0.893700
recall        0.858865     0.914458  0.893657     0.886662      0.893657
f1-score      0.858031     0.914990  0.893657     0.886510      0.893678
support    2572.000000  4302.000000  0.893657  6874.000000   6874.000000
_______________________________________________
Confusion Matrix: 
 [[2209  363]
 [ 368 3934]]

resumen4

test_score = accuracy_score(y_test, rf_clf.predict(X_test)) * 100
train_score = accuracy_score(y_train, rf_clf.predict(X_train)) * 100

results_df_2 = pd.DataFrame(data=[["Random Forest Classifier", train_score, test_score]], 
                          columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
results_df = pd.concat([results_df, results_df_2], ignore_index=True)

results_df

	Model	Training Accuracy %	Testing Accuracy %
0	Logistic Regression	83.789513	84.230434
1	K-nearest neighbors	88.658894	85.379692
2	Decision Tree Classifier	94.120581	88.943846
3	Random Forest Classifier	94.120581	89.365726