Show the code
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
from sklearn import tree
import matplotlib.pyplot as plt # for graphs}
from lets_plot import *
LetsPlot.setup_html()
Course DS 250
Gabriel Guerrero
The clean air act of 1970 was the beginning of the end for the use of asbestos in home building. By 1976, the U.S. Environmental Protection Agency (EPA) was given authority to restrict the use of asbestos in paint. Homes built during and before this period are known to have materials with asbestos YOu can read more about this ban.
The state of Colorado has a large portion of their residential dwelling data that is missing the year built and they would like you to build a predictive model that can classify if a house is built pre 1980.
Colorado gave you home sales data for the city of Denver from 2013 on which to train your model. They said all the column names should be descriptive enough for your modeling and that they would like you to use the latest machine learning methods.
https://github.com/byuidatascience/data4dwellings/blob/master/data.md
The Client is a state agency in Colorado that is responsible for the health and safety of its residents. They have a large portion of their residential dwelling data that is missing the year built and they would like you to build a predictive model that can classify if a house is built pre 1980.
This project aims to predict whether a house was built before 1980 based on various features such as the number of bedrooms, bathrooms, square footage, and more. The dataset used for this project contains information on various houses in Denver, Colorado, and the goal is to build a predictive model that can accurately classify whether a house was built before 1980.
# data in local file
#df = pd.read_json(r"C:\Users\Gabriel Guerrero\OneDrive - AVASA\BYU-I\DS 250 Data Coding\db-projects\flights_missing.json")
df = pd.read_csv("https://github.com/byuidatascience/data4dwellings/raw/master/data-raw/dwellings_ml/dwellings_ml.csv")
#online
#https://raw.githubusercontent.com/byuidatascience/data4missing/refs/heads/master/data-raw/flights_missing/flights_missing.json
Create 2-3 charts that evaluate potential relationships between the home variables and before1980. Explain what you learn from the charts that could help a machine learning algorithm.
From the correlation I learn that the variables that have the highest correlation with the target variable are: ybuilt, qualityB, noncars, gartypenon to mention a few.
<Figure size 1920x960 with 0 Axes>
<Figure size 1920x960 with 0 Axes>
from plotnine import ggplot, aes, geom_boxplot, geom_point, geom_bar, labs, theme_minimal, theme, element_text
# 1. Boxplot
boxplot_chart = (
ggplot(df_ready, aes(x='yrbuilt', y='livearea')) +
geom_point() +
labs(title='Distribution of Home Values',
subtitle='Comparison year and living area',
x='Built Before 1980',
y='Home Value')
#theme(plot_title=element_text(color='black', size=16, face='bold'))
)
# 2. Scatter Plot
scatter_chart = (
ggplot(df_ready, aes(x='yrbuilt', y='basement', color='before1980')) +
geom_point(alpha=0.6) +
labs(title='ybuilt and basement',
subtitle='Colored by Homes Built Before and After 1980',
x='Square Footage',
y='Price') +
#theme_minimal() +
theme(plot_title=element_text(color='black', size=16, face='bold'))
)
# 3. Bar Chahrt
bar_chart = (
ggplot(df_ready, aes(x='nocars', fill='before1980')) +
geom_bar(position='dodge') +
labs(title='Comparaison of noncars and before 1980',
subtitle='Built Before and After 1980',
x='Neighborhood',
y='Count')
)
# Display the charts
print(boxplot_chart)
print(scatter_chart)
print(bar_chart)
boxplot_chart.save('plot1.png')
scatter_chart.save('plot2.png')
bar_chart.save('plot3.png')
<ggplot: (672 x 480)>
<ggplot: (672 x 480)>
<ggplot: (672 x 480)>
Build a classification model labeling houses as being built “before 1980” or “during or after 1980”. Your goal is to reach or exceed 90% accuracy. Explain your final model choice (algorithm, tuning parameters, etc) and describe what other models you tried.
Justify your classification model by discussing the most important features selected by your model. This discussion should include a feature importance chart and a description of the features.
The most important features selected by the model are: conditon Excel, gartype?none, quality?B, arctstyle-story, arcstyle-end-unit. It seems that the model will pick these features becasue it looks like there is a pattern of use of asbestos in paint in some houses by these specific features.
importance | feature | |
---|---|---|
5 | 0.000167 | condition_Excel |
8 | 0.002010 | gartype_None |
11 | 0.009865 | quality_B |
3 | 0.010496 | arcstyle_TWO-STORY |
1 | 0.023401 | arcstyle_END UNIT |
Describe the quality of your classification model using 2-3 different evaluation metrics. You also need to explain how to interpret each of the evaluation metrics you use.
I have used the classification report to evaluate the model. The classification report provides a detailed breakdown of the model’s performance on each class. It includes metrics such as precision, recall, and F1-score for each class.
Precision is the ratio of true positives to the sum of true positives and false positives. It measures the accuracy of positive predictions, indicating the proportion of predictions labeled as positive that are actually correct.
Recall is the ratio of true positives to the sum of true positives and false negatives. It measures the coverage of actual positives, showing the proportion of actual positive cases that are correctly identified by the model.
Repeat the classification model using 3 different algorithms. Display their Feature Importance, and Decision Matrix. Explian the differences between the models and which one you would recommend to the Client.
The models differes in the way they handle the data and the way they make predictions. The first model uses a logisitic regression classifier, which is a linear model that uses a logistic function to model the probability of a binary outcome. The second model uses k-nearest neighbors classifier, which is a non-parametric method that classifies data points based on the majority class of their k-nearest neighbors. The third model uses a decision tree classifier, which is a type of supervised learning algorithm that builds a tree-like model of decisions and their possible consequences. The last model is a randon forest classifier, which is an ensemble learning method that combines multiple decision trees to make predictions.
Desicion tree classifer is the best model because it has the highest accuracy score and the lowest error rate.
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
def print_score(clf, X_train, y_train, X_test, y_test, train=True):
if train:
pred = clf.predict(X_train)
clf_report = pd.DataFrame(classification_report(y_train, pred, output_dict=True))
print("Train Result:\n================================================")
print(f"Accuracy Score: {accuracy_score(y_train, pred) * 100:.2f}%")
print("_______________________________________________")
print(f"CLASSIFICATION REPORT:\n{clf_report}")
print("_______________________________________________")
print(f"Confusion Matrix: \n {confusion_matrix(y_train, pred)}\n")
elif train==False:
pred = clf.predict(X_test)
clf_report = pd.DataFrame(classification_report(y_test, pred, output_dict=True))
print("Test Result:\n================================================")
print(f"Accuracy Score: {accuracy_score(y_test, pred) * 100:.2f}%")
print("_______________________________________________")
print(f"CLASSIFICATION REPORT:\n{clf_report}")
print("_______________________________________________")
print(f"Confusion Matrix: \n {confusion_matrix(y_test, pred)}\n")
Train Result:
================================================
Accuracy Score: 83.79%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.793583 0.863187 0.837895 0.828385 0.837053
recall 0.768017 0.879904 0.837895 0.823961 0.837895
f1-score 0.780591 0.871465 0.837895 0.826028 0.837346
support 6022.000000 10017.000000 0.837895 16039.000000 16039.000000
_______________________________________________
Confusion Matrix:
[[4625 1397]
[1203 8814]]
Test Result:
================================================
Accuracy Score: 84.23%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.799758 0.866348 0.842304 0.833053 0.841432
recall 0.771773 0.884472 0.842304 0.828123 0.842304
f1-score 0.785516 0.875316 0.842304 0.830416 0.841716
support 2572.000000 4302.000000 0.842304 6874.000000 6874.000000
_______________________________________________
Confusion Matrix:
[[1985 587]
[ 497 3805]]
Train Result:
================================================
Accuracy Score: 88.66%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.849028 0.909164 0.886589 0.879096 0.886585
recall 0.848887 0.909254 0.886589 0.879071 0.886589
f1-score 0.848958 0.909209 0.886589 0.879083 0.886587
support 6022.000000 10017.000000 0.886589 16039.000000 16039.000000
_______________________________________________
Confusion Matrix:
[[5112 910]
[ 909 9108]]
Test Result:
================================================
Accuracy Score: 85.38%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.803565 0.883997 0.853797 0.843781 0.853902
recall 0.806376 0.882148 0.853797 0.844262 0.853797
f1-score 0.804968 0.883072 0.853797 0.844020 0.853848
support 2572.000000 4302.000000 0.853797 6874.000000 6874.000000
_______________________________________________
Confusion Matrix:
[[2074 498]
[ 507 3795]]
Train Result:
================================================
Accuracy Score: 94.12%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.924453 0.951173 0.941206 0.937813 0.941141
recall 0.918466 0.954877 0.941206 0.936671 0.941206
f1-score 0.921449 0.953021 0.941206 0.937235 0.941167
support 6022.000000 10017.000000 0.941206 16039.000000 16039.000000
_______________________________________________
Confusion Matrix:
[[5531 491]
[ 452 9565]]
Test Result:
================================================
Accuracy Score: 88.94%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.847126 0.915338 0.889438 0.881232 0.889816
recall 0.859642 0.907252 0.889438 0.883447 0.889438
f1-score 0.853338 0.911277 0.889438 0.882308 0.889599
support 2572.000000 4302.000000 0.889438 6874.000000 6874.000000
_______________________________________________
Confusion Matrix:
[[2211 361]
[ 399 3903]]
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
rf_clf = RandomForestClassifier(n_estimators=1000, random_state=42)
rf_clf.fit(X_train, y_train)
print_score(rf_clf, X_train, y_train, X_test, y_test, train=True)
print_score(rf_clf, X_train, y_train, X_test, y_test, train=False)
Train Result:
================================================
Accuracy Score: 94.12%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.927453 0.949297 0.941206 0.938375 0.941095
recall 0.914978 0.956973 0.941206 0.935976 0.941206
f1-score 0.921174 0.953120 0.941206 0.937147 0.941125
support 6022.000000 10017.000000 0.941206 16039.000000 16039.000000
_______________________________________________
Confusion Matrix:
[[5510 512]
[ 431 9586]]
Test Result:
================================================
Accuracy Score: 89.37%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.857198 0.915522 0.893657 0.886360 0.893700
recall 0.858865 0.914458 0.893657 0.886662 0.893657
f1-score 0.858031 0.914990 0.893657 0.886510 0.893678
support 2572.000000 4302.000000 0.893657 6874.000000 6874.000000
_______________________________________________
Confusion Matrix:
[[2209 363]
[ 368 3934]]
test_score = accuracy_score(y_test, rf_clf.predict(X_test)) * 100
train_score = accuracy_score(y_train, rf_clf.predict(X_train)) * 100
results_df_2 = pd.DataFrame(data=[["Random Forest Classifier", train_score, test_score]],
columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
results_df = pd.concat([results_df, results_df_2], ignore_index=True)
results_df
Model | Training Accuracy % | Testing Accuracy % | |
---|---|---|---|
0 | Logistic Regression | 83.789513 | 84.230434 |
1 | K-nearest neighbors | 88.658894 | 85.379692 |
2 | Decision Tree Classifier | 94.120581 | 88.943846 |
3 | Random Forest Classifier | 94.120581 | 89.365726 |