Show the code
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
from sklearn import tree
import matplotlib.pyplot as plt # for graphs}
from lets_plot import *
LetsPlot.setup_html()
Course DS 250
Gabriel Guerrero
Survey data is notoriously difficult to munge. Even when the data is recorded cleanly the options for ‘write in questions’, ‘choose from multiple answers’, ‘pick all that are right’, and ‘multiple choice questions’ makes storing the data in a tidy format difficult.
In 2014, FiveThirtyEight surveyed over 1000 people to write the article titled, America’s Favorite ‘Star Wars’ Movies (And Least Favorite Characters). They have provided the data on GitHub.
For this project, your client would like to use the Star Wars survey data to figure out if they can predict an interviewing job candidate’s current income based on a few responses about Star Wars movies.
The Client is who performed the survey but outsourced the analitics to a 3rd party. They want you to clean up the data so you can: a. Validate the data provided on GitHub lines up with the article by recreating 2 of the visuals from the article a. Determine if you predict if a person from the survey makes more than $50k
This project aims to create a machine learning model that can predict the income of a person based on their responses to a survey about Star Wars movies. The dataset used for this project contains information on various people who have taken the survey, including their income, gender, education level, and more. The goal is to build a predictive model that can accurately classify whether a person makes more than $50,000 per year based on their responses to the survey questions.
Shorten the column names and clean them up for easier use with pandas. Provide a table or list that exemplifies how you fixed the names.
yes_no={
'Yes': True,
'No': False
}
yes_no_cols = ['Have you seen any of the 6 films in the Star Wars franchise?', 'Do you consider yourself to be a fan of the Star Wars film franchise?']
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] = star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].map(yes_no)
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] = star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].map(yes_no)
star_wars['Do you consider yourself to be a fan of the Expanded Universe?æ'] = star_wars['Do you consider yourself to be a fan of the Expanded Universe?æ'].map(yes_no)
star_wars['Do you consider yourself to be a fan of the Star Trek franchise?'] = star_wars['Do you consider yourself to be a fan of the Star Trek franchise?'].map(yes_no)
seen_notseen = {
'seen_notseen_1': {
star_wars.iloc[0,3]: True,
np.nan: False
},
'seen_notseen_2': {
star_wars.iloc[0,4]: True,
np.nan: False
},
'seen_notseen_3': {
star_wars.iloc[0,5]: True,
np.nan: False
},
'seen_notseen_4': {
star_wars.iloc[0,6]: True,
np.nan: False
},
'seen_notseen_5': {
star_wars.iloc[0,7]: True,
np.nan: False
},
'seen_notseen_6': {
star_wars.iloc[0,8]: True,
np.nan: False
},
}
for movie in range(1,7):
star_wars['seen_' + str(movie)] = star_wars['seen_' + str(movie)].map(seen_notseen['seen_notseen_' + str(movie)])
star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)
cols_rank = {
'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.': 'ranking_1',
'Unnamed: 10': 'ranking_2',
'Unnamed: 11': 'ranking_3',
'Unnamed: 12': 'ranking_4',
'Unnamed: 13': 'ranking_5',
'Unnamed: 14': 'ranking_6'
}
star_wars = star_wars.rename(columns=cols_rank)
male_female={
'Male': 1,
'female': 0
}
ages={
'18-29': 1,
'30-44': 2,
'45-60': 3,
'> 60': 4
}
income = {
'$0 - $24,999': (24999),
'$25,000 - $49,999': (49999),
'$50,000 - $99,999': (99999),
'$100,000 - $149,999': (149999),
'$150,000+': (200000) # Upper limit for simulation
}
education={
'Less than high school degree ': 1,
'High school degree ': 2,
'Some college or Associate degree': 3,
'Bachelor degree': 4,
'Graduate degree': 5
}
star_wars['Gender'] = star_wars['Gender'].map(male_female)
star_wars['Age'] = star_wars['Age'].map(ages)
star_wars['Household Income'] = star_wars['Household Income'].map(income) # Random income
star_wars['Education'] = star_wars['Education'].map(education)
sta_wars_names = star_wars_drop.rename(columns={'Have you seen any of the 6 films in the Star Wars franchise?': 'Seen_any_film','Do you consider yourself to be a fan of the Star Wars film franchise?':'Are_you_fan','Do you consider yourself to be a fan of the Expanded Universe?æ':'fan_expanded_universe','Do you consider yourself to be a fan of the Star Trek franchise?':'fan_star_trek','Household Income':'Household_Income','Location (Census Region)':'location'})
sta_wars_names.head(5)
RespondentID | Seen_any_film | Are_you_fan | seen_1 | seen_2 | seen_3 | seen_4 | seen_5 | seen_6 | ranking_1 | ranking_2 | ranking_3 | ranking_4 | ranking_5 | ranking_6 | fan_expanded_universe | fan_star_trek | Gender | Age | Household_Income | Education | location | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3.292880e+09 | True | True | True | True | True | True | True | True | 3.0 | 2.0 | 1.0 | 4.0 | 5.0 | 6.0 | False | False | 1.0 | 1.0 | NaN | NaN | South Atlantic |
2 | 3.292880e+09 | False | NaN | False | False | False | False | False | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | True | 1.0 | 1.0 | 24999.0 | 4.0 | West South Central |
3 | 3.292765e+09 | True | False | True | True | True | False | False | False | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 | NaN | False | 1.0 | 1.0 | 24999.0 | NaN | West North Central |
4 | 3.292763e+09 | True | True | True | True | True | True | True | True | 5.0 | 6.0 | 1.0 | 2.0 | 4.0 | 3.0 | NaN | True | 1.0 | 1.0 | 149999.0 | 3.0 | West North Central |
5 | 3.292731e+09 | True | True | True | True | True | True | True | True | 5.0 | 4.0 | 6.0 | 2.0 | 1.0 | 3.0 | False | False | 1.0 | 1.0 | 149999.0 | 3.0 | West North Central |
Clean and format the data so that it can be used in a machine learning model. As you format the data, you should complete each item listed below. In your final report provide example(s) of the reformatted data with a short description of the changes made. - Filter the dataset to respondents that have seen at least one film - Create a new column that converts the age ranges to a single number. Drop the age range categorical column - Create a new column that converts the education groupings to a single number. Drop the school categorical column - Create a new column that converts the income ranges to a single number. Drop the income range categorical column - Create your target (also known as “y” or “label”) column based on the new income range column - One-hot encode all remaining categorical columns
RespondentID | Seen_any_film | Are_you_fan | seen_1 | seen_2 | seen_3 | seen_4 | seen_5 | seen_6 | ranking_1 | ranking_2 | ranking_3 | ranking_4 | ranking_5 | ranking_6 | fan_expanded_universe | fan_star_trek | Gender | Age | Household_Income | Education | location | seen_any_real | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3.292880e+09 | True | True | True | True | True | True | True | True | 3.0 | 2.0 | 1.0 | 4.0 | 5.0 | 6.0 | False | False | 1.0 | 1.0 | NaN | NaN | South Atlantic | True |
3 | 3.292765e+09 | True | False | True | True | True | False | False | False | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 | NaN | False | 1.0 | 1.0 | 24999.0 | NaN | West North Central | True |
4 | 3.292763e+09 | True | True | True | True | True | True | True | True | 5.0 | 6.0 | 1.0 | 2.0 | 4.0 | 3.0 | NaN | True | 1.0 | 1.0 | 149999.0 | 3.0 | West North Central | True |
5 | 3.292731e+09 | True | True | True | True | True | True | True | True | 5.0 | 4.0 | 6.0 | 2.0 | 1.0 | 3.0 | False | False | 1.0 | 1.0 | 149999.0 | 3.0 | West North Central | True |
6 | 3.292719e+09 | True | True | True | True | True | True | True | True | 1.0 | 4.0 | 3.0 | 6.0 | 5.0 | 2.0 | False | True | 1.0 | 1.0 | 49999.0 | 4.0 | Middle Atlantic | True |
Question 2,3,4 were completed in previous codes labels : cleaning_1, hot_encoding1, hot_encoding
RespondentID | Seen_any_film | Are_you_fan | seen_1 | seen_2 | seen_3 | seen_4 | seen_5 | seen_6 | ranking_1 | ranking_2 | ranking_3 | ranking_4 | ranking_5 | ranking_6 | fan_expanded_universe | fan_star_trek | Gender | Age | y | Education | location | seen_any_real | y_target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3.292880e+09 | True | True | True | True | True | True | True | True | 3.0 | 2.0 | 1.0 | 4.0 | 5.0 | 6.0 | False | False | 1.0 | 1.0 | NaN | NaN | South Atlantic | True | NaN |
3 | 3.292765e+09 | True | False | True | True | True | False | False | False | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 | NaN | False | 1.0 | 1.0 | 24999.0 | NaN | West North Central | True | 24999.0 |
4 | 3.292763e+09 | True | True | True | True | True | True | True | True | 5.0 | 6.0 | 1.0 | 2.0 | 4.0 | 3.0 | NaN | True | 1.0 | 1.0 | 149999.0 | 3.0 | West North Central | True | 149999.0 |
5 | 3.292731e+09 | True | True | True | True | True | True | True | True | 5.0 | 4.0 | 6.0 | 2.0 | 1.0 | 3.0 | False | False | 1.0 | 1.0 | 149999.0 | 3.0 | West North Central | True | 149999.0 |
6 | 3.292719e+09 | True | True | True | True | True | True | True | True | 1.0 | 4.0 | 3.0 | 6.0 | 5.0 | 2.0 | False | True | 1.0 | 1.0 | 49999.0 | 4.0 | Middle Atlantic | True | 49999.0 |
Validate that the data provided on GitHub lines up with the article by recreating 2 of the visuals from the article.
from plotnine import ggplot, aes, geom_bar, labs, geom_text, theme, element_text
plot = (
ggplot(grouped_counts, aes(x='movies', y='percentage')) +
geom_bar(stat='identity', fill='darkblue') +
geom_text(aes(label='percentage'), va='bottom', ha='center', color='Black', size=10) + # Adding percentage labels
labs(
title='Unique Movies Seen by Respondents',
x='Movies',
y='Percentage of Respondents'
) +
theme(
axis_text_x=element_text(rotation=90, hjust=1), # Rotate x-axis labels
plot_title=element_text(size=16, face='bold'),
plot_subtitle=element_text(size=12)
)
)
print(plot)
plot.save('plot2.png')
<ggplot: (672 x 480)>
import pandas as pd
# Melt seen columns
df_meltedq = filtered_df.melt(
id_vars=['RespondentID'],
value_vars=['seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6'],
var_name='movies',
value_name='test'
)
# Melt ranking columns
df_meltedq1 = filtered_df.melt(
id_vars=['RespondentID'],
value_vars=['ranking_1', 'ranking_2', 'ranking_3', 'ranking_4', 'ranking_5', 'ranking_6'],
var_name='movies',
value_name='ranking'
)
# Extract the numeric part of the 'movies' column
df_meltedq['movies'] = df_meltedq['movies'].str.extract('(\d+)', expand=False)
df_meltedq1['movies'] = df_meltedq1['movies'].str.extract('(\d+)', expand=False)
# Merge on RespondentID and movies
result = pd.merge(df_meltedq, df_meltedq1, on=['RespondentID', 'movies'])
#print("\nMerged Melted DataFrame:")
#print(result)
from plotnine import ggplot, aes, geom_bar, labs, theme_minimal
name_movies={
'6': 'The Panthon Menace',
'5': 'Attack of the Clones',
'4': 'Revenge of the Sith',
'3': 'A New Hope',
'2': 'The Empire Strikes Back',
'1': 'Return of the Jedi'
}
filtered['movies'] = filtered['movies'].map(name_movies)
grouped_counts = filtered.groupby('movies')['ranking'].count().reset_index()
grouped_counts.columns = ['movies', 'count']
total_count = grouped_counts['count'].sum()
grouped_counts['percentage'] = ((grouped_counts['count'] / 471) * 100).round(0)
from plotnine import ggplot, aes, geom_bar, labs, geom_text, theme, element_text
plot = (
ggplot(grouped_counts, aes(x='movies', y='percentage')) +
geom_bar(stat='identity', fill='darkblue') +
geom_text(aes(label='percentage'), va='bottom', ha='center', color='Black', size=10) + # Adding percentage labels
labs(
title='What is the best star ward movies ',
subtitle='Of 471 respondents who have seen all 6 movies',
x='Movies',
y='Percentage of Respondents'
) +
theme(
axis_text_x=element_text(rotation=90, hjust=1), # Rotate x-axis labels
plot_title=element_text(size=16, face='bold'),
plot_subtitle=element_text(size=12)
)
)
print(plot)
plot.save('plot1.png')
<ggplot: (672 x 480)>
Build a machine learning model that predicts whether a person makes more than $50k. Describe your model and report the accuracy.
I usda a classification model to predict whether a person makes more than $50k. I used a logistic regression model.
RespondentID | Seen_any_film | Are_you_fan | seen_1 | seen_2 | seen_3 | seen_4 | seen_5 | seen_6 | ranking_1 | ranking_2 | ranking_3 | ranking_4 | ranking_5 | ranking_6 | fan_expanded_universe | fan_star_trek | Gender | Age | Household_Income | Education | location | seen_any_real | seen_all_true | ml_prep | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3.292880e+09 | True | True | True | True | True | True | True | True | 3.0 | 2.0 | 1.0 | 4.0 | 5.0 | 6.0 | False | False | 1.0 | 1.0 | NaN | NaN | South Atlantic | True | True | 0 |
2 | 3.292880e+09 | False | NaN | False | False | False | False | False | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | True | 1.0 | 1.0 | 24999.0 | 4.0 | West South Central | False | False | 0 |
3 | 3.292765e+09 | True | False | True | True | True | False | False | False | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 | NaN | False | 1.0 | 1.0 | 24999.0 | NaN | West North Central | True | False | 0 |
4 | 3.292763e+09 | True | True | True | True | True | True | True | True | 5.0 | 6.0 | 1.0 | 2.0 | 4.0 | 3.0 | NaN | True | 1.0 | 1.0 | 149999.0 | 3.0 | West North Central | True | True | 1 |
5 | 3.292731e+09 | True | True | True | True | True | True | True | True | 5.0 | 4.0 | 6.0 | 2.0 | 1.0 | 3.0 | False | False | 1.0 | 1.0 | 149999.0 | 3.0 | West North Central | True | True | 1 |
for column in sta_wars_names.columns:
most_common_value = sta_wars_names[column].mode()[0] # Get the mode (most frequent value) of the column
sta_wars_names[column].fillna(most_common_value, inplace=True)
# replace nan values
#star_wars_names['Household_Income'] = star_wars_names['Household_Income'].fillna(star_wars_names['Household_Income'].median())
X = sta_wars_names[['RespondentID', 'Seen_any_film', 'seen_1', 'seen_2',
'seen_3', 'seen_4', 'seen_5', 'seen_6', 'ranking_1', 'ranking_2',
'ranking_3', 'ranking_4', 'ranking_5', 'ranking_6',
'Gender', 'Age',
'Household_Income', 'Education']]
y=sta_wars_names['ml_prep']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=1)
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
def print_score(clf, X_train, y_train, X_test, y_test, train=True):
if train:
pred = clf.predict(X_train)
clf_report = pd.DataFrame(classification_report(y_train, pred, output_dict=True))
print("Train Result:\n================================================")
print(f"Accuracy Score: {accuracy_score(y_train, pred) * 100:.2f}%")
print("_______________________________________________")
print(f"CLASSIFICATION REPORT:\n{clf_report}")
print("_______________________________________________")
print(f"Confusion Matrix: \n {confusion_matrix(y_train, pred)}\n")
elif train==False:
pred = clf.predict(X_test)
clf_report = pd.DataFrame(classification_report(y_test, pred, output_dict=True))
print("Test Result:\n================================================")
print(f"Accuracy Score: {accuracy_score(y_test, pred) * 100:.2f}%")
print("_______________________________________________")
print(f"CLASSIFICATION REPORT:\n{clf_report}")
print("_______________________________________________")
print(f"Confusion Matrix: \n {confusion_matrix(y_test, pred)}\n")
'''
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(
n_neighbors=2, # Number of nearest neighbors to consider (default is 5)
weights='uniform', # Can be 'uniform' or 'distance' (closer neighbors have more influence)
algorithm='auto', # 'auto', 'ball_tree', 'kd_tree', or 'brute'
p=2, # Power parameter for the Minkowski metric (2 = Euclidean distance)
metric='minkowski' # Distance metric to use
)
knn_clf.fit(X_train, y_train)
'''
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
print_score(clf, X_train, y_train, X_test, y_test, train=True)
print_score(clf, X_train, y_train, X_test, y_test, train=False)
Train Result:
================================================
Accuracy Score: 100.00%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 1.0 1.0 1.0 1.0 1.0
recall 1.0 1.0 1.0 1.0 1.0
f1-score 1.0 1.0 1.0 1.0 1.0
support 459.0 371.0 1.0 830.0 830.0
_______________________________________________
Confusion Matrix:
[[459 0]
[ 0 371]]
Test Result:
================================================
Accuracy Score: 80.62%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.829787 0.779762 0.80618 0.804775 0.806882
recall 0.808290 0.803681 0.80618 0.805986 0.806180
f1-score 0.818898 0.791541 0.80618 0.805219 0.806372
support 193.000000 163.000000 0.80618 356.000000 356.000000
_______________________________________________
Confusion Matrix:
[[156 37]
[ 32 131]]