Client Report - The War with Star Wars – Portfolio Cases- Gabriel Guerrero

Background

Survey data is notoriously difficult to munge. Even when the data is recorded cleanly the options for ‘write in questions’, ‘choose from multiple answers’, ‘pick all that are right’, and ‘multiple choice questions’ makes storing the data in a tidy format difficult.

In 2014, FiveThirtyEight surveyed over 1000 people to write the article titled, America’s Favorite ‘Star Wars’ Movies (And Least Favorite Characters). They have provided the data on GitHub.

For this project, your client would like to use the Star Wars survey data to figure out if they can predict an interviewing job candidate’s current income based on a few responses about Star Wars movies.

Client Request

The Client is who performed the survey but outsourced the analitics to a 3rd party. They want you to clean up the data so you can: a. Validate the data provided on GitHub lines up with the article by recreating 2 of the visuals from the article a. Determine if you predict if a person from the survey makes more than $50k

Elevator pitch

This project aims to create a machine learning model that can predict the income of a person based on their responses to a survey about Star Wars movies. The dataset used for this project contains information on various people who have taken the survey, including their income, gender, education level, and more. The goal is to build a predictive model that can accurately classify whether a person makes more than $50,000 per year based on their responses to the survey questions.

Libraries and Tide Data

Show the code

import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
from sklearn import tree
import matplotlib.pyplot as plt # for graphs}
from lets_plot import *
LetsPlot.setup_html()

Question 1

Shorten the column names and clean them up for easier use with pandas. Provide a table or list that exemplifies how you fixed the names.

mapping

yes_no={
    'Yes': True,
    'No': False
}

yes_no_cols = ['Have you seen any of the 6 films in the Star Wars franchise?', 'Do you consider yourself to be a fan of the Star Wars film franchise?']

star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] = star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].map(yes_no)

star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] = star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].map(yes_no)

star_wars['Do you consider yourself to be a fan of the Expanded Universe?æ'] = star_wars['Do you consider yourself to be a fan of the Expanded Universe?æ'].map(yes_no)

star_wars['Do you consider yourself to be a fan of the Star Trek franchise?'] = star_wars['Do you consider yourself to be a fan of the Star Trek franchise?'].map(yes_no)

cleaning

cols_seen = {
    'Which of the following Star Wars films have you seen? Please select all that apply.': 'seen_1',
    'Unnamed: 4': 'seen_2',
    'Unnamed: 5': 'seen_3',
    'Unnamed: 6': 'seen_4',
    'Unnamed: 7': 'seen_5',
    'Unnamed: 8': 'seen_6'    
}

star_wars = star_wars.rename(columns=cols_seen)

cleaning_1

seen_notseen = {
    
    'seen_notseen_1': {
        star_wars.iloc[0,3]: True,
        np.nan: False
    },

    'seen_notseen_2': {
        star_wars.iloc[0,4]: True,
        np.nan: False
    },

    'seen_notseen_3': {
        star_wars.iloc[0,5]: True,
        np.nan: False
    },
    
    'seen_notseen_4': {
        star_wars.iloc[0,6]: True,
        np.nan: False
    },
    
    'seen_notseen_5': {
        star_wars.iloc[0,7]: True,
        np.nan: False
    },

    'seen_notseen_6': {
        star_wars.iloc[0,8]: True,
        np.nan: False
    },
}


for movie in range(1,7):
    star_wars['seen_' + str(movie)] = star_wars['seen_' + str(movie)].map(seen_notseen['seen_notseen_' + str(movie)])

hot_encoding1

star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)

cols_rank = {
    'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.': 'ranking_1',
    'Unnamed: 10': 'ranking_2',
    'Unnamed: 11': 'ranking_3',
    'Unnamed: 12': 'ranking_4',
    'Unnamed: 13': 'ranking_5',
    'Unnamed: 14': 'ranking_6'    
}

star_wars = star_wars.rename(columns=cols_rank)

hot_encoding

male_female={
    'Male': 1,
    'female': 0
}

ages={
    '18-29': 1,
    '30-44': 2,
    '45-60': 3,
    '> 60': 4
}

income = {
    '$0 - $24,999': (24999),
    '$25,000 - $49,999': (49999),
    '$50,000 - $99,999': (99999),
    '$100,000 - $149,999': (149999),
    '$150,000+': (200000)  # Upper limit for simulation
}


education={
    'Less than high school degree  ': 1,
    'High school degree ': 2,
    'Some college or Associate degree': 3,
    'Bachelor degree': 4,
    'Graduate degree': 5
}


star_wars['Gender'] = star_wars['Gender'].map(male_female)
star_wars['Age'] = star_wars['Age'].map(ages)
star_wars['Household Income'] = star_wars['Household Income'].map(income)  # Random income
star_wars['Education'] = star_wars['Education'].map(education)

name_clean

sta_wars_names = star_wars_drop.rename(columns={'Have you seen any of the 6 films in the Star Wars franchise?': 'Seen_any_film','Do you consider yourself to be a fan of the Star Wars film franchise?':'Are_you_fan','Do you consider yourself to be a fan of the Expanded Universe?æ':'fan_expanded_universe','Do you consider yourself to be a fan of the Star Trek franchise?':'fan_star_trek','Household Income':'Household_Income','Location (Census Region)':'location'})


sta_wars_names.head(5)

	RespondentID	Seen_any_film	Are_you_fan	seen_1	seen_2	seen_3	seen_4	seen_5	seen_6	ranking_1	ranking_2	ranking_3	ranking_4	ranking_5	ranking_6	fan_expanded_universe	fan_star_trek	Gender	Age	Household_Income	Education	location
1	3.292880e+09	True	True	True	True	True	True	True	True	3.0	2.0	1.0	4.0	5.0	6.0	False	False	1.0	1.0	NaN	NaN	South Atlantic
2	3.292880e+09	False	NaN	False	False	False	False	False	False	NaN	NaN	NaN	NaN	NaN	NaN	NaN	True	1.0	1.0	24999.0	4.0	West South Central
3	3.292765e+09	True	False	True	True	True	False	False	False	1.0	2.0	3.0	4.0	5.0	6.0	NaN	False	1.0	1.0	24999.0	NaN	West North Central
4	3.292763e+09	True	True	True	True	True	True	True	True	5.0	6.0	1.0	2.0	4.0	3.0	NaN	True	1.0	1.0	149999.0	3.0	West North Central
5	3.292731e+09	True	True	True	True	True	True	True	True	5.0	4.0	6.0	2.0	1.0	3.0	False	False	1.0	1.0	149999.0	3.0	West North Central

Question2

Clean and format the data so that it can be used in a machine learning model. As you format the data, you should complete each item listed below. In your final report provide example(s) of the reformatted data with a short description of the changes made. - Filter the dataset to respondents that have seen at least one film - Create a new column that converts the age ranges to a single number. Drop the age range categorical column - Create a new column that converts the education groupings to a single number. Drop the school categorical column - Create a new column that converts the income ranges to a single number. Drop the income range categorical column - Create your target (also known as “y” or “label”) column based on the new income range column - One-hot encode all remaining categorical columns

seen_one_film

#Filter the dataset to respondents that have seen at least one film
sta_wars_names['seen_any_real'] = sta_wars_names[['seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6']].any(axis=1)
filtered_df = sta_wars_names[sta_wars_names['seen_any_real'] == True]
filtered_df.head(5)

	RespondentID	Seen_any_film	Are_you_fan	seen_1	seen_2	seen_3	seen_4	seen_5	seen_6	ranking_1	ranking_2	ranking_3	ranking_4	ranking_5	ranking_6	fan_expanded_universe	fan_star_trek	Gender	Age	Household_Income	Education	location	seen_any_real
1	3.292880e+09	True	True	True	True	True	True	True	True	3.0	2.0	1.0	4.0	5.0	6.0	False	False	1.0	1.0	NaN	NaN	South Atlantic	True
3	3.292765e+09	True	False	True	True	True	False	False	False	1.0	2.0	3.0	4.0	5.0	6.0	NaN	False	1.0	1.0	24999.0	NaN	West North Central	True
4	3.292763e+09	True	True	True	True	True	True	True	True	5.0	6.0	1.0	2.0	4.0	3.0	NaN	True	1.0	1.0	149999.0	3.0	West North Central	True
5	3.292731e+09	True	True	True	True	True	True	True	True	5.0	4.0	6.0	2.0	1.0	3.0	False	False	1.0	1.0	149999.0	3.0	West North Central	True
6	3.292719e+09	True	True	True	True	True	True	True	True	1.0	4.0	3.0	6.0	5.0	2.0	False	True	1.0	1.0	49999.0	4.0	Middle Atlantic	True

Question 2,3,4 were completed in previous codes labels : cleaning_1, hot_encoding1, hot_encoding

target_y

filtered_df = filtered_df.rename(columns={'Household_Income': 'y'})  # Rename the column
filtered_df['y_target'] = filtered_df['y']  # Assign the renamed column to 'y_target'
filtered_df.head(5)

	RespondentID	Seen_any_film	Are_you_fan	seen_1	seen_2	seen_3	seen_4	seen_5	seen_6	ranking_1	ranking_2	ranking_3	ranking_4	ranking_5	ranking_6	fan_expanded_universe	fan_star_trek	Gender	Age	y	Education	location	seen_any_real	y_target
1	3.292880e+09	True	True	True	True	True	True	True	True	3.0	2.0	1.0	4.0	5.0	6.0	False	False	1.0	1.0	NaN	NaN	South Atlantic	True	NaN
3	3.292765e+09	True	False	True	True	True	False	False	False	1.0	2.0	3.0	4.0	5.0	6.0	NaN	False	1.0	1.0	24999.0	NaN	West North Central	True	24999.0
4	3.292763e+09	True	True	True	True	True	True	True	True	5.0	6.0	1.0	2.0	4.0	3.0	NaN	True	1.0	1.0	149999.0	3.0	West North Central	True	149999.0
5	3.292731e+09	True	True	True	True	True	True	True	True	5.0	4.0	6.0	2.0	1.0	3.0	False	False	1.0	1.0	149999.0	3.0	West North Central	True	149999.0
6	3.292719e+09	True	True	True	True	True	True	True	True	1.0	4.0	3.0	6.0	5.0	2.0	False	True	1.0	1.0	49999.0	4.0	Middle Atlantic	True	49999.0

Question 3

Validate that the data provided on GitHub lines up with the article by recreating 2 of the visuals from the article.

question_3

#wider seen
df_melted_q1 = filtered_df.melt(
    id_vars=['RespondentID'],
    value_vars=['seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6'],
    var_name='movies',
    value_name= 'test'
)

question_3-1

grouped_counts = df_melted_q1[df_melted_q1['test'] == True].groupby('movies')['RespondentID'].count().reset_index()
grouped_counts.columns = ['movies', 'count']

question_3-2

total_count = grouped_counts['count'].sum()
grouped_counts['percentage'] = ((grouped_counts['count'] / 835) * 100).round(0)

question_graph2

from plotnine import ggplot, aes, geom_bar, labs, geom_text, theme, element_text

plot = (
    ggplot(grouped_counts, aes(x='movies', y='percentage')) +
    geom_bar(stat='identity', fill='darkblue') +
    geom_text(aes(label='percentage'), va='bottom', ha='center', color='Black', size=10) +  # Adding percentage labels
    labs(
        title='Unique Movies Seen by Respondents',
        x='Movies',
        y='Percentage of Respondents'
    ) +
    theme(
        axis_text_x=element_text(rotation=90, hjust=1),  # Rotate x-axis labels
        plot_title=element_text(size=16, face='bold'),
        plot_subtitle=element_text(size=12)
    )
)

print(plot)
plot.save('plot2.png')

<ggplot: (672 x 480)>

que_graph-2

sta_wars_names['seen_all_true'] = sta_wars_names[['seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6']].all(axis=1)

filtered_df = sta_wars_names[sta_wars_names['seen_all_true'] == True]
#filtered_df.count()

import pandas as pd

# Melt seen columns
df_meltedq = filtered_df.melt(
    id_vars=['RespondentID'],
    value_vars=['seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6'],
    var_name='movies',
    value_name='test'

)
# Melt ranking columns
df_meltedq1 = filtered_df.melt(
    id_vars=['RespondentID'],
    value_vars=['ranking_1', 'ranking_2', 'ranking_3', 'ranking_4', 'ranking_5', 'ranking_6'],
    var_name='movies',
    value_name='ranking'
)

# Extract the numeric part of the 'movies' column
df_meltedq['movies'] = df_meltedq['movies'].str.extract('(\d+)', expand=False)
df_meltedq1['movies'] = df_meltedq1['movies'].str.extract('(\d+)', expand=False)

# Merge on RespondentID and movies
result = pd.merge(df_meltedq, df_meltedq1, on=['RespondentID', 'movies'])

#print("\nMerged Melted DataFrame:")
#print(result)

Show the code

filtered = result[result['ranking'] == 5]

#print("\nFiltered DataFrame (Ranking = 5):")
#print(filtered)

Show the code

from plotnine import ggplot, aes, geom_bar, labs, theme_minimal

name_movies={
    '6': 'The Panthon Menace',
    '5': 'Attack of the Clones',
    '4': 'Revenge of the Sith',
    '3': 'A New Hope',
    '2': 'The Empire Strikes Back',
    '1': 'Return of the Jedi'
}

filtered['movies'] = filtered['movies'].map(name_movies)

grouped_counts = filtered.groupby('movies')['ranking'].count().reset_index()
grouped_counts.columns = ['movies', 'count']

total_count = grouped_counts['count'].sum()
grouped_counts['percentage'] = ((grouped_counts['count'] / 471) * 100).round(0)

question_graph3

from plotnine import ggplot, aes, geom_bar, labs, geom_text, theme, element_text

plot = (
    ggplot(grouped_counts, aes(x='movies', y='percentage')) +
    geom_bar(stat='identity', fill='darkblue') +
    geom_text(aes(label='percentage'), va='bottom', ha='center', color='Black', size=10) +  # Adding percentage labels
    labs(
        title='What is the best star ward movies ',
        subtitle='Of 471 respondents who have seen all 6 movies',
        x='Movies',
        y='Percentage of Respondents'
    ) +
    theme(
        axis_text_x=element_text(rotation=90, hjust=1),  # Rotate x-axis labels
        plot_title=element_text(size=16, face='bold'),
        plot_subtitle=element_text(size=12)
    )
)

print(plot)
plot.save('plot1.png')

<ggplot: (672 x 480)>

Question 4

Build a machine learning model that predicts whether a person makes more than $50k. Describe your model and report the accuracy.

I usda a classification model to predict whether a person makes more than $50k. I used a logistic regression model.

training

sta_wars_names['ml_prep'] = sta_wars_names['Household_Income'].apply(lambda x: '1' if x > 50000 else '0')

sta_wars_names.head(5)

	RespondentID	Seen_any_film	Are_you_fan	seen_1	seen_2	seen_3	seen_4	seen_5	seen_6	ranking_1	ranking_2	ranking_3	ranking_4	ranking_5	ranking_6	fan_expanded_universe	fan_star_trek	Gender	Age	Household_Income	Education	location	seen_any_real	seen_all_true	ml_prep
1	3.292880e+09	True	True	True	True	True	True	True	True	3.0	2.0	1.0	4.0	5.0	6.0	False	False	1.0	1.0	NaN	NaN	South Atlantic	True	True	0
2	3.292880e+09	False	NaN	False	False	False	False	False	False	NaN	NaN	NaN	NaN	NaN	NaN	NaN	True	1.0	1.0	24999.0	4.0	West South Central	False	False	0
3	3.292765e+09	True	False	True	True	True	False	False	False	1.0	2.0	3.0	4.0	5.0	6.0	NaN	False	1.0	1.0	24999.0	NaN	West North Central	True	False	0
4	3.292763e+09	True	True	True	True	True	True	True	True	5.0	6.0	1.0	2.0	4.0	3.0	NaN	True	1.0	1.0	149999.0	3.0	West North Central	True	True	1
5	3.292731e+09	True	True	True	True	True	True	True	True	5.0	4.0	6.0	2.0	1.0	3.0	False	False	1.0	1.0	149999.0	3.0	West North Central	True	True	1

replace

for column in sta_wars_names.columns:
    most_common_value = sta_wars_names[column].mode()[0]  # Get the mode (most frequent value) of the column
    sta_wars_names[column].fillna(most_common_value, inplace=True)

# replace nan values
#star_wars_names['Household_Income'] = star_wars_names['Household_Income'].fillna(star_wars_names['Household_Income'].median())

building_model

X = sta_wars_names[['RespondentID', 'Seen_any_film', 'seen_1', 'seen_2',
       'seen_3', 'seen_4', 'seen_5', 'seen_6', 'ranking_1', 'ranking_2',
       'ranking_3', 'ranking_4', 'ranking_5', 'ranking_6',
        'Gender', 'Age',
       'Household_Income', 'Education']]

y=sta_wars_names['ml_prep']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
          X, y, test_size=0.3, random_state=1)

function

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

def print_score(clf, X_train, y_train, X_test, y_test, train=True):
    if train:
        pred = clf.predict(X_train)
        clf_report = pd.DataFrame(classification_report(y_train, pred, output_dict=True))
        print("Train Result:\n================================================")
        print(f"Accuracy Score: {accuracy_score(y_train, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_train, pred)}\n")
        
    elif train==False:
        pred = clf.predict(X_test)
        clf_report = pd.DataFrame(classification_report(y_test, pred, output_dict=True))
        print("Test Result:\n================================================")        
        print(f"Accuracy Score: {accuracy_score(y_test, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_test, pred)}\n")

running_model1

'''
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(
    n_neighbors=2,  # Number of nearest neighbors to consider (default is 5)
    weights='uniform',  # Can be 'uniform' or 'distance' (closer neighbors have more influence)
    algorithm='auto',  # 'auto', 'ball_tree', 'kd_tree', or 'brute'
    p=2,  # Power parameter for the Minkowski metric (2 = Euclidean distance)
    metric='minkowski'  # Distance metric to use
)



knn_clf.fit(X_train, y_train)
'''
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

print_score(clf, X_train, y_train, X_test, y_test, train=True)
print_score(clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
================================================
Accuracy Score: 100.00%
_______________________________________________
CLASSIFICATION REPORT:
               0      1  accuracy  macro avg  weighted avg
precision    1.0    1.0       1.0        1.0           1.0
recall       1.0    1.0       1.0        1.0           1.0
f1-score     1.0    1.0       1.0        1.0           1.0
support    459.0  371.0       1.0      830.0         830.0
_______________________________________________
Confusion Matrix: 
 [[459   0]
 [  0 371]]

Test Result:
================================================
Accuracy Score: 80.62%
_______________________________________________
CLASSIFICATION REPORT:
                    0           1  accuracy   macro avg  weighted avg
precision    0.829787    0.779762   0.80618    0.804775      0.806882
recall       0.808290    0.803681   0.80618    0.805986      0.806180
f1-score     0.818898    0.791541   0.80618    0.805219      0.806372
support    193.000000  163.000000   0.80618  356.000000    356.000000
_______________________________________________
Confusion Matrix: 
 [[156  37]
 [ 32 131]]