Show the code
import pandas as pd
import numpy as np
import plotly.express as px
from lets_plot import *
LetsPlot.setup_html()from datetime import datetime
Course DS 250
Gabriel Guerrero
Early in prehistory, some descriptive names began to be used again and again until they formed a name pool for a particular culture. Parents would choose names from the pool of existing names rather than invent new ones for their children.
With the rise of Christianity, certain trends in naming practices manifested. Christians were encouraged to name their children after saints and martyrs of the church. These early Christian names can be found in many cultures today, in various forms. These were spread by early missionaries throughout the Mediterranean basin and Europe.
By the Middle Ages, the Christian influence on naming practices was pervasive. Each culture had its pool of names, which were a combination of native names and early Christian names that had been in the language long enough to be considered native.
# data in local file
#df = pd.read_csv(r"C:\Users\Gabriel Guerrero\OneDrive - AVASA\BYU-I\DS 250 Data Coding\db-projects\names_year.csv")
df = pd.read_csv('https://raw.githubusercontent.com/byuidatascience/data4names/refs/heads/master/data-raw/names_year/names_year.csv')
#local
#r"C:\Users\Gabriel Guerrero\OneDrive - AVASA\BYU-I\DS 250 Data Coding\db-projects\names_year.csv"
#online
#https://raw.githubusercontent.com/byuidatascience/data4names/refs/heads/master/data-raw/names_year/names_year.csv'
This project aims to explore understand the use of a name over time. The use of group funtion and sum will be used to show the use of a name over time.
For Project 1 the answer to each question should include a chart
and a written response
. The years labels on your charts should not include a comma. At least two of your charts must include reference marks.
How does your name at your birth year compare to its use historically?
This following graph show that my name has increased its used over the years from less than a 1000 to more than 10000 in 40 years.
df_grouped1= df_grouped[df_grouped['name'] == 'Gabriel']
import numpy as np
trend_text='Incresing accerelating \n after 1980'
(
ggplot(df_grouped1, aes(x='year', y='total', color='name')) +
geom_line(color='blue') +
geom_point(color='blue') +
labs(title='Names from 1900 to 2022',
subtitle='Use of Gabriel is increasing',
x='Year',
y='Total Numbers') +
theme_bw() + # Base theme
# Customized x-axis
scale_x_continuous(breaks=np.arange(1900, 2030, 20).astype(int),
limits=(1900, 2020)) + # Set start and end for x-axis
# Customized y-axis
scale_y_continuous(breaks=np.arange(0, int(df_grouped1['total'].max()) + 1, 2000),
limits=(0, int(df_grouped1['total'].max()))) + # Set start and end for y-axis
# Title and subtitle styling
theme(plot_title=element_text(color='black', size=18, face='bold'),
plot_subtitle=element_text(color='Blue', size=14, face='italic'))+ # Subtitle styling+
geom_label(x=1980, y=6000, label=trend_text, color="black", size='6', fill='#D3D3D3')
)
Source: Names from years csv
If you talked to someone named Brittany on the phone, what is your guess of his or her age? What ages would you not guess?
Based on the graph, I would guess that I will be taking to a person betweeen 20 and 30 years old. I would not guess that I will be talking to a person over 40 years old and lest than 10 years old.
today = datetime.now() # Get the current date and time
trend_text = "Concentration of Brittany \n between 80 and 90"
# Calculate the actual year difference
df['actual_year'] = today.year - df['year']
df2 = df[df['name']=='Brittany']
'''
df_grouped2 = df2.groupby(['name', 'year']).agg(
Total=('Total', 'sum'),
Age=('actual_year', 'mean')
).reset_index()
'''
#keeping name, year
df2 = df2[['name', 'year', 'Total', 'actual_year']]
(
ggplot(df2, aes(x='year', y='actual_year', size='Total')) +
geom_line(color='black') +
geom_point(color='#103d85') +
labs(title='Use of the name Brittany from 1960 to 2010',
subtitle='80 and 90 has more density',
x='Year',
y='Total Numbers') +
theme_bw() + # Base theme
# Customized x-axis
scale_x_continuous(breaks=np.arange(1960, 2018, 10).astype(int),
limits=(1960, 2020)) + # Set start and end for x-axis
# Customized y-axis
scale_y_continuous(breaks=np.arange(0, int(df2['actual_year'].max()) + 1, 10),
limits=(0, int(df2['actual_year'].max()))) + # Set start and end for y-axis
# Title and subtitle styling
theme(plot_title=element_text(color='black', size=18, face='bold'),
plot_subtitle=element_text(color='#4287f5', size=14, face='italic')
) # Subtitle styling
# Adding a label
+geom_label(x=2005, y=40, label=trend_text, color="black", size='6', fill='#D3D3D3')
#geom_text(data=last_points, aes(x='year', y='total', label=name))
)
Source: Names from years csv
Mary, Martha, Peter, and Paul are all Christian names. From 1920 - 2000, compare the name usage of each of the four names in a single chart. What trends do you notice?
The trend indicates that the use of names of Mary and Paul have declined over the years more than Martha and Peter.
df_grouped3 = df_grouped[df_grouped['name'].isin(['Mary', 'Martha', 'Peter', 'Paul'])]
#Checking the unique names
# df_grouped3['name'].unique()
# Define your label text
trend_text = "Significant decline \nafter 2,000"
last_points = df_grouped3.groupby('name').last().reset_index()
(
ggplot(df_grouped3, aes(x='year', y='total', color='name')) + # Color by 'name'
geom_line() +
geom_line(aes(size=(df_grouped3['name'].map({'Mary': 2, 'Martha': 1, 'Peter': 1, 'Paul': 2})))) +
labs(title='Selected Names Analysis',
subtitle='Mary, Martha, Peter, Paul through the years',
x='Year',
y='Total Numbers') +
# Base theme
theme_bw() +
# Customized x-axis
scale_x_continuous(breaks=np.arange(1970, 2020, 10).astype(int),
limits=(1970, 2015)) + # Set start and end for x-axis
# Customized y-axis
scale_y_continuous(breaks=np.arange(0, int(df_grouped3['total'].max()) + 1, 5000),
limits=(0, 20000)) + # Set start and end for y-axis
# Title and subtitle styling
theme(
plot_title=element_text(color='black', size=20, face='bold'),
plot_subtitle=element_text(color='#4287f5', size=16, face='regular'),
#legend_position="none" # Remove legends,
) +
# Adding a label
geom_label(x=2005, y=8000, label=trend_text, color="black", size='6', fill='#D3D3D3')
)
Source: Names from years csv
Think of a unique name from a famous movie. Plot the usage of that name and see how changes line up with the movie release. Does it look like the movie had an effect on usage?
I picked the name Olivia for actress of the movie “Grease”. The movie was released in 1978. The usage of the name Olivia increased after the movie was released aroudn 1990.
df_grouped4 = df_grouped[df_grouped['name'].isin(['Olivia'])]
(
ggplot(df_grouped4, aes(x='year', y='total', color='name')) + # Color by 'name'
geom_line()+
labs(title='Unique name from a famous movie Analysis',
subtitle='Movie Grease released in 1990',
x='Year',
y='Total Numbers') +
# Base theme
theme_classic() +
# Customized x-axis
scale_x_continuous(breaks=np.arange(1970, 2020, 10).astype(int),
limits=(1970, 2015)) + # Set start and end for x-axis
# Customized y-axis
scale_y_continuous(breaks=np.arange(0, int(df_grouped3['total'].max()) + 1, 5000),
limits=(0, 20000)) + # Set start and end for y-axis
# Title and subtitle styling
theme(
plot_title=element_text(color='black', size=20, face='bold'),
plot_subtitle=element_text(color='#4287f5', size=16, face='regular'),
#legend_position="none" # Remove legends,
) +
# Adding a label
geom_label(x=2005, y=8000, label="Olivia has increased \nafter 1990", color="black", size='6', fill='#D3D3D3')
)
Source: Names from years csv