Hey Habr!
Today we will work on the skill of using data grouping and visualization tools in Python. In the provided
By tradition, at the beginning, we define the goals:
- Group data by sex and year and visualize the overall dynamics of the birth rate of both sexes;
- Find the most popular names in history;
- Divide the entire time period in the data into 10 parts and for each find the most popular name of each gender. For each name found, visualize its dynamics over time;
- For each year, calculate how many names are covered by 50% of the people and visualize (we will see the variety of names for each year);
- Select 4 years from the total interval and display for each year the distribution by the first letter in the name and by the last letter in the name;
- Make a list of several famous people (presidents, singers, actors, movie heroes) and evaluate their influence on the dynamics of names. Build visualization.
Less words, more code!
Let's group the data by sex and year and visualize the overall dynamics of the birth rate of both sexes:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
years = np.arange(1880, 2011, 3)
datalist = 'https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/datasets/babynames/yob{year}.txt'
dataframes = []
for year in years:
dataset = datalist.format(year=year)
dataframe = pd.read_csv(dataset, names=['name', 'sex', 'count'])
dataframes.append(dataframe.assign(year=year))
result = pd.concat(dataframes)
sex = result.groupby('sex')
births_men = sex.get_group('M').groupby('year', as_index=False)
births_women = sex.get_group('F').groupby('year', as_index=False)
births_men_list = births_men.aggregate(np.sum)['count'].tolist()
births_women_list = births_women.aggregate(np.sum)['count'].tolist()
fig, ax = plt.subplots()
fig.set_size_inches(25,15)
index = np.arange(len(years))
stolb1 = ax.bar(index, births_men_list, 0.4, color='c', label='ΠΡΠΆΡΠΈΠ½Ρ')
stolb2 = ax.bar(index + 0.4, births_women_list, 0.4, alpha=0.8, color='r', label='ΠΠ΅Π½ΡΠΈΠ½Ρ')
ax.set_title('Π ΠΎΠΆΠ΄Π°Π΅ΠΌΠΎΡΡΡ ΠΏΠΎ ΠΏΠΎΠ»Ρ ΠΈ Π³ΠΎΠ΄Π°ΠΌ')
ax.set_xlabel('ΠΠΎΠ΄Π°')
ax.set_ylabel('Π ΠΎΠΆΠ΄Π°Π΅ΠΌΠΎΡΡΡ')
ax.set_xticklabels(years)
ax.set_xticks(index + 0.4)
ax.legend(loc=9)
fig.tight_layout()
plt.show()
Let's find the most popular names in history:
years = np.arange(1880, 2011)
dataframes = []
for year in years:
dataset = datalist.format(year=year)
dataframe = pd.read_csv(dataset, names=['name', 'sex', 'count'])
dataframes.append(dataframe)
result = pd.concat(dataframes)
names = result.groupby('name', as_index=False).sum().sort_values('count', ascending=False)
names.head(10)
Let's divide the entire time period in the data into 10 parts and for each we find the most popular name of each gender. For each found name, we visualize its dynamics over the entire time:
years = np.arange(1880, 2011)
part_size = int((years[years.size - 1] - years[0]) / 10) + 1
parts = {}
def GetPart(year):
return int((year - years[0]) / part_size)
for year in years:
index = GetPart(year)
r = years[0] + part_size * index, min(years[years.size - 1], years[0] + part_size * (index + 1))
parts[index] = str(r[0]) + '-' + str(r[1])
dataframe_parts = []
dataframes = []
for year in years:
dataset = datalist.format(year=year)
dataframe = pd.read_csv(dataset, names=['name', 'sex', 'count'])
dataframe_parts.append(dataframe.assign(years=parts[GetPart(year)]))
dataframes.append(dataframe.assign(year=year))
result_parts = pd.concat(dataframe_parts)
result = pd.concat(dataframes)
result_parts_sums = result_parts.groupby(['years', 'sex', 'name'], as_index=False).sum()
result_parts_names = result_parts_sums.iloc[result_parts_sums.groupby(['years', 'sex'], as_index=False).apply(lambda x: x['count'].idxmax())]
result_sums = result.groupby(['year', 'sex', 'name'], as_index=False).sum()
for groupName, groupLabels in result_parts_names.groupby(['name', 'sex']).groups.items():
group = result_sums.groupby(['name', 'sex']).get_group(groupName)
fig, ax = plt.subplots(1, 1, figsize=(18,10))
ax.set_xlabel('ΠΠΎΠ΄Π°')
ax.set_ylabel('Π ΠΎΠΆΠ΄Π°Π΅ΠΌΠΎΡΡΡ')
label = group['name']
ax.plot(group['year'], group['count'], label=label.aggregate(np.max), color='b', ls='-')
ax.legend(loc=9, fontsize=11)
plt.show()
For each year, we calculate how many names are covered by 50% of people and visualize this data:
dataframe = pd.DataFrame({'year': [], 'count': []})
years = np.arange(1880, 2011)
for year in years:
dataset = datalist.format(year=year)
csv = pd.read_csv(dataset, names=['name', 'sex', 'count'])
names = csv.groupby('name', as_index=False).aggregate(np.sum)
names['sum'] = names.sum()['count']
names['percent'] = names['count'] / names['sum'] * 100
names = names.sort_values(['percent'], ascending=False)
names['cum_perc'] = names['percent'].cumsum()
names_filtered = names[names['cum_perc'] <= 50]
dataframe = dataframe.append(pd.DataFrame({'year': [year], 'count': [names_filtered.shape[0]]}))
fig, ax1 = plt.subplots(1, 1, figsize=(22,13))
ax1.set_xlabel('ΠΠΎΠ΄Π°', fontsize = 12)
ax1.set_ylabel('Π Π°Π·Π½ΠΎΠΎΠ±ΡΠ°Π·ΠΈΠ΅ ΠΈΠΌΠ΅Π½', fontsize = 12)
ax1.plot(dataframe['year'], dataframe['count'], color='r', ls='-')
ax1.legend(loc=9, fontsize=12)
plt.show()
Let's select 4 years from the entire interval and display for each year the distribution by the first letter in the name and by the last letter in the name:
from string import ascii_lowercase, ascii_uppercase
fig_first, ax_first = plt.subplots(1, 1, figsize=(14,10))
fig_last, ax_last = plt.subplots(1, 1, figsize=(14,10))
index = np.arange(len(ascii_uppercase))
years = [1944, 1978, 1991, 2003]
colors = ['r', 'g', 'b', 'y']
n = 0
for year in years:
dataset = datalist.format(year=year)
csv = pd.read_csv(dataset, names=['name', 'sex', 'count'])
names = csv.groupby('name', as_index=False).aggregate(np.sum)
count = names.shape[0]
dataframe = pd.DataFrame({'letter': [], 'frequency_first': [], 'frequency_last': []})
for letter in ascii_uppercase:
countFirst = (names[names.name.str.startswith(letter)].count()['count'])
countLast = (names[names.name.str.endswith(letter.lower())].count()['count'])
dataframe = dataframe.append(pd.DataFrame({
'letter': [letter],
'frequency_first': [countFirst / count * 100],
'frequency_last': [countLast / count * 100]}))
ax_first.bar(index + 0.3 * n, dataframe['frequency_first'], 0.3, alpha=0.5, color=colors[n], label=year)
ax_last.bar(index + bar_width * n, dataframe['frequency_last'], 0.3, alpha=0.5, color=colors[n], label=year)
n += 1
ax_first.set_xlabel('ΠΡΠΊΠ²Π° Π°Π»ΡΠ°Π²ΠΈΡΠ°')
ax_first.set_ylabel('Π§Π°ΡΡΠΎΡΠ°, %')
ax_first.set_title('ΠΠ΅ΡΠ²Π°Ρ Π±ΡΠΊΠ²Π° Π² ΠΈΠΌΠ΅Π½ΠΈ')
ax_first.set_xticks(index)
ax_first.set_xticklabels(ascii_uppercase)
ax_first.legend()
ax_last.set_xlabel('ΠΡΠΊΠ²Π° Π°Π»ΡΠ°Π²ΠΈΡΠ°')
ax_last.set_ylabel('Π§Π°ΡΡΠΎΡΠ°, %')
ax_last.set_title('ΠΠΎΡΠ»Π΅Π΄Π½ΡΡ Π±ΡΠΊΠ²Π° Π² ΠΈΠΌΠ΅Π½ΠΈ')
ax_last.set_xticks(index)
ax_last.set_xticklabels(ascii_uppercase)
ax_last.legend()
fig_first.tight_layout()
fig_last.tight_layout()
plt.show()
Let's make a list of several famous people (presidents, singers, actors, movie heroes) and evaluate their influence on the dynamics of names:
celebrities = {'Frank': 'M', 'Britney': 'F', 'Madonna': 'F', 'Bob': 'M'}
dataframes = []
for year in years:
dataset = datalist.format(year=year)
dataframe = pd.read_csv(dataset, names=['name', 'sex', 'count'])
dataframes.append(dataframe.assign(year=year))
result = pd.concat(dataframes)
for celebrity, sex in celebrities.items():
names = result[result.name == celebrity]
dataframe = names[names.sex == sex]
fig, ax = plt.subplots(1, 1, figsize=(16,8))
ax.set_xlabel('ΠΠΎΠ΄Π°', fontsize = 10)
ax.set_ylabel('Π ΠΎΠΆΠ΄Π°Π΅ΠΌΠΎΡΡΡ', fontsize = 10)
ax.plot(dataframe['year'], dataframe['count'], label=celebrity, color='r', ls='-')
ax.legend(loc=9, fontsize=12)
plt.show()
For practice, you can add the period of a celebrity's life to the visualization from the last example in order to visually evaluate their influence on the dynamics of names.
On this, all our goals were achieved and fulfilled. We have mastered the use of data grouping and visualization tools in Python, and we will work with data further. Everyone can draw conclusions on ready-made, visualized data.
All knowledge!
Source: habr.com