Na-arụ ọrụ na nka nke iji mkpokọta na nhụta data na Python

Na-arụ ọrụ na nka nke iji mkpokọta na nhụta data na Python

Ndewo, Habr!

Taa, anyị ga-arụ ọrụ na nka nke iji ngwaọrụ maka ịchịkọta na ịhụ anya data na Python. Na nke enyere Dataset na Github Ka anyị nyochaa ọtụtụ njirimara wee wulite usoro ihe nlere anya.

Dị ka omenala si dị, na mmalite, ka anyị kọwaa ihe mgbaru ọsọ:

  • Otu data site na okike na afọ wee jiri anya nke uche hụ usoro ọmụmụ n'ozuzu nke ọnụọgụ ọmụmụ nke nwoke abụọ;
  • Chọta aha kachasị ewu ewu n'oge niile;
  • Kewaa oge niile na data n'ime akụkụ 10 na nke ọ bụla, chọta aha kacha ewu ewu nke nwoke ọ bụla. Maka aha ọ bụla achọtara, jiri anya nke uche hụ ihe ọ na-eme n'oge niile;
  • Maka afọ ọ bụla, gbakọọ aha ole na-ekpuchi 50% nke ndị mmadụ wee jiri anya nke uche hụ (anyị ga-ahụ ụdị aha dị iche iche maka afọ ọ bụla);
  • Họrọ afọ 4 site na etiti oge niile wee gosipụta maka afọ ọ bụla nkesa site na mkpụrụedemede mbụ n'aha na mkpụrụedemede ikpeazụ n'aha;
  • Mepụta ndepụta nke ọtụtụ ndị ama ama (ndị isi ala, ndị na-agụ egwú, ndị na-eme ihe nkiri, ndị na-eme ihe nkiri) wee nyochaa mmetụta ha na mgbanwe aha. Wulite nhụta anya.

Obere okwu, koodu ọzọ!

Ma, ka anyị gawa.

Ka anyị chịkọta data ahụ site na okike na afọ wee jiri anya nke uche hụ usoro ọmụmụ n'ozuzu nke ogo ọmụmụ nke nwoke abụọ ahụ:

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt

years = np.arange(1880, 2011, 3)
datalist = 'https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/datasets/babynames/yob{year}.txt'
dataframes = []
for year in years:
    dataset = datalist.format(year=year)
    dataframe = pd.read_csv(dataset, names=['name', 'sex', 'count'])
    dataframes.append(dataframe.assign(year=year))

result = pd.concat(dataframes)
sex = result.groupby('sex')
births_men = sex.get_group('M').groupby('year', as_index=False)
births_women = sex.get_group('F').groupby('year', as_index=False)
births_men_list = births_men.aggregate(np.sum)['count'].tolist()
births_women_list = births_women.aggregate(np.sum)['count'].tolist()

fig, ax = plt.subplots()
fig.set_size_inches(25,15)

index = np.arange(len(years))
stolb1 = ax.bar(index, births_men_list, 0.4, color='c', label='Мужчины')
stolb2 = ax.bar(index + 0.4, births_women_list, 0.4, alpha=0.8, color='r', label='Женщины')

ax.set_title('Рождаемость по полу и годам')
ax.set_xlabel('Года')
ax.set_ylabel('Рождаемость')
ax.set_xticklabels(years)
ax.set_xticks(index + 0.4)
ax.legend(loc=9)

fig.tight_layout()
plt.show()

Na-arụ ọrụ na nka nke iji mkpokọta na nhụta data na Python

Ka anyị chọta aha ndị kacha ewu ewu na akụkọ ntolite:

years = np.arange(1880, 2011)

dataframes = []
for year in years:
    dataset = datalist.format(year=year)
    dataframe = pd.read_csv(dataset, names=['name', 'sex', 'count'])
    dataframes.append(dataframe)

result = pd.concat(dataframes)
names = result.groupby('name', as_index=False).sum().sort_values('count', ascending=False)
names.head(10)

Na-arụ ọrụ na nka nke iji mkpokọta na nhụta data na Python

Ka anyị kewaa oge niile na data n'ime akụkụ 10 na nke ọ bụla anyị ga-ahụ aha kacha ewu ewu nke nwoke ọ bụla. Maka aha ọ bụla achọtara, anyị na-ahụ ihe ọ na-eme n'oge niile:

years = np.arange(1880, 2011)
part_size = int((years[years.size - 1] - years[0]) / 10) + 1
parts = {}
def GetPart(year):
    return int((year - years[0]) / part_size)
for year in years:
    index = GetPart(year)
    r = years[0] + part_size * index, min(years[years.size - 1], years[0] + part_size * (index + 1))
    parts[index] = str(r[0]) + '-' + str(r[1])

dataframe_parts = []
dataframes = []
for year in years:
    dataset = datalist.format(year=year)
    dataframe = pd.read_csv(dataset, names=['name', 'sex', 'count'])
    dataframe_parts.append(dataframe.assign(years=parts[GetPart(year)]))
    dataframes.append(dataframe.assign(year=year))
    
result_parts = pd.concat(dataframe_parts)
result = pd.concat(dataframes)

result_parts_sums = result_parts.groupby(['years', 'sex', 'name'], as_index=False).sum()
result_parts_names = result_parts_sums.iloc[result_parts_sums.groupby(['years', 'sex'], as_index=False).apply(lambda x: x['count'].idxmax())]
result_sums = result.groupby(['year', 'sex', 'name'], as_index=False).sum()

for groupName, groupLabels in result_parts_names.groupby(['name', 'sex']).groups.items():
    group = result_sums.groupby(['name', 'sex']).get_group(groupName)
    fig, ax = plt.subplots(1, 1, figsize=(18,10))

    ax.set_xlabel('Года')
    ax.set_ylabel('Рождаемость')
    label = group['name']
    ax.plot(group['year'], group['count'], label=label.aggregate(np.max), color='b', ls='-')
    ax.legend(loc=9, fontsize=11)

    plt.show()

Na-arụ ọrụ na nka nke iji mkpokọta na nhụta data na Python

Na-arụ ọrụ na nka nke iji mkpokọta na nhụta data na Python

Na-arụ ọrụ na nka nke iji mkpokọta na nhụta data na Python

Na-arụ ọrụ na nka nke iji mkpokọta na nhụta data na Python

Na-arụ ọrụ na nka nke iji mkpokọta na nhụta data na Python

Na-arụ ọrụ na nka nke iji mkpokọta na nhụta data na Python

Na-arụ ọrụ na nka nke iji mkpokọta na nhụta data na Python

Na-arụ ọrụ na nka nke iji mkpokọta na nhụta data na Python

Na-arụ ọrụ na nka nke iji mkpokọta na nhụta data na Python

Na-arụ ọrụ na nka nke iji mkpokọta na nhụta data na Python

Maka afọ ọ bụla, anyị na-agbakọ aha ole na-ekpuchi 50% nke ndị mmadụ wee jiri anya nke uche hụ data a:

dataframe = pd.DataFrame({'year': [], 'count': []})
years = np.arange(1880, 2011)
for year in years:
    dataset = datalist.format(year=year)
    csv = pd.read_csv(dataset, names=['name', 'sex', 'count'])
    names = csv.groupby('name', as_index=False).aggregate(np.sum)
    names['sum'] = names.sum()['count']
    names['percent'] = names['count'] / names['sum'] * 100
    names = names.sort_values(['percent'], ascending=False)
    names['cum_perc'] = names['percent'].cumsum()
    names_filtered = names[names['cum_perc'] <= 50]
    dataframe = dataframe.append(pd.DataFrame({'year': [year], 'count': [names_filtered.shape[0]]}))

fig, ax1 = plt.subplots(1, 1, figsize=(22,13))
ax1.set_xlabel('Года', fontsize = 12)
ax1.set_ylabel('Разнообразие имен', fontsize = 12)
ax1.plot(dataframe['year'], dataframe['count'], color='r', ls='-')
ax1.legend(loc=9, fontsize=12)

plt.show()

Na-arụ ọrụ na nka nke iji mkpokọta na nhụta data na Python

Ka anyị họrọ afọ 4 site na etiti oge niile wee gosipụta maka afọ ọ bụla nkesa site na mkpụrụedemede mbụ n'aha yana site na mkpụrụedemede ikpeazụ n'aha:

from string import ascii_lowercase, ascii_uppercase

fig_first, ax_first = plt.subplots(1, 1, figsize=(14,10))
fig_last, ax_last = plt.subplots(1, 1, figsize=(14,10))

index = np.arange(len(ascii_uppercase))
years = [1944, 1978, 1991, 2003]
colors = ['r', 'g', 'b', 'y']
n = 0
for year in years:
    dataset = datalist.format(year=year)
    csv = pd.read_csv(dataset, names=['name', 'sex', 'count'])
    names = csv.groupby('name', as_index=False).aggregate(np.sum)
    count = names.shape[0]

    dataframe = pd.DataFrame({'letter': [], 'frequency_first': [], 'frequency_last': []})
    for letter in ascii_uppercase:
        countFirst = (names[names.name.str.startswith(letter)].count()['count'])
        countLast = (names[names.name.str.endswith(letter.lower())].count()['count'])

        dataframe = dataframe.append(pd.DataFrame({
            'letter': [letter],
            'frequency_first': [countFirst / count * 100],
            'frequency_last': [countLast / count * 100]}))

    ax_first.bar(index + 0.3 * n, dataframe['frequency_first'], 0.3, alpha=0.5, color=colors[n], label=year)
    ax_last.bar(index + bar_width * n, dataframe['frequency_last'], 0.3, alpha=0.5, color=colors[n], label=year)
    n += 1

ax_first.set_xlabel('Буква алфавита')
ax_first.set_ylabel('Частота, %')
ax_first.set_title('Первая буква в имени')
ax_first.set_xticks(index)
ax_first.set_xticklabels(ascii_uppercase)
ax_first.legend()

ax_last.set_xlabel('Буква алфавита')
ax_last.set_ylabel('Частота, %')
ax_last.set_title('Последняя буква в имени')
ax_last.set_xticks(index)
ax_last.set_xticklabels(ascii_uppercase)
ax_last.legend()

fig_first.tight_layout()
fig_last.tight_layout()

plt.show()

Na-arụ ọrụ na nka nke iji mkpokọta na nhụta data na Python

Na-arụ ọrụ na nka nke iji mkpokọta na nhụta data na Python

Ka anyị depụta ọtụtụ ndị ama ama (ndị isi ala, ndị na-agụ egwú, ndị na-eme ihe nkiri, ndị na-eme ihe nkiri) wee nyochaa mmetụta ha na mgbanwe nke aha:

celebrities = {'Frank': 'M', 'Britney': 'F', 'Madonna': 'F', 'Bob': 'M'}
dataframes = []
for year in years:
    dataset = datalist.format(year=year)
    dataframe = pd.read_csv(dataset, names=['name', 'sex', 'count'])
    dataframes.append(dataframe.assign(year=year))

result = pd.concat(dataframes)

for celebrity, sex in celebrities.items():
    names = result[result.name == celebrity]
    dataframe = names[names.sex == sex]
    fig, ax = plt.subplots(1, 1, figsize=(16,8))

    ax.set_xlabel('Года', fontsize = 10)
    ax.set_ylabel('Рождаемость', fontsize = 10)
    ax.plot(dataframe['year'], dataframe['count'], label=celebrity, color='r', ls='-')
    ax.legend(loc=9, fontsize=12)
        
    plt.show()

Na-arụ ọrụ na nka nke iji mkpokọta na nhụta data na Python

Na-arụ ọrụ na nka nke iji mkpokọta na nhụta data na Python

Na-arụ ọrụ na nka nke iji mkpokọta na nhụta data na Python

Na-arụ ọrụ na nka nke iji mkpokọta na nhụta data na Python

Maka ọzụzụ, ị nwere ike ịgbakwunye oge ndụ nke ndị a ma ama na nhụta site na ihe atụ ikpeazụ iji chọpụta nke ọma mmetụta ha na mgbanwe nke aha.

Site na nke a, e mezuru ma mezuo ebumnuche anyị niile. Anyị azụlitela nka nke iji ngwaọrụ maka ịchịkọta na ịhụ anya data na Python, anyị ga-aga n'ihu na-arụ ọrụ na data. Onye ọ bụla nwere ike nweta nkwubi okwu dabere na data emebere, nke a na-ahụ anya n'onwe ha.

Ihe ọmụma nye onye ọ bụla!

isi: www.habr.com

Tinye a comment