Kugwira ntchito pa luso logwiritsa ntchito magulu ndi mawonedwe a data mu Python

Kugwira ntchito pa luso logwiritsa ntchito magulu ndi mawonedwe a data mu Python

Pa Habr!

Lero tigwira ntchito pa luso logwiritsa ntchito zida zoyika m'magulu ndikuwonera deta mu Python. Mu zoperekedwa dataset pa Github Tiyeni tiwunike mawonekedwe angapo ndikupanga mawonekedwe azithunzi.

Malinga ndi mwambo, poyamba, tiyeni tifotokoze zolinga:

  • Deta yamagulu malinga ndi jenda ndi chaka ndikuwona zochitika zonse za kuchuluka kwa kubadwa kwa amuna ndi akazi;
  • Pezani mayina otchuka kwambiri nthawi zonse;
  • Gawani nthawi yonse muzinthu 10 ndipo pagawo lililonse pezani dzina lodziwika kwambiri la jenda. Pa dzina lililonse lomwe lapezeka, wonerani mayendedwe ake nthawi zonse;
  • Chaka chilichonse, werengerani mayina angati omwe amaphimba 50% ya anthu ndikuwona m'maganizo (tidzawona mitundu yosiyanasiyana ya mayina chaka chilichonse);
  • Sankhani zaka 4 kuchokera pa nthawi yonseyi ndikuwonetsa chaka chilichonse kugawidwa ndi chilembo choyamba m'dzina ndi chilembo chomaliza m'dzina;
  • Lembani mndandanda wa anthu angapo otchuka (mapurezidenti, oimba, ochita zisudzo, otchulidwa m'mafilimu) ndikuwunika momwe amakhudzira kusinthika kwa mayina. Pangani chithunzithunzi.

Mawu ochepa, ma code ambiri!

Ndipo, tiyeni tizipita.

Tiyeni tipange m'magulumagulu malinga ndi jenda ndi chaka ndikuwona zochitika zonse za kuchuluka kwa kubadwa kwa amuna ndi akazi:

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt

years = np.arange(1880, 2011, 3)
datalist = 'https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/datasets/babynames/yob{year}.txt'
dataframes = []
for year in years:
    dataset = datalist.format(year=year)
    dataframe = pd.read_csv(dataset, names=['name', 'sex', 'count'])
    dataframes.append(dataframe.assign(year=year))

result = pd.concat(dataframes)
sex = result.groupby('sex')
births_men = sex.get_group('M').groupby('year', as_index=False)
births_women = sex.get_group('F').groupby('year', as_index=False)
births_men_list = births_men.aggregate(np.sum)['count'].tolist()
births_women_list = births_women.aggregate(np.sum)['count'].tolist()

fig, ax = plt.subplots()
fig.set_size_inches(25,15)

index = np.arange(len(years))
stolb1 = ax.bar(index, births_men_list, 0.4, color='c', label='ΠœΡƒΠΆΡ‡ΠΈΠ½Ρ‹')
stolb2 = ax.bar(index + 0.4, births_women_list, 0.4, alpha=0.8, color='r', label='Π–Π΅Π½Ρ‰ΠΈΠ½Ρ‹')

ax.set_title('Π ΠΎΠΆΠ΄Π°Π΅ΠΌΠΎΡΡ‚ΡŒ ΠΏΠΎ ΠΏΠΎΠ»Ρƒ ΠΈ Π³ΠΎΠ΄Π°ΠΌ')
ax.set_xlabel('Π“ΠΎΠ΄Π°')
ax.set_ylabel('Π ΠΎΠΆΠ΄Π°Π΅ΠΌΠΎΡΡ‚ΡŒ')
ax.set_xticklabels(years)
ax.set_xticks(index + 0.4)
ax.legend(loc=9)

fig.tight_layout()
plt.show()

Kugwira ntchito pa luso logwiritsa ntchito magulu ndi mawonedwe a data mu Python

Tiyeni tipeze mayina otchuka kwambiri m'mbiri:

years = np.arange(1880, 2011)

dataframes = []
for year in years:
    dataset = datalist.format(year=year)
    dataframe = pd.read_csv(dataset, names=['name', 'sex', 'count'])
    dataframes.append(dataframe)

result = pd.concat(dataframes)
names = result.groupby('name', as_index=False).sum().sort_values('count', ascending=False)
names.head(10)

Kugwira ntchito pa luso logwiritsa ntchito magulu ndi mawonedwe a data mu Python

Tiyeni tigawane nthawi yonse mu data mu magawo 10 ndipo pa chilichonse tipeza dzina lodziwika kwambiri la jenda. Pa dzina lililonse lomwe lapezeka, timawonera zochitika zake nthawi zonse:

years = np.arange(1880, 2011)
part_size = int((years[years.size - 1] - years[0]) / 10) + 1
parts = {}
def GetPart(year):
    return int((year - years[0]) / part_size)
for year in years:
    index = GetPart(year)
    r = years[0] + part_size * index, min(years[years.size - 1], years[0] + part_size * (index + 1))
    parts[index] = str(r[0]) + '-' + str(r[1])

dataframe_parts = []
dataframes = []
for year in years:
    dataset = datalist.format(year=year)
    dataframe = pd.read_csv(dataset, names=['name', 'sex', 'count'])
    dataframe_parts.append(dataframe.assign(years=parts[GetPart(year)]))
    dataframes.append(dataframe.assign(year=year))
    
result_parts = pd.concat(dataframe_parts)
result = pd.concat(dataframes)

result_parts_sums = result_parts.groupby(['years', 'sex', 'name'], as_index=False).sum()
result_parts_names = result_parts_sums.iloc[result_parts_sums.groupby(['years', 'sex'], as_index=False).apply(lambda x: x['count'].idxmax())]
result_sums = result.groupby(['year', 'sex', 'name'], as_index=False).sum()

for groupName, groupLabels in result_parts_names.groupby(['name', 'sex']).groups.items():
    group = result_sums.groupby(['name', 'sex']).get_group(groupName)
    fig, ax = plt.subplots(1, 1, figsize=(18,10))

    ax.set_xlabel('Π“ΠΎΠ΄Π°')
    ax.set_ylabel('Π ΠΎΠΆΠ΄Π°Π΅ΠΌΠΎΡΡ‚ΡŒ')
    label = group['name']
    ax.plot(group['year'], group['count'], label=label.aggregate(np.max), color='b', ls='-')
    ax.legend(loc=9, fontsize=11)

    plt.show()

Kugwira ntchito pa luso logwiritsa ntchito magulu ndi mawonedwe a data mu Python

Kugwira ntchito pa luso logwiritsa ntchito magulu ndi mawonedwe a data mu Python

Kugwira ntchito pa luso logwiritsa ntchito magulu ndi mawonedwe a data mu Python

Kugwira ntchito pa luso logwiritsa ntchito magulu ndi mawonedwe a data mu Python

Kugwira ntchito pa luso logwiritsa ntchito magulu ndi mawonedwe a data mu Python

Kugwira ntchito pa luso logwiritsa ntchito magulu ndi mawonedwe a data mu Python

Kugwira ntchito pa luso logwiritsa ntchito magulu ndi mawonedwe a data mu Python

Kugwira ntchito pa luso logwiritsa ntchito magulu ndi mawonedwe a data mu Python

Kugwira ntchito pa luso logwiritsa ntchito magulu ndi mawonedwe a data mu Python

Kugwira ntchito pa luso logwiritsa ntchito magulu ndi mawonedwe a data mu Python

Chaka chilichonse, timawerengera kuti ndi mayina angati omwe amakhudza 50% ya anthu ndikuwona izi:

dataframe = pd.DataFrame({'year': [], 'count': []})
years = np.arange(1880, 2011)
for year in years:
    dataset = datalist.format(year=year)
    csv = pd.read_csv(dataset, names=['name', 'sex', 'count'])
    names = csv.groupby('name', as_index=False).aggregate(np.sum)
    names['sum'] = names.sum()['count']
    names['percent'] = names['count'] / names['sum'] * 100
    names = names.sort_values(['percent'], ascending=False)
    names['cum_perc'] = names['percent'].cumsum()
    names_filtered = names[names['cum_perc'] <= 50]
    dataframe = dataframe.append(pd.DataFrame({'year': [year], 'count': [names_filtered.shape[0]]}))

fig, ax1 = plt.subplots(1, 1, figsize=(22,13))
ax1.set_xlabel('Π“ΠΎΠ΄Π°', fontsize = 12)
ax1.set_ylabel('Π Π°Π·Π½ΠΎΠΎΠ±Ρ€Π°Π·ΠΈΠ΅ ΠΈΠΌΠ΅Π½', fontsize = 12)
ax1.plot(dataframe['year'], dataframe['count'], color='r', ls='-')
ax1.legend(loc=9, fontsize=12)

plt.show()

Kugwira ntchito pa luso logwiritsa ntchito magulu ndi mawonedwe a data mu Python

Tiyeni tisankhe zaka 4 kuchokera pa nthawi yonseyi ndikuwonetsa chaka chilichonse kugawidwa ndi chilembo choyamba m'dzina ndi chilembo chomaliza cha dzina:

from string import ascii_lowercase, ascii_uppercase

fig_first, ax_first = plt.subplots(1, 1, figsize=(14,10))
fig_last, ax_last = plt.subplots(1, 1, figsize=(14,10))

index = np.arange(len(ascii_uppercase))
years = [1944, 1978, 1991, 2003]
colors = ['r', 'g', 'b', 'y']
n = 0
for year in years:
    dataset = datalist.format(year=year)
    csv = pd.read_csv(dataset, names=['name', 'sex', 'count'])
    names = csv.groupby('name', as_index=False).aggregate(np.sum)
    count = names.shape[0]

    dataframe = pd.DataFrame({'letter': [], 'frequency_first': [], 'frequency_last': []})
    for letter in ascii_uppercase:
        countFirst = (names[names.name.str.startswith(letter)].count()['count'])
        countLast = (names[names.name.str.endswith(letter.lower())].count()['count'])

        dataframe = dataframe.append(pd.DataFrame({
            'letter': [letter],
            'frequency_first': [countFirst / count * 100],
            'frequency_last': [countLast / count * 100]}))

    ax_first.bar(index + 0.3 * n, dataframe['frequency_first'], 0.3, alpha=0.5, color=colors[n], label=year)
    ax_last.bar(index + bar_width * n, dataframe['frequency_last'], 0.3, alpha=0.5, color=colors[n], label=year)
    n += 1

ax_first.set_xlabel('Π‘ΡƒΠΊΠ²Π° Π°Π»Ρ„Π°Π²ΠΈΡ‚Π°')
ax_first.set_ylabel('Частота, %')
ax_first.set_title('ΠŸΠ΅Ρ€Π²Π°Ρ Π±ΡƒΠΊΠ²Π° Π² ΠΈΠΌΠ΅Π½ΠΈ')
ax_first.set_xticks(index)
ax_first.set_xticklabels(ascii_uppercase)
ax_first.legend()

ax_last.set_xlabel('Π‘ΡƒΠΊΠ²Π° Π°Π»Ρ„Π°Π²ΠΈΡ‚Π°')
ax_last.set_ylabel('Частота, %')
ax_last.set_title('ПослСдняя Π±ΡƒΠΊΠ²Π° Π² ΠΈΠΌΠ΅Π½ΠΈ')
ax_last.set_xticks(index)
ax_last.set_xticklabels(ascii_uppercase)
ax_last.legend()

fig_first.tight_layout()
fig_last.tight_layout()

plt.show()

Kugwira ntchito pa luso logwiritsa ntchito magulu ndi mawonedwe a data mu Python

Kugwira ntchito pa luso logwiritsa ntchito magulu ndi mawonedwe a data mu Python

Tiyeni tipange mndandanda wa anthu angapo otchuka (mapurezidenti, oimba, zisudzo, otchulidwa m'mafilimu) ndikuwunika momwe amakhudzidwira pamayendedwe a mayina:

celebrities = {'Frank': 'M', 'Britney': 'F', 'Madonna': 'F', 'Bob': 'M'}
dataframes = []
for year in years:
    dataset = datalist.format(year=year)
    dataframe = pd.read_csv(dataset, names=['name', 'sex', 'count'])
    dataframes.append(dataframe.assign(year=year))

result = pd.concat(dataframes)

for celebrity, sex in celebrities.items():
    names = result[result.name == celebrity]
    dataframe = names[names.sex == sex]
    fig, ax = plt.subplots(1, 1, figsize=(16,8))

    ax.set_xlabel('Π“ΠΎΠ΄Π°', fontsize = 10)
    ax.set_ylabel('Π ΠΎΠΆΠ΄Π°Π΅ΠΌΠΎΡΡ‚ΡŒ', fontsize = 10)
    ax.plot(dataframe['year'], dataframe['count'], label=celebrity, color='r', ls='-')
    ax.legend(loc=9, fontsize=12)
        
    plt.show()

Kugwira ntchito pa luso logwiritsa ntchito magulu ndi mawonedwe a data mu Python

Kugwira ntchito pa luso logwiritsa ntchito magulu ndi mawonedwe a data mu Python

Kugwira ntchito pa luso logwiritsa ntchito magulu ndi mawonedwe a data mu Python

Kugwira ntchito pa luso logwiritsa ntchito magulu ndi mawonedwe a data mu Python

Kuti muphunzitse, mukhoza kuwonjezera nthawi ya moyo wa anthu otchuka kuti muwone kuchokera ku chitsanzo chomaliza kuti muwonetsere bwino mphamvu zawo pamayendedwe a mayina.

Ndi izi, zolinga zathu zonse zidakwaniritsidwa ndikukwaniritsidwa. Tapanga luso logwiritsa ntchito zida zoyika m'magulu ndikuwonera deta mu Python, ndipo tipitiliza kugwira ntchito ndi data. Aliyense akhoza kuganiza motengera zomwe zapangidwa kale, zowonera okha.

Chidziwitso kwa aliyense!

Source: www.habr.com

Kuwonjezera ndemanga