Ua haujlwm ntawm kev txawj siv pab pawg thiab kev pom cov ntaub ntawv hauv Python

Ua haujlwm ntawm kev txawj siv pab pawg thiab kev pom cov ntaub ntawv hauv Python

Hlo Habr!

Niaj hnub no peb yuav ua hauj lwm ntawm kev txawj siv cov cuab yeej los pab pawg thiab pom cov ntaub ntawv hauv Python. Hauv qhov muab dataset ntawm Github Cia peb txheeb xyuas ob peb yam ntxwv thiab tsim kom muaj ib txheej ntawm kev pom.

Raws li kev lig kev cai, thaum pib, cia peb txhais cov hom phiaj:

  • Cov ntaub ntawv pab pawg los ntawm poj niam txiv neej thiab xyoo thiab pom qhov kev hloov pauv tag nrho ntawm kev yug me nyuam ntawm ob leeg;
  • Nrhiav cov npe nrov tshaj plaws ntawm txhua lub sijhawm;
  • Faib tag nrho lub sij hawm nyob rau hauv cov ntaub ntawv rau hauv 10 qhov chaw thiab rau txhua tus, nrhiav lub npe nrov tshaj plaws ntawm txhua tus poj niam txiv neej. Rau txhua lub npe pom, pom nws lub zog txhua lub sijhawm;
  • Rau txhua xyoo, suav pes tsawg lub npe npog 50% ntawm cov neeg thiab pom (peb yuav pom ntau lub npe rau txhua xyoo);
  • Xaiv 4 xyoo los ntawm tag nrho cov ncua sij hawm thiab tso saib rau txhua xyoo kev faib los ntawm thawj tsab ntawv hauv lub npe thiab los ntawm tsab ntawv kawg hauv lub npe;
  • Ua ib daim ntawv teev npe ntawm ntau tus neeg nto moo (cov thawj tswj hwm, cov neeg hu nkauj, cov neeg ua yeeb yam, cov yeeb yaj kiab) thiab ntsuas lawv cov kev cuam tshuam ntawm cov npe ntawm cov npe. Tsim kom muaj kev pom.

Tsawg lo lus, ntau code!

Thiab, cia peb mus.

Cia peb muab cov ntaub ntawv los ntawm poj niam txiv neej thiab xyoo thiab pom qhov kev hloov pauv tag nrho ntawm kev yug me nyuam ntawm ob tus poj niam txiv neej:

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt

years = np.arange(1880, 2011, 3)
datalist = 'https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/datasets/babynames/yob{year}.txt'
dataframes = []
for year in years:
    dataset = datalist.format(year=year)
    dataframe = pd.read_csv(dataset, names=['name', 'sex', 'count'])
    dataframes.append(dataframe.assign(year=year))

result = pd.concat(dataframes)
sex = result.groupby('sex')
births_men = sex.get_group('M').groupby('year', as_index=False)
births_women = sex.get_group('F').groupby('year', as_index=False)
births_men_list = births_men.aggregate(np.sum)['count'].tolist()
births_women_list = births_women.aggregate(np.sum)['count'].tolist()

fig, ax = plt.subplots()
fig.set_size_inches(25,15)

index = np.arange(len(years))
stolb1 = ax.bar(index, births_men_list, 0.4, color='c', label='ΠœΡƒΠΆΡ‡ΠΈΠ½Ρ‹')
stolb2 = ax.bar(index + 0.4, births_women_list, 0.4, alpha=0.8, color='r', label='Π–Π΅Π½Ρ‰ΠΈΠ½Ρ‹')

ax.set_title('Π ΠΎΠΆΠ΄Π°Π΅ΠΌΠΎΡΡ‚ΡŒ ΠΏΠΎ ΠΏΠΎΠ»Ρƒ ΠΈ Π³ΠΎΠ΄Π°ΠΌ')
ax.set_xlabel('Π“ΠΎΠ΄Π°')
ax.set_ylabel('Π ΠΎΠΆΠ΄Π°Π΅ΠΌΠΎΡΡ‚ΡŒ')
ax.set_xticklabels(years)
ax.set_xticks(index + 0.4)
ax.legend(loc=9)

fig.tight_layout()
plt.show()

Ua haujlwm ntawm kev txawj siv pab pawg thiab kev pom cov ntaub ntawv hauv Python

Cia peb pom cov npe nrov tshaj plaws hauv keeb kwm:

years = np.arange(1880, 2011)

dataframes = []
for year in years:
    dataset = datalist.format(year=year)
    dataframe = pd.read_csv(dataset, names=['name', 'sex', 'count'])
    dataframes.append(dataframe)

result = pd.concat(dataframes)
names = result.groupby('name', as_index=False).sum().sort_values('count', ascending=False)
names.head(10)

Ua haujlwm ntawm kev txawj siv pab pawg thiab kev pom cov ntaub ntawv hauv Python

Cia peb faib lub sijhawm tag nrho ntawm cov ntaub ntawv rau hauv 10 qhov chaw thiab rau txhua qhov peb yuav pom cov npe nrov tshaj plaws ntawm txhua tus poj niam txiv neej. Rau txhua lub npe pom, peb pom nws lub zog txhua lub sijhawm:

years = np.arange(1880, 2011)
part_size = int((years[years.size - 1] - years[0]) / 10) + 1
parts = {}
def GetPart(year):
    return int((year - years[0]) / part_size)
for year in years:
    index = GetPart(year)
    r = years[0] + part_size * index, min(years[years.size - 1], years[0] + part_size * (index + 1))
    parts[index] = str(r[0]) + '-' + str(r[1])

dataframe_parts = []
dataframes = []
for year in years:
    dataset = datalist.format(year=year)
    dataframe = pd.read_csv(dataset, names=['name', 'sex', 'count'])
    dataframe_parts.append(dataframe.assign(years=parts[GetPart(year)]))
    dataframes.append(dataframe.assign(year=year))
    
result_parts = pd.concat(dataframe_parts)
result = pd.concat(dataframes)

result_parts_sums = result_parts.groupby(['years', 'sex', 'name'], as_index=False).sum()
result_parts_names = result_parts_sums.iloc[result_parts_sums.groupby(['years', 'sex'], as_index=False).apply(lambda x: x['count'].idxmax())]
result_sums = result.groupby(['year', 'sex', 'name'], as_index=False).sum()

for groupName, groupLabels in result_parts_names.groupby(['name', 'sex']).groups.items():
    group = result_sums.groupby(['name', 'sex']).get_group(groupName)
    fig, ax = plt.subplots(1, 1, figsize=(18,10))

    ax.set_xlabel('Π“ΠΎΠ΄Π°')
    ax.set_ylabel('Π ΠΎΠΆΠ΄Π°Π΅ΠΌΠΎΡΡ‚ΡŒ')
    label = group['name']
    ax.plot(group['year'], group['count'], label=label.aggregate(np.max), color='b', ls='-')
    ax.legend(loc=9, fontsize=11)

    plt.show()

Ua haujlwm ntawm kev txawj siv pab pawg thiab kev pom cov ntaub ntawv hauv Python

Ua haujlwm ntawm kev txawj siv pab pawg thiab kev pom cov ntaub ntawv hauv Python

Ua haujlwm ntawm kev txawj siv pab pawg thiab kev pom cov ntaub ntawv hauv Python

Ua haujlwm ntawm kev txawj siv pab pawg thiab kev pom cov ntaub ntawv hauv Python

Ua haujlwm ntawm kev txawj siv pab pawg thiab kev pom cov ntaub ntawv hauv Python

Ua haujlwm ntawm kev txawj siv pab pawg thiab kev pom cov ntaub ntawv hauv Python

Ua haujlwm ntawm kev txawj siv pab pawg thiab kev pom cov ntaub ntawv hauv Python

Ua haujlwm ntawm kev txawj siv pab pawg thiab kev pom cov ntaub ntawv hauv Python

Ua haujlwm ntawm kev txawj siv pab pawg thiab kev pom cov ntaub ntawv hauv Python

Ua haujlwm ntawm kev txawj siv pab pawg thiab kev pom cov ntaub ntawv hauv Python

Rau txhua xyoo, peb suav pes tsawg lub npe npog 50% ntawm cov neeg thiab pom cov ntaub ntawv no:

dataframe = pd.DataFrame({'year': [], 'count': []})
years = np.arange(1880, 2011)
for year in years:
    dataset = datalist.format(year=year)
    csv = pd.read_csv(dataset, names=['name', 'sex', 'count'])
    names = csv.groupby('name', as_index=False).aggregate(np.sum)
    names['sum'] = names.sum()['count']
    names['percent'] = names['count'] / names['sum'] * 100
    names = names.sort_values(['percent'], ascending=False)
    names['cum_perc'] = names['percent'].cumsum()
    names_filtered = names[names['cum_perc'] <= 50]
    dataframe = dataframe.append(pd.DataFrame({'year': [year], 'count': [names_filtered.shape[0]]}))

fig, ax1 = plt.subplots(1, 1, figsize=(22,13))
ax1.set_xlabel('Π“ΠΎΠ΄Π°', fontsize = 12)
ax1.set_ylabel('Π Π°Π·Π½ΠΎΠΎΠ±Ρ€Π°Π·ΠΈΠ΅ ΠΈΠΌΠ΅Π½', fontsize = 12)
ax1.plot(dataframe['year'], dataframe['count'], color='r', ls='-')
ax1.legend(loc=9, fontsize=12)

plt.show()

Ua haujlwm ntawm kev txawj siv pab pawg thiab kev pom cov ntaub ntawv hauv Python

Cia peb xaiv 4 xyoo los ntawm tag nrho lub sijhawm thiab tso saib rau txhua xyoo kev faib khoom los ntawm thawj tsab ntawv hauv lub npe thiab los ntawm tsab ntawv kawg hauv lub npe:

from string import ascii_lowercase, ascii_uppercase

fig_first, ax_first = plt.subplots(1, 1, figsize=(14,10))
fig_last, ax_last = plt.subplots(1, 1, figsize=(14,10))

index = np.arange(len(ascii_uppercase))
years = [1944, 1978, 1991, 2003]
colors = ['r', 'g', 'b', 'y']
n = 0
for year in years:
    dataset = datalist.format(year=year)
    csv = pd.read_csv(dataset, names=['name', 'sex', 'count'])
    names = csv.groupby('name', as_index=False).aggregate(np.sum)
    count = names.shape[0]

    dataframe = pd.DataFrame({'letter': [], 'frequency_first': [], 'frequency_last': []})
    for letter in ascii_uppercase:
        countFirst = (names[names.name.str.startswith(letter)].count()['count'])
        countLast = (names[names.name.str.endswith(letter.lower())].count()['count'])

        dataframe = dataframe.append(pd.DataFrame({
            'letter': [letter],
            'frequency_first': [countFirst / count * 100],
            'frequency_last': [countLast / count * 100]}))

    ax_first.bar(index + 0.3 * n, dataframe['frequency_first'], 0.3, alpha=0.5, color=colors[n], label=year)
    ax_last.bar(index + bar_width * n, dataframe['frequency_last'], 0.3, alpha=0.5, color=colors[n], label=year)
    n += 1

ax_first.set_xlabel('Π‘ΡƒΠΊΠ²Π° Π°Π»Ρ„Π°Π²ΠΈΡ‚Π°')
ax_first.set_ylabel('Частота, %')
ax_first.set_title('ΠŸΠ΅Ρ€Π²Π°Ρ Π±ΡƒΠΊΠ²Π° Π² ΠΈΠΌΠ΅Π½ΠΈ')
ax_first.set_xticks(index)
ax_first.set_xticklabels(ascii_uppercase)
ax_first.legend()

ax_last.set_xlabel('Π‘ΡƒΠΊΠ²Π° Π°Π»Ρ„Π°Π²ΠΈΡ‚Π°')
ax_last.set_ylabel('Частота, %')
ax_last.set_title('ПослСдняя Π±ΡƒΠΊΠ²Π° Π² ΠΈΠΌΠ΅Π½ΠΈ')
ax_last.set_xticks(index)
ax_last.set_xticklabels(ascii_uppercase)
ax_last.legend()

fig_first.tight_layout()
fig_last.tight_layout()

plt.show()

Ua haujlwm ntawm kev txawj siv pab pawg thiab kev pom cov ntaub ntawv hauv Python

Ua haujlwm ntawm kev txawj siv pab pawg thiab kev pom cov ntaub ntawv hauv Python

Cia peb ua ib daim ntawv teev npe ntawm ntau tus neeg nto moo (tus thawj tswj hwm, cov neeg hu nkauj, cov neeg ua yeeb yam, cov yeeb yaj kiab) thiab ntsuas lawv cov kev cuam tshuam ntawm cov npe ntawm cov npe:

celebrities = {'Frank': 'M', 'Britney': 'F', 'Madonna': 'F', 'Bob': 'M'}
dataframes = []
for year in years:
    dataset = datalist.format(year=year)
    dataframe = pd.read_csv(dataset, names=['name', 'sex', 'count'])
    dataframes.append(dataframe.assign(year=year))

result = pd.concat(dataframes)

for celebrity, sex in celebrities.items():
    names = result[result.name == celebrity]
    dataframe = names[names.sex == sex]
    fig, ax = plt.subplots(1, 1, figsize=(16,8))

    ax.set_xlabel('Π“ΠΎΠ΄Π°', fontsize = 10)
    ax.set_ylabel('Π ΠΎΠΆΠ΄Π°Π΅ΠΌΠΎΡΡ‚ΡŒ', fontsize = 10)
    ax.plot(dataframe['year'], dataframe['count'], label=celebrity, color='r', ls='-')
    ax.legend(loc=9, fontsize=12)
        
    plt.show()

Ua haujlwm ntawm kev txawj siv pab pawg thiab kev pom cov ntaub ntawv hauv Python

Ua haujlwm ntawm kev txawj siv pab pawg thiab kev pom cov ntaub ntawv hauv Python

Ua haujlwm ntawm kev txawj siv pab pawg thiab kev pom cov ntaub ntawv hauv Python

Ua haujlwm ntawm kev txawj siv pab pawg thiab kev pom cov ntaub ntawv hauv Python

Rau kev cob qhia, koj tuaj yeem ntxiv cov neeg nto moo lub neej lub sijhawm rau kev pom los ntawm qhov piv txwv kawg txhawm rau txhawm rau txheeb xyuas lawv cov kev cuam tshuam ntawm cov npe ntawm cov npe.

Nrog rau qhov no, tag nrho peb cov hom phiaj tau ua tiav thiab ua tiav. Peb tau tsim cov txuj ci ntawm kev siv cov cuab yeej rau kev ua pab pawg thiab pom cov ntaub ntawv hauv Python, thiab peb yuav txuas ntxiv ua haujlwm nrog cov ntaub ntawv. Txhua tus tuaj yeem kos cov lus xaus raws li npaj ua, pom cov ntaub ntawv lawv tus kheej.

Kev paub rau sawv daws!

Tau qhov twg los: www.hab.com

Ntxiv ib saib