Gawé dina kaahlian ngagunakeun grup sareng visualisasi data dina Python

Gawé dina kaahlian ngagunakeun grup sareng visualisasi data dina Python

Héy Habr!

Dinten ieu kami bakal dianggo dina kaahlian ngagunakeun parabot pikeun ngagolongkeun sarta visualizing data dina Python. Dina disadiakeun susunan data dina Github Hayu urang analisa sababaraha ciri sareng ngawangun sakumpulan visualisasi.

Numutkeun tradisi, di awal, hayu urang nangtukeun tujuan:

  • Grup data dumasar gender jeung taun sarta visualize dinamika sakabéh laju kalahiran duanana sexes;
  • Milarian nami anu pang populerna sadaya waktos;
  • Bagikeun sakabéh periode waktu dina data kana 10 bagian sarta pikeun tiap, manggihan ngaran nu pang populerna unggal gender. Pikeun unggal ngaran kapanggih, visualize dinamika na sepanjang waktos;
  • Pikeun unggal taun, ngitung sabaraha ngaran nutupan 50% jalma jeung visualize (urang bakal ningali rupa-rupa ngaran pikeun tiap taun);
  • Pilih 4 taun ti sakabéh interval jeung tampilan pikeun tiap taun distribusi ku hurup kahiji dina ngaran jeung ku hurup panungtungan dina ngaran;
  • Jieun daptar sababaraha jalma kawentar (presiden, penyanyi, aktor, karakter pilem) jeung evaluate pangaruh maranéhanana dina dinamika ngaran. Ngawangun visualisasi.

Kirang kecap, langkung seueur kode!

Jeung, hayu urang balik.

Hayu urang ngagolongkeun data dumasar génder sareng taun sareng visualisasikeun dinamika sakabéh tingkat kalahiran boh kelamin:

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt

years = np.arange(1880, 2011, 3)
datalist = 'https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/datasets/babynames/yob{year}.txt'
dataframes = []
for year in years:
    dataset = datalist.format(year=year)
    dataframe = pd.read_csv(dataset, names=['name', 'sex', 'count'])
    dataframes.append(dataframe.assign(year=year))

result = pd.concat(dataframes)
sex = result.groupby('sex')
births_men = sex.get_group('M').groupby('year', as_index=False)
births_women = sex.get_group('F').groupby('year', as_index=False)
births_men_list = births_men.aggregate(np.sum)['count'].tolist()
births_women_list = births_women.aggregate(np.sum)['count'].tolist()

fig, ax = plt.subplots()
fig.set_size_inches(25,15)

index = np.arange(len(years))
stolb1 = ax.bar(index, births_men_list, 0.4, color='c', label='Мужчины')
stolb2 = ax.bar(index + 0.4, births_women_list, 0.4, alpha=0.8, color='r', label='Женщины')

ax.set_title('Рождаемость по полу и годам')
ax.set_xlabel('Года')
ax.set_ylabel('Рождаемость')
ax.set_xticklabels(years)
ax.set_xticks(index + 0.4)
ax.legend(loc=9)

fig.tight_layout()
plt.show()

Gawé dina kaahlian ngagunakeun grup sareng visualisasi data dina Python

Hayu urang manggihan ngaran nu pang populerna di sajarah:

years = np.arange(1880, 2011)

dataframes = []
for year in years:
    dataset = datalist.format(year=year)
    dataframe = pd.read_csv(dataset, names=['name', 'sex', 'count'])
    dataframes.append(dataframe)

result = pd.concat(dataframes)
names = result.groupby('name', as_index=False).sum().sort_values('count', ascending=False)
names.head(10)

Gawé dina kaahlian ngagunakeun grup sareng visualisasi data dina Python

Hayu urang ngabagi sadaya periode waktos dina data kana 10 bagian sareng masing-masing urang bakal mendakan nami pang populerna unggal gender. Pikeun unggal ngaran kapanggih, urang visualize dinamika na sepanjang waktos:

years = np.arange(1880, 2011)
part_size = int((years[years.size - 1] - years[0]) / 10) + 1
parts = {}
def GetPart(year):
    return int((year - years[0]) / part_size)
for year in years:
    index = GetPart(year)
    r = years[0] + part_size * index, min(years[years.size - 1], years[0] + part_size * (index + 1))
    parts[index] = str(r[0]) + '-' + str(r[1])

dataframe_parts = []
dataframes = []
for year in years:
    dataset = datalist.format(year=year)
    dataframe = pd.read_csv(dataset, names=['name', 'sex', 'count'])
    dataframe_parts.append(dataframe.assign(years=parts[GetPart(year)]))
    dataframes.append(dataframe.assign(year=year))
    
result_parts = pd.concat(dataframe_parts)
result = pd.concat(dataframes)

result_parts_sums = result_parts.groupby(['years', 'sex', 'name'], as_index=False).sum()
result_parts_names = result_parts_sums.iloc[result_parts_sums.groupby(['years', 'sex'], as_index=False).apply(lambda x: x['count'].idxmax())]
result_sums = result.groupby(['year', 'sex', 'name'], as_index=False).sum()

for groupName, groupLabels in result_parts_names.groupby(['name', 'sex']).groups.items():
    group = result_sums.groupby(['name', 'sex']).get_group(groupName)
    fig, ax = plt.subplots(1, 1, figsize=(18,10))

    ax.set_xlabel('Года')
    ax.set_ylabel('Рождаемость')
    label = group['name']
    ax.plot(group['year'], group['count'], label=label.aggregate(np.max), color='b', ls='-')
    ax.legend(loc=9, fontsize=11)

    plt.show()

Gawé dina kaahlian ngagunakeun grup sareng visualisasi data dina Python

Gawé dina kaahlian ngagunakeun grup sareng visualisasi data dina Python

Gawé dina kaahlian ngagunakeun grup sareng visualisasi data dina Python

Gawé dina kaahlian ngagunakeun grup sareng visualisasi data dina Python

Gawé dina kaahlian ngagunakeun grup sareng visualisasi data dina Python

Gawé dina kaahlian ngagunakeun grup sareng visualisasi data dina Python

Gawé dina kaahlian ngagunakeun grup sareng visualisasi data dina Python

Gawé dina kaahlian ngagunakeun grup sareng visualisasi data dina Python

Gawé dina kaahlian ngagunakeun grup sareng visualisasi data dina Python

Gawé dina kaahlian ngagunakeun grup sareng visualisasi data dina Python

Pikeun unggal taun, urang ngitung sabaraha ngaran nutupan 50% jalma jeung visualize data ieu:

dataframe = pd.DataFrame({'year': [], 'count': []})
years = np.arange(1880, 2011)
for year in years:
    dataset = datalist.format(year=year)
    csv = pd.read_csv(dataset, names=['name', 'sex', 'count'])
    names = csv.groupby('name', as_index=False).aggregate(np.sum)
    names['sum'] = names.sum()['count']
    names['percent'] = names['count'] / names['sum'] * 100
    names = names.sort_values(['percent'], ascending=False)
    names['cum_perc'] = names['percent'].cumsum()
    names_filtered = names[names['cum_perc'] <= 50]
    dataframe = dataframe.append(pd.DataFrame({'year': [year], 'count': [names_filtered.shape[0]]}))

fig, ax1 = plt.subplots(1, 1, figsize=(22,13))
ax1.set_xlabel('Года', fontsize = 12)
ax1.set_ylabel('Разнообразие имен', fontsize = 12)
ax1.plot(dataframe['year'], dataframe['count'], color='r', ls='-')
ax1.legend(loc=9, fontsize=12)

plt.show()

Gawé dina kaahlian ngagunakeun grup sareng visualisasi data dina Python

Hayu urang pilih 4 taun tina sakabéh interval jeung mintonkeun pikeun tiap taun distribusi ku hurup kahiji dina ngaran jeung ku hurup panungtungan dina ngaran:

from string import ascii_lowercase, ascii_uppercase

fig_first, ax_first = plt.subplots(1, 1, figsize=(14,10))
fig_last, ax_last = plt.subplots(1, 1, figsize=(14,10))

index = np.arange(len(ascii_uppercase))
years = [1944, 1978, 1991, 2003]
colors = ['r', 'g', 'b', 'y']
n = 0
for year in years:
    dataset = datalist.format(year=year)
    csv = pd.read_csv(dataset, names=['name', 'sex', 'count'])
    names = csv.groupby('name', as_index=False).aggregate(np.sum)
    count = names.shape[0]

    dataframe = pd.DataFrame({'letter': [], 'frequency_first': [], 'frequency_last': []})
    for letter in ascii_uppercase:
        countFirst = (names[names.name.str.startswith(letter)].count()['count'])
        countLast = (names[names.name.str.endswith(letter.lower())].count()['count'])

        dataframe = dataframe.append(pd.DataFrame({
            'letter': [letter],
            'frequency_first': [countFirst / count * 100],
            'frequency_last': [countLast / count * 100]}))

    ax_first.bar(index + 0.3 * n, dataframe['frequency_first'], 0.3, alpha=0.5, color=colors[n], label=year)
    ax_last.bar(index + bar_width * n, dataframe['frequency_last'], 0.3, alpha=0.5, color=colors[n], label=year)
    n += 1

ax_first.set_xlabel('Буква алфавита')
ax_first.set_ylabel('Частота, %')
ax_first.set_title('Первая буква в имени')
ax_first.set_xticks(index)
ax_first.set_xticklabels(ascii_uppercase)
ax_first.legend()

ax_last.set_xlabel('Буква алфавита')
ax_last.set_ylabel('Частота, %')
ax_last.set_title('Последняя буква в имени')
ax_last.set_xticks(index)
ax_last.set_xticklabels(ascii_uppercase)
ax_last.legend()

fig_first.tight_layout()
fig_last.tight_layout()

plt.show()

Gawé dina kaahlian ngagunakeun grup sareng visualisasi data dina Python

Gawé dina kaahlian ngagunakeun grup sareng visualisasi data dina Python

Hayu urang nyieun daptar sababaraha jalma kawentar (presiden, penyanyi, aktor, karakter pilem) jeung evaluate pangaruh maranéhanana dina dinamika ngaran:

celebrities = {'Frank': 'M', 'Britney': 'F', 'Madonna': 'F', 'Bob': 'M'}
dataframes = []
for year in years:
    dataset = datalist.format(year=year)
    dataframe = pd.read_csv(dataset, names=['name', 'sex', 'count'])
    dataframes.append(dataframe.assign(year=year))

result = pd.concat(dataframes)

for celebrity, sex in celebrities.items():
    names = result[result.name == celebrity]
    dataframe = names[names.sex == sex]
    fig, ax = plt.subplots(1, 1, figsize=(16,8))

    ax.set_xlabel('Года', fontsize = 10)
    ax.set_ylabel('Рождаемость', fontsize = 10)
    ax.plot(dataframe['year'], dataframe['count'], label=celebrity, color='r', ls='-')
    ax.legend(loc=9, fontsize=12)
        
    plt.show()

Gawé dina kaahlian ngagunakeun grup sareng visualisasi data dina Python

Gawé dina kaahlian ngagunakeun grup sareng visualisasi data dina Python

Gawé dina kaahlian ngagunakeun grup sareng visualisasi data dina Python

Gawé dina kaahlian ngagunakeun grup sareng visualisasi data dina Python

Pikeun palatihan, anjeun tiasa nambihan jaman hirup selebritis kana visualisasi tina conto anu terakhir supados jelas ngira-ngira pangaruhna dina dinamika nami.

Kalayan ieu, sadaya tujuan urang kahontal sareng kacumponan. Kami geus ngembangkeun katerampilan ngagunakeun parabot pikeun ngagolongkeun sarta visualizing data dina Python, sarta kami bakal neruskeun gawé bareng data. Sarerea bisa nyieun conclusions dumasar kana siap-dijieun, data visualized sorangan.

Pangaweruh ka dulur!

sumber: www.habr.com

Tambahkeun komentar