Python-da guruhlash va ma'lumotlar vizualizatsiyasidan foydalanish mahorati ustida ishlash

Python-da guruhlash va ma'lumotlar vizualizatsiyasidan foydalanish mahorati ustida ishlash

Hey Xabr!

Bugun biz Python-da ma'lumotlarni guruhlash va vizualizatsiya qilish vositalaridan foydalanish mahorati ustida ishlaymiz. Taqdim etilgan Github-da ma'lumotlar to'plami Keling, bir nechta xususiyatlarni tahlil qilaylik va vizualizatsiya to'plamini yarataylik.

An'anaga ko'ra, avvaliga maqsadlarni aniqlaymiz:

  • Ma'lumotlarni jinsi va yillari bo'yicha guruhlash va ikkala jinsdagi tug'ilishning umumiy dinamikasini tasavvur qilish;
  • Barcha davrlarning eng mashhur ismlarini toping;
  • Ma'lumotlardagi butun vaqt davrini 10 qismga bo'ling va har biri uchun har bir jinsning eng mashhur ismini toping. Topilgan har bir nom uchun uning dinamikasini butun vaqt davomida tasavvur qiling;
  • Har bir yil uchun, qancha nomlar odamlarning 50 foizini qamrab olganligini hisoblang va tasavvur qiling (har bir yil uchun nomlarning xilma-xilligini ko'ramiz);
  • Butun oraliqdan 4 yilni tanlang va har bir yil uchun nomning birinchi harfi va ismning oxirgi harfi bo'yicha taqsimotni ko'rsating;
  • Bir nechta taniqli shaxslar (prezidentlar, qo'shiqchilar, aktyorlar, film qahramonlari) ro'yxatini tuzing va ularning ismlarning dinamikasiga ta'sirini baholang. Vizualizatsiya yaratish.

Kamroq so'z, ko'proq kod!

Va, ketaylik.

Keling, ma'lumotlarni jins va yil bo'yicha guruhlaymiz va ikkala jinsdagi tug'ilishning umumiy dinamikasini tasavvur qilamiz:

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt

years = np.arange(1880, 2011, 3)
datalist = 'https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/datasets/babynames/yob{year}.txt'
dataframes = []
for year in years:
    dataset = datalist.format(year=year)
    dataframe = pd.read_csv(dataset, names=['name', 'sex', 'count'])
    dataframes.append(dataframe.assign(year=year))

result = pd.concat(dataframes)
sex = result.groupby('sex')
births_men = sex.get_group('M').groupby('year', as_index=False)
births_women = sex.get_group('F').groupby('year', as_index=False)
births_men_list = births_men.aggregate(np.sum)['count'].tolist()
births_women_list = births_women.aggregate(np.sum)['count'].tolist()

fig, ax = plt.subplots()
fig.set_size_inches(25,15)

index = np.arange(len(years))
stolb1 = ax.bar(index, births_men_list, 0.4, color='c', label='ΠœΡƒΠΆΡ‡ΠΈΠ½Ρ‹')
stolb2 = ax.bar(index + 0.4, births_women_list, 0.4, alpha=0.8, color='r', label='Π–Π΅Π½Ρ‰ΠΈΠ½Ρ‹')

ax.set_title('Π ΠΎΠΆΠ΄Π°Π΅ΠΌΠΎΡΡ‚ΡŒ ΠΏΠΎ ΠΏΠΎΠ»Ρƒ ΠΈ Π³ΠΎΠ΄Π°ΠΌ')
ax.set_xlabel('Π“ΠΎΠ΄Π°')
ax.set_ylabel('Π ΠΎΠΆΠ΄Π°Π΅ΠΌΠΎΡΡ‚ΡŒ')
ax.set_xticklabels(years)
ax.set_xticks(index + 0.4)
ax.legend(loc=9)

fig.tight_layout()
plt.show()

Python-da guruhlash va ma'lumotlar vizualizatsiyasidan foydalanish mahorati ustida ishlash

Keling, tarixdagi eng mashhur ismlarni topamiz:

years = np.arange(1880, 2011)

dataframes = []
for year in years:
    dataset = datalist.format(year=year)
    dataframe = pd.read_csv(dataset, names=['name', 'sex', 'count'])
    dataframes.append(dataframe)

result = pd.concat(dataframes)
names = result.groupby('name', as_index=False).sum().sort_values('count', ascending=False)
names.head(10)

Python-da guruhlash va ma'lumotlar vizualizatsiyasidan foydalanish mahorati ustida ishlash

Keling, ma'lumotlardagi butun vaqtni 10 qismga ajratamiz va har biri uchun har bir jinsning eng mashhur ismini topamiz. Topilgan har bir nom uchun biz uning dinamikasini barcha vaqt davomida tasavvur qilamiz:

years = np.arange(1880, 2011)
part_size = int((years[years.size - 1] - years[0]) / 10) + 1
parts = {}
def GetPart(year):
    return int((year - years[0]) / part_size)
for year in years:
    index = GetPart(year)
    r = years[0] + part_size * index, min(years[years.size - 1], years[0] + part_size * (index + 1))
    parts[index] = str(r[0]) + '-' + str(r[1])

dataframe_parts = []
dataframes = []
for year in years:
    dataset = datalist.format(year=year)
    dataframe = pd.read_csv(dataset, names=['name', 'sex', 'count'])
    dataframe_parts.append(dataframe.assign(years=parts[GetPart(year)]))
    dataframes.append(dataframe.assign(year=year))
    
result_parts = pd.concat(dataframe_parts)
result = pd.concat(dataframes)

result_parts_sums = result_parts.groupby(['years', 'sex', 'name'], as_index=False).sum()
result_parts_names = result_parts_sums.iloc[result_parts_sums.groupby(['years', 'sex'], as_index=False).apply(lambda x: x['count'].idxmax())]
result_sums = result.groupby(['year', 'sex', 'name'], as_index=False).sum()

for groupName, groupLabels in result_parts_names.groupby(['name', 'sex']).groups.items():
    group = result_sums.groupby(['name', 'sex']).get_group(groupName)
    fig, ax = plt.subplots(1, 1, figsize=(18,10))

    ax.set_xlabel('Π“ΠΎΠ΄Π°')
    ax.set_ylabel('Π ΠΎΠΆΠ΄Π°Π΅ΠΌΠΎΡΡ‚ΡŒ')
    label = group['name']
    ax.plot(group['year'], group['count'], label=label.aggregate(np.max), color='b', ls='-')
    ax.legend(loc=9, fontsize=11)

    plt.show()

Python-da guruhlash va ma'lumotlar vizualizatsiyasidan foydalanish mahorati ustida ishlash

Python-da guruhlash va ma'lumotlar vizualizatsiyasidan foydalanish mahorati ustida ishlash

Python-da guruhlash va ma'lumotlar vizualizatsiyasidan foydalanish mahorati ustida ishlash

Python-da guruhlash va ma'lumotlar vizualizatsiyasidan foydalanish mahorati ustida ishlash

Python-da guruhlash va ma'lumotlar vizualizatsiyasidan foydalanish mahorati ustida ishlash

Python-da guruhlash va ma'lumotlar vizualizatsiyasidan foydalanish mahorati ustida ishlash

Python-da guruhlash va ma'lumotlar vizualizatsiyasidan foydalanish mahorati ustida ishlash

Python-da guruhlash va ma'lumotlar vizualizatsiyasidan foydalanish mahorati ustida ishlash

Python-da guruhlash va ma'lumotlar vizualizatsiyasidan foydalanish mahorati ustida ishlash

Python-da guruhlash va ma'lumotlar vizualizatsiyasidan foydalanish mahorati ustida ishlash

Har bir yil uchun biz qancha ism odamlarning 50 foizini qamrab olganini hisoblaymiz va ushbu ma'lumotlarni ingl.

dataframe = pd.DataFrame({'year': [], 'count': []})
years = np.arange(1880, 2011)
for year in years:
    dataset = datalist.format(year=year)
    csv = pd.read_csv(dataset, names=['name', 'sex', 'count'])
    names = csv.groupby('name', as_index=False).aggregate(np.sum)
    names['sum'] = names.sum()['count']
    names['percent'] = names['count'] / names['sum'] * 100
    names = names.sort_values(['percent'], ascending=False)
    names['cum_perc'] = names['percent'].cumsum()
    names_filtered = names[names['cum_perc'] <= 50]
    dataframe = dataframe.append(pd.DataFrame({'year': [year], 'count': [names_filtered.shape[0]]}))

fig, ax1 = plt.subplots(1, 1, figsize=(22,13))
ax1.set_xlabel('Π“ΠΎΠ΄Π°', fontsize = 12)
ax1.set_ylabel('Π Π°Π·Π½ΠΎΠΎΠ±Ρ€Π°Π·ΠΈΠ΅ ΠΈΠΌΠ΅Π½', fontsize = 12)
ax1.plot(dataframe['year'], dataframe['count'], color='r', ls='-')
ax1.legend(loc=9, fontsize=12)

plt.show()

Python-da guruhlash va ma'lumotlar vizualizatsiyasidan foydalanish mahorati ustida ishlash

Keling, butun oraliqdan 4 yilni tanlaymiz va har bir yil uchun nomdagi birinchi harf va nomdagi oxirgi harf bo'yicha taqsimotni ko'rsatamiz:

from string import ascii_lowercase, ascii_uppercase

fig_first, ax_first = plt.subplots(1, 1, figsize=(14,10))
fig_last, ax_last = plt.subplots(1, 1, figsize=(14,10))

index = np.arange(len(ascii_uppercase))
years = [1944, 1978, 1991, 2003]
colors = ['r', 'g', 'b', 'y']
n = 0
for year in years:
    dataset = datalist.format(year=year)
    csv = pd.read_csv(dataset, names=['name', 'sex', 'count'])
    names = csv.groupby('name', as_index=False).aggregate(np.sum)
    count = names.shape[0]

    dataframe = pd.DataFrame({'letter': [], 'frequency_first': [], 'frequency_last': []})
    for letter in ascii_uppercase:
        countFirst = (names[names.name.str.startswith(letter)].count()['count'])
        countLast = (names[names.name.str.endswith(letter.lower())].count()['count'])

        dataframe = dataframe.append(pd.DataFrame({
            'letter': [letter],
            'frequency_first': [countFirst / count * 100],
            'frequency_last': [countLast / count * 100]}))

    ax_first.bar(index + 0.3 * n, dataframe['frequency_first'], 0.3, alpha=0.5, color=colors[n], label=year)
    ax_last.bar(index + bar_width * n, dataframe['frequency_last'], 0.3, alpha=0.5, color=colors[n], label=year)
    n += 1

ax_first.set_xlabel('Π‘ΡƒΠΊΠ²Π° Π°Π»Ρ„Π°Π²ΠΈΡ‚Π°')
ax_first.set_ylabel('Частота, %')
ax_first.set_title('ΠŸΠ΅Ρ€Π²Π°Ρ Π±ΡƒΠΊΠ²Π° Π² ΠΈΠΌΠ΅Π½ΠΈ')
ax_first.set_xticks(index)
ax_first.set_xticklabels(ascii_uppercase)
ax_first.legend()

ax_last.set_xlabel('Π‘ΡƒΠΊΠ²Π° Π°Π»Ρ„Π°Π²ΠΈΡ‚Π°')
ax_last.set_ylabel('Частота, %')
ax_last.set_title('ПослСдняя Π±ΡƒΠΊΠ²Π° Π² ΠΈΠΌΠ΅Π½ΠΈ')
ax_last.set_xticks(index)
ax_last.set_xticklabels(ascii_uppercase)
ax_last.legend()

fig_first.tight_layout()
fig_last.tight_layout()

plt.show()

Python-da guruhlash va ma'lumotlar vizualizatsiyasidan foydalanish mahorati ustida ishlash

Python-da guruhlash va ma'lumotlar vizualizatsiyasidan foydalanish mahorati ustida ishlash

Keling, bir nechta taniqli shaxslar (prezidentlar, qo'shiqchilar, aktyorlar, film qahramonlari) ro'yxatini tuzamiz va ularning ismlarning dinamikasiga ta'sirini baholaymiz:

celebrities = {'Frank': 'M', 'Britney': 'F', 'Madonna': 'F', 'Bob': 'M'}
dataframes = []
for year in years:
    dataset = datalist.format(year=year)
    dataframe = pd.read_csv(dataset, names=['name', 'sex', 'count'])
    dataframes.append(dataframe.assign(year=year))

result = pd.concat(dataframes)

for celebrity, sex in celebrities.items():
    names = result[result.name == celebrity]
    dataframe = names[names.sex == sex]
    fig, ax = plt.subplots(1, 1, figsize=(16,8))

    ax.set_xlabel('Π“ΠΎΠ΄Π°', fontsize = 10)
    ax.set_ylabel('Π ΠΎΠΆΠ΄Π°Π΅ΠΌΠΎΡΡ‚ΡŒ', fontsize = 10)
    ax.plot(dataframe['year'], dataframe['count'], label=celebrity, color='r', ls='-')
    ax.legend(loc=9, fontsize=12)
        
    plt.show()

Python-da guruhlash va ma'lumotlar vizualizatsiyasidan foydalanish mahorati ustida ishlash

Python-da guruhlash va ma'lumotlar vizualizatsiyasidan foydalanish mahorati ustida ishlash

Python-da guruhlash va ma'lumotlar vizualizatsiyasidan foydalanish mahorati ustida ishlash

Python-da guruhlash va ma'lumotlar vizualizatsiyasidan foydalanish mahorati ustida ishlash

Trening uchun, ismlarning dinamikasiga ta'sirini aniq baholash uchun siz taniqli shaxsning hayot davrini oxirgi misoldagi vizualizatsiyaga qo'shishingiz mumkin.

Shu bilan barcha maqsadlarimizga erishildi va amalga oshdi. Biz Python-da ma'lumotlarni guruhlash va vizualizatsiya qilish vositalaridan foydalanish mahoratini rivojlantirdik va biz ma'lumotlar bilan ishlashni davom ettiramiz. Har kim o'zi tayyor, vizuallashtirilgan ma'lumotlarga asoslanib xulosa chiqarishi mumkin.

Hammaga bilim!

Manba: www.habr.com

a Izoh qo'shish