ʻO ka hana mua i ka wā e hoʻomaka ai e hana me kahi hoʻonohonoho ʻikepili hou e hoʻomaopopo iā ia. No ka hana ʻana i kēia, pono ʻoe, no ka laʻana, e ʻike i nā pae o nā waiwai i ʻae ʻia e nā mea hoʻololi, kā lākou ʻano, a ʻike pū i ka helu o nā waiwai i nalowale.
Hāʻawi ka waihona pandas iā mākou i nā mea pono he nui no ka hoʻokō ʻana i ka ʻikepili ʻikepili ʻimi (EDA). Akā ma mua o ka hoʻohana ʻana iā lākou, pono ʻoe e hoʻomaka me nā hana maʻamau e like me df.describe(). Eia naʻe, pono e hoʻomaopopoʻia he palena nā mana i hāʻawiʻia e ia mau hana, aʻo nā hana mua o ka hanaʻana me nā pūʻuluʻikepili i ka wā e hoʻokō ai i ka EDA e like loa me kekahi.
ʻO ka mea kākau o ka mea a mākou e paʻi nei i kēia lā, ʻōlelo ʻo ia ʻaʻole ia he mea makemake i ka hana hou ʻana. ʻO ka hopena, i ka ʻimi ʻana i nā mea hana e hana wikiwiki a maikaʻi hoʻi i ka nānā ʻana i ka ʻikepili exploratory, ua loaʻa iā ia ka waihona
Maanei e nānā mākou i nā hiʻohiʻona o ka hoʻohana ʻana i ka waihona pandas-profiling me ka hoʻohana ʻana i ka dataset Titanic ma ke ʻano he laʻana.
ʻIke ʻikepili ʻimi me ka hoʻohana ʻana i nā pandas
Ua hoʻoholo wau e hoʻokolohua me nā pandas-profiling ma ka ʻikepili Titanic ma muli o nā ʻano ʻikepili like ʻole i loaʻa a me ka loaʻa ʻana o nā waiwai i nalowale. Ke manaʻoʻiʻo nei au he mea hoihoi loa ka waihona pandas-profiling i nā hihia kahi i hoʻomaʻemaʻe ʻole ʻia ai ka ʻikepili a koi aku i ka hana hou ʻana ma muli o kāna mau hiʻohiʻona. I mea e hoʻokō pono ai i ia kaʻina hana, pono ʻoe e ʻike i kahi e hoʻomaka ai a me ka mea e hoʻolohe ai. ʻO kēia kahi e hiki mai ai nā mana panda-profiling.
ʻO ka mea mua, lawe mākou i ka ʻikepili a hoʻohana i nā pandas e kiʻi i nā ʻikepili wehewehe:
# импорт необходимых пакетов
import pandas as pd
import pandas_profiling
import numpy as np
# импорт данных
df = pd.read_csv('/Users/lukas/Downloads/titanic/train.csv')
# вычисление показателей описательной статистики
df.describe()
Ma hope o ka hoʻokō ʻana i kēia ʻāpana code, e loaʻa iā ʻoe ka mea i hōʻike ʻia ma kēia kiʻi.
Loaʻa nā ʻikepili wehewehe me ka hoʻohana ʻana i nā mea hana pandas maʻamau
ʻOiai he nui nā ʻike pono ma ʻaneʻi, ʻaʻole i loko o nā mea a pau e hoihoi i ka ʻike e pili ana i ka ʻikepili e aʻo ʻia. No ka laʻana, manaʻo paha kekahi i loko o kahi kiʻi ʻikepili, i kahi hoʻolālā DataFrame
, he 891 laina. Inā pono e nānā ʻia kēia, a laila koi ʻia kahi laina code hou e hoʻoholo ai i ka nui o ke kiʻi. ʻOiai ʻaʻole koʻikoʻi kēia mau helu ʻana, ʻoi aku ka maikaʻi o ka hoʻomaʻemaʻe ʻana i ka ʻikepili.
Ka ʻimi ʻikepili ʻikepili me ka hoʻohana ʻana i ka panda-profiling
E hana like kāua me ka hoʻohana ʻana i ka pandas-profiling:
pandas_profiling.ProfileReport(df)
ʻO ka hoʻokō ʻana i ka laina ma luna o ke code e hoʻopuka i kahi hōʻike me nā hōʻailona hōʻike ʻikepili ʻimi. ʻO ke code i hōʻike ʻia ma luna nei e hoʻopuka i ka ʻikepili i loaʻa, akā hiki iā ʻoe ke hoʻopuka i kahi faila HTML hiki iā ʻoe ke hōʻike i kekahi, no ka laʻana.
Aia ka ʻāpana mua o ka hōʻike i kahi ʻāpana Overview, e hāʻawi ana i ka ʻike kumu e pili ana i ka ʻikepili (helu o ka nānā ʻana, ka helu o nā mea hoʻololi, etc.). Loaʻa iā ia kahi papa inoa o nā mākaʻikaʻi, e hōʻike ana i ka mea loiloi i nā mea e nānā pono ai. Hiki i kēia mau makaʻala ke hāʻawi i nā hōʻailona e pili ana i kahi e hiki ai iā ʻoe ke kālele i kāu mau hana hoʻomaʻemaʻe ʻikepili.
Māhele hōʻike manaʻo nui
Ka Imi Imi
Ma lalo o ka ʻāpana Overview o ka hōʻike hiki iā ʻoe ke ʻike i ka ʻike pono e pili ana i kēlā me kēia ʻano. Hoʻokomo pū lākou, ma waena o nā mea ʻē aʻe, nā palapala liʻiliʻi e wehewehe ana i ka māhele ʻana o kēlā me kēia ʻano.
E pili ana i ka helu makahiki
E like me kāu e ʻike ai mai ka laʻana mua, hāʻawi ka pandas-profiling iā mākou i kekahi mau hōʻailona pono, e like me ka pākēneka a me ka helu o nā waiwai i nalowale, a me nā ana helu wehewehe a mākou i ʻike mua ai. No ka mea Age
he helu helu, ʻike ʻia kona puʻunaue ʻana ma ke ʻano o ka histogram e hiki ai iā mākou ke hoʻoholo i kā mākou mahele ʻana i ka ʻākau.
I ka noʻonoʻo ʻana i kahi hoʻololi categorical, ʻokoʻa iki nā hopena i loaʻa mai nā mea i loaʻa no kahi loli helu.
E pili ana i ka Sex categorical variable
ʻO ia, ma kahi o ka loaʻa ʻana o ka awelika, ka liʻiliʻi a me ka nui, ua loaʻa i ka waihona pandas-profiling ka helu o nā papa. No ka mea Sex
- kahi hoʻololi binary, hōʻike ʻia kona mau waiwai e nā papa ʻelua.
Inā makemake ʻoe e nānā i nā code e like me aʻu, makemake paha ʻoe i ke ʻano o ka helu ʻana o ka waihona pandas-profiling i kēia mau metric. ʻO ka ʻike e pili ana i kēia, hāʻawi ʻia ua wehe ʻia ka code waihona a loaʻa iā GitHub, ʻaʻole paʻakikī loa. No ka mea ʻaʻole wau makemake nui i ka hoʻohana ʻana i nā pahu ʻeleʻele i kaʻu mau papahana, ua nānā au i ka code kumu o ka waihona. No ka laʻana, ʻo ia ke ʻano o ke ʻano o ka hana ʻana i nā ʻano helu helu, i hōʻike ʻia e ka hana
def describe_numeric_1d(series, **kwargs):
"""Compute summary statistics of a numerical (`TYPE_NUM`) variable (a Series).
Also create histograms (mini an full) of its distribution.
Parameters
----------
series : Series
The variable to describe.
Returns
-------
Series
The description of the variable as a Series with index being stats keys.
"""
# Format a number as a percentage. For example 0.25 will be turned to 25%.
_percentile_format = "{:.0%}"
stats = dict()
stats['type'] = base.TYPE_NUM
stats['mean'] = series.mean()
stats['std'] = series.std()
stats['variance'] = series.var()
stats['min'] = series.min()
stats['max'] = series.max()
stats['range'] = stats['max'] - stats['min']
# To avoid to compute it several times
_series_no_na = series.dropna()
for percentile in np.array([0.05, 0.25, 0.5, 0.75, 0.95]):
# The dropna() is a workaround for https://github.com/pydata/pandas/issues/13098
stats[_percentile_format.format(percentile)] = _series_no_na.quantile(percentile)
stats['iqr'] = stats['75%'] - stats['25%']
stats['kurtosis'] = series.kurt()
stats['skewness'] = series.skew()
stats['sum'] = series.sum()
stats['mad'] = series.mad()
stats['cv'] = stats['std'] / stats['mean'] if stats['mean'] else np.NaN
stats['n_zeros'] = (len(series) - np.count_nonzero(series))
stats['p_zeros'] = stats['n_zeros'] * 1.0 / len(series)
# Histograms
stats['histogram'] = histogram(series, **kwargs)
stats['mini_histogram'] = mini_histogram(series, **kwargs)
return pd.Series(stats, name=series.name)
ʻOiai he mea nui a paʻakikī paha kēia ʻāpana code, maʻalahi loa ia e hoʻomaopopo. ʻO ka manaʻo, aia i loko o ka code source o ka waihona kahi hana e hoʻoholo ai i nā ʻano o nā mea hoʻololi. Inā ʻike ʻia ua loaʻa ka waihona i kahi ʻano helu helu, e ʻike ka hana ma luna nei i nā metric a mākou e nānā nei. Hoʻohana kēia hana i nā hana pandas maʻamau no ka hana ʻana me nā mea o ke ʻano Series
, like series.mean()
. Mālama ʻia nā hopena helu i loko o ka puke wehewehe stats
. Hana ʻia nā histograms me ka hoʻohana ʻana i kahi mana kūpono o ka hana matplotlib.pyplot.hist
. Hoʻopili ʻia ka hoʻololi ʻana i ka hōʻoia ʻana e hiki ke hana i ka hana me nā ʻano pūʻulu ʻikepili like ʻole.
Ua aʻo ʻia nā hōʻailona hoʻoponopono a me ka ʻikepili laʻana
Ma hope o nā hopena o ka nānā ʻana i nā mea hoʻololi, pandas-profiling, ma ka ʻāpana Correlations, e hōʻike i nā matrices correlation Pearson a me Spearman.
Pearson correlation matrix
Inā pono, hiki iā ʻoe, ma ka laina o ke code e hoʻāla ai i ka hanauna o ka hōʻike, e hoʻonohonoho i nā hōʻailona o nā koina paepae i hoʻohana ʻia i ka helu ʻana i ka correlation. Ma ka hana ʻana i kēia, hiki iā ʻoe ke kuhikuhi i ka ikaika o ka correlation i manaʻo ʻia he mea nui no kāu loiloi.
ʻO ka hope, hōʻike ka hōʻike panda-profiling, ma ka ʻāpana Sample, ma ke ʻano he laʻana, kahi ʻāpana ʻikepili i lawe ʻia mai ka hoʻomaka ʻana o ka hoʻonohonoho ʻikepili. Hiki i kēia ala ke alakaʻi i nā pīhoihoi maikaʻi ʻole, no ka mea, ʻo nā ʻike mua loa e hōʻike ana i kahi laʻana i hōʻike ʻole i nā ʻano o ka pūʻulu ʻikepili holoʻokoʻa.
ʻO ka ʻāpana i loaʻa nā ʻikepili laʻana e aʻo ʻia
ʻO ka hopena, ʻaʻole wau manaʻo e hoʻolohe i kēia ʻāpana hope. Akā, ʻoi aku ka maikaʻi o ka hoʻohana ʻana i ke kauoha df.sample(5)
, ka mea e koho maalea i 5 ike mai ka hoonohonoho ikepili.
Nā hopena
I ka hōʻuluʻulu ʻana, hāʻawi ka waihona pandas-profiling i ka mea loiloi i kekahi mau mea pono e hiki mai ana i nā hihia kahi e pono ai ʻoe e kiʻi koke i kahi manaʻo koʻikoʻi o ka ʻikepili a i ʻole e hāʻawi i kahi hōʻike loiloi naʻauao i kekahi. I ka manawa like, hana ʻia ka hana maoli me ka ʻikepili, e noʻonoʻo ana i kāna mau hiʻohiʻona, me ka ʻole o ka hoʻohana ʻana i ka pandas-profiling, me ka lima.
Inā makemake ʻoe e nānā i ke ʻano o ka nānā ʻana i ka ʻikepili naʻauao āpau i hoʻokahi puke Jupyter, e nānā
E nā mea heluhelu aloha! Ma hea ʻoe e hoʻomaka ai e kālailai i nā pūʻulu ʻikepili hou?
Source: www.habr.com