Ndi chilankhulo chiti chomwe mungasankhe kuti mugwiritse ntchito ndi data - R kapena Python? Onse! Kusamuka kuchokera ku panda kupita ku tidyverse ndi data.table ndi kubwerera

Pofufuza R kapena Python pa intaneti, mupeza mamiliyoni a zolemba ndi ma kilomita a zokambirana pamutu wakuti ndi wabwino, wachangu komanso wosavuta kugwiritsa ntchito ndi data. Koma mwatsoka, zolemba zonsezi ndi mikangano sizothandiza kwenikweni.

Ndi chilankhulo chiti chomwe mungasankhe kuti mugwiritse ntchito ndi data - R kapena Python? Onse! Kusamuka kuchokera ku panda kupita ku tidyverse ndi data.table ndi kubwerera

Cholinga cha nkhaniyi ndikufanizira njira zoyendetsera deta zomwe zili m'maphukusi otchuka azinenero zonse ziwiri. Ndipo thandizani owerenga kudziwa mwachangu zomwe sakudziwa. Kwa iwo omwe amalemba mu Python, fufuzani momwe mungachitire zomwezo mu R, ndi mosemphanitsa.

M'nkhaniyo tidzasanthula kalembedwe ka phukusi lodziwika kwambiri mu R. Awa ndi mapepala omwe akuphatikizidwa mu laibulale tidyversekomanso paketi data.table. Ndipo yerekezerani mawu awo ndi pandas, phukusi lodziwika kwambiri losanthula deta ku Python.

Tidzadutsa pang'onopang'ono njira yonse yowunikira deta kuchokera pakuyiyika mpaka pochita ntchito zazenera pogwiritsa ntchito Python ndi R.

Zamkatimu

Nkhaniyi ingagwiritsidwe ntchito ngati pepala lachinyengo ngati mwaiwala momwe mungapangire ntchito yokonza deta mu imodzi mwa phukusi lomwe likuganiziridwa.

Ndi chilankhulo chiti chomwe mungasankhe kuti mugwiritse ntchito ndi data - R kapena Python? Onse! Kusamuka kuchokera ku panda kupita ku tidyverse ndi data.table ndi kubwerera

  1. Kusiyana kwakukulu kwa syntax pakati pa R ndi Python
    1.1. Kupeza Ntchito Za Phukusi
    1.2. Ntchito
    1.3. Kulozera
    1.4. Njira ndi OOP
    1.5. Mapaipi
    1.6. Kapangidwe ka Data
  2. Mawu ochepa okhudza mapaketi omwe tidzagwiritse ntchito
    2.1. zasintha
    2.2. deta.table
    2.3. panda
  3. Kuyika phukusi
  4. Loading Data
  5. Kupanga ma dataframes
  6. Kusankha Mizati Zomwe Mukufunikira
  7. Kusefa mizere
  8. Kupanga Magulu ndi Kuphatikiza
  9. Vertical union of tables (UNION)
  10. Matebulo opingasa (JOIN)
  11. Ntchito zoyambira zenera ndi magawo owerengeka
  12. Kulemberana makalata pakati pa njira zopangira deta mu R ndi Python
  13. Pomaliza
  14. Kafukufuku wamfupi wokhudza phukusi lomwe mumagwiritsa ntchito

Ngati mukufuna kusanthula deta, mungapeze wanga uthengawo и Youtube njira. Zambiri mwazomwe zimaperekedwa ku chilankhulo cha R.

Kusiyana kwakukulu kwa syntax pakati pa R ndi Python

Kuti zikhale zosavuta kuti musinthe kuchokera ku Python kupita ku R, kapena mosemphanitsa, ndikupatsani mfundo zazikulu zingapo zomwe muyenera kuziganizira.

Kupeza Ntchito Za Phukusi

Phukusi likayikidwa mu R, simuyenera kufotokoza dzina la phukusi kuti mupeze ntchito zake. Nthawi zambiri izi sizodziwika mu R, koma ndizovomerezeka. Simuyenera kuitanitsa phukusi konse ngati mukufuna imodzi mwazochita zake mu code yanu, koma ingoyitchani potchula dzina la phukusi ndi dzina la ntchitoyo. Olekanitsa pakati pa phukusi ndi mayina a ntchito mu R ndi coloni iwiri. package_name::function_name().

Ku Python, m'malo mwake, zimawonedwa ngati zachikale kutchula ntchito za phukusi pofotokoza mwatsatanetsatane dzina lake. Phukusi likatsitsidwa, nthawi zambiri limapatsidwa dzina lofupikitsidwa, mwachitsanzo. pandas nthawi zambiri amagwiritsa ntchito pseudonym pd. Ntchito ya phukusi imapezeka kudzera padontho package_name.function_name().

Ntchito

Mu R, ndizofala kugwiritsa ntchito muvi kuti mupereke mtengo ku chinthu. obj_name <- value, ngakhale chizindikiro chimodzi chofanana ndi chololedwa, chizindikiro chimodzi chofanana mu R chimagwiritsidwa ntchito makamaka popereka mfundo kuti zigwire ntchito.

Ku Python, ntchito imachitidwa kokha ndi chizindikiro chimodzi chofanana obj_name = value.

Kulozera

Palinso kusiyana kwakukulu apa. Mu R, kulondolera kumayambira pa imodzi ndipo kumaphatikizapo zinthu zonse zomwe zafotokozedwa muzotsatira,

Mu Python, kulondolera kumayambira pa ziro ndipo mndandanda womwe wasankhidwa suphatikiza chinthu chomaliza chomwe chafotokozedwa muzolozera. Choncho kupanga x[i:j] mu Python sichiphatikizapo j element.

Palinso zosiyana pakulondolera kolakwika, mu R notation x[-1] idzabwezeretsa zinthu zonse za vector kupatula yomaliza. Ku Python, zolemba zofananira zidzangobweretsa chinthu chomaliza.

Njira ndi OOP

R imagwiritsa ntchito OOP mwanjira yake, ndidalemba za izi m'nkhaniyi "OOP m'chinenero cha R (gawo 1): makalasi a S3". Kawirikawiri, R ndi chinenero chogwira ntchito, ndipo chirichonse chomwe chiri momwemo chimamangidwa pa ntchito. Chifukwa chake, mwachitsanzo, kwa ogwiritsa ntchito a Excel, pitani ku tydiverse zidzakhala zosavuta kuposa pandas. Ngakhale izi zitha kukhala lingaliro langa lokhazikika.

Mwachidule, zinthu zomwe zili mu R zilibe njira (ngati tikukamba za makalasi a S3, koma pali zina zomwe zakhazikitsidwa ndi OOP zomwe ndizochepa kwambiri). Pali ntchito zokhazikika zomwe zimawasintha mosiyana malinga ndi gulu la chinthucho.

Mapaipi

Mwina ili ndi dzina lake pandas Sizingakhale zolondola kwathunthu, koma ndiyesera kufotokoza tanthauzo lake.

Kuti musapulumutse kuwerengera kwapakatikati komanso kuti musapange zinthu zosafunika m'malo ogwirira ntchito, mutha kugwiritsa ntchito mtundu wa payipi. Iwo. perekani zotsatira za kuwerengera kuchokera ku ntchito ina kupita ku ina, ndipo musasunge zotsatira zapakatikati.

Tiyeni titenge chitsanzo chotsatirachi, pomwe timasunga mawerengedwe apakatikati muzinthu zosiyana:

temp_object <- func1()
temp_object2 <- func2(temp_object )
obj <- func3(temp_object2 )

Tidachita maopaleshoni atatu motsatizana, ndipo zotsatira za chilichonse zidasungidwa mu chinthu china. Koma kwenikweni, sitifuna zinthu zapakatikati izi.

Kapena choyipa kwambiri, koma chodziwika bwino kwa ogwiritsa ntchito a Excel.

obj  <- func3(func2(func1()))

Pachifukwa ichi, sitinasunge zotsatira zowerengera zapakatikati, koma ma code owerengera okhala ndi zisa zawo zimakhala zovuta kwambiri.

Tiwona njira zingapo zosinthira deta mu R, ndipo amachitanso chimodzimodzi m'njira zosiyanasiyana.

Mapaipi mu laibulale tidyverse zoyendetsedwa ndi wogwiritsa ntchito %>%.

obj <- func1() %>% 
            func2() %>%
            func3()

Potero timatenga zotsatira za ntchitoyo func1() ndikuchipereka ngati mtsutso woyamba func2(), ndiye timadutsa zotsatira za kuwerengera uku ngati mkangano woyamba func3(). Ndipo pamapeto pake, timalemba zowerengera zonse zomwe zachitika mu chinthucho obj <-.

Zonse zomwe zili pamwambazi zikuwonetsedwa bwino kuposa mawu a meme iyi:
Ndi chilankhulo chiti chomwe mungasankhe kuti mugwiritse ntchito ndi data - R kapena Python? Onse! Kusamuka kuchokera ku panda kupita ku tidyverse ndi data.table ndi kubwerera

В data.table maunyolo amagwiritsidwa ntchito mofananamo.

newDT <- DT[where, select|update|do, by][where, select|update|do, by][where, select|update|do, by]

Mumabulaketi aliwonse amzere mutha kugwiritsa ntchito zotsatira za ntchito yapitayi.

В pandas ntchito zoterezi zimalekanitsidwa ndi kadontho.

obj = df.fun1().fun2().fun3()

Iwo. timatenga tebulo lathu df ndi kugwiritsa ntchito njira yake fun1(), ndiye timagwiritsa ntchito njira pazotsatira zomwe tapeza fun2()pambuyo fun3(). Zotsatira zimasungidwa mu chinthu kukhala .

Kapangidwe ka Data

Mapangidwe a data mu R ndi Python ndi ofanana, koma ali ndi mayina osiyanasiyana.

mafotokozedwe
Mutu mu R
Dzina mu Python/pandas

Mapangidwe a tebulo
data.frame, data.table, idati
DataFrame

Mndandanda wazinthu zamtundu umodzi
Vector
Mndandanda mu pandas kapena mndandanda mu Python yoyera

Kapangidwe kamitundu ingapo kopanda ma tabular
Mndandanda
Mtanthauzira mawu (dict)

Tiwona zina ndi zosiyana za syntax pansipa.

Mawu ochepa okhudza mapaketi omwe tidzagwiritse ntchito

Choyamba, ndikuwuzani pang'ono za mapaketi omwe mudzawadziwa bwino m'nkhaniyi.

zasintha

Webusaiti yamtundu: tidyverse.org
Ndi chilankhulo chiti chomwe mungasankhe kuti mugwiritse ntchito ndi data - R kapena Python? Onse! Kusamuka kuchokera ku panda kupita ku tidyverse ndi data.table ndi kubwerera
Library tidyverse yolembedwa ndi Hedley Wickham, Senior Research Scientist ku RStudio. tidyverse imakhala ndi maphukusi ochititsa chidwi omwe amathandizira kukonza deta mosavuta, 5 mwa iwo akuphatikizidwa muzotsitsa 10 zapamwamba kuchokera kunkhokwe ya CRAN.

Pakatikati pa laibulale ili ndi mapepala awa: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats. Iliyonse mwa phukusili ili ndi cholinga chothetsa vuto linalake. Mwachitsanzo dplyr idapangidwa kuti iwononge data, tidyr kubweretsa deta mu mawonekedwe abwino, stringr imathandizira kugwira ntchito ndi zingwe, ndi ggplot2 ndi imodzi mwa zida zodziwika bwino zowonera deta.

mwayi tidyverse ndikosavuta komanso kosavuta kuwerenga, komwe kuli kofanana m'njira zambiri ndi chilankhulo cha SQL.

deta.table

Ndi chilankhulo chiti chomwe mungasankhe kuti mugwiritse ntchito ndi data - R kapena Python? Onse! Kusamuka kuchokera ku panda kupita ku tidyverse ndi data.table ndi kubwereraWebusaiti yamtundu: r-datatable.com

Wolemba data.table ndi Matt Dole wa H2O.ai.

Kutulutsidwa koyamba kwa laibulaleyi kunachitika mu 2006.

Syntax ya phukusi siyosavuta ngati in tidyverse ndipo imakumbukiranso ma dataframe akale mu R, koma nthawi yomweyo amakulitsidwa kwambiri pakugwira ntchito.

Zosintha zonse zomwe zili ndi tebulo mu phukusili zikufotokozedwa m'mabulaketi apakati, ndipo ngati mutamasulira mawuwo. data.table mu SQL, mumapeza zinthu monga izi: data.table[ WHERE, SELECT, GROUP BY ]

Mphamvu ya phukusili ndi liwiro la kukonza deta yambiri.

panda

Webusaiti yamtundu: pandas.pydata.org Ndi chilankhulo chiti chomwe mungasankhe kuti mugwiritse ntchito ndi data - R kapena Python? Onse! Kusamuka kuchokera ku panda kupita ku tidyverse ndi data.table ndi kubwerera

Dzina la laibulaleyo limachokera ku liwu la econometric "panel data", lomwe limagwiritsidwa ntchito kufotokoza zambiri zamitundu yosiyanasiyana.

Wolemba pandas ndi American Wes McKinney.

Zikafika pakusanthula deta ku Python, zofanana pandas Ayi. Phukusi lazinthu zambiri, lapamwamba kwambiri lomwe limakupatsani mwayi wogwiritsa ntchito deta, kuyambira pakukweza deta kuchokera kuzinthu zilizonse mpaka kuziwonera.

Kuyika ma phukusi owonjezera

Maphukusi omwe akufotokozedwa m'nkhaniyi sanaphatikizidwe mu magawo oyambira a R ndi Python. Ngakhale pali chenjezo laling'ono, ngati mwayika kugawa kwa Anaconda, ndiye kukhazikitsanso pandas osafunikira.

Kuyika mapaketi mu R

Ngati mwatsegula malo a chitukuko cha RStudio kamodzi, mwinamwake mukudziwa kale momwe mungayikitsire phukusi lofunikira mu R. Kuti muyike mapepala, gwiritsani ntchito lamulo lokhazikika. install.packages() poyiyendetsa molunjika mu R yokha.

# установка пакетов
install.packages("vroom")
install.packages("readr")
install.packages("dplyr")
install.packages("data.table")

Pambuyo kukhazikitsa, phukusi liyenera kulumikizidwa, lomwe nthawi zambiri limagwiritsidwa ntchito library().

# подключение или импорт пакетов в рабочее окружение
library(vroom)
library(readr)
library(dplyr)
library(data.table)

Kuyika mapaketi ku Python

Chifukwa chake, ngati muli ndi Python yoyera, ndiye pandas muyenera kukhazikitsa pamanja. Tsegulani mzere wolamula, kapena terminal, kutengera makina anu ogwiritsira ntchito ndikulowetsa lamulo lotsatirali.

pip install pandas

Kenako timabwerera ku Python ndikulowetsa phukusi lokhazikitsidwa ndi lamulo import.

import pandas as pd

Loading Data

Kutsitsa deta ndi chimodzi mwazinthu zofunika kwambiri pakusanthula deta. Onse a Python ndi R, ngati angafune, amakupatsirani mwayi wambiri wopezera deta kuchokera kulikonse: mafayilo am'deralo, mafayilo kuchokera pa intaneti, mawebusayiti, mitundu yonse yamasamba.

Ndi chilankhulo chiti chomwe mungasankhe kuti mugwiritse ntchito ndi data - R kapena Python? Onse! Kusamuka kuchokera ku panda kupita ku tidyverse ndi data.table ndi kubwerera

M'nkhani yonse tidzagwiritsa ntchito ma dataset angapo:

  1. Kutsitsa kuwiri kuchokera ku Google Analytics.
  2. Titanic Passenger Dataset.

Deta yonse ili pa ine GitHub mu mawonekedwe a csv ndi tsv mafayilo. Kodi tizipempha kuti?

Kutsegula deta mu R: tidyverse, vroom, readr

Kutsitsa deta mu library tidyverse Pali mapaketi awiri: vroom, readr. vroom zamakono, koma m'tsogolomu phukusi likhoza kuphatikizidwa.

Mawu ochokera zolemba zovomerezeka vroom.

vroom vs owerenga
Kodi kumasulidwa kwa vroom kutanthauza za readr? Pakadali pano tikukonzekera kuti mapaketi awiriwa asinthe mosiyana, koma mwina tidzagwirizanitsa mapaketiwo mtsogolomu. Choyipa chimodzi pakuwerenga kwaulesi kwa vroom ndikuti zovuta zina za data sizinganenedwe patsogolo, kotero momwe mungawagwirizanitse pamafunika lingaliro.

vroom vs owerenga
Kodi kumasulidwa kumatanthauza chiyani? vroom chifukwa readr? Pakadali pano tikukonzekera kupanga mapaketi onse awiri padera, koma mwina tidzawaphatikiza mtsogolo. Chimodzi mwazovuta za kuwerenga kwaulesi vroom ndi kuti mavuto ena ndi deta sangathe lipoti pasadakhale, kotero muyenera kuganizira mmene bwino kuphatikiza iwo.

M'nkhaniyi tiwona mapaketi onse otsitsa deta:

Kutsegula deta mu R: phukusi la vroom

# install.packages("vroom")
library(vroom)

# Чтение данных
## vroom
ga_nov  <- vroom("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_nowember.csv")
ga_dec  <- vroom("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_december.csv")
titanic <- vroom("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/titanic.csv")

Kutsegula deta mu R: owerenga

# install.packages("readr")
library(readr)

# Чтение данных
## readr
ga_nov  <- read_tsv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_nowember.csv")
ga_dec  <- read_tsv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_december.csv")
titanic <- read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/titanic.csv")

Mu phukusi vroom, mosasamala kanthu za mtundu wa data wa csv / tsv, kutsitsa kumachitika ndi ntchito ya dzina lomwelo vroom(), mu paketi readr timagwiritsa ntchito zosiyana pamtundu uliwonse read_tsv() и read_csv().

Kutsegula deta mu R: data.table

В data.table pali ntchito yotsitsa deta fread().

Kutsegula deta mu R: data.table phukusi

# install.packages("data.table")
library(data.table)

## data.table
ga_nov  <- fread("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_nowember.csv")
ga_dec  <- fread("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_december.csv")
titanic <- fread("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/titanic.csv")

Kuyika deta mu Python: pandas

Tikayerekeza ndi phukusi la R, ndiye kuti pamenepa mawuwa ali pafupi kwambiri pandas kudzakhala readr, chifukwa pandas akhoza kupempha deta kulikonse, ndipo pali banja lonse la ntchito phukusili read_*().

  • read_csv()
  • read_excel()
  • read_sql()
  • read_json()
  • read_html()

Ndi ntchito zina zambiri zomwe zimapangidwira kuti ziwerenge deta kuchokera kumitundu yosiyanasiyana. Koma kwa zolinga zathu ndi zokwanira read_table() kapena read_csv() pogwiritsa ntchito mtsutso Sep kufotokoza cholekanitsa ndime.

Kuyika deta mu Python: pandas

import pandas as pd

ga_nov  = pd.read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/russian_text_in_r/ga_nowember.csv", sep = "t")
ga_dec  = pd.read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/russian_text_in_r/ga_december.csv", sep = "t")
titanic = pd.read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/russian_text_in_r/titanic.csv")

Kupanga ma dataframes

Table Titanic, amene tidasenza, pali munda kugonana, yomwe imasunga chizindikiritso cha jenda.

Koma kuti muwonetsere zambiri zamtundu wa jenda, muyenera kugwiritsa ntchito dzinalo osati nambala ya jenda.

Kuti tichite izi, tidzapanga bukhu laling'ono, tebulo lomwe lidzakhala ndi mizati 2 (code ndi dzina la jenda) ndi mizere iwiri, motsatana.

Kupanga dataframe mu R: tidyverse, dplyr

Muchitsanzo cha code pansipa, timapanga deta yomwe tikufuna kugwiritsa ntchito ntchitoyi tibble() .

Kupanga dataframe mu R: dplyr

## dplyr
### создаём справочник
gender <- tibble(id = c(1, 2),
                 gender = c("female", "male"))

Kupanga mawonekedwe a data mu R: data.table

Kupanga mawonekedwe a data mu R: data.table

## data.table
### создаём справочник
gender <- data.table(id = c(1, 2),
                    gender = c("female", "male"))

Kupanga deta mu Python: pandas

В pandas Kupanga mafelemu kumachitika mu magawo angapo, choyamba timapanga dikishonale, kenako timatembenuza dikishonale kukhala dataframe.

Kupanga deta mu Python: pandas

# создаём дата фрейм
gender_dict = {'id': [1, 2],
               'gender': ["female", "male"]}
# преобразуем словарь в датафрейм
gender = pd.DataFrame.from_dict(gender_dict)

Kusankha Mizati

Matebulo omwe mumagwira nawo akhoza kukhala ndi dazeni kapena mazana azigawo za data. Koma kuti mufufuze, monga lamulo, simukusowa mizati yonse yomwe ikupezeka mu tebulo la gwero.

Ndi chilankhulo chiti chomwe mungasankhe kuti mugwiritse ntchito ndi data - R kapena Python? Onse! Kusamuka kuchokera ku panda kupita ku tidyverse ndi data.table ndi kubwerera

Chifukwa chake, chimodzi mwazinthu zoyamba zomwe mungapange ndi tsamba loyambira ndikuchotsa zidziwitso zosafunikira ndikumasula kukumbukira komwe chidziwitsochi chimakhala.

Kusankha mizati mu R: tidyverse, dplyr

malembedwe dplyr ndizofanana kwambiri ndi chilankhulo cha SQL, ngati mukuchidziwa mutha kudziwa bwino phukusili.

Kuti musankhe mizati, gwiritsani ntchito ntchitoyi select().

M'munsimu muli zitsanzo zamakhodi omwe mungasankhire mizati m'njira zotsatirazi:

  • Kulemba mayina amizere yofunikira
  • Onani mayina a magawo pogwiritsa ntchito mawu okhazikika
  • Mwa mtundu wa data kapena katundu wina uliwonse wa data yomwe ili mugawoli

Kusankha mizati mu R: dplyr

# Выбор нужных столбцов
## dplyr
### выбрать по названию столбцов
select(ga_nov, date, source, sessions)
### исключь по названию столбцов
select(ga_nov, -medium, -bounces)
### выбрать по регулярному выражению, стобцы имена которых заканчиваются на s
select(ga_nov, matches("s$"))
### выбрать по условию, выбираем только целочисленные столбцы
select_if(ga_nov, is.integer)

Kusankha mizati mu R: data.table

Ntchito zomwezo mu data.table amachitidwa mwanjira ina, koyambirira kwa nkhaniyo ndidapereka kufotokozera zomwe zili mkati mwa mabulaketi apakati data.table.

DT[i,j,by]

Kumeneko:
i - kuti, i.e. kusefa ndi mizere
j - sankhani|kusintha|chita, i.e. kusankha mizati ndikusintha
mwa - gulu la data

Kusankha mizati mu R: data.table

## data.table
### выбрать по названию столбцов
ga_nov[ , .(date, source, sessions) ]
### исключь по названию столбцов
ga_nov[ , .SD, .SDcols = ! names(ga_nov) %like% "medium|bounces" ]
### выбрать по регулярному выражению
ga_nov[, .SD, .SDcols = patterns("s$")]

Zosiyanasiyana .SD amakulolani kuti mupeze mizati yonse, ndi .SDcols sefa mizati yofunikira pogwiritsa ntchito mawu okhazikika, kapena ntchito zina zosefera mayina amigawo yomwe mukufuna.

Kusankha mizati ku Python, pandas

Kusankha mizati ndi dzina mu pandas ndikokwanira kupereka mndandanda wa mayina awo. Ndipo kuti musankhe kapena kusanja mizati ndi mayina pogwiritsa ntchito mawu okhazikika, muyenera kugwiritsa ntchito ntchitozo drop() и filter(), ndi kukangana olamulira=1, zomwe mukuwonetsa kuti ndikofunikira kukonza zipilala osati mizere.

Kuti musankhe gawo ndi mtundu wa data, gwiritsani ntchito ntchitoyi select_dtypes(), ndi kukangana onjezerani kapena sungani perekani mndandanda wamitundu yofananira ndi magawo omwe muyenera kusankha.

Kusankha mizati mu Python: pandas

# Выбор полей по названию
ga_nov[['date', 'source', 'sessions']]
# Исключить по названию
ga_nov.drop(['medium', 'bounces'], axis=1)
# Выбрать по регулярному выражению
ga_nov.filter(regex="s$", axis=1)
# Выбрать числовые поля
ga_nov.select_dtypes(include=['number'])
# Выбрать текстовые поля
ga_nov.select_dtypes(include=['object'])

Kusefa mizere

Mwachitsanzo, tebulo la magwero likhoza kukhala ndi zaka zingapo za deta, koma muyenera kusanthula mwezi wapitawu. Apanso, mizere yowonjezera idzachedwetsa ndondomeko yowonongeka kwa deta ndikutseka kukumbukira kwa PC.

Ndi chilankhulo chiti chomwe mungasankhe kuti mugwiritse ntchito ndi data - R kapena Python? Onse! Kusamuka kuchokera ku panda kupita ku tidyverse ndi data.table ndi kubwerera

Sefa mizere mu R: tydyverse, dplyr

В dplyr ntchitoyo imagwiritsidwa ntchito kusefa mizere filter(). Zimatengera dataframe ngati mkangano woyamba, ndiye mumalemba zosefera.

Polemba mawu omveka kuti musefe tebulo, pamenepa, tchulani mayina azagawo popanda mawu komanso osatchula dzina la tebulo.

Mukamagwiritsa ntchito mawu omveka angapo kuti musefe, gwiritsani ntchito zotsatirazi:

  • & kapena koma - zomveka NDI
  • | | - zomveka OR

Kusefa mizere mu R: dplyr

# фильтрация строк
## dplyr
### фильтрация строк по одному условию
filter(ga_nov, source == "google")
### фильтр по двум условиям соединённым логическим и
filter(ga_nov, source == "google" & sessions >= 10)
### фильтр по двум условиям соединённым логическим или
filter(ga_nov, source == "google" | sessions >= 10)

Sefa mizere mu R: data.table

Monga ndalemba kale pamwambapa, mu data.table Kutembenuza kwa data kumatsekeredwa m'mabulaketi apakati.

DT[i,j,by]

Kumeneko:
i - kuti, i.e. kusefa ndi mizere
j - sankhani|kusintha|chita, i.e. kusankha mizati ndikusintha
mwa - gulu la data

Mtsutso umagwiritsidwa ntchito kusefa mizere i, yomwe ili ndi malo oyamba m'mabulaketi apakati.

Mizati imapezeka m'mawu omveka opanda zizindikiro zongobwereza komanso popanda kutchula dzina la tebulo.

Mawu omveka amalumikizana wina ndi mzake mofanana ndi mu dplyr kudzera mwa & ndi | operekera.

Sefa mizere mu R: data.table

## data.table
### фильтрация строк по одному условию
ga_nov[source == "google"]
### фильтр по двум условиям соединённым логическим и
ga_nov[source == "google" & sessions >= 10]
### фильтр по двум условиям соединённым логическим или
ga_nov[source == "google" | sessions >= 10]

Kusefa zingwe mu Python: pandas

Sefa ndi mizere mkati pandas zofanana ndi kusefa mkati data.table, ndipo imachitika m'mabulaketi apakati.

Pachifukwa ichi, kupeza mizati kumachitika makamaka posonyeza dzina la dataframe; ndiye dzina lazagawo likhoza kuwonetsedwanso m'mawu obwereza m'mabulaketi akuluakulu (chitsanzo df['col_name']), kapena popanda mawu pambuyo pa nthawi (chitsanzo df.col_name).

Ngati mukufunika kusefa dataframe ndi zinthu zingapo, chikhalidwe chilichonse chiyenera kuikidwa m'makolo. Zinthu zomveka zimalumikizidwa wina ndi mnzake ndi ogwira ntchito & и |.

Kusefa zingwe mu Python: pandas

# Фильтрация строк таблицы
### фильтрация строк по одному условию
ga_nov[ ga_nov['source'] == "google" ]
### фильтр по двум условиям соединённым логическим и
ga_nov[(ga_nov['source'] == "google") & (ga_nov['sessions'] >= 10)]
### фильтр по двум условиям соединённым логическим или
ga_nov[(ga_nov['source'] == "google") | (ga_nov['sessions'] >= 10)]

Kuyika m'magulu ndi kusonkhanitsa deta

Chimodzi mwazinthu zomwe zimagwiritsidwa ntchito kwambiri pakusanthula deta ndikuyika magulu ndikuphatikiza.

Ndi chilankhulo chiti chomwe mungasankhe kuti mugwiritse ntchito ndi data - R kapena Python? Onse! Kusamuka kuchokera ku panda kupita ku tidyverse ndi data.table ndi kubwerera

Mawu ochitira izi amamwazikana pamaphukusi onse omwe timawunika.

Pankhaniyi, titenga dataframe mwachitsanzo Titanic, ndikuwerengera chiwerengero ndi mtengo wapakati wa matikiti kutengera kalasi ya kanyumba.

Kugawa ndi kusonkhanitsa deta mu R: tidyverse, dplyr

В dplyr ntchitoyi imagwiritsidwa ntchito kupanga magulu group_by(), ndi kusonkhanitsa summarise(). Pamenepo, dplyr pali banja lonse la ntchito summarise_*(), koma cholinga cha nkhaniyi ndikufanizira mawu oyambira, kuti tisalowe m'nkhalango yotere.

Basic aggregation ntchito:

  • sum() - mwachidule
  • min() / max() - mtengo wochepera komanso wopambana
  • mean() - pafupifupi
  • median() - wapakati
  • length() - kuchuluka

Magulu ndi kuphatikiza mu R: dplyr

## dplyr
### группировка и агрегация строк
group_by(titanic, Pclass) %>%
  summarise(passangers = length(PassengerId),
            avg_price  = mean(Fare))

Kugwira ntchito group_by() tinadutsa tebulo ngati mkangano woyamba Titanic, ndiyeno kusonyeza munda Pclass, momwe tidzayikamo tebulo lathu. Zotsatira za opaleshoniyi pogwiritsa ntchito woyendetsa %>% idadutsa ngati mtsutso woyamba ku ntchitoyi summarise(), ndikuwonjezera minda ina 2: apaulendo и avg_price. Choyamba, mutha kugwiritsa ntchito length() anawerengera chiwerengero cha matikiti, ndipo chachiwiri ntchito ntchito mean() adalandira mtengo wapakati wa tikiti.

Kuyika m'magulu ndi kusonkhanitsa deta mu R: data.table

В data.table mkangano umagwiritsidwa ntchito pophatikiza j yomwe ili ndi malo achiwiri m'mabulaketi apakati, ndi magulu by kapena keyby, amene ali ndi malo achitatu.

Mndandanda wa ntchito zophatikizira mu nkhaniyi ndi zofanana ndi zomwe zafotokozedwa mu dplyr, chifukwa izi ndi ntchito zochokera ku R syntax.

Kuyika magulu ndi kuphatikiza mu R: data.table

## data.table
### фильтрация строк по одному условию
titanic[, .(passangers = length(PassengerId),
            avg_price  = mean(Fare)),
        by = Pclass]

Kuyika m'magulu ndi kuphatikiza kwa data mu Python: pandas

Kupanga magulu pandas zofanana ndi dplyr, koma kuphatikizikako sikufanana ndi dplyr ayi pa data.table.

Kuti mupange gulu, gwiritsani ntchito njirayo groupby(), momwe muyenera kudutsa mndandanda wazitsulo zomwe dataframe idzagawidwe.

Kuti aggregation mungagwiritse ntchito njira agg()yomwe imavomereza dikishonale. Makiyi a mtanthauzira mawu ndi mizati yomwe mungagwiritse ntchito pophatikizira, ndipo mfundo zake ndi mayina a ntchito zophatikiza.

Ntchito zophatikizira:

  • sum() - mwachidule
  • min() / max() - mtengo wochepera komanso wopambana
  • mean() - pafupifupi
  • median() - wapakati
  • count() - kuchuluka

ntchito reset_index() m'chitsanzo chomwe chili pansipa chimagwiritsidwa ntchito kukonzanso zolozera zomwe zili m'zisa pandas zosasintha pambuyo pa kusonkhanitsa deta.

Chizindikiro amakulolani kusunthira ku mzere wotsatira.

Kupanga magulu ndi kuphatikiza mu Python: pandas

# группировка и агрегация данных
titanic.groupby(["Pclass"]).
    agg({'PassengerId': 'count', 'Fare': 'mean'}).
        reset_index()

Kulumikizana koyima kwa matebulo

Opaleshoni yomwe mumalumikiza magome awiri kapena kuposerapo amtundu womwewo. Zomwe takweza zili ndi matebulo ga_nov и ga_dec. Matebulo awa ndi ofanana m'mapangidwe, i.e. khalani ndi mizati yofanana, ndi mitundu ya data muzaza izi.

Ndi chilankhulo chiti chomwe mungasankhe kuti mugwiritse ntchito ndi data - R kapena Python? Onse! Kusamuka kuchokera ku panda kupita ku tidyverse ndi data.table ndi kubwerera

Uku ndikukweza kuchokera ku Google Analytics m'mwezi wa Novembala ndi Disembala, mugawo lino tidzaphatikiza deta iyi kukhala tebulo limodzi.

Kujowina matebulo mu R: tidyverse, dplyr

В dplyr Mutha kuphatikiza matebulo awiri kukhala amodzi pogwiritsa ntchito ntchitoyi bind_rows(), kudutsa matebulo monga mfundo zake.

Kusefa mizere mu R: dplyr

# Вертикальное объединение таблиц
## dplyr
bind_rows(ga_nov, ga_dec)

Kujowina magome mu R: data.table

Palibenso zovuta, tiyeni tigwiritse ntchito rbind().

Sefa mizere mu R: data.table

## data.table
rbind(ga_nov, ga_dec)

Kujowina matebulo mu Python: pandas

В pandas ntchitoyi imagwiritsidwa ntchito kujowina matebulo concat(), momwe muyenera kudutsa mndandanda wa mafelemu kuti muwaphatikize.

Kusefa zingwe mu Python: pandas

# вертикальное объединение таблиц
pd.concat([ga_nov, ga_dec])

Kulumikizana kopingasa kwa matebulo

Ntchito yomwe mizati yachiwiri imawonjezedwa patebulo loyamba ndi kiyi. Amagwiritsidwa ntchito nthawi zambiri pokulitsa tebulo lazinthu (mwachitsanzo, tebulo lokhala ndi data yogulitsa) ndi data ina (mwachitsanzo, mtengo wa chinthu).

Ndi chilankhulo chiti chomwe mungasankhe kuti mugwiritse ntchito ndi data - R kapena Python? Onse! Kusamuka kuchokera ku panda kupita ku tidyverse ndi data.table ndi kubwerera

Pali mitundu ingapo yolumikizirana:

Ndi chilankhulo chiti chomwe mungasankhe kuti mugwiritse ntchito ndi data - R kapena Python? Onse! Kusamuka kuchokera ku panda kupita ku tidyverse ndi data.table ndi kubwerera

Mu tebulo lodzaza kale Titanic tili ndi gawo kugonana, zomwe zimagwirizana ndi nambala ya jenda:

1 - mkazi
2 - mwamuna

Komanso, tapanga tebulo - buku lofotokozera chikhalidwe. Kuti muwonetse zambiri zamtundu wa jenda, tiyenera kuwonjezera dzina la jenda kuchokera m'ndandanda. chikhalidwe ku tebulo Titanic.

Gome lopingasa lolowera R: tidyverse, dplyr

В dplyr Pali gulu lonse la ntchito zolumikizira zopingasa:

  • inner_join()
  • left_join()
  • right_join()
  • full_join()
  • semi_join()
  • nest_join()
  • anti_join()

Zomwe zimagwiritsidwa ntchito kwambiri muzochita zanga ndi left_join().

Monga mfundo ziwiri zoyamba, ntchito zomwe zatchulidwa pamwambapa zimatenga magome awiri kuti agwirizane, ndipo ngati mtsutso wachitatu by muyenera kufotokoza mizati kuti mugwirizane.

Gome lopingasa lolowera R: dplyr

# объединяем таблицы
left_join(titanic, gender,
          by = c("Sex" = "id"))

Kulumikizana kopingasa kwa matebulo mu R: data.table

В data.table Muyenera kujowina matebulo pogwiritsa ntchito kiyi merge().

Zotsutsana zophatikiza () ntchito mu data.table

  • x, y - Matebulo ojowina
  • by — Column that is the key to join if it has the same name in all the tables
  • by.x, by.y - Mayina amigawo kuti aphatikizidwa, ngati ali ndi mayina osiyanasiyana pamatebulo
  • all, all.x, all.y - Lembani mtundu, zonse zidzabweza mizere yonse kuchokera ku matebulo onse awiri, all.x ikugwirizana ndi LEFT JOIN (idzasiya mizere yonse ya tebulo loyamba), all.y - ikugwirizana ndi RIGHT JOIN ntchito (idzasiya mizere yonse ya tebulo lachiwiri).

Kulumikizana kopingasa kwa matebulo mu R: data.table

# объединяем таблицы
merge(titanic, gender, by.x = "Sex", by.y = "id", all.x = T)

Gome lopingasa lilowa nawo Python: pandas

Komanso mu data.tablemu pandas ntchitoyi imagwiritsidwa ntchito kujowina matebulo merge().

Zotsutsana za ntchito ya merge() mu pandas

  • momwe - Mtundu wolumikizira: kumanzere, kumanja, kunja, mkati
  • pa - Column yomwe ndi kiyi ngati ili ndi dzina lomwelo pamagome onse awiri
  • left_on, right_on - Mayina a mizati yofunika, ngati ali ndi mayina osiyana mu matebulo

Gome lopingasa lilowa nawo Python: pandas

# объединяем по ключу
titanic.merge(gender, how = "left", left_on = "Sex", right_on = "id")

Ntchito zoyambira zenera ndi magawo owerengeka

Ntchito zamazenera ndizofanana ndi ntchito zophatikiza, ndipo zimagwiritsidwanso ntchito posanthula deta. Koma mosiyana ndi ntchito zophatikizira, ntchito zazenera sizisintha kuchuluka kwa mizere ya dataframe yotuluka.

Ndi chilankhulo chiti chomwe mungasankhe kuti mugwiritse ntchito ndi data - R kapena Python? Onse! Kusamuka kuchokera ku panda kupita ku tidyverse ndi data.table ndi kubwerera

Kwenikweni, pogwiritsa ntchito zenera, timagawaniza dataframe yomwe ikubwera m'zigawo molingana ndi miyeso ina, i.e. ndi mtengo wamunda, kapena minda ingapo. Ndipo timachita masamu pawindo lililonse. Zotsatira za ntchitozi zidzabwezeredwa pamzere uliwonse, i.e. popanda kusintha chiwerengero chonse cha mizere patebulo.

Mwachitsanzo, tiyeni titenge tebulo Titanic. Titha kuwerengera kuchuluka kwa mtengo wa tikiti iliyonse mkati mwa kalasi yake yanyumba.

Kuti tichite izi, tifunika kuyika pamzere uliwonse mtengo wa tikiti wa gulu lanyumba lomwe tikiti ili pamzerewu, ndikugawa mtengo wa tikiti iliyonse ndi mtengo wokwanira wa matikiti onse agulu lomwelo. .

Mawindo amagwira ntchito mu R: tidyverse, dplyr

Kuti muwonjezere mizati yatsopano, osagwiritsa ntchito magulu a mizere, mu dplyr imagwira ntchito mutate().

Mutha kuthana ndi vuto lomwe lafotokozedwa pamwambapa pogawa deta ndi gawo Pclass ndi kufotokoza mwachidule gawolo mu ndime yatsopano Mitengo. Kenako, sankhani gululo ndikugawa magawo amunda Mitengo ku zomwe zidachitika m'mbuyomu.

Mawindo amagwira ntchito mu R: dplyr

group_by(titanic, Pclass) %>%
  mutate(Pclass_cost = sum(Fare)) %>%
  ungroup() %>%
  mutate(ticket_fare_rate = Fare / Pclass_cost)

Ntchito zamawindo mu R: data.table

The solution algorithm imakhalabe yofanana ndi in dplyr, tifunika kugawa tebulo m'mawindo ndi gawo Pclass. Kutulutsa mugawo latsopano kuchuluka kwa gulu lomwe likugwirizana ndi mzere uliwonse, ndikuwonjezera gawo lomwe timawerengera mtengo wa tikiti iliyonse pagulu lake.

Kuti muwonjezere mizati yatsopano data.table woyendetsa alipo :=. Pansipa pali chitsanzo cha kuthetsa vuto pogwiritsa ntchito phukusi data.table

Ntchito zamawindo mu R: data.table

titanic[,c("Pclass_cost","ticket_fare_rate") := .(sum(Fare), Fare / Pclass_cost), 
        by = Pclass]

Ntchito zamawindo ku Python: pandas

Njira imodzi yowonjezerera ndime yatsopano pandas - kugwiritsa ntchito assign(). Kuti tifotokoze mwachidule mtengo wa matikiti ndi kalasi ya kanyumba, popanda mizere yamagulu, tidzagwiritsa ntchito ntchitoyi transform().

Pansipa pali chitsanzo cha yankho lomwe timawonjezera patebulo Titanic mizati 2 yomweyo.

Ntchito zamawindo ku Python: pandas

titanic.assign(Pclass_cost      =  titanic.groupby('Pclass').Fare.transform(sum),
               ticket_fare_rate = lambda x: x['Fare'] / x['Pclass_cost'])

Ntchito ndi njira tebulo makalata

Pansipa pali mndandanda wamakalata pakati pa njira zogwirira ntchito zosiyanasiyana ndi data pamaphukusi omwe takambirana.

mafotokozedwe
zasintha
deta.table
panda

Loading Data
vroom()/ readr::read_csv() / readr::read_tsv()
fread()
read_csv()

Kupanga ma dataframes
tibble()
data.table()
dict() + from_dict()

Kusankha Mizati
select()
kukangana j, malo achiwiri m'mabulaketi akuluakulu
timadutsa mndandanda wamizere yofunikira m'mabulaketi akulu / drop() / filter() / select_dtypes()

Kusefa mizere
filter()
kukangana i, malo oyamba m'mabulaketi akuluakulu
Timalemba zosefera m'mabulaketi akulu / filter()

Kupanga Magulu ndi Kuphatikiza
group_by() + summarise()
zotsutsana j + by
groupby() + agg()

Vertical union of tables (UNION)
bind_rows()
rbind()
concat()

Matebulo opingasa (JOIN)
left_join() / *_join()
merge()
merge()

Ntchito zoyambira zenera ndikuwonjezera magawo owerengeka
group_by() + mutate()
kukangana j kugwiritsa ntchito := + kukangana by
transform() + assign()

Pomaliza

Mwina m'nkhani yomwe ndidafotokoza osati njira zabwino kwambiri zosinthira deta, chifukwa chake ndidzakhala wokondwa ngati mukonza zolakwika zanga mu ndemanga, kapena kungowonjezera zomwe zaperekedwa m'nkhaniyi ndi njira zina zogwirira ntchito ndi data mu R / Python.

Monga ndalembera pamwambapa, cholinga cha nkhaniyi sichinali kukakamiza maganizo a munthu pa chinenero chabwino, koma kupeputsa mwayi wophunzira zinenero zonse ziwiri, kapena, ngati n'koyenera, kusamuka pakati pawo.

Ngati mudakonda nkhaniyi, ndidzakhala wokondwa kukhala ndi olembetsa atsopano kwanga Youtube и uthengawo njira.

Sewero

Ndi phukusi liti mwazinthu zotsatirazi lomwe mumagwiritsa ntchito pantchito yanu?

Mu ndemanga mukhoza kulemba chifukwa cha kusankha kwanu.

Ogwiritsa ntchito olembetsedwa okha ndi omwe angatenge nawo gawo pa kafukufukuyu. Lowani muakauntichonde.

Ndi phukusi liti lokonzekera deta lomwe mumagwiritsa ntchito (mutha kusankha zingapo)

  • 45,2%zosintha 19

  • 33,3%deta.table14

  • 54,8%panda 23

Ogwiritsa ntchito 42 adavota. Ogwiritsa 9 adakana.

Source: www.habr.com

Kuwonjezera ndemanga